Professional Documents
Culture Documents
12 - 16 Sep 2021
Online Training
Presented by: Eng. Mahmoud Abd El Aziz
www.petroknowledge.com
About Us
Leading Provider of Training Services
to the International Oil, Gas
and Energy Industry
North America
PetroKnowledge offers an extensive range of training seminars, courses • Boston, U.S.A.
and workshops for the Oil, Gas, Petrochemical and Energy Industries that • Houston TX, U.S.A.
include technical and specialized training seminars in Offshore & Marine • New York, U.S.A.
Technology, LNG Process Operations, Refining & Petrochemicals, Geology,
Latin America
Seismology & Petrophysics, Well Drilling & Completion Technology, • Bogota, Colombia
Mechanical & Process Engineering, Maintenance Management, Health • Mexico City, Mexico
& Safety, and Instrumentation & Process Control, as well as Business and • Rio De Janeiro, Brazil
Management courses which include International Petroleum Business,
Leadership & Strategic Planning, Finance & Accounting, Project & Middle East
• Abu Dhabi, U.A.E.
Construction Management, Contracts & Procurement Management, and • Dubai, U.A.E.
Human Resource Management. • Muscat, Oman
www.petroknowledge.com
Maintenance Scheduling
Using Big Data, IoT
and Agent-Based
Simulation
Accurately Predict and Perform Maintenance
When & Where needed
Why Choose this Online Training Course?
We all know that the maintenance schedules and failure rates often differ and quite
frequently we either need to reduce the interval between regular maintenance or
even send assets for emergency repairs. With Big Data and IoT maintenance
planning and failure rate prediction is now much easier and the companies who use
the benefits of these concepts are improving their maintenance schedules, reducing
the costs and downtimes, therefore, winning over their competition.
With the addition of agent-based simulation, the machine learning and deep learning
algorithms could be expedited, and the maintenance predictions made as lose to the
real life as possible, as we can simulate the behavior of aging assets and new
workforce behavior, or the introduction of cutting edge technology to aging
workforce, something which is not in the user manuals, but it is omnipresent in
today’s industry.
This online training course is designed for all professionals working in the field of
data analysis, oil and gas exploration, geology and reservoir modelling, process
improvement, asset management and maintenance management.
This online training course is suitable for a wide range of professionals but
will greatly benefit:
3
• Other professionals involved in procurement, maintenance and operations of
assets
This PetroKnowledge online training course will utilize a variety of proven online
learning techniques to ensure maximum understanding, comprehension, retention of
the information presented. The training course is conducted online via an Advanced
Virtual Learning Platform in the comfort of your choice location.
Daily Agenda
4
DAY1
THE NEED FOR MAINTENANCE
What is Maintenance?
BS 3811:1974
Maintenance is defined as:
• The work under taken in order to keep or restore a facility
to an acceptable standard level.
or
• The combination of activities by which a facility is kept
in, or restored to, a state in which it can perform its
acceptable standard.
Maintenance Policies
“To Keep” “To Restore”
Planned Maintenance Unplanned Maintenance
- Time Based Maintenance - Corrective Maintenance
- Condition Based Maintenance - Run To Failure
- Risk Based Maintenance - Emergency Maintenance
- Break down Maintenance
Maintenance Policies
5
Preventive maintenance - Time-based PM
❑ Pure time )calendar) based: Weekly, monthly, annually,
etc.
❑ Used (running) time based: 1000 km, 1000 RH, 3000 RH,
etc.
6
What is the Example:
Maintenance?
1- • System/equipment description
How to keep or • Main parameters
restore the facility at • Main items
acceptable standard • Functional block diagram
level in certain • Criticality
operating conditions? • Working conditions
2- Main failures:
How to prevent the
failures?
PM:
3- Main failures:
How to discover the
hidden failures?
Policy:
4- Main failures:
How to detect the
early failures?
Policy:
5- Main failures:
How to minimize the
risk of failures? Risk:
Policy:
7
Experience
Maintenance
Planner
Tools Information
Experience:
❑ Technical
❑ Planning
❑ Analysis
❑ Decision making
❑ Problem solving
❑ Working conditions, etc.
Information:
❑ Catalogue
❑ Forms / reports
❑ Data collection
❑ PM levels
❑ Job plans for each PM level
❑ Resources
❑ Cost rates
❑ CM work orders
❑ Failure analysis, etc.
Tools:
❑ Computer programs
❑ International standards
❑ Management tools, etc.
8
What is the ratio between Maintenance Cost
& Manufacturing Costs?
• Maintenance costs are a major part of the total operating costs of all
manufacturing or production plants.
Direct cost:
• Spare parts & supplies cost
• Labor cost
• Contract cost
Indirect cost:
• Overhead cost
• Down time cost
9
Maintenance cost = Direct cost + Overhead cost
Cost
PM Cost
10
What is Maintenance Management (MM)?
Through:
1. Define the target and constraints,
2. Information collecting & analysis,
3. Maintenance planning,
4. Maintenance organization,
5. Motivation & direction,
6. Maintenance control,
7. Corrective actions, and
8. Learned lessons.
11
Maintenance Management History
3rd generation
Higher plant
Availability &
reliability
2nd generation
Higher plant availability Grater safety
1st generation
Fix it when it Better product quality
Longer equipment life
broke
No damage to
1940 1950 1960 1970 1980 1990
Customer
Satisfaction
Time Cost & Resources
12
Maintenance Planning Concept:
Before you start to maintenance plan, consider...
• Who is the ultimate customer?
• What are the customer needs?
• How long will the maintenance project last?
• Where are we now?
• Where should we end-up?
• What are the cost constraints?
• What are the technical challenges?
2- MM Time plans:
• Long term 2 to 10 y Risk 15 to 25%
• Medium term 6m to 1 y Risk 7 to 10%
• Short term 1w to 3 m Risk 3 to 5%
13
3- MM risk plans:
• Target plan (normal or most likely)
• Optimistic plan (best case)
• Pessimistic plan (worst case)
4- MM Strategic Plans:
• Strategic plan
• Tactical plan
• Operational plan
• Urgent plan
5- MM Planning Level:
• Overall plan “Complete information”
• Partial plan “Incomplete information”
• Urgent plan “Without information”
- Technical Constraints
- Financial constraints
Facility / Plan
- Target Maintenance at acceptable
- Information processes standard
- Resources - Reports
The output is equipment
Maintenance that is up, reliable, and
performance well configured to
indicators achieve the planned
operation of the plant.
14
Sub-system: Water Pump Unit
Control system
380 V, 3 ph
Environment
15
Figure – Main Components
Pump specifications: Valves specifications:
- -
Motor specifications: -
Coupling specifications:
- -
Bearing specifications: Strainer specifications:
- -
16
Current PM Program:
Item Job plan Frequen
cy
(1)
Motor
(2)
Coupling
(3)
Pump
(4)
Suction
line
(5)
Discharg
e line
(6)
Valves
17
Root Cause Failure Analysis:
Item Main Failures Root Cause MTBF
(1)
Motor
(2)
Coupling
(3)
Pump
(4)
Suction
line
(5)
Discharg
e line
(6)
Valves
18
1) Motor:
Failure PM PrD CM
Policy Freq. Policy Freq.
2) Coupling:
3) Pump:
4) Suction line:
5) Discharge line:
6) Valves:
19
Developed PM Program:
Item Job plan Frequency
(1)
Motor
(2)
Coupling
(3)
Pump
(4)
Suction
line
(5)
Discharg
e line
(6)
Valves
20
Maintenance Planning Steps:
1. System criticality analysis
2. Equipment selection
3. Information collection & analysis
4. Target & constraints definitions
5. Requirements & standard levels
6. Main failures determination
7. Root cause failure analysis (RCFA)
8. Best maintenance policy
9. Maintenance policy planning
10. Work orders
11. Measure
12. Analysis
13. Action
14. Performance evaluation & KPI
15. Improvement
21
Maintenance Planning Steps:
Step Description
1. System criticality HSE - Process – Down time – Cost –
analysis MTBF – MTTR - .. etc.
2. Equipment • Critical equipment
selection • Non-critical equipment
3. Information Maintenance catalog – Design
collection & information – Equipment history-
analysis Working conditions- PMs – CMs –
Trouble shooting – Reliability information
– HSE instructions. etc.
4. Target & • Targets: Reliability, Availability,
constraints MTBF, MTTR, Down time, Cost, HSE
definitions level, .. etc.
• Constraints: Budget, Spare parts,
Tools, Manpower, Information,etc.
5. Requirements & • Functional levels: Flow rate, Head,
standard levels Pressure, Power, .. etc.
• HSE levels
6. Main failures Functional failures - HSE failures –
determination Mechanical failures – Electrical failures -
.. etc.
7. Root Cause Main failures, Root cause, RRC,
Failure Analysis Mechanism, Probability, MTBF, MTTR,
(RCFA) Remedy.
8. Best maintenance • Run To Failure (RTF)
policy • Time-based (Preventive) PM
• Condition-based (Predictive) PdM
• Risk-based (Proactive) PrM
22
Maintenance Planning Steps:
Step Description
9. Maintenance Frequency- Levels- Alarm limits-
policy planning Tools- Job plan- HSE plan- Spare
parts- Duration- Manpower- .. etc.
10. Work orders • W/O # - W/O type- Dates/time -
Responsibility- Level - Alarm
limits- Tools- Job plan- HSE plan-
Spare parts- Duration- Manpower-
Failure - Root cause- .. etc.
• Complete Feedback.
11. Measure Running hours- Noise- Vibration-
Temperature- Oil level- viscosity-
Flow rate – Head – Speed - .. etc.
12. Analysis Noise analysis- Vibration analysis –
Temperature analysis - Oil analysis -
Flow rate analysis – Head analysis –
Speed analysis - .. etc.
13. Action - Good condition
- Call for service (PM)
- Call for repair (planned CM)
- Breakdown (unplanned CM)
14. Performance CM/PM- MTBF- MTTR- MTBM-
evaluation & KPI MTTM- Reliability – Availability-
Maintainability- RAM- Spare parts
consumption rates- .. etc.
15. Improvement • Information – Maintenance levels-
Tools – Spare parts – Manpower
skills – Time – HSE - .. etc.
23
• Approach: FMEA - RCM - RBI-
PMIS - .. etc.
Risk analysis
24
Maintenance Policies
(1) (5)
Failure-Based Total-Based
(3)
Reactive (ReM): Global (GM):
Condition-Based
- RTF (2) - OSM
Predictive (PdM):
Time-Based (4)
- Oil analysis
Preventive (PM): Risk-Based
- Vibration analysis
- Calendar: Proactive (PaM):
- Temperature analysis
Weekly - RCFA
::
Figure (1): Classification of maintenance policies.
[Venkatesh 2003, Waeyenberg and Pintelon 2004, and Gomaa et al. 2005]
Policy Approach Goals
Minimize maintenance
Run to failure (fix-it
Reactive costs for non-critical
when broke).
equipment.
Use-based
Minimize equipment
Preventive maintenance
breakdown.
program.
Maintenance Discover hidden
decision based on failures and improve
Predictive
equipment reliability for critical
condition. equipment.
Minimize the risk of
Detection of sources
Proactive failures for critical
of failures.
systems.
Maximize the system
Global Integrated approach.
productivity.
25
Policy Approach Goals
Identification of root
RCFA Eliminate failures.
causes of failures.
Identification of Improve equipment
FMECA
criticality of failures. availability.
Identification of
hazards and
HAZOP Improve HSE effect.
problems associated
with operations.
Determination of best
Preserve system
maintenance
RCM function & improve
requirements for
reliability.
critical systems.
Determination of an
optimum inspection Improve system HSE
RBI
plan for critical and availability.
systems.
26
2- PM Management
27
Maintenance Works
Planned Unplanned
≥ 70 % ≤ 30 %
28
What are the main Elements of Maintenance Plan?
29
MAINTENANCE WORK ORDER
• Work order number
Requester Section:
• Plant (or department) name / code
• Equipment name / code
• Equipment priority
• Maintenance type & level (PM / Repair / Overhaul)
• Job scope & description
• Responsibility
Planning Section:
• Manpower types & skills
• Time estimation
• Spare parts
• Special tools
• Expected equipment down time (from xxx to xxx)
• Cost estimation
• Safety instructions
• Responsibility
Craft Feedback:
• Job scope & description
• Manpower types & skills
• Time estimation
• Spare parts
• Special tools
• Actual equipment down time (from xxx to xxx)
• Actual Cost
• Responsibility
Coding:
• Plant (or department), Equipment
• Resources (Manpower, Spare parts, Special tools)
30
3- Maintenance Control
2- Time control
• Behind schedule (late)
• Ahead schedule (early)
3- Cost control
• Cost overrun
• Cost under-run
4- Quality control
• Acceptable level
• Non-acceptable level
5- Inventory control
• Over estimation
• Under estimation
6- Resources control
• Over estimation
• Under estimation
31
Control Steps:
1- What to control?
2- What is the standard (target) performance?
3- What is the actual performance level?
4- Comparison between the actual & target.
5- Detection of variance
6- Identification of causes of variance
7- Corrective actions
8- Learned lessons.
System Effectiveness
Efficiency Availability
MTBM MTTM
32
Maintenance Control Levels:
- Maintenance Follow-up
- (Actual/Plan)
33
Productivity:
• It is a combination of both effectiveness &
efficiency.
Productivity index
= Output obtained / Input expended
= Performance achieved / Resources consumed
Good PM Program 60 – 80 %
Good bonus & incentive system
Good PM Program based on RCM
Good bonus & incentive system More than 80 %
36
Maintenance Performance Evaluation
S −d
Availability = A =
S x 100%
Percentage of downtime = Id = 100% - A
S −d
Mean time between failures = MTBF =
f
df
Mean time to repair MTTR =
f
Where, S = Scheduled production time
d = Downtime f = Number of failures.
df = Downtime delays from failures.
Example:
Scheduled production time = 31 day
Downtime = 6 day
Number of failures = 3 failure/month
37
31 − 6
A= x 100% = 80.6 %
31
Id = 100 - 80.6 = 19.4%
31− 6
MTBF =
3 = 8.33 days
6
MTTR= = 2 days
3
39
x 100
Availability = A =
Planned production time - Unplanned downtime
Planned production time
Quality = Q =
Actual amount of production - Unaccepte d amount
Actual amount
40
Emergency man-hours % =
Man - hours spent on emergency jobs
x 100
Total direct maintenanc e hours worked
• EMPAC www.plant-maintenance.com
• FMMS www.kdr.com.au
• GPS5 www.gps5.com
• IMAINT www.dpsi.com
• IMPACT-XP www.impactxp.com
• IMPOWER www.impower.co.uk
• MAINPAC www.mainpac.com.au
• MAINPLAN www.mainplan.com
• MAXIMO www.maximo.com
• MP2 www.datastream.net
• OEE MANAGER www.zerofailures.co.uk
• OEE SYSTEMS www.oeesystems.com
• OEE TOOLKIT www.oeetoolkit.com
• OEE-IMPACT www.oeeimpact.com
• PEMAC www.pemac.org
• PERFORM OEE www.ssw.ie/performoee.asp
• RAMS www.reliability.com.au
• RCM Turbo www.strategic.com
• REAL-TPI www.abb.com
• SAP-RLINK www.osisoft.com
• TPM Software www.tpmsoftware.com
43
CMMS
44
What is the effect of the Good Computerized
Maintenance Package?
45
CMMS main Steps:
46
5- PM Case Studies
Case (1):
How to construct the coding & criticality
systems:
EQUIPMENT CODING
Location Equipment Type Equipment Tag #
1 2 3 4 7 8
Propose a coding system and priority rules for the
following equipment:
Plant Equipment Type Number of
Systems Location Machines
Productive Turning 4
systems Machining Milling 2
shop Drilling 2
Grinding 2
Press 1
Induction furnaces 2
Foundry Molding machines 5
shop
Arc Welding 1
Welding
shop
Supportive Fork lift 4
systems Material
handling
Compressor 2
Air room
Pump – 50 HP 2
Water room Pump – 100 HP 2
Diesel generator 2
Power room
47
Equipment Coding Structure:
Equipment Type
Location
01 Machining shop 01 Turning
02 Milling
03 Drilling
04 Grinding
05 Press
02 Foundry shop 10 Induction furnaces
11 Molding machines
03 Welding shop 20 Arc Welding
04 Material handling 30 Fork lift
05 Air room 40 Compressor
06 Water room 51 Pump – 50 HP
52 Pump – 100 HP
07 Power room 06 Diesel generator
Example: 010202
01 02 02
Machining shop Milling #2
Example: 065201
06 52 01
Water room Pump – 100 HP #1
48
EIGHT LEVEL DECOMPOSITION:
Level Characterization
0 System
1 Sub-System
2 Major Assembly
3 Assembly
4 Sub-Assembly
5 Component
6 Part
7 Material
EQUIPMENT PRIORITY
Failure effect:
- Effect on HSE
- Effect on Production
- Effect on Cost
Failure Probability:
- Failure Frequency
Example:
Factors % Levels
1- Production 30 V- Very Important
I- Important
N- Normal
2- HSE 30 V- Very Important
I- Important
N- Normal
3- Stand by 15 WO- Without
WS- With Standby
4- Value 5 H- High Value
M- Medium
L- Low
49
Priority
Description
Level
A Group A: Equipment with 100% duty factor, whose
failure involves production losses and potential
safety hazards.
B Group B: Equipment with a ratio duty factor, i.e.,
having some standby, whose failure involves
production losses and potential safety hazards.
C Group C: Equipment with standby, whose failure
involves either production losses or potential safety
hazards.
D Group D: Equipment with standby, whose failure
involves neither production losses nor safety
hazards.
Equipment Priorities
Equipment Type Priority Level
Location
Machining shop Turning B
Milling B
Drilling B
Grinding B
Press D
Foundry shop Induction furnaces A
Molding machines B
Welding shop Arc Welding A
Material handling Fork lift C
Air room Compressor C
Water room Pump – 50 HP C
Pump – 100 HP C
Power room Diesel generator A
50
Case (2):
Four Policies:
• Replacement after first failure (after 36 month)
Cost rate:
Replacement $ 10,000 & Repair $ 3,500
Required:
• Select the best maintenance policy
• Estimate the annual budget for the best policy
• Target maintenance plan
51
Case (3):
4- Cost rates:
2. Oil cost 5 $/liter
3. Filter cost 50 $/unit
Required:
1. Annual materials (oil and filters) requirements
Planning.
2. Annual materials cost
3. Annual PM plans
4. Materials profile (histogram)
5. Maintenance work order for each PM level
52
Case (4):
The yearly PM programs information for six similar gas
turbines in a power station are as follows:
1- PM information:
2- Working conditions:
• Gas turbine operating conditions: 24 hour/day
• Workers operating conditions: 300 day/year & 8
hour/day
3- CM information:
• Average effort of CM = 380 man-day per gas turbine
• Average annual spare parts CM = $ 12000 per gas
turbine
• Average CM downtime = 15 days/year per gas
turbine
• Average downtime cost rate = $ 1000 per day
4- Cost rates:
• Average labor cost rate = $ 10 per man-day
• Overhead cost = 25 % direct cost (spare parts &
labor)
53
Required:
1) The size of maintenance labor force.
2) Average system availability.
3) Annual downtime cost losses.
4) Annual maintenance cost.
5) Annual PM plan.
6) Maintenance resource profiles.
7) Monthly PM plans.
8) Maintenance work order
54
The average down time per year
System Reliability:
Series or chain structure: Rs = R1 * R2 * R3 * … etc.
Parallel structure: Rs = 1 –(1-R1)* (1-R2)* (1-R3) * .etc.
55
Annual maintenance cost
PM Annual Cost Spare parts
Type Frequency $1000 PM Cost
$1000
Y 1 10 10 * 1= 10
S 1 8 8*1=8
3M 2 5 5 * 2 = 10
M 8 2 2 * 8 = 16
Annual spare parts PM per gas 44
turbine =
Total annual spare parts PM cost 44 * 6 = 264
=
The average annual spare parts CM cost =
$ 12000 * 6 = $ 72,000
Annual spare parts maintenance cost =
264000 + 72000 = $ 336,000
Annual labor cost =
25 workers * 300 day/year * $ 10 per man-day= $ 75,000
56
Basic Annual PM Plan
Eq. Month #
code 1 2 3 4 5 6 7 8 9 10 11 12
G01 Y M M 3M M M S M M 3M M M
G02
M M Y M M 3M M M S M M 3M
G03
M 3M M M Y M M 3M M M S M
G04
S M M 3M M M Y M M 3M M M
G05
M M S M M 3M M M Y M M 3M
G06
M 3M M M S M M 3M M M Y M
Resource analysis:
Man- 58 23 58 23 58 23 58 23 58 23 58 23
day 0 0 0 0 0 0 0 0 0 0 0 0
Day/
24 24 24 24 24 24 24 24 24 24 24 24
month
Worker
24 10 24 10 24 10 24 10 24 10 24 10
s
SP cost 26 18 26 18 26 18 26 18 26 18 26 18
DT 33 18 33 18 33 18 33 18 33 18 33 18
57
Target Annual PM Plan # 1
Eq. Month #
code 1 2 3 4 5 6 7 8 9 10 11 12
G01 Y M M 3M M M S M M 3M M M
G02
M M Y M M 3M M M S M M 3M
G03
M 3M M M Y M M 3M M M S M
G04
M S M M 3M M M Y M M 3M M
G05
3M M M S M M 3M M M Y M M
G06
M M 3M M M S M M 3M M M Y
Resource analysis:
Man- 45 35 45 35 45 35 35 45 35 45 35 45
day 5 5 5 5 5 5 5 5 5 5 5 5
Worker
19 15 19 15 19 15 15 19 15 19 15 19
s
SP cost 23 21 23 21 23 21 21 23 21 23 21 23
DT 28 23 28 23 28 23 23 28 23 28 23 28
58
Target Annual PM Plan # 2
Eq. Month #
code 1 2 3 4 5 6 7 8 9 10 11 12
G01 Y M M 3M M M M S M 3M M M
G02
M M Y M M 3M M M M S M 3M
G03
M 3M M M Y M M 3M M M M S
G04
M S M 3M M M Y M M 3M M M
G05
M M M S M 3M M M Y M M 3M
G06
M 3M M M M S M 3M M M Y M
Resource analysis:
Man- 40 41 40 41 40 41 40 41 40 41 40 41
day 0 0 0 0 0 0 0 0 0 0 0 0
Worker
17 17 17 17 17 17 17 17 17 17 17 17
s
SP cost 20 24 20 24 20 24 20 24 20 24 20 24
DT 25 26 25 26 25 26 25 26 25 26 25 26
59
Monthly Maintenance Plan: Month # 1
Day G01 G02 G03 G04 G05 G06 PM worker
1. Y 20
2. Y 20
3. Y 20
4. Y 20
5. Y 20
6. Y 20
7. Y 20
8. Y 20
9. Y 20
10. Y 20
11. Y 20
12. Y 20
13. Y 20
14. Y 20
15. Y 20
16. SB -
17. M 10
18. M 10
19. SB -
20. M 10
21. M 10
22. SB -
23. M 10
24. M 10
25. SB -
26. M 10
27. M 10
28. SB -
29. M 10
30. M 10
31. SB -
60
MAINTENANCE WORK ORDER
010120
Requester Section:
Power Station PS03 - Gas Turbine G01 - Priority: A
Maintenance type/level: Annual PM
1- Check ….
2- Clean …..
3- Replace …..
4- Adjust ……
5- Repair …..
Eng. Attia Gomaa
Planning Section:
Labor: 4 Mech. 2 Helper 5 days
5 Elec. 4 Helper 10 days
Spare parts: 2 valve xx1, 4 air filter yy3, .. etc.
Special tools: xxx, yyyy, … etc,
Expected down time (from 01/01 to 15/01/2004)
Cost estimation ($ 10,000)
Safety instructions:
- Check … Eng. Aly Ahmed
Craft Feedback:
1- Check ….
2- Clean …..
3- Replace …..
4- Adjust ……
5- Repair …..
Labor: 3 Mech. 2 Helper 5 days
6 Elec. 3 Helper 11 days
1 Vib. 1 Helper 2 days
Spare parts: 2 valve xx1, 4 air filter yy3, .. etc.
Special tools: Vibrometer, … etc,
Down time (01/01 to 17/01/2004) Actual Cost ($ 12,000)
Eng. Omer Aly
Coding:
61
Case (5):
62
Case (6):
Maintenance spare parts cost ($):
Forecastin
Year Year Year Year Exp.
g limits
1999 2000 2001 2002 2003
2003
1450 1300 1200 1000 ? ?
X 1 2 3 4 5
Y 1450 1300 1200 1000 ?
XY 1450 2600 3600 4000
n=4
Sum X = 10 Sum X2 = 30
Sum Y = 4950 Sum XY = 11650
4950 = 4 a + 10 b 11650 = 10 a + 30 b
14850 = 12 a + 30 b
a = 1600 b = - 145
X 1 2 3 4 5
A 1450 1300 1200 1000 -
F 1445 1310 1165 1020 875
(A-F) 5 10 35 20
(A-F)2 25 100 1225 400
CLs = 0 ± Z S = 0 ± 48
63
Case (7):
Solution
64
MAINTENANCE SHUTDOWN PLANNING
USING CPM
Required:
65
Case (9): Monthly Maintenance Plan for Wire
Production Line
9 6 3 2
W01 W02 W03
1- Activity List
Relations
Duration Predec (SS, FS,
Activity ID
(day) essors FF, and
SF)
1 Preparation 2 - -
PRP
2 Mech. maintenance # MM1 7 PRP -
01
3 Elec. maintenance # EM1 9 SS 3
01 MM1
4 Mech. maintenance # MM2 6 PRP -
02
5 Elec. maintenance # EM2 8 MM2 SS 2
02
6 Mech. maintenance # MM3 5 PRP -
03
7 Elec. maintenance # EM3 7 MM3 SS 2
03
8 Setup STP 1 EM1 -
EM2
EM3
66
2- Resource List
3- Resource Allocation
Resource
Activity ID L01/ L02/ SPS
day day (Total)
1 Preparation 2 1 1
PRP
2 Mech. maintenance # MM1 4 - 3
01
3 Elec. maintenance # EM1 - 5 4
01
4 Mech. maintenance # MM2 3 - 2
02
5 Elec. maintenance # EM2 - 4 3
02
6 Mech. maintenance # MM3 2 - 2
03
7 Elec. maintenance # EM3 - 3 3
03
8 Setup STP 2 2 1
67
4- Base Calendar (Working periods)
Saturda Sunda Monda Tuesda Wednesda Thursda Friday
y y y y y y
X X X X X X
1/01/04
Holidays: 20 to 21 Jan. 2004
Required:
1. Draw the project network (logic diagram)?
2. Draw the corresponding Gantt chart?
3. Construct the corresponding smoothed worker
loading?
4. Construct the corresponding worker leveling?
5. Construct the target action plan?.
6. Construct the cost profile & S-curve?
7. Construct the target master plan?
68
Case (10): Annual Maintenance Plan for AUC-IT
Labs.
Project Name : AMIT Project start: 1 Jan. 2004
Planning unit : Day 6 DAYS /WEEK
1- Activity List
Duration Predeces Relations
Activity ID
(day) sors
1 Preparation 1 - -
PRP
2 Server maintenance SRM 3 -
PRP
3 Hardware HM1 4 SRM -
maintenance Lab #01
4 Software maintenance SM1 5 SS 2
Lab #01 HM1
5 Hardware HM2 3 SRM -
maintenance Lab #02
6 Software maintenance SM2 4 HM2 SS 1
Lab #02
7 Hardware HM3 3 SRM -
maintenance Lab #03
8 Software maintenance SM3 4 HM3 SS 1
Lab #03
9 Setup STP 1 SM1 -
SM2
SM3
2- Resource List
Resource Resource Unit Limits/day Price
Code Description Norm Max. LE/unit
L01 Hardware Engineer md 3 6 120
L02 Software Engineer md 4 8 100
SPS Spare parts & cost - - 1000
supplies
69
3- Resource Allocation
Resource
Activity ID L01/ L02/ SPS
day day (Total)
1 Preparation 2 1 1
PRP
2 Server maintenance SRM 1 1 1
3 Hardware HM1 4 - 2
maintenance Lab #01
4 Software SM1 - 5 3
maintenance Lab #01
5 Hardware HM2 3 - 1
maintenance Lab #02
6 Software SM2 - 4 2
maintenance Lab #02
7 Hardware HM3 2 - 1
maintenance Lab #03
8 Software SM3 - 3 2
maintenance Lab #03
9 Setup STP 2 2 1
B(2) C(2)
G(1) D(2)
Component A B C D E F G
Lead time 1 2 1 1 2 3 2
(week)
On-Hand 10 15 20 10 10 5 0
Required:
1. Time-phased for the gear box structure
2. Gross requirements plan for 50 gear box
3. Net material requirements plan for 50 gear
box.
71
Case (12): The monthly plan and the actual maintenance
spare parts in ABC Company are as follows:
72
TOTAL MAINTENANCE CONTROL
Case (13):
Monthly production information on Foundry Shop
FS510 was as follows:
Jan. Feb.
Item 2004 2004
Working days 31 28
73
Average down time (hr/day) 6 4
74
Based on these data, determine the different PE
indicators for the productive system.
Basic data
Jan 04 Feb 04 Feb. / Jan.
Item
75
Average down time (hr/day) 6 4 67 %
76
Performance Evaluation
77
January February Feb. /
Indicator 2004 2004 Jan.
18/24= 75
Availability % 20/24= 83 % 111 %
78
OEE 44 % 60 % 136 %
TEEP 37 % 51 % 138 %
NEE 29 % 52 % 179 %
79
Case (14):
The six-monthly maintenance costs ($1000) for a
productive system are as follows:
Target Costs:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly
PM Cost:
Spar 100 100 100 100 100 100 100
parts 50 50 50 50 50 50 50
Labor
CM Cost:
Spar 200 200 200 200 200 200 200
parts 150 150 150 150 150 150 150
Labor
DT Cost 300 300 300 300 300 300 300
Actual Costs:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly
PM Cost:
Spar 23 38 49 56 68 65 54
parts 32 65 96 94 94 90 72
Labor
CM Cost:
Spar 231 213 181 185 199 196 157
parts 503 370 293 164 201 193 142
Labor
DT Cost 407 397 320 290 330 320 362
80
Based on these data, determine the different
performance evaluation indicators for the
maintenance system.
81
Target:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly Total
PM Cost 150 150 150 150 150 150 150 1050
CM Cost 350 350 350 350 350 350 350 2450
TM Cost 800 800 800 800 800 800 800 5600
DT Cost 300 300 300 300 300 300 300 2100
TM+DT 110 110 110 110 110 110 110 7700
0 0 0 0 0 0 0
PM/TM 0.14 0.14 0.14 0.14 0.14 0.14 0.14 0.955
CM/PM 2.33 2.33 2.33 2.33 2.33 2.33 2.33 16.33
Actual:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly Total
PM Cost 55 103 145 150 162 155 126 896
CM Cost 734 583 474 349 400 369 299 3208
TM Cost 1196 108 939 789 892 864 787 6550
3
DT Cost 407 397 320 290 330 320 362 2426
TM+DT 1603 148 125 107 122 118 114 8976
0 9 9 2 4 9
PM/TM 0.05 0.10 0.15 0.19 0.18 0.18 0.16 1.007
CM/PM 13.35 5.66 3.27 2.33 2.47 2.38 2.37 31.82
Change %:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly Total
PM Cost
CM Cost
TM Cost
DT Cost
TM+DT
PM/TM
CM/PM
82
Case (15):
Change
Item Target Actual
%
Total labor force 25 30 + 20
(worker)
Annual s. parts cost 336 400 + 19
($1000)
Annual labor cost 75 80 + 6.6
($1000)
Overhead cost ($1000) 514 520 + 1.2
Total m. cost ($1000) 925 1000 + 8.1
Average down time 66 50 - 24.3
Down time cost 66 50 - 24.3
($1000)
84
Case (16):
The six-monthly maintenance costs ($1000) for a
productive system are as follows:
Target Costs:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly
PM Cost:
Spar 100 100 100 100 100 100 100
parts 50 50 50 50 50 50 50
Labor
CM Cost:
Spar 200 200 200 200 200 200 200
parts 150 150 150 150 150 150 150
Labor
DT Cost 300 300 300 300 300 300 300
Actual Costs:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly
PM Cost:
Spar 23 38 49 56 68 65 54
parts 32 65 96 94 94 90 72
Labor
CM Cost:
Spar 231 213 181 185 199 196 157
parts 503 370 293 164 201 193 142
Labor
DT Cost 407 397 320 290 330 320 362
85
Target:
Actual:
Month #
Cost item
Jan Feb Mar Apr May Jun Jly
PM Cost 55 103 145 150 162 155 126
CM Cost 734 583 474 349 400 369 299
DT Cost 407 397 320 290 330 320 362
TM Cost 1196 1083 939 789 892 864 787
86
7- Machine Failure Analysis
Parameters
Type of fault
Vibration Temp. Oil
Out of balance xxx - -
Misalignment / bent shaft xxx x -
Damage of rolling bearing xxx xx x
Damage of journal xxx xx x
bearing
Damage of gear box xxx x xx
Belt problems xx - -
Motor problems xx x -
Mechanical looseness xxx x x
Resonance xxx - -
88
Bearing Failure Analysis
89
90
91
Bearing Failure: Causes and Cures
Excessive Loads:
• Excessive loads usually cause premature fatigue.
Tight fits, brinelling and improper preloading can also
bring about early fatigue failure.
92
Overheating:
• Symptoms are discoloration of the rings, balls, and
cages from gold to blue.
• Temperature in excess of 400F can anneal the ring
and ball materials.
• The resulting loss in hardness reduces the bearing
capacity causing early failure.
• In extreme cases, balls and rings will deform. The
temperature rise can also degrade or destroy
lubricant.
93
True Brinelling:
• Brinelling occurs when loads exceed the elastic limit
of the ring material.
• Brinell marks show as indentations in the raceways
which increase bearing vibration (noise).
• Any static overload or severe impact can cause
brinelling.
94
False Brinelling:
• False brinelling - elliptical wear marks in an axial
direction at each ball position with a bright finish and
sharp demarcation, often surrounded by a ring of
brown debris – indicates excessive external
vibration.
• Correct by isolating bearings from external vibration,
and using greases containing antiwear additives.
95
Normal Fatigue Failure:
• Fatigue failure - usually referred to as spalling - is a
fracture of the running surfaces and subsequent
removal of small discrete particles of material.
96
Reverse Loading:
• Angular contact bearings are designed to accept an
axial load in one direction only.
97
Contamination:
• Contamination is one of the leading causes of
bearing failure.
98
Lubricant Failure:
• Discolored (blue/brown) ball tracks and balls are
symptoms of lubricant failure. Excessive wear of
balls, ring, and cages will follow, resulting in
overheating and subsequent catastrophic failure.
99
Corrosion:
• Red/brown areas on balls, race-way, cages, or
bands of ball bearings are symptoms of corrosion.
100
Misalignment:
• Misalignment can be detected on the raceway of the
nonrotating ring by a ball wear path that is not parallel
to the raceways edges.
101
Loose Fits:
• Loose fits can cause relative motion between
mating parts. If the relative motion between mating
parts is slight but continuous, fretting occurs.
102
Tight Fits:
• A heavy ball wear path in the bottom of the raceway
around the entire circumference of the inner ring and
outer ring indicates a tight fit.
103
Case (17): Pump Failure Analysis
104
• Determine the different PE indicators for this system.
• Construct how to analyze and eliminate the bearing
failure.
105
Failure Analysis:
Pump Station: 8 Centrifugat pump Code: 1000
Failure Type: Bearing failure Part code: xxxxx
(Year 2004)
# of Equipment Run Repair Failure
failure code time time Mechanism
(hr) (hr)
1 1007 1250 8 Corrosion
2 1008 1450 6 Corrosion
3 1001 1000 10 Temperature
4 1004 1500 7 Corrosion
5 1006 1000 4 Oil
6 1002 1250 7 Corrosion
7 1003 700 9 Oil
8 1007 600 8 Temperature
9 1008 500 8 Temperature
10 1006 1250 9 Corrosion
11 1001 1000 10 Oil
12 1002 1450 8 Corrosion
13 1005 700 8 Temperature
14 1004 1250 11 Corrosion
15 1005 1000 9 Corrosion
16 1003 700 6 Oil
17 1008 600 9 Temperature
18 1001 1000 8 Oil
Total 18200 145
106
MTBF at which less than 20 % of the pumps are assumed
to fail
Freq
6
5
4
3
2
1
300 650 900 1150 1400
650 900 1150 1400 1650
MTBF
107
Equipment Level:
Remedy:
Maintenance Policy
Condition Based Time Based
Every 300 hours (1) Change oil every 600 hour
Oil analysis (2) Change bearing & oil every
Temperature analysis 1200 hour
Vibration analysis Down time: (1) 1 hr & (2) 8 hr
109
Cost Analysis:
Cost elements:
110
Maintenance Policy:
I- Vibration analysis:
2- Tool:
• Vibration Equipment: accelerometers, charge amplifier
and analyser.
• Computer program for trend analysis and prediction.
4- Method:
1. Record the vibration spectrum, specify the peaks
corresponds to the bearing components
2. Record each component peak and frequency.
3. By using the soft ware and the standard limits,
determine the trend of each peak.
4. Determine the bearing state(good –need service –need
change)
6- Actions:
1. Bearing is Good
2. Call for bearing change
3. Bearing must be changed immediately
111
II- Temperature analysis:
2- Tool:
• Temperature measuring equipments as
thermocouple or infrared camera.
• Computer program for trend analysis and
prediction.
4- Method:
• Measure the temperature of the bearing on line
and take the average value every day.
• By using the software analyze the data,
determine the max. & average temperature
values.
• According to the allowable range specified in SKF
standard, determine the bearing state.
6- Actions:
1. Bearing is good
2. Call for bearing change
3. Bearing must be changed immediately
112
III- Oil analysis:
• Viscosity change
• Acidic content
• Wear rate
2- Tool:
4- Method:
113
5- Limits:
6- Actions:
1. Oil is Good
2. Call for oil change
3. Oil must changed immediately
114
LEVEL II
ADVANCED MAINTENANCE MANAGEMENT
115
Predictive (Condition-based) Maintenance
by monitoring key equipment parameters "Off-line or On-line"
• Vibration analysis, Oil analysis
• Wear analysis, Noise analysis
• Temperature analysis
• Pressure analysis, Quality analysis
• Efficiency analysis
116
The premise of predictive maintenance is that regular monitoring
of the actual mechanical condition of machine trains and
operating efficiency of process systems will ensure the maximum
interval between repairs; minimize the number and cost of
unscheduled outages created by machine-train failures and
improve the overall availability of operating plants.
117
• Prevention of catastrophic failures and early detection of
incipient machine and systems problems increased the
useful operating life of plant machinery by an average of
30%.
118
required tools and plan the work reduced the time required
from three weeks to five days.
119
They are as follows:
A. Minimizes or eliminates costly downtime - increases
profitable uptime.
B. Minimizes or eliminates catastrophic machinery failures -
damage from catastrophic failure is usually much more
extensive than otherwise would have been.
C. Reduces maintenance costs.
D. Reduces unscheduled maintenance - repairs can be made
at times that least affect production.
E. Reduces spare parts inventories - many parts can be
purchased just in time for repairs to be made during
scheduled machinery shutdowns..
F. Optimizes machinery performance - machinery always
operates within specifications.
G. Reduces excessive electric power consumption caused by
inefficient machinery performance - saves money on
energy requirements.
H. Reduces need for standby equipment or additional floor
space to cover excessive downtime - less capital
investment required for equipment or plant.
I. Increases plant capacity.
J. Reduces depreciation of capital investment caused by
poor machinery maintenance - well maintained machinery
lasts longer and performs better.
K. Reduces unnecessary machinery repairs - machines are
repaired only when their performance is less than
optimal.
L. Minimizes or eliminates the possibility that machinery
repairs were the wrong repairs.
M. Reduces the number of dissatisfied customers or lost
customers due to poor quality - with less than optimal
machine performance, quality always suffers.
N. Reduces rework of goods caused by machines operating
at less than optimal condition.
120
O. Reduces scrap caused by poorly performing machinery.
P. Reduces overtime required to make up for lost production
due to broken down or poorly performing machinery.
Q. Reduces penalties that result from late deliveries caused
be broken down or poorly performing machinery.
R. Reduces warranty claims due to poor product quality
caused by poorly performing machinery.
S. Reduces the possibility of accepting recently purchased
new or used machinery with defects - the invoice is not
paid until the defects are corrected.
T. Increases the likelihood that newly purchased new or used
machinery meets specifications.
U. Increases machinery safety - injuries are often caused by
poorly performing machinery.
V. Reduces safety penalties levied against a company for
unsafe machinery.
W. Reduces insurance rates because well maintained
machinery increases safety.
X. Reduces the time required to make machinery repairs -
advance notice of machinery condition permits more
efficient organization of the repair process.
Y. Increases the speed that machinery can be operated, if
desirable.
Z. Increases the ease of operation of machinery.
121
Advantages & Disadvantages Of PdM:
Advantages:
Disadvantages:
123
All equipment in this classification must be evaluated to determine
whether routine monitoring is cost-effective. In some cases,
replacement costs are lower than the annual costs required to
monitor machinery in this classification. The completed list should
include every machine, system or other plant equipment that has or
could have a serious impact on the availability and process
efficiency of your plant. The next step is to determine the best
method or technique for cost effectively monitoring the operating
condition of each item on the list. To select the best methods for
regular monitoring, you should consider the dynamics of operation
and normal failure modes of each machine or system to be included
in the program. A clear understanding of the operating
characteristics and failure modes will provide the answer to which
predictive maintenance method should be used.
125
Inspection Policy Planning & Control:
❑ Routine Maintenance,
❑ Repair, or
❑ Replace.
126
Viberation Analysis
Vibration analysis is the dominant
technique used for predictive maintenance management. Since the
greatest population of typical plant equipment is mechanical, this
technique has the widest application and benefits in a total plant
program.
127
specific degrading machine components or the failure mode of plant
machinery before serious damage occurs.
128
Vibration Sources
M e c h a n ic a l
U n b a la n c e
Lo o se n e s s B e n t S h a ft B la d e P a s s /
G e a rs F l u id R e la t e d
S lo t F r e q u e n c y /
EM re la te d
A lig n m e n t
M o to r
J o u r n a l ( F l u i d F i lm )
B e a rin g s
M e c h a n ic a l R o ll i n g E l e m e n t
Reson an ces C o u p li n g s
B e a rin g s
Sam Shearman
National Instruments
Accelerometer Location
129
Time Domain
Blade Pass
Motor EM
Rotation
Force
Power Spectrum
130
Transducers Vibration
Monitoring System
Diagram
Machine
Acquisition
HW
PCI/PXI/CompactPCI PC
Measurement &
Automation SW
131
Tool Selection:
Vibrometer Accelerometer
w/wn > 2.5 <0.33
The percentage <= 0.5 %
error
w = measured frequency wn= Natural frequency
The percentage error: % e = 100 (1 – MF)
Case (20):
A vibrometer of 10 Hz natural frequency and 0.68
damping ratio is used to measure the vibration of a
machine with frequency 15 Hz.
1- Is this a successful selection for the measuring
transducer? Why?
2- What is the percentage error in the measured
vibration?
MF = 0.94
132
%e = 100 (1 – 0.94) = 6 %
133
ISO 2372-Vibration Severity Rang Limits (Velocity)
134
ISO 2372-Vibration Severity Range Limits (Velocity)
Machines Belonging to:
mm/Sec
Class I Class II Class III Class IV
(RMS)
< 20 HP 20-100 >100 >100 HP
HP
0.28
A
0.45 A
A A
0.71
B (Good)
1.12
B
1.80
2.80 C B
B
4.50 C (Allowable)
7.10 C
C
11.2 (Tolerable)
18.0 D
D
D D
28.0
(Not
45.0
Permissable)
71.0
A: Good B: Allowable C: Tolerable D: Not Permissible
Suggested Classifications:
Class I: Small (up to 15kW) machines and subassemblies of larger machines.
From: Arne
Comments
ISO 2372 is still valid for power below 15 kW. The "new" standard is
called ISO 10816 and has several parts. The part /1 outlines the basics
and the connection to older and newer standards. Part /3 is the essential
part for all general production machinery such as fans, pumps etc. In
general, as compared to older levels back to Rathbone or VDI 2056 /
Iso 2372, the levels are reduced from what was the red limit before
down to approx. half and the lowest levels called just "A" are a slight
bit higher but have aquired firm statements like "Delivery status", much
stronger recommendation than just an "A". Reciprocating / piston och
screw volume machines had Class 5 in ISO 2372 but these levels are
lost in 10816 with a very soft talk about asking the user to please report
back to ISO about experiences. That has cause this part to be useless
and old 2372 Class 5 for such machinery is used a lot here. Meaning in
clear text that 4.5 mm/s rms is delivery "green" level unless technically
motivated to be something else. Frequency range is now expanded to
cover those frequencies that are relevant, instead of the 10-1000 Hz that
was used in 2372. Unit is still mm/s rms (rms is true rms, not just an
average using a diode and a capacitor in the instrument). Regards Arne
136
Canadian specification CDA/MS/NVSH107
The vibration levels (mm/s)
137
Canadian specification CDA/MS/NVSH107
The vibration levels (mm/s)
138
Vibration Trouble Shooting Chart
Nature of fault Frequency of Dominant
Vibration (Hz=rpm/60)
Rotating members out 1 * rpm
of balance
Misalignment & (1 to 2) * rpm
Bent shaft
Impact rates for the
Damaged rolling individual bearing
Elements bearing component. *
(ball, roller, etc.)
Vibration at high frequencies
(2 to 60 kHz)
Journal bearings (1/2 to 1/3) rpm
loose in housing
Oil film whirl or Slightly less than half shaft
Whip in Journal speed (42 to 48%)
bearings
Mechanical looseness 2 * rpm
*Impact rates f(Hz):
n = number of balls, Bd = ball diameter mm,
Pd = Pitch circle diameter mm, β= Angle
Required:
1- Construct the vibration trend.
1- Predict the vibration level at time 110 running hours.
2- Using the Canadian specification CDA/MS/NVSH107,
predict the time to “call for service” and to “immediate
repair” starting from the last measurement (at t1)
Hours 100 200 300 400 500 600 700 800 900
mm/s 058 1.08 1.58 2.08 2.58 3.08 3.58 4.08 4.58
Required:
1. Find the percentage error in the measured vibration if the
high level corresponds to 25 HZ.
2. Construct the vibration trending and predict the vibration
level after 110 hours.
3. Does the above results changed if the vibrometer fixed by
a waxy material, why?
4. Using the Canadian specification CDA/MS/NVSH107,
predict the time to “call for service” and to “immediate
repair” starting from the last measurement (at t1)
Solution:
1- percentage error in the measured vibration level
measured frequency w = 25 [Hz])
Vibrometer natural frequency wn = 10 [Hz]
Z= damping factor =0.7
141
Construction of vibration trending:
By using least square method:
Time in hours =t Vibration velocity in mm/s=v
n= number of the measured data =9
a &b are constants to be evaluated
v=0.08+0.005t [mm/s]
142
Case (23):
The following Figure shows the line diagram of a
pumping system.
Gearbox
Coupling Ratio 1:10
Motor
Gear 1
1800 rev/min B1 B2
Gear 2
Pump
B3 B4
143
1- Unbalance in motor and pump.
Speed ratio = N1 / N2 = Z1 / Z2
144
3- Bearing 3 outer race:
if it is a ball bearing having
n = number of balls 10,
Bd = ball diameter 5 mm,
Pd = Pitch circle diameter 50 mm,
β= Angle =0.
145
Lubricating oil analysis
4. Solids content : is a general test. All solid materials in the oil are
measured as a percentage of the sample volume or weight. The
presence of solids in a lubricating system can significantly increase the
wear on lubricated parts. Any unexpected rise in reported solids is
cause for concern.
147
11. Spectrographic analysis : allows accurate, rapid measurements of
many of the elements present in lubricating oil. These elements are
generally classified as wear metals, contaminates, or additives. Some
elements can be listed in more than one of these classifications.
Standard lubricating oil analysis does not attempt to determine the
specific failure modes of developing machine-train problems.
Therefore, additional techniques must be used as part of a
comprehensive predictive maintenance program.
12. Wear particle analysis : is related to oil analysis only in that the
particles to be studied are collected through drawing a sample of
lubricating oil. Where lubricating oil analysis determines the actual
condition of the oil sample, wear particle analysis provides direct
information about the wearing condition of the machine-train. Particles
in the lubricate of a machine can provide significant information about
the condition of the machine. This information is derived from the
study of particle shape, composition, size and quantity. Wear particle
analysis is normally conducted in two stages.
The first method used for wear particle analysis is routine monitoring
and trending of the solids content of machine lubricant. In simple terms
the quantity, composition and size of particulate matter in the
lubricating oil is indicative of the mechanical condition of the machine.
A normal machine will contain low levels of solids with a size less than
10 microns. As the machine’s condition degrades, the number and size
of particulate matter will increase.
148
Five basic types of wear can be identified according to the
classification of particles: rubbing wear, cutting wear, rolling fatigue
wear, combined rolling and sliding wear and severe sliding wear. Only
rubbing wear and early rolling fatigue mechanisms generate particles
predominantly less than 15 microns in size.
(d) Combined rolling and sliding wear results from the moving
contact of surfaces in gear systems. These larger particles result from
tensile stresses on the gear surface, causing the fatigue cracks to spread
deeper into the gear tooth before pitting. Gear fatigue cracks do not
generate spheres. Scuffing of gears is caused by too high a load or
speed. The excessive heat generated by this condition breaks down the
lubricating film and causes adhesion of the mating gear teeth. As the
wear surfaces become rougher, the wear rate increases. Once started,
scuffing usually affects each gear tooth.
150
8- Optimal System Maintenance (OSM)
OSM approaches focus on mathematical modeling and
developing optimal policies to inspect, repair, or replace
equipment based on its specific reliability characteristics.
152
9- Risk Based Inspection (RBI)
"Success is foreseeing failure" - Henry Petroski
153
• RBI studies define inspection programs. Information is generated
on the types of damage that may be expected, appropriate inspection
techniques to be used, where to look for the potential damage, and
how often inspections should take place.
• The highest risk is mostly associated with a small percentage of
plant items. History tells us that 80% of the risk in industrial plants
in general is related to 20% of the pressure equipment. To be more
efficient with inspections and maintenance, it is useful to identify
this 20% higher risk assets.
• RBI has been used in the nuclear power generation industry for
some time and is also employed in refineries and petrochemical
plant.
• RBI has been applied in industries such as power generation,
refineries, petrochemical plants and pipelines.
RBI Targets
The ultimate goals of RBI are:
154
10- Reliability Centered Maintenance (RCM)
The concept of RCM finds its roots in the early 1960's, with RCM
strategies for commercial aircraft developed in the late 1960s,
when wide-body jets were introduced to commercial airline
service. A major concern of airlines was that existing time-based
preventive maintenance programs would threaten the economic
viability of larger, more complex aircraft. The experience of
airlines with the RCM approach was that maintenance costs
remained roughly constant but that the availability and reliability
of their planes improved. RCM is now standard practice for most
of the world's airlines.
There are four features that define & characterize RCM, which
are as follows:
• Preserve function, by addressing system function,
inputs & outputs.
• Identify failure modes that can defeat the function.
• Prioritize function need (via the function mode).
155
• Select only applicable & effective PM tasks
Benefits of R.C.M
• Improve operating performance.
• Improve quality
• Greater maintenance cost effectiveness
• Increase equipment life
• Better teamwork
• Increase moral
2- Safety-critical components:
156
SM tasks must be performed only when such tasks will
prevent a decrease in reliability and/or deterioration of
safety to unacceptable levels or when the tasks will reduce
the life-cycle cost of ownership.
157
RCM Steps:
1. System selection and information analysis.
2. System boundary definition.
3. System description and functional block diagram.
4. System function and functional failures.
5. Failure mode and effects analysis.
6. Logic (decision) tree analysis.
7. Task selection.
158
Failure Modes and Effects Analysis (FMEA)
• A “Hazard Identification” method
• Involves breaking a system down into sub-systems and
component parts
159
TOTAL PRODUCTIVE MAINTENANCE (TPM)
160
involves operational and maintenance staff working together as
a team to reduce wastage, minimize downtime and improve end-
product quality.
TPM builds on the concepts of JIT, TQM and design to achieve
minimum life-cycle cost (LCC) [Eti et al., 2004]. TPM aims to
obtain the maximum production output with the best levels of
product quality, and doing this at minimum cost to the facility
providing the least risk of breakdown [O’Donoghue and
Prendergast, 2004].
162
3. Autonomous maintenance programs for the production
department;
4. Planned maintenance programs for the maintenance department;
5. Equipment design modifications for maintenance department or
suppliers;
6. Manpower education and training;
7. Manpower motivation and direction; and
8. Performance evaluation and continuous improvement.
Chan et al. (2003) developed a TPM program in four phases, which are
as follows: introduction-preparatory stage, introduction stage of TPM
implementation, introduction-execution stage of TPM implementation,
and finally, establishment stage.
164
System Maintenance Production
configuration planning & planning &
control control
OEE
TPM Master
plan
Equipment Human-resource
module module
Overall system
analysis module
165
166
Ten steps to equipment reliability
Introduction
8. Evaluate the residual risk to confirm that it is tolerable wrt type and grade
5. Analyse the effect of each failure mode and then evaluate the risk in terms of type and grade
4. Anticipate thereasonably likely failure modes, mechanisms, that cause(d) each failed state
3. Determine how each asset can fail to meet a desired standard of performance (failed states)
2. Determine the functions, required standards of performance within the present context of each asset
1. Identify and classify physical asset in terms of criticality, configuration, bill of material, etc
Step one The first step in the process is to identify the physical assets and systems in terms of their
role in the business, their location, configuration, bill of material. This information is
gathered from the OEM documentation and may also require a walk down of the
equipment to ensure that we identify the equipment properly in terms of it’s specific model,
version, standard and optional equipment. We also need to gather the available
information regarding the recommended maintenance, recommended spares, operating
and maintenance manuals.
Step two Determine what the system is supposed to do for the business. In other words what are
our expectations in terms of functions and performance
Step three Anticipate in what way the asset could fail to meet our expectations. Functional failures
define the failed state of the asset. What the asset is unable to do.
Step four Anticipate the failure modes that cause functional failures. Failure modes describe the part
that is likely to fail and how it fails
Step five Anticipate and analyse the physical effect that the failure mode has on the system, how
we will know that the failure mode has occurred and also what needs to be done to
restore the system back to an operational state.
The failure mode and effect is also classified into one of four categories of risk. The risk of
the failure mode and effect is then graded in terms of the likelihood and the severity.
167
Ten steps to equipment reliability continued
Step six In step six we analyse the failure mode mechanism to determine if there is any
relationship between ageing of the part or system and likelihood of the event occurring
and also to consider if the failure mode is preceded by a warning or form of deterioration
that is detectable
Step seven Using a decision diagram or algorithm we propose an appropriate failure management
tactic / task to deal with the failure mode by means of a programmed maintenance task
that prevents, predicts, detects the failure mode or we implement a redesign / one-time
change that reduces the risk of failure or we can decide to tolerate the failure mode
Step eight After selection of a failure management tactic in step seven we now do a risk assessment
again to determine if the task or tactic that we have selected does actually reduce the risk
of failure. The outcome must be a lowered risk unless we have decided to tolerate the
failure mode otherwise the action proposed in step seven is not justified
Step nine Systems, processes, procedures, people, spares, tools are required to implement the
tactics proposed in step seven
Step ten In spite of our best efforts some failure modes that were not anticipated could occur in the
future. We also therefore need a system that enables data about these failure modes to
recorded so that we can analyse them using a process such as root cause analysis and
then implement tactics that will reduce the risk of re-occurence
168
Equipment functions
Introduction Assets are not acquired by an organisation for the benefit of ownership. They are acquired
for the benefits derived from their ability to produce the goods and services on which the
organisation bases its existence.
Equipment is acquired to perform specific functions. Maintenance should therefore focus
on preserving the functions in the first place, not the equipment.
Maintenance must have an intimate knowledge of the functions of all equipment in the
plant if it is to fulfil its basic goal.
Primary We have already said that all equipment is acquired to perform one or more specific
functions functions. The function that determines the reason for an item's existence is the primary
function. The primary function of an item of plant is usually derived from the name of the
item.
Secondary All equipment also performs functions that are not always immediately apparent. Due to
functions our focus on the primary functions, these functions are neglected with disastrous effects.
Consider the following examples of secondary functions.
Superfluous In every plant there are a number of equipment items or components that serve no
functions purpose. This is often the case in older plants where additions or changes to the plant or
equipment have occurred over time and functions of some equipment have been taken
over by other equipment. After due consideration superfluous equipment should be
physically removee from the plant.
169
170
Protection systems
Introduction We have all experienced how modern equipment has brought with it an increase in the
type and number of protection systems being introduced. This is made necessary by the
complexity of the equipment and an increase in the ways that equipment can fail.
Types of The table below describes the types and functions of protection systems that you may
systems come across in the analysis of equipment functions.
Special focus Protection systems require special focus because they are easily neglected because they
have no direct impact on operations. They do however pose a specific type of risk in that
some of them may fail in such a manner that people are unaware that they are in a failed
state
171
Performance standards
Introduction The basic objective of maintenance is to ensure that all equipment continues to fulfil their
required functions.
The definition of a failure is the inability of equipment to meet a desired standard of
performance.
Knowledge about the performance standards of equipment and their functions is key to
the development of a maintenance programme.
If you do not know what the standard is, how will you know if it has failed or if it requires
maintenance.
Performance Each item of equipment is designed to perform at a specific standard. The performance
standards standard is usually a quantification of the primary function. At closer inspection, there are
however a number of standards that could be applicable if you take a closer look.
Here are some examples of types of performance standards and their quantification.
Standard Example
Output A boiler generates 10 tons of steam per hour at 1000 kPa.
Quality The liquid filler fills containers to a capacity of 250g with a deviation of
less than 5g
Efficiency The boiler produces 10 000kg of steam from 300kg of coal
Safety The safety belt must withstand a weight of 10 000N
Environment The air scrubber must reduce pollution of the air by dust particles to
below 200 PPM
Design We have already pointed out that equipment is designed to perform at a specific standard.
standards Maintenance on its own cannot improve the performance of equipment.
Operational The users of the equipment often set standards of performance that disregard the
standards designed standard of performance by either under utilising the capacity of the equipment
or exceeding the capacity.
The greatest conflict arises when what the user wants the equipment to do is more than
what it can do.
The question that arises is what standard must be used to determine when the equipment
has failed. Consider the following definition of failures.
A functional failure has occurred when.
What the user wants the equipment to do is equal to what it can do, and it does not do
what the user wants it to do.
What the user wants the equipment to do is less than what it can do, and it does not do
what the user wants it to do.
A functional failure has not occurred when
The equipment can do what the user wants it to do but it cannot do what it was designed
to do.
172
Performance standards continued
Operating It is clear from the above that you must consider all failures in the operating context of the
Context equipment. Consider the example below.
Example The screw elevator is designed to convey material at a rate of 4 tons per hour.
The reactors are designed for a throughput of 1 ton per hour each.
The performance standard for the screw elevator will be to elevate raw material to a
vertical height of 8 meters at a rate of at least 3 tons per hour.
The performance standard for the screw elevator in a different operating context may be
stated as 4 ton per hour.
Which The desired standard of performance must be used to specify the functions and
standard? performance standards of equipment. If the desired standards exceed the designed
standards, then the equipment must be redesigned or the standards changed.
The requirement for maintenance is determined by the desired functions and standards of
performance.
173
Records of functions and standards
Introduction You must record the functions and, where known, the performance standards of each item
of equipment. These records are essential for:
• The identification of functional failures.
• The establishment of performance benchmarks for the evaluation of equipment
performance and the effectiveness of maintenance.
• The development of effective maintenance policies and programmes.
Participation The identification of functions and the establishment of performance standards require a
high level of participation and interaction from both maintenance and operations.
Much potential conflict is prevented when consensus is achieved regarding functions and
performance standards between maintenance and operations.
Team Use the following procedure as a guide to set up a team and conduct a meeting that will
identify equipment functions and set performance standards.
Step Action
Preparations
1 Divide the plant up into areas based on current areas of responsibility and rank
these areas in sequence of importance.
2 Select the most important area of the plant and create and generate a general
information report that lists the applicable equipment specifications and bill of
material for each address.
3 Appoint a team of knowledgeable persons selected from the operations and
maintenance function.
It is important that people from all levels of the organisation, including
technicians and operators, that have an intimate knowledge of the equipment
actively participate in the team. The only qualification for participation is
knowledge of the equipment functions.
4 Obtain a written mandate from senior management that specifies that the role of
the team is to:
• Identify equipment functions.
• Set performance standards for each equipment item.
Hold the First Meeting
5 Inform the team with regard to the definition of functions and performance
standards as discussed in the preceding topics.
6 Write the address reference of each equipment item on flip chart and ask the
team to brainstorm list of functions associated with the equipment related to that
address.
7 Write all the functions listed on the flipchart. Do not purify the list before all the
functions have been listed.
8 Start at the top of the list and discuss each function in turn. Remove invalid
functions from the list.
9 Let the team specify the current required performance standard for each item. (A
process flow diagram of the plant will be most useful during the discussion.)
10 Record by consensus the performance standards for each item.
174
Records of functions and standards
Continuity It is extremely unlikely that the work of the teams will be completed in a single
session. Depending on the complexity of the equipment and the size of the
area, several sessions will be necessary.
It is therefore essential that someone be appointed team leader by the group.
His responsibility will be to manage the process and keep the team moving
forward to achieve its mandate.
Example of It is vital that the work of the team is recorded permanently to paper or into a database
functions that is developed for this purpose. These records will form the basis for the analysis of
record functional failures.
175
Section E
Overview
Introduction In this section we introduce the concept of functional failures, failure modes and damage
mechanisms
Topic
Functional failures
Failure modes
176
Functional failures
Introduction Equipment is put into operation to perform a required function at a predefined standard of
performance. The objective of maintenance is to maintain this function at the required
standard of performance.
A failure is defined as having occurred when equipment is no longer able to meet a
required standard of performance.
Functional failures therefore generate the need for maintenance intervention, usually by
executing a corrective maintenance task.
Failures and Functional failures must always relate to the function and the standard of performance set
performance for that function.
standards Much confusion and difference of opinion can occur if this is not done because equipment
often suffers a partial loss of function or a failure that is not associated with a primary
function.
Secondary Some organisations wrongly only define equipment as failed when the primary function is
functions affected.
If we only consider primary functions, the reactor in the above example would not be
considered to have suffered a functional failure if the shell developed a minor leak. This is
because it is still able to produce product at the required standard of performance.
This ignorance of secondary failures is often the root cause of the gradual deterioration
that takes place to equipment.
It is therefore extremely important that the team comprises members from the
maintenance and operation functions so that all functions and performance standards
associated with equipment are identified.
Operating It is valid at this point to emphasise the importance of setting performance standards
context within the operating context.
The same type of equipment could be considered to have suffered a functional failure in
one operating context, yet be performing satisfactorily in another context.
Identify Use the same team or a team with the same structure to identify functional failures
functional applicable to each function and performance standard.
failures Brief the team about what is required from them and facilitate them to identify all the
functional failures.
177
Functional failures continued
Example It is vital that the work of the team is recorded permanently to paper or into a database
that is developed for this purpose. These records will form the basis for the identification
of failure modes (causes of failures).
D
2 To contain pressure of 1
300kpa without
leaking.
3 To vent pressure
more than 300kPa to
the atmosphere.
4 To sound an alarm
when the pressure
exceeds 250kPa
5 To maintain the
pressure inside the
reactor at 200kPa
178
Failure modes
Introduction The previous topics discussed how equipment is put into operation to fulfil a desired
function. A failure is considered to have occurred when equipment is unable to function at
a desired level of performance. Knowledge of the above enables maintenance to know
what its objectives are for each equipment item and when intervention is required.
This knowledge however does not enable maintenance to behave proactively. It is only
when you understand the cause of the functional failures that you can behave proactively
and do something before a failure occurs.
Failure Failure modes are used to identify the specific part that fails and how it fails. It is common
modes and that for each functional failure that can occur there can be a multiple failure modes. It is
causes important however that you focus on the causes of failures at a level where something can
be done to prevent or deal with them in a proactive manner. You should also only identify
those that have a reasonable probability of occurring.
179
Error! No text of specified style in document., Continued
Types of It is also important that you not only concentrate on the normal wear and tear causes of
Failure failures but also identify failures that fall into the following classifications.
modes
Classification Description
Disassembly Fasteners, locking bushes, lock rings, pins, keys, collars, flanges,
mountings, couplings, pulleys, bearings, etc. that become loose and
cause parts to fall off or move from their intended positions.
Operation Incorrect operating procedures such as:
• Operating at the incorrect load.
• Incorrect speed.
• Starting in the incorrect sequence.
• Incorrect set up.
• Stopping or starting under load.
Operating It is again important that the operating context (operating environment) of the equipment
context be taken into consideration, even when analysing failure modes.
Example
Consider different failure modes associated with two of the same type of vehicles, the one
operates on dirt roads and the other on tarred surfaces.
180
Error! No text of specified style in document., Continued
Information The best source of information about possible failure modes is contrary to popular opinion
Sources not the equipment manufacturer but they can also be of assistance.
Other sources of information are:
• The technicians that maintain the equipment.
• The operators that use the equipment.
• Users of the same type of equipment.
• History records.
The most useful source of failure mode information is still the technicians that maintain the
equipment and operators that use the equipment.
It seems that the team has more work ahead!
Probability Only failures that at least have a reasonable probability of occurrence should be listed.
Consider the following candidates for listing:
Identify Consult the above sources for information regarding possible failure modes and record
Functional the possible failure modes. Below is an example of the format in which the information can
Failures be recorded.
181
Section F
Overview
Introduction In this section we introduce the causes and nature of failure modes
Topic
The physical causes of failure
Psychological error classification
Examples of failure modes and damage mechanisms
182
The physical causes of failure
Two basic
types of
causes
PHYSICAL DAMAGE MECHANISMS HUMAN ERROR
Physical The underlying cause of a failure mode could be related to physical deterioration
causes which is a function of the physical characteristics of the component or system, and
the operating conditions. These are the typical conditions that were considered
when the equipment was designed so the deterioration of the component or
system would be considered normal when it is used under those conditions.
Human error Sometimes abnormal conditions arise when people operate the asset in a manner
causes for which it was not designed and the component or system suffers accelerated
deterioration and it fails rapidly. Abnormal conditions are not limited to operational
errors but could also arise from maintenance work that is not done correctly.
183
Psychological error classification
The James
Reason
model Attentional failure
Carry out a planned task
Slip incorrectly or in the
wrong sequence
Unintended
action
Memory failure
Lapse Miss out a step in a planned
sequence of events
ERROR
Rule-based mistakes
Misapplication of a good rule or
application of a bad rule
Mistake Knowledge based mistakes
Inappropriate response to a novel
abnormal situation
Intended
action
Exceptional violation
Violation Routine violation
Sabotage
Un-intended The important aspect to note about this type of error is that it is not related to a
action lack of of knowledge or skill. The person who makes this type of error knows what
to do and how to do it. His intentions of what he sets out to do is correct. However,
at some point during the execution his actions deviate from the intentions. This is
likely due to a distraction, preoccupation, absent mindedness. This type of error is
therefore an error of execution not intention
Intended With this type of error the person’s intentions are already incorrect. This could be
action because he or she chooses a course of action that is not appropriate for the situation
at hand does something habitually that is not the right thing to do, does not really
know what to do but proceeds anyway knows the correct course of action or
behaviour but chooses to act inappropriately
Why do we Knowing which type of human error will or has caused a failure will help us to select
need to a failure management tactic that is appropriate to deal with it. In the same manner a
know this? doctor needs to make a proper diagnosis before he or she can prescribe a treatment
184
Examples of failure modes & damage mechanisms
Failure mode
and
mechanism
FAILURE MODE DAMAGE MECHANISM
Chemical decay
Abrasion
Mechanical fatique
Abrasion
Corrosion
Lack of lubrication
Insulation deterioration
Insulation damage
Validation
Damage The above examples show that systems, components, parts, may be exposed to
mechanism more than one type of damage mechanism so in this part of the analysis the team
and operating should consider the operating environment to be able to identity which may be
environment valid under those operating conditions.
Human error caused the hose to be routed incorrectly which caused the hose to
rub against the cylinder rod which caused the hose to weaken and rupture. If the
hose was routed correctly then abrasion would have not occurred.
We could therefore say the cause was human error and the effect was that the
hose suffered abrasion which had the effect of the weakening the hose and the
effect was that it ruptured.
Human error Note in the example above that the abrasion that weakens the hydraulic hose and
causes it to rupture could originate from incorrect routing of the hose that causes it
to rub against something else and therefore suffer abrasion so the underlying
cause is actually human error.
Cause vs We could therefore say the cause was human error and the effect was that the
effect hose suffered abrasion which had the effect of the weakening the hose and the
effect was that it ruptured.
Damage The workbook contains a sheet named ‘validation’ which contains a library of
code library known damage mechanisms, some of which are also listed in API 581
185
Section G
Introduction In this section we introduce the effects, costs and risks of failure and degradation
Topic
Failure effects
Hidden failure risks
Safety and environmental risks
Operational risks
Non operational risks
Probability / severity risk grading matrix
The six failure patterns
Age and usage related failure patterns
Failure patterns not related to age
Infant mortality damage mechanisms
186
Failure Effects
Introduction It is important that you understand what happens when a specific failure mode is
encountered. This information is necessary to enable you to evaluate the consequences
of the failure so that it can be dealt with effectively.
Properties It is important that you describe specific properties if there are any, when you list failure
effects. This is done so that you know what action to take concerning proactive
maintenance.
Item Property
1 The evidence that a failure has occurred.
2 The ways in which it poses a threat to safety or the environment.
3 The physical damage caused by the failure.
4 The work that must be done to repair the failure.
Note: This property determines the need for corrective maintenance when a
suitable proactive task cannot be performed.
Sources If you consider the types of effects listed in the above table you will agree that the
identification of failure effects also requires participation from both maintenance and
operators.
2 Gas regulator
diaphragm leaks
due to wear.
187
Hidden failure risks
Introduction Most failures that occur become apparent to users by the observation of one or more
symptoms. These observations include things like audible or visible alarms, vibration, loss
of flow or pressure, sub-standard product or output, leaks, etc.
There are however failures that are not evident to the equipment users when they occur.
Failures that do not become evident to the equipment users on their own are called
hidden failures.
These failures usually relate to protective functions such as standby equipment, alarms or
devices that shutdown equipment in the case of functional failures to functions that are
protected.
Example The following example illustrates an example of protective equipment that is subject to
hidden failures.
The spare wheel on your car.
• You will be unaware that the spare tyre of your car has deflated under normal
circumstances.
• You will only become aware of the deflated state of your spare wheel when you suffer
a flat tyre or if you periodically check the pressure.
Multiple Hidden failures on their own have no direct consequences. Hidden failures however do
Failures increase the risk of multiple failures that usually have serious consequences. Consider
the following scenarios.
Scenario Wheels on Car Spare Wheel Consequences
1 Inflated Inflated Non
2 1 Deflated Inflated Minor
3 Inflated Deflated Non
4 1 Deflated Deflated Serious
Scenario 4 is a multiple failure situation.
Note how the consequences escalate to serious in the situation of a multiple failure.
188
Hidden failure risks continued
Approach You need to consider the following facts when you make decisions about how to maintain
protective functions.
• Hidden failures have no direct consequences other than to increase the risk of a
multiple failure.
• A maintenance programme for a hidden function is to reduce the risk a multiple failure.
• The amount of effort used to prevent a hidden failure is proportional to the
consequences of the multiple failure.
Quantify the We have already illustrated that the only consequence of a failure to a hidden function is
Standard. an increase in the risk of multiple failure.
The performance standards of (hidden) protective functions must therefore be related or
quantified to a statement of the risk of multiple failure.
Risk of Risk in this context is expressed as the probability that the protected function will fail while
Multiple the protective device is in a failed state during the same period.
Failure The calculation of the probability of a multiple failure requires knowledge about the
reliability of the protected function and the availability of the protective function.
Calculation of The example of the motorcar spare is used to illustrate the probability of multiple failure.
Probability
Scenario:
• The average motorcar, under normal conditions, suffers a flat tyre once every 36
months.
• The average spare wheel of a motorcar is in a deflated state for approximately one
month per year.
Calculate the probability of the car suffering a flat tyre while the spare wheel is flat in the
next year.
Step Action
1 Calculate the probability of a failure to the protected function. The probability of
failure in the next year is:
1 divided by 3 = 0.33 (1 failure every 3 years)
2 Calculate the probability that the protective function will be in a failed state.
The spare wheel is flat for 1 month a year. The probability that the wheel will
be flat at any point during the next year is:
1 divided by 12 = 0.083 (once in 1 year)
3 Calculate the probability of a multiple failure.
The probability of a multiple failure is determined by the product of:
• The probability of a failure to the protected function
• The probability of a failure to the protective function
0.33 x 0.083 = 0.027. (Once in 37 years.)
189
Hidden failure risks continued
Assessing The performance standard for a hidden function is the availability required to maintain the
the Risk risk of a multiple failure at an acceptable standard.
Maintenance should assess the probability of all multiple failures and determine if the
combination of the risk and consequences are acceptable.
Reducing the The driver of the a motor vehicle in the above example will find himself stranded on the
Risk roadside once in 37 years. The consequences of being stranded are serious and can
hardly be modified.
The probability of this multiple failure must therefore be reduced. In this example and
practically all examples in a plant, increasing the availability of the protective function
reduces the probability.
In the above example, the probability of being stranded by the roadside can be reduced to
once in 74 years by increasing the availability of the spare wheel to 11.5 months per year
instead of the current 11 months per year.
Fail Safe A fail-safe device is a protective device whose failure will become evident to the user of
Devices equipment on it's own under normal circumstances.
Fail-safe protective devices are therefore devices that are able to indicate failures to
themselves to the equipment users.
190
Safety and environmental risks
Introduction The injury, maiming and killing of people by industrial "accidents" are totally
unacceptable in the society in which we operate. Damage to the environment is
considered just as serious. It is therefore important that the safety and environmental
consequences of failures be considered first.
Safety risks You must consider a failure mode to have safety consequences if the loss of function
can injure or kill someone.
Example
The rupture of a pipe containing aggressive chemicals can injure or kill people
in the immediate area.
Environmental You must consider a failure mode to have environmental consequences if the loss of
risks function can cause an infringement of any environmental standard rule, ordinance,
regulation or statute.
Example
The failure of an oil separator in a vehicle wash bay can cause oil to enter the
reclaimed waste water system.
Related to It is noteworthy that most failures with safety or environmental consequences will also
operational have operational consequences. Preventing safety or environmental consequences will
risks therefore in most cases prevent operational consequences as well.
From this perspective it also makes sense to first analyse the failure effects for
environmental and safety consequences.
Evaluating risk The complexities of the factors that influence risk are such that no single individual can
assess if a risk to safety or the environment is acceptable. A risk assessment done by an
individual will either be considered too conservative for some or "reckless" by others.
Evaluation The only way to evaluate risks and the consequences associated with them effectively is
team to use a representative and knowledgeable team of people from the organisation. This
team must comprise:
• Management who are ultimately accountable for all safety and environmental
incidents.
• Maintenance and operational people that are knowledgeable regarding the
operational and failure process of the equipment.
• The people that are exposed to the risk of the consequences.
•
Prevention In all cases where there is real risk of safety or environmental consequences, a proactive
maintenance task to reduce the risk must be found. If this is not possible, then the
equipment or the process must be changed to reduce the risk or the consequences of
failure.
191
Operational risks
Introduction The primary function of most equipment is to produce the products or services that
generate the revenue or value that justifies the existence of the organisation.
Operational consequences are next in line for analysis after hidden failures and safety
and environmental consequences have been dealt with.
You must identify the factors that influence operational consequences and deal with the
factors in a proactively to improve maintenance effectiveness in supporting the
operational objectives of the organisation.
Definition All failures that have a direct adverse effect on the operational capability of the
equipment functions are considered to have operational consequences.
Factors You cannot prevent or even begin to manage the operational consequences of functional
failures unless you identify and gain an in depth understanding of the factors that
determine the consequences. The following factors need to be considered.
192
Operational risks continued
Process flow The consequences of operational failures cannot be assessed unless there is an in depth
knowledge of the process flow within the plant.
This includes knowledge about things like:
• The throughput rate of each process.
• Storage capacity between processes in hoppers, silos, etc.
• Single and parallel processes.
• Batch processes.
• Bottleneck processes.
• Bypass options, etc.
Repair time The severity of most operational consequences is closely related to the time it takes to
restore the function that has failed.
The following determinants of repair time must be considered:
• Organisational factors:
• Response time.
• Skill of the technician.
• Transport availability.
• Special tools availability.
• Users diagnostic and repair skills.
• Maintainability factors:
• Accessibility.
• Complexity.
• Evidence of failure.
Spares The availability of spares and the lead-time to obtain spares is a major factor in the
analysis of operational consequences. We often put the cart before the horse by first
putting the spares into the inventory that we think we should have or what the equipment
vendor has recommended.
What you should be doing is analysing the consequences of failures, and then based on
those consequences, determine which spares should be carried in stock.
Raw Material The consequences of failures are more severe in plants that process a perishable raw
product. A fruit canning plant where the fruit that has been harvested, is perishable and
the fruit that has not been harvested will deteriorate in quality if the process plant is down,
is a good example.
193
Operational risks continued
Market The market demand for a product or service can also determine the consequences of a
demand failure. A failure to the signal system of an underground railway on a Sunday will be less
severe than it will be on a Monday.
Maintenance programmes should be developed to also utilise periods of low demand for
maintenance.
Cost benefits Proactive maintenance tasks are worth doing to prevent operational consequences if, over
a time, the cost of the proactive task is less than the sum of:
• The cost of the operational consequences.
• The cost of repairing the failure.
Alternative If a proactive task cannot be justified on economic grounds but the operational
maintenance consequences of the failure are still unacceptable, the following options must be pursued.
approaches • Eliminate or reduce the consequences of the failure.
• Modify the proactive task to make it cost effective.
• Increase the interval at which the failure mode occurs.
Do not implement the alternative options unless you are completely satisfied that the
process and equipment in its current configuration cannot satisfy your needs.
194
Non-operational risks
Introduction There are some failure modes that do not affect safety, the environment or the
operational capacity of the equipment. These failure modes must however still be
analysed for their economic consequences.
Types Failure modes that do not have operational consequences fall into the following
categories:
• Protected functions where a failure to an item of equipment does not affect
operational capability because there is a stand-by unit that can be switched over to
while the main item is being repaired.
• Secondary damage resulting from a failure mode. An example is where a protected
function like a lubrication pump is allowed to run to failure and in the process pollutes
the system with metal particles.
Maintenance A programmed maintenance task is worth doing if the cost of the proactive task over
approach time costs less to do than the repair costs of the failure that is being prevented.
195
Probability / Severity risk grading matrix
Risk
matrix
Likelihood/Probability
B: Almost E: Very
A: Certain C: Likely D: Unlikely
Certain Unlikely
1: Catastrophic 25 23 20 16 11
Severity/Consequence
2:
Major 24 21 17 12 7
3:
Serious 22 18 13 8 4
4:
Moderate 19 14 9 5 2
5:
Minor 15 10 6 3 1
Risk grading Risks are graded in accordance with an assessment made in terms of the probability
that a failure mode will occur versus the severity or consequence of the event
occurring
Severity Severity can be expressed in monetary values such as $0-$10k, $10k-$50, $50k-
values $100k, $100k-$500k, etc
Grading The matrix is used as a guideline to grade each failure mode prior to proposing a
failure failure management tactic and then also to assess the residual risk after a tactic has
modes been proposed
196
The six failure patterns
Cumulative
probability
density curves
A 4%
2%
Likelihood of Failure B
C 5%
D
7%
E
14%
F
68%
Time
Fallacies It was believed that most equipment wore out with use and therefore became more
about likely to fail over time in the early years of maintenance.
failures Based on research that was done by Nolan, Heap and Matheson in the nineteen
seventies it was found not be generally true
Random The other three D. E, F are not age related. The probability of failure does not
failure increase with ageing or usage. This does not mean that a failure will never occur, it
patterns just means that the likelihood of the failure occurring in any given period remains
constant.
Why do we Knowing which failure pattern to associate with a failure mode helps us to select a
need to failure management tactic that is most appropriate to deal with the failure mode.
know this?
197
Age and usage related failure patterns
Conditions Age related failure modes are more likely to occur under conditions where the
components are stressed or exposed to conditions that lead to deterioration.
Below are some examples:
Abrasion A loss of material caused by solid particles entrapped between surfaces in close
proximity, solid particles sliding or rubbing against a surface, two surfaces
rubbing against each other especially in the absence of or contamination of
lubricants
Erosion A loss of material caused by the moving contact or flow of liquid or gas over a solid
surface. Cavitation is an accelerated form of erosion
Mechanical The end state of a mechanical fatique failure mode is a fracture. Mechanical fatique
fatique fractures are preceded by cracks that develop over a relatively long period and
eventually the remaining structure fractures when there is insufficient material left
in the cross section to support the load.
Mechanical fatique happens under conditions where a component is subjected to
cyclic stress. If the stress in a steel component is limited to below 350Mpa it is
unlikely that mechanical fatique will develop. Aluminum components are prone to
fatique failure at lower levels of stress. Fatique cracks and fractures are often
initiated by stress concentrations
198
Failure patterns not related to age
Failure
patterns
D, E, F
A 4%
5%
C
D 7%
More complex equipment with likely failure
causes including:
E 14%
- electronics
- hydraulics
- pneumatics
- some mechanical items like rolling element
F 68% bearings
Introduction Reliability engineers have over time become increasingly aware that not all failures to
equipment are age related.
This phenomenon has been brought about by the technology that has been introduced
into industry as well as
This brings new challenges to maintenance to find appropriate proactive tasks to deal with
this circumstance.
Failure The diagram below shows the current known failure patterns of equipment. Patterns D, E
patterns not and F show no relationship between age on the horizontal axis and probability of failure
age related on the vertical axis.
The outstanding characteristic of most of these patterns is that after an initial settling in
period, the probability of failure remains unchanged over time. You are dealing here with
a random failure pattern.
Occurrence Failure patterns not related to age are mainly applicable to:
of random • Electrical and electronic equipment.
failures
• Hydraulic and pneumatic equipment.
• Rolling element bearings.
• Mechanical seals
• Failure modes caused by human error
199
Infant mortality damage mechanisms
Failure
pattern
‘F’
Many causes Infant mortality features prominently in the hall of failures! Many things can go
wrong with a part or a system even before the system is commissioned. The poor
procurement process, transport, storage and delivery practices of most
organisations are a hotbed for infant mortality. Then comes the installation of the
part or system, the commissioning, operation and maintenance and then the
chances of survival become even less. In older designs, components were
probably more robust there was greater dimensional tolerance. This has all
changed on modern equipment. Modern equipment is less tolerant to transient
current, dirt ingress, temperature fluctuations. Packaging, transport, storage,
installation and commissioning processes have to be more compliant than before
otherwise infant mortality will be over represented in the failure data of the
business.
200
DAY 3
FAILURE MANAGEMENT STRATEGY DEVELOPMENT
Overview
Introduction In this chapter we introduce how failures, degradation and their costs are
managed by applying tactics
Section Description
A Risk-based approaches to failure management
B Select proactive maintenance on the basis of costs and risks
C Preventive maintenance tasks and intervals
D Predictive maintenance tasks and intervals
E Failure detection and function testing tasks and intervals
F Repair after failure strategies
201
Section A
Overview
Introduction In this section we introduce the different types of programmed maintenance tactics and
the criteria that is used to select them
Topic
Three basic types of failure management tactics
202
Three basic types of failure management tactics
Tactic
types
Repair after Also known as run to failure is a tactic where the failure mode is allowed to run it’s full
failure course and is then repaired by doing a corrective task
203
Three basic types of programmed maintenance
Programme
d
maintenance
types
Condition- The risk of suffering an unanticipated failure mode is reduced by performing regular
based inspections, evaluations, monitoring of the equipment condition in order to be able to
maintenance execute a planned corrective task
Function The risk of suffering a multiple failure or the consequences of a hidden failure is
testing / reduced by verifying that a failure mode has not occurred in the period preceding the
failure task. This type of programmed task only applies to hidden failures or failures that
finding operators are unable or unwilling to report.
204
Section B
Select proactive maintenance on the basis of costs and risks
Overview
Introduction In this section we introduce the different types of programmed maintenance tactics and
the criteria that is used to select them
Topic
Tactics selection diagram
205
The tactics selection diagram
Diagra
m
Failure mode The tactic selection diagram is read from left to right, top to bottom. The first step
in the diagram commences with a failure mode that has been identified either in
anticipation of a future event that may happen or an event that has already
happened, or an event for which we have a current tactic in place
Questions The user of the diagram proceeds to a series of questions, each to which he or
she may only answer a definitive yes or no until there are no more questions and a
specific tactic has been selected
206
Section C
Predictive maintenance task and intervals
Overview
Topic
Condition –based questions of the tactics selection diagram
Programmed condition-based tasks
Types of condition-monitoring
207
Condition-based questions of the tactics selection diagram
Diagra
m
Failure mode An event that causes a loss of function, performance, efficiency, quality, injury,
death, environmental incident, production loss
Is the failure Potential failures are conditions that precede the failure mode. They must be
mode observable, measureable, evaluated, assessed and must give a clear indication
preceded that the failure mode is going to occur in the near future. Techniques that could be
one or more employed include condition monitoring, human senses, performance monitoring,
failure product monitoring
conditions Examples: Alignment, arcing, acoustic emissions, bowing, bubbling, bouncing,
bulging, burred, clearance, contamination, crack, cut, color, collapse, chatter,
discoloration, deformation, distortion, damage, exposed, electrical resistance,
electrical current, eddy current, flattening, flaking, flow, flexing, frayed, holed,
hardness, humming, interference, play, pressure, pitting, knock, kinked, nipped,
long, level, looseness, magnetism, noise, play, peeling, roughness, torn, thinning,
thickness, twisted, soft, sagging, specific gravity, slack, short, sound, submersed,
tension, tight, tear, stalled, temperature, taste, worn, vibration, X-ray
208
Condition-based questions of the tactics selection diagram cont.
Is the lead time The time from when the potential failure is detected until the failure mode is
of the potential likely to occur must be long enough in time for it to be possible to plan and take
failure/s long corrective action that would avoid or significantly reduce the consequences of
enough for the failure.
planned
corrective
action?
Specify the Start the description of the task with an action word such as: Verify, inspect,
programmed measure, record,
condition-based Example 1:
tactic/s at Record the density pump shaft bearing temperature and verify that it does not
interval/s that exceed 70°C
are less than Example 2:
the PF lead time Verify that the drive belts are free of cracks
Example 3:
Verify that the lubrication oil flow is 200-250lpm
Example 4:
Verify that the lining is free of excessive wear or damage
Propose an interval for the task that is shorter than the PF interval
Is the residual The selection of one or more condition-based task is usually adequate to
risk of un- reduce the risk of failure. However, it may in rare cases where safety is
anticipated involved it may be necessary to specify a PM task as well.
failure tolerable EXAMPLE
and the tactic The OEM recommends that the seat belt should be replaced once every three
cost effective? years and has printed an expiry date on the identification tag.
The seat belt can also suffer damage or deterioration before the expiry date.
The seat belt could therefore suffer more than one failure mode see FM1 and
FM2 below:
FM1: Seat belt damaged.
FM2: Seat belt expires
FM1, seat belt is damaged is dealt with by two condition-based tasks:
CBM task 1: An inspection of the seat by the operator every shift.
CBM task 2: An inspection of the seat by the technician every 500 hours.
The seat belt has expired is dealt with by a preventive tactic:
PM task 1: A replacement of the seat belt by the techniciian every 3 years
The risk in the above case is only reduced to a tolerable level by the proposal
of multiple tasks
209
Programmed condition-based tasks
Introduction Condition based tasks are based on the fact that many failure modes do not occur
instantaneously, but actually develop over a period of time.
Condition based maintenance tasks involve periodic monitoring or checking of
equipment for potential failures, so that actions can be taken to either prevent the failure
or prevent or reduce the consequences of the failure.
These failure characteristics create an opportunity for maintenance to observe the point
at which the potential for failure has developed and take action to either prevent the
failure or the consequences of the failure.
There are four basic types of condition-based maintenance tactics:
Condition Use specialised equipment to monitor an aspect of the system to monitor things like
monitoring temperature, pressure, flow, electrical potential, current, viscosity, density, acidity, color,
contamination, wear, etc. to identify potential failures to the system.
Performance Use built in monitoring systems, test equipment (BITE) to monitor things like
monitoring temperature, pressure, flow, electrical potential, current, viscosity, density, acidity, color,
contamination, wear, etc. to identify potential failures to the system.
Product Monitor the products produced by system in terms of characteristics like mass, size,
monitoring composition, color, taste, temperature, viscosity, density, acidity, bacteria count,
hardness, strength, ductility, etc. as a means to identify potential failures to the system.
Inspection Using the human senses of sight, smell, touch, taste, hearing to identify potential
failures. Most condition-based tasks are based on these senses.
210
Error! No text of specified style in document., Continued
Potential Here are some examples of potential failure observations that can be made:
failure • Hot spots on bus bars showing insecure connections.
examples
• Abnormal noise emanating from an automotive water pump.
• Abnormal temperature from a bearing.
• Abnormal low pressure in a hydraulic system.
• Brass particles in the lubricating oil of an ore crusher.
• Cracks in the frame of front end loader
• Scour damage to a hydraulic cylinder rod
Characteristics The diagram below is a model of the factors associated with condition based
maintenance
211
Error! No text of specified style in document., Continued
Terminology The following terminology is associated with the diagram of the condition based
maintenance model:
Term Definition
Deterioration The degree of deterioration of the item associated with its potential for
level failure
Running time The cumulative age of the item. It can be expressed in calendar time or
operational utilisation.
Potential failure The point at which the onset of failure can be observed.
Failed The point at which the item has actually failed.
Lead fime The interval of time or operational utilisation between the observation
of the potential failure and the actual failure.
Task interval If a condition based task is to be successful, it is obvious from the above model that you
must detect potential failures and deal with them before the point of failure. In practical
terms, the interval at which you perform condition monitoring, or checking, must be shorter
than the lead-time.
Effectiveness The objective of all proactive tasks is to prevent or reduce the consequences of functional
criteria failures.
Condition based maintenance tasks are only effective when they achieve this objective.
The basis for evaluating the effectiveness in terms of operational or economical
consequences is that the cost of doing the task must be less than the cost of the
consequences that the task is intended to prevent.
When the condition-based task is applied to prevent safety consequences, it is only
effective if the task reduces the risk of the functional failure to an acceptable level.
212
Error! No text of specified style in document., Continued
What CBM Condition based maintenance rarely prevents a failure from occurring. The component
Achieves that reveals a potential failure is usually doomed to fail and nothing can prevent it.
The biggest benefit achieved from condition based maintenance is the time it makes
available to maintenance to plan the task to correct the failure so that the task is executed:
• At and within the least disruptive time to operations.
• With the least impact on safety and the environment.
• At the lowest possible cost and consumption of resources.
Planning When a potential failure is identified the planning function must, depending on the lead
Actions time available:
• Register a defect and create a corrective task in the CMMS.
• Verify that the spares are available by reserving current stock, registering a
requirement with stores or expediting an existing requirement.
• Schedule the corrective task into the forthcoming weekly schedule, planned outage or,
in an emergency, negotiate an outage with operations.
213
Types of condition monitoring
Condition The term condition monitoring is used to define the application of specialised equipment
Monitoring to monitor the condition of equipment.
The monitoring process involves the use of equipment to measure changes in
characteristics associated with the equipment functions. The specific techniques are
classified as follows:
• Dynamic
• Chemical
• Temperature
• Particle distribution
• Physical
• Electrical
• Acoustic
The selection of the most appropriate condition monitoring technique must be done with
forethought. There are as many successes as there are failures in the use of techniques.
214
Section D
Preventive maintenance task and intervals
Overview
Topic
Preventive maintenance questions of the tactics selection diagram
Programmed restoration and discard task intervals
215
Preventive maintenance questions of the tactics selection diagram
Diagra
m
Is there an A yes reply to this question means there is a high level of certainty of when the
age where the likelihood of failure increases and the team must then specify the age or usage
likelihood of at which this happens.
failure Do not confuse 'age' in this question with MTBF. Every physical part has an
increases MTBF but not every part has a specific age where the likelihood of failure
rapidly increases rapidly
The benefit of PM is that it is simple to implement by programming the CMMS to
produce replacement work orders at the prescribed interval. Once set up it will
continue to produce work orders without anybody needing to make assessments
and judgments about whether a part or component should be replaced or not.
Will a This question asks whether something can be done at fixed intervals to maintain
programmed an item or system without replacing it. The most common example of this is
replenishment, cleaning, lubricating, tensioning, adjusting, calibrating, replenishing, topping up.
adjustment, House-keeping and cleaning tasks that are performed at fixed intervals are
calibration, covered by this question.
top-up or
clean prevent
the failure?
216
Preventive maintenance questions of the tactics selection diagram
cont.
Specify a Start the task instruction with a verb such as top-up, clean, adjust, align, train,
programmed tighten, tension, grease, oil, lubricate, decontaminate, drain, wipe, polish, apply,
refurbishment at refuel, backwash, rinse, brush, grind, re-groove, restore, update, refresh,
an age before resurface, deglaze, reshape, remove, de-burr, trim, secure, reset and then
the likelihood of specify the object, component or system and the interval at which this needs to
failure increases be done
rapidly
Is the residual If the risk of failure cannot be reduced by a programmed refurbishment then a
risk of proceed to consider a programmed replacement
unanticipated
failure tolerable
and the tactic
cost effective?
No additional If the risk was reduced sufficiently after a refurbishment task was proposed then
tactics required no additional tactics are required
Will a Having already answered yes to the question regarding whether there is an
programmed increasing likelihood of failure. This questions asks whether something can be
replacement replaced at fixed intervals. Failure modes associated with consumable items
prevent the such as filters, lubricants in small quantities, fluids, are candidates for PM
failure Failure modes related to components with limited shelf lives that suffer
chemical decay such as seat belts, lifting slings made of synthetic materials
and automotive tyres and others could also be candidates.
Specify a Start the task instruction with the verb 'Replace' and then specify the object or
programmed component that should be replaced and the interval at which it has to be
replacement at replaced to prevent it from failing
an age before
the likelihood of
failure
increases
rapidly
Is the residual If the tactics / tasks that have been proposed to deal with the failure mode have
risk of not succeeded in reducing the risk then further options can be considered by
unanticipated following the logic of the diagram
failure tolerable
and the tactic
cost effective?
217
Programmed restoration and discard task intervals
Introduction Refurbishing or replacing components on a fixed time basis was at one time considered to
be the best practice approach for maintaining assets. The economic disadvantages of this
approach, however, far outweigh the benefits.
Criteria The fixed interval replacement or reworking of components is only feasible if:
• There is an identifiable point at which the item shows a rapid increase in the
conditional probability of failure. The failure distribution of the item must therefore have
a small standard deviation.
• Most of the items under consideration survive to that identifiable point. If there are
safety or environmental consequences involved, then all the items must survive to that
identifiable point.
Failure The diagram below shows how most age related failures follow a normal distribution. In
distribution this distribution the first failures occur at 2000 hours, the greatest number survive to 2600
and the longest survivors to 3200.
218
Error! No text of specified style in document., Continued
Feasibility Fixed interval discard or restoration of items that follow the above distribution pattern is
feasible but rarely economical. This is because, depending on the tolerance level for
failures and the standard deviation of the failure distribution, the majority of items would
be discarded or restored long before they have achieved their potential life.
Using the functions of a normal distribution in the example above, 97.7% of the items in
the example survive beyond 2200 hours and 84% beyond 2400 hours.
Safety and When there are safety and environmental consequences at stake, the decision in the
environmental previous paragraph will not be acceptable because there is still a substantial risk that
some failures will occur.
The only available option under these circumstances is to perform the programmed
discard, or restoration task, before any failures can occur.
Safe-life limits A safe-limit interval must be applied to prevent all possibility of failure. This requires that
the component is discarded or restored before any failures can occur.
The diagram below shows how a safe life limit is determined.
219
Error! No text of specified style in document., Continued
Determining Determining the safe life interval is extremely difficult because it is often impossible to
The interval obtain history about previous failures. On the one hand, you cannot allow failures with
safety consequences to occur yet; you need history to determine how to prevent them.
This is an unacceptable situation.
The only way to determine the safe life interval in this case is to perform a representative
number of experiments on test rigs that simulate the equipment in the operating context.
The cost involved in this are, in all but a few cases, not justified. Alternative approaches
must rather be explored.
220
Section E
Failure detection and function testing task intervals
Overview
Introduction In this section we introduce the detective or function testing maintenance as it is also
known
Topic
Detective maintenance questions of the tactics selection diagram
Function testing and failure finding tasks
Function testing and failure finding interval calculation
The four important reliability functions
221
Detective maintenance questions of the tactics selection diagram
Diagra
m
Is the loss of A yes reply to this question means there is a high level of certainty of when the
function likelihood of failure increases and the team must then specify the age or usage
hidden under at which this happens.
normal Do not confuse 'age' in this question with MTBF. Every physical part has an
circumstances MTBF but not every part has a specific age where the likelihood of failure
increases rapidly
The benefit of PM is that it is simple to implement by programming the CMMS to
produce replacement work orders at the prescribed interval. Once set up it will
continue to produce work orders without anybody needing to make assessments
and judgments about whether a part or component should be replaced or not.
Is a This question asks whether something can be done at fixed intervals to maintain
programmed an item or system without replacing it. The most common example of this is
function test cleaning, lubricating, tensioning, adjusting, calibrating, replenishing, topping up.
or failure House-keeping and cleaning tasks that are performed at fixed intervals are
finding task covered by this question.
feasible
222
Function testing and failure finding tasks
Introduction Failure finding and function testing tactics are also known as "detective" tasks. Failure
finding tasks are not preventive, predictive or proactive tasks because their intention is not
to detect a failure mode after it has occurred.
Function testing and failure finding is only applicable to hidden failures.
Objective The objective of a failure finding task is to secure the availability required from a protective
function to reduce the risk of a multiple failure and associated consequences to an
acceptable level.
Interval The interval of failure finding tasks is a function of the reliability of the protected function
and the risk tolerance of the organisation to the multiple failure.
Example The application of these functions is illustrated in the spare wheel example:
Given the following information, calculate the interval at which you must inspect your
spare wheel.
• You on average suffer a puncture once every three years.
• You do not want to get caught with a flat tyre and a flat spare more than once in 74
years.
An analysis of the above problem reveals the following factors:
• The reliability (MTBF) of the protected function is three years.
• Risk tolerance to a multiple failure is one in 74 years.
Interval of Mean Time Between Failure
failure finding Risk Tolerance
task
3
74
0.04 years
14.7 days
The interval at which you should test your spare wheel for failures is two weeks.
223
Error! No text of specified style in document., Continued
Conclusion The following conclusions can be made regarding the above example:
• The lower the reliability (MTBF) of the protected function the higher the risk of multiple
failure.
• The lower the availability of the protective function the higher the risk of a multiple
failure.
• Either improving the reliability of the protected function or increasing the availability of
the protective function can decrease the risk.
• The availability of the protective function is determined by the interval of the failure-
finding task.
Reliability of With the above in mind a person could conclude that the interval of a failure finding task
the must be determined with only the above factors in mind.
protection There is however also the factor of the reliability of the protective function to be
system considered.
The practical application of the above interval means that you will inspect the status of the
spare wheel 26 times during the course of one year. What will your reaction be to the
following two situations after a year of doing a failure-finding task?
• The wheel was functional 26 out of 26 times.
• The wheel required inflation 26 out of 26 times.
Experience It is good practice to use the example calculation when determining the interval of failure
finding tasks for the first time.
The experience gained from executing these tasks over time must however be applied to
adjust the intervals to reflect the historical trends.
In the case of safety and environmental consequences the interval should not be
decreased without a rigorous analysis being done first.
Impractical The outcome of a multiple failure finding task analysis or experience could result in an
intervals impractical failure finding interval. This can be driven by a high demand for the availability
of the protective device or low reliability of the protected function.
In the case of standby plant, the context of the protective function changes to a primary
function once it takes over from the function that has failed. From that point onward, it will
only be as reliable as the function it replaced.
Double the The best solution to the problem discussed above is to add another protective function to
protection decrease the risk of multiple failures. An additional standby pump, spare wheel, high level
alarm, safety valve, temperature cut out, etc. are examples of typical protective devices
that can be added. In extreme cases it may even be necessary to add more than one
additional protective function.
224
Error! No text of specified style in document., Continued
Feasibility Failure finding tasks are done with one single purpose in mind namely, to determine if the
protective device will prevent or reduce the consequences of a failure to the protected
function.
It must therefore be technically feasible to check the protective device to determine if it is
functioning correctly. The preferred method of checking is in position. There are also
cases where the equipment must be removed or even dismantled before it can be
checked.
Other checks There are situations where none the above type of checks is possible. These fall into
three categories:
Type Description
Dismantle The item must be dismantled and the parts inspected.
This type of checking often introduces the very failures that you are
trying to prevent.
Destruct The item can only be verified by its destruction.
Fusible links, rupture discs, shear pins, etc. fall into this category. In
these situations you are dependent on the integrity of the material and
process used in the manufacture of these components.
Samples from current stock can be tested to destruction locally or at a
laboratory. The correct specifications must be adhered to in the
acquisition of these components.
Beware of cheap substitutes being purchased or manufactured by
ignorant technicians.
Impossible It is not possible to verify the function at all.
These items must be regarded with a high level of suspicion. If the
integrity cannot be verified by your own organisation or an appropriate
authority, the equipment must be redesigned or the device replaced by
one that can be checked
225
Function testing and failure finding interval calculations
Single The formula below is suitable for singular protection systems and uses failure
protection probabilities as the basis for determining the interval
system risk
based interval
2 X Mted X Mtive
MMF
Parallel The formula below is suitable for parallel redundant protection systems and uses failure
redundant probabilities as the basis for determining the interval
protection
system
risk based
interval
1/n
Mtive X
{ (n+1)Mted
MMF }
MMF: Tolerable mean time between multiple failures
Mted: MTBF of The protected function – The demand for protection
Mtive: MTBF of The protective device – The device protecting the demand
n: The number of parallel redundant protective devices
226
Function testing and failure finding interval calculations continued
Single The formula below optimizes the interval of inspection in relation to the cost of the
protection multiple failure
system cost
based interval
1/2
FFI for The formula below is suitable for protection systems that have multiple failure modes
Multiple that need to be considered
failure
modes in
a single
protectiv
e device
2 x Mted
(1/M1 + 1/M2 + 1/M3 ….) x MMF
M1, M2 , M3 : MTBF of individual failure modes associated with the protective device
Mted: MTBF of The protected function – The demand for protection
MMF: Tolerable mean time between multiple failures
227
Function testing and failure finding interval calculations continued
Voting The formula below is used to determine the function test interval for voting systems
Systems
{
1/r
Mtive X (n-r)!(r+1) x Mted
n! x MMF
}
Mtive: MTBF of The protective device – The device protecting the demand
Mted: MTBF of The protected function – The demand for protection
MMF: Tolerable mean time between multiple failures
n: The number of units in parallel
k: The number of units needed to activate the system
r: The number of units that must be failed for the system to fail
Therefore: r = n - k + 1
Optimising The diagram below shows the relationship between the cost factors and how increasing
function or decreasing the interval of failure finding or function testing has an impact on the total
testing cost
and failure
finding
intervals
228
st
Co
t al
To
e
ur
ail
p lef
i
m ult
of
o st
e dc
lis
nua
An
Annua
lis
Cost
e d c os
t of fail
ure-find
ing
Failures An inspector selected 37 switches from a production batch of switches and exposed
per cycle them to accelerated life testing. Each switch was operated until it failed and the
data number that failed during each incremental number of cycles was recorded on a
table like the one above. There is obviously a relationship between the number of
cycles the switch is operated and the likelihood of failure.
This information is also plotted on a histogram ( bar graph) as illustrated in the next
diagram.
229
Cycles Failures Cumulative % Cumulative Survivors No. Failed/Sample Size
(Probability Density)
1000 0 0 0.00 37 0
Total 37 1
Reliability
functions
230
The four important reliability functions continued
Failure This is basically a histogram of the number of switches that failed in each period.
distribution
Failure Distribution
8
7
6
Failed Items
5
4
3
2
1
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Operating Cycles
Some Thirty seven switches were put into operation on the test bench at the same time
conclusions and the number of cycles they operated before they failed was recorded. The test
was continued until all the failures had been recorded.
This information was plotted onto a histogram. We can see that by 3000 cycles the
number of switches that had failed was 0+1+3 = 4 failures.
The number of survivors at this point is 37-4 namely 33. As a percentage of the
total number of failures the corresponding figure at this point is 4/37 approximately
11% or 33/37 or approximately 89% respectively.
We could therefore state that a switch that had not failed by 3000 hrs has a
survival probability of 89%. Another way of stating this is to say that the reliability
of the switch at this point is 89%.
There is no guarantee that it will last longer but there is an 89% chance that it will
survive beyond this point. As time passes this reliability figure keeps falling.
Referring to this table we can see that at 5000 hours
The cumulative number of failures is 17.
The proportion of cumulative failures to the sample size (37) is 46%
The proportion of survivors is about 100%-46% = 54%
231
The four important reliability functions continued
Probability The curve in this example is drawn from the values derived from dividing the number of
density failures by the sample size. The result shows the probability that a switch fail in the
function corresponding period.
0,200
Probability of failure
0,150
0,100
0,050
0,000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cycles
Cumulativ This diagram was drawn from the cumulative number of failures divided by the sample
e size:
probability
function Cum ulative Probability of Failure
1.200
1.000
Probability
0.800
0.600
0.400
0.200
0.000
00
00
00
00
00
00
00
00
00
0
00
10
20
30
40
50
60
70
80
90
10
Cycles
Significance The probability density function (pdf) and cumulative distribution function (cdf) are
two of the most important statistical functions in reliability and are very closely
related. When these functions are known, almost any other reliability measure of
interest can be derived or obtained.
232
233
The four important reliability functions continued
Survival Survival function is defined as the probability that an item survives beyond time ‘t’
probabilit without experiencing a failure.
y Survival probability values are calculated by subtracting the cumulative probability
values in each period from 1.
Survival Probability
1.200
Probability of Survival
1.000
0.800
0.600
0.400
0.200
0.000
00
00
00
00
00
00
00
00
00
0
00
10
20
30
40
50
60
70
80
90
10
Cycles
Significance The survival probability is the proportion of units that survive beyond a specified
time. These estimates of survival probabilities are frequently referred to
as reliability estimates. ... The cumulative failure probabilities are the likelihood of
failing instead of surviving.
234
The four important reliability functions continued
Hazar The hazard rate is calculated by dividing the failure distribution density by the survival
d rate probability
Hazard Rate
2.500
2.000
Hazard Rate
1.500
1.000
0.500
0.000
1000 2000 3000 4000 5000 6000 7000 8000 9000
Cycles
Significance Hazard rate is defined as ratio of density function and the survival function.
For, the density function of the time to failure, f(t), and the reliability function, R(t),
the hazard rate function for any time, t, can be defined as
h(t) = f(t) / R(t)
Where, f(t) is the probability density function (PDF) representing a failure
distribution and R(t) is the survival function.
In words: the probability that the item will survive to ‘t’, it will succumb to the event
in the next instant.
Average failure rate is the fraction of the number of units that fail during an interval
by the number of units alive at the beginning of the interval.
In the limit of smaller time intervals, the average failure rate measures the rate of
failure in the next instant on time for those units (conditional on) surviving to time
‘t,’ known as instantaneous failure rate,
235
The four important reliability functions continued
Two- In probability theory and statistics, the Weibull distribution /ˈveɪbʊl/ is a continuous
paramete probability distribution. It is named after Waloddi Weibull, who described it in detail in
r Weibull 1951, although it was first identified by Fréchet (1927) and first applied by Rosin &
function Rammler (1933) to describe a particle size distribution.
β
-{ tη }
R(t) = ℮
η Characteristic life
β Shape parameter
Article
www.weibull.com
http://www.isograph-software.com/fourohfour.htm
Significance Where k > 0 is the shape parameter and λ > 0 is the scale parameter of the
distribution. I
ts complementary cumulative distribution function is a stretched exponential
function.
The Weibull distribution is related to a number of other probability distributions; in
particular, it interpolates between the exponential distribution (k = 1) and the
Rayleigh distribution (k = 2 If the quantity X is a "time-to-failure", the Weibull
distribution gives a distribution for which the failure rate is proportional to a power
of time. The shape parameter, k, is that power plus one, and so this parameter
can be interpreted directly as follows:
A value of k < 1 indicates that the failure rate decreases over time. This happens if
there is significant "infant mortality", or defective items failing early and the failure
rate decreasing over time as the defective items are weeded out of the population.
A value of k = 1 indicates that the failure rate is constant over time. This might
suggest random external events are causing mortality, or failure.
A value of k > 1 indicates that the failure rate increases with time. This happens if
there is an "aging" process, or parts that are more likely to fail as time goes on.
In the field of materials science, the shape parameter k of a distribution of
strengths is known as the Weibull modulus. See: www.weibull.com
236
The four important reliability functions continued
Probabilit The value of the Weibull shape parameter β helps to determine the sharpness of the
y density curve
function
for
various
beta
values 0.06
B=10
B=0.5
Probability Density
0.05
0.04
B=5
0.03
B=3.44
0.02
B=2
0.01 B=1
0
0 10 20 30 40 50 60 70 80 90 100
Week No.
Significance In cases of a sharp curve we can be fairly certain that the failure will occur at close
to the η value. The figure shows a set of pdf curves with the same η value of 66.7
weeks we used earlier and different β values. The next figure shows the
corresponding survival probability or reliability curves. From the latter we see that
when β is 5, till the 26th week, the reliability is 99%.
With high β values usage or time-based maintenance tactics may be effective. If
the failure distribution is exponential on the other hand, time based strategies are
not effective.
β values that are less than 1 indicate premature or early failures. In such cases
the hazard rate falls with increasing age and exhibits infant mortality
characteristics.
237
The four important reliability functions continued
Survival This slide shows the corresponding survival probability or reliability curves
probabilit
y for
various β
values
1.00 B=10
B=5
B=3.44
0.08
B=1
Survival Probability
B=0.5
0.06
0.04
0.02
0
0 20 40 60 80 100
Week No.
Significance From the latter we see that when β is 5, till the 26th week, the reliability is 99%.
With high β values usage or time-based maintenance tactics may be effective. If
the failure distribution is exponential on the other hand, time based strategies are
not effective.
β values that are less than 1 indicate premature or early failures. In such cases
the hazard rate falls with increasing age and exhibits infant mortality
characteristics
238
Section F
Repair after failure strategies
Overview
Introduction In this section we introduce the default maintenance tactics that may apply when no
proactive maintenance is feasible or cost effective
Topic
S/H/E questions of the tactics selection diagram
Once-off change and RAF questions of the tactics selection diagram
Reliability centred maintenance
239
S/H/E questions of the tactics selection diagram
Diagra
m
Does the loss The user of the diagram will only get to this question if the risk of the failure mode
of function could not be reduced by either a condition-based task, preventive task or
cause an detective task.
intolerable
risk to safety
Specify a Due to the intolerable risk of a safety, health or environment the team is obliged
combination to review the previous options that were presented and if a single or combination
of of programmed tactics does not reduce the risk then the team must then
programmed consider a once-off change (re-design) to reduce the risk to a tolerable level
tactics or a
once-off
change to
reduce the
risk to a
tolerable level
240
Once-off change and RAF questions of the tactics selection
diagram
Diagram
Can the risk of The team has reached this block because non of the programmed tactic options
failure be were feasible either technically or economically. A final choice has to be made
reduced by about either to re-design the equipment, process, procedures or train the
means of a people that operate or maintain the equipment
once-off
change?
Is the once-off It costs money to develop and implement a once-off change and so it should be
change cost justified on the basis of cost. This is calculated by comparing a 'repair after
effective? failure' tactic with the once-off change. This comparison must be made over a
reasonable period of time such as 4 years. Modifications to the system should
be approached with a high degree of caution because they often spawn
unintended consequences such as new unanticipated failure modes and risks
that add to the maintenance work-load.
Specify the The team writes a short description of the change that is required and it is then
once-off assigned to a person such as the reliability engineer or engineering for further
change investigation, development and implementation
241
Repair after The failure will be dealt with by means of corrective maintenance at the time
failure and place it occurs
242
Reliability centred maintenance
Introduction We discussed the selection of the most appropriate task to deal with each type of failure
mode during the previous section. In this section you are shown how a typical RCM
decision tree is used to achieve the same objective.
Definition of RCM is a systematic analysis approach whereby the equipment design is evaluated in
RCM terms of possible failures, the consequences of these failures, and the recommended
maintenance procedure that should be implemented.
History of Messrs Nolan, Heap and Matheson pioneered RCM in the civil aviation industry where it
RCM was it became known as MSG3. This represented the work of Maintenance Steering
Group Version 3. It was subsequently incorporated into the Logistic Support Analysis
process by the military and weapons industry.
In industry it is known as RCM
Implementing You were introduced to the fundamentals of RCM in this chapter. In-depth training in the
RCM practical application of RCM lies outside the scope of this course. It is essential that all
persons involved with the development of maintenance programmes must undergo formal
training. The investment required to develop a maintenance program is substantial and
every effort must be made to ensure that the program will be as effective as possible.
RCM seven The RCM decision logic is based on seven basic questions that should be considered for
questions each critical item of equipment in the plant, i.e.:
1. What are the functions and associated performance standards of the asset in its
present operating context?
2. In what way does it fail to fulfil its functions?
3. What causes each functional failure?
4. What happens when each failure occurs?
5. In what way does each failure matter?
6. What can be done to predict or prevent each failure?
7. What should be done if a suitable proactive task cannot be found?
RCM data All the data generated from the RCM process can be captured into a database. There are
a number of databases available in the market place that can be used for this purpose.
Closing the The full benefits of the RCM process are realised when actual failure data that is
loop accumulated in using the equipment is recorded and compared with the failure modes
identified during the RCM analysis. This information serves the following purposes:
• To identify failure modes that were not identified during the analysis so that they
can be further analysed.
• The degree to which actual failure modes conform to anticipated failure modes
will determine the confidence that we should have in the outputs of the original
analysis.
243
DAY 4
FAILURE MANAGEMENT STRATEGY IMPLEMENTATION
Overview
Introduction In this chapter we introduce how work management brings strategy to life and
enable organisations to reap the benefits of work management
Section Description
A Implementing maintenance tactics
B Aggressive defect reporting to feed the backlog
C Planning for quality time and safety
D Budget for spare parts and make stocking decisions
E Schedule maintenance to minimise operational downtime
F Use appropriate metrics to drive defect elimination
G Work logistics and preparation
244
Section A
Implementing maintenance tactics
Overview
Introduction The maintenance program must be organised in such a manner that the PM tasks can be
scheduled and executed in consideration of plant operations and the availability of
resources and materials.
Topic
Categorise programmed maintenance tasks
Non critical equipment
Grouping programmed tasks
On-line programmed maintenance schedule
Off-line window programmed maintenance schedule
Off-line shutdown programmed maintenance schedule
Corrective maintenance guidelines for critical equipment
245
Categorise programmed maintenance tasks
Introduction The result of conducting the RCM analysis is a list of maintenance tasks that are
appropriate to each plant address or equipment item. These tasks cannot be managed in
their ‘native’ format.
Categories It is important that the tasks generated by the selection process are categorised correctly
and Terms so that they can be grouped and scheduled in such a way that they can be controlled
effectively. The terms used in the classification of the PM tasks are as follows:
Term Definition
On-line tasks Tasks that must be done while the equipment is in operation.
Dynamic condition monitoring tasks, lubrication, walk around checks etc.
are usually done with the equipment on-line.
Off-line tasks Tasks that cannot be done while the equipment is in operation.
Window tasks Tasks that are of short enough duration that they can be scheduled to
coincide with existing operational non-utilisation periods. (Non
production windows)
Shutdown The duration of these tasks or the network of tasks is too long to be
tasks integrated into a non-utilisation window. The equipment must specifically
be taken off-line for maintenance and downtime due to maintenance is
incurred in the process.
Unit A group of equipment items performing a process in a plant. The outage
of one item of equipment in the unit will result in all the other items going
down because these items are usually also electrically interlocked. In a
parallel process, a unit is usually the largest grouping of equipment that
can be shutdown without stopping the total output of the plant.
Routine or A document usually in the form of a list of activities or inspections, that a
Schedule person is expected to perform to specified equipment at a predetermined
interval. It is also referred to as a "PM schedule".
Skill or Trade Both terms mean the same in the context of this document. Many PM
tasks can be done successfully and cost effectively by the equipment
users and they should therefore also be classed as a specific type of
skill or trade in the system. It is a management principle that a task is
always allocated to the lowest appropriate level of skill.
Procedure The PM tasks that originate from the analysis process are listed per plant item because
that is the level at which the analysis was done.
Where applicable, you must group these PM tasks by unit. The tasks per unit are then
further categorised into on-line and off-line. In the next stage:
• The on-line tasks are passed to the on-line grouping and schedule process.
• The off-line tasks are passed to the off-line grouping process where they are grouped
into shutdown and window categories.
246
Non critical equipment
Introduction This topic deals with the selection and development of maintenance routines for non-
critical equipment.
Criticality C & The shaded block in the diagram below shows that equipment items with a criticality of C
D Items and D need not be the subject of an analysis process. The cost of this analysis is not
justified due to the minor nature of their failure consequences.
Task The basic approach for these equipment items is that of run to failure. This must not be
selection interpreted as neglect! Maintenance programmes are developed in accordance with
criteria manufacturers recommendations.
Integration It is important that the maintenance tasks and routines for this equipment are integrated
into the Proactive maintenance programme. This is essential to ensure that the labour,
spares and scheduling requirements of these routines are visible in the overall
programme.
Procedure Prepare per unit, separate lists of tasks or routines for on-line, window and shutdowns.
Categorise these lists by:
• Skill or trade
• Interval
247
Grouping programmed tasks
Introduction The proactive maintenance tasks that were grouped by unit and categorised
as off-line tasks are categorised further in this stage in preparation for inclusion
into the schedule.
The Process The shaded block in the diagram below shows the relative position of the
current stage in the maintenance programme development process.
Process flow An intimate knowledge of the plant process flow and operation policy is prerequisite to the
and policy development of the maintenance program. It is not possible to develop an effective
maintenance program if you do not have an intimate knowledge of the operational
environment.
A representation of the plant in block or flow diagram format will enable you to visualise
the operation and the ramifications of non-utilisation windows.
248
Error! No text of specified style in document., Continued
Grouping You must categorise and group the off-line PM tasks as follows.
Procedure
Step Action
1 Categorise the list of tasks per unit by interval. You will now have a list of
tasks per unit by interval.
2 Analyse the operations schedule and determine if there are regular non-
utilisation windows available that coincide with the interval of the tasks per
unit.
3 Identify tasks that coincide with the non utilisation windows
4 Match the duration of the tasks with the duration of the windows. Perform a
basic critical path analysis for complex tasks.
5 Categorise tasks that fit the interval and duration profile of non-utilisation
windows as window tasks.
6 Categorise window tasks by skill or trade.
7 Where applicable, group minor tasks of similar skill and interval onto
routines for a unit or area identified by a PAR.
8 Allocate a Document Identity Reference each routine.
9 Categorise remaining tasks as shutdown tasks.
10 Categorise shutdown tasks by skill or trade.
249
On-line programmed maintenance schedule
Introduction The on-line proactive maintenance schedule is a schedule of all the PM tasks that can be
done while the equipment is in operation.
The PM tasks that originate from the analysis done for class A and B equipment and the
manufacturers recommendations for the class C and D equipment are integrated into on-
line schedule.
The Process The shaded block in the diagram below shows the two sources from which tasks that are
included into the on-line schedule can originate.
Characteristics On-line maintenance tasks usually comprise programmed inspection procedures, failure
finding tasks and the taking of dynamic condition monitoring readings to equipment.
They are usually conducted at a relatively short interval of between one and seven days.
The production programme is not affected in any way by the on-line schedule.
250
Error! No text of specified style in document., Continued
Design criteria On-line tasks are designed with the following objectives in mind:
• To achieve the efficient utilisation of resources.
• To ensure that basic asset care tasks are performed and observations of deviations
from standards are identified in a consistent manner.
• To minimise the administrative work.
To achieve these objectives, on-line tasks are grouped onto routines or schedules.
Procedure The on-line schedule is prepared as follows:
Step Action
1 Bring together onto a single list the on-line tasks for class A, B, C and D
equipment.
2 Group the tasks by plant or area of responsibility.
Example
All on-line tasks for each unit or PAR in the Bleach Plant example will be
grouped into a single list for the Bleach Plant.
3 Combine all tasks of the same interval using the same skill onto routines
or "schedules".
Note:
Equipment users are part of the skill base and must be allocated
their share of the PM work.
4 Order the activities listed on the routine in the sequence in which the
person performing the schedule will perform them.
4 Identify each routine or schedule developed in this way by means of a
unique Document Identification Reference and register this document
into your database for control purposes.
5 Create a scheduled task for each routine in the scheduling topic of your
CMMS or schedule the task on a manual pegboard.
6 Allocate the Document Identification Reference of the routine to the
scheduled task in your CMMS or to the pin on the pegboard.
251
Off-line window programmed maintenance schedule
Introduction The off-line proactive maintenance schedule is a schedule of all the PM tasks that can be
done in known periods of non-utilisation of the equipment. These windows usually result
from units in the plant operating in parallel.
The process The shaded block in the diagram below shows the two sources from which tasks, that are
included into the off-line window schedule, originate.
Characteristics Off-line window proactive maintenance tasks usually comprise of routine inspections,
static condition monitoring and discard tasks.
They require the opening up or partial disassembly of equipment and can therefore only
be done with the equipment in an off-line state.
They are usually conducted at an interval of between 1 and 12 weeks.
Design Criteria Off-line tasks are designed with the following objectives in mind:
• To achieve the efficient utilisation of resources.
• To make the best use of non-utilisation windows.
• Not to extend beyond the duration of the non-production windows.
To achieve these objectives, off-line window tasks are grouped into units that contain all
the equipment items that are in an off-line state for the duration of the window.
252
Error! No text of specified style in document., Continued
Step Action
1 Bring together on a list all the tasks that are applicable to the equipment that
is in the off-line state during each specific non-utilisation window in a plant.
Example:
Each of the three Reactors in the Bleach Plant is taken off-line for 3
hours every 4 weeks.
All the proactive tasks that must be done at 4 weekly intervals to a
reactor.
253
Error! No text of specified style in document., Continued
254
Off-line shutdown programmed maintenance schedule
Introduction The off-line shutdown proactive maintenance schedule is a schedule of all the PM tasks
that cannot be done on-line or in the existing periods of non-utilisation of the equipment.
The plant has to be taken out of commission for a period of time while these tasks are
performed.
The Process The shaded block in the diagram below shows the two sources from which tasks that are
included into the on-line schedule can originate.
Characteristics Off-line shutdown proactive maintenance task usually comprise static condition
monitoring tasks and condition based restoration and discard tasks. Some of these tasks
such as pressure vessel tests are also of a statutory nature.
Shutdown proactive maintenance schedules usually include work that would have been
done in off-line windows that immediately precede or succeed the shutdown.
Shutdown proactive maintenance is most applicable to continuous process plants found
in the chemical and paper industries.
A large amount of corrective work is usually included into a shutdown for proactive
maintenance.
They are usually conducted at an interval of between 12 and 52 weeks.
255
Error! No text of specified style in document., Continued
Design Criteria Most organisations incur a loss of revenue while a shutdown is in progress. For this
reason, shutdown proactive tasks are designed in such a manner that:
• The overall duration of the shutdown is minimised.
• Management are at all times aware of the current progress with regard to the
shutdown.
• Logistic requirements are well specified and managed to prevent delays, especially
on the critical path.
To achieve these objectives, off-line shutdown tasks are grouped at plant level, i.e. all
the equipment in the affected plant that are in an off-line state for the duration of the
shutdown.
Step Action
1 Bring together on a list all the proactive tasks that are applicable to the
equipment that is in the off-line state during the shutdown.
Example:
The Bleach Plant is shutdown for 3 days once every 12 months.
2 Sub-divide this list of tasks by trade.
3 Develop routines per equipment item per trade.
4 Identify each routine or schedule developed in this way by means of a unique
Document Identification Reference and register this document into your
database for control purposes.
5 Create a planned task for each routine in your CMMS but do not schedule
these tasks in the pegboard.
6 Allocate the Document Identification Reference of the schedule to the
planned task.
7 Analyse, and estimate the duration, man-hours and spares requirements of
each task that will be conducted during the non-utilisation window.
8 Identify the precedence relationships between the tasks and develop a
master network for the shutdown.
9 Perform a critical path analysis for the shutdown and report the outcome to
senior maintenance and operations management.
10 Register a master task to represent all the tasks that will be done during the
plant shutdown into the CMMS.
11 Allocate the master task identifier to each of the tasks that will be done during
the non-utilisation window that is represented by the master task.
12 Schedule the master task in the CMMS based on the estimated start date
and time of the shutdown.
256
Error! No text of specified style in document., Continued
257
Error! No text of specified style in document., Continued
Shutdown Shutdowns are rarely scheduled at fixed intervals or at a fixed start date and time. One or
Schedule more of the following factors can influence the starting point of a shutdown.
Factor Reason
Market demand The effect of the shutdown on revenue can be reduced if it is
conducted in a period of low demand for the product or service of a
plant.
Current If the performance of the plant in terms of reliability, output and quality,
performance operating efficiency has or is deteriorating, a shutdown could be
justified to restore performance. This will be the case if the increase in
performance achieved after the shutdown will recover the loss incurred
during the shutdown.
Major failure It could be a wise decision to initiate the shutdown because the
duration of the major failure that stops the plant is comparable to the
duration of the shutdown that was due in the near future.
Extension of Shutdowns are often scheduled to commence with a non-utilisation
non-utilisation window. This again reduces the overall impact.
windows
Industrial action Strikes and stay-aways are good opportunities to stage a shutdown.
Unfortunately there is little prior warning and the work force may picket
the establishment making it impossible for shutdown workers and
contractors to gain access to the premises.
258
Corrective maintenance guidelines for critical equipment
Introduction The failure modes effects and consequences analyses will reveal some failure modes that
cannot be prevented by the execution of PM tasks. It is furthermore inevitable that despite
our efforts, failures will occur, making corrective maintenance necessary.
The process The shaded block in the diagram below shows that the need for corrective maintenance
guidelines originates from the process of the identification of appropriate PM tasks.
259
Error! No text of specified style in document., Continued
Objective of Condition based maintenance or condition monitoring tasks have one main objective and
CBM that is to enable you to deal effectively with the consequences of a failure by:
• Enabling you to identify the point of potential failure so that you can use the lead to
prepare and deal with the consequences of that failure. Dealing with the failure in
most cases involves performing a corrective maintenance task.
Objective of Failure finding tasks have one main objective and that is to reduce the risk of the
failure finding consequences of suffering a multiple failure by:
• Identifying a failure to a protective function so that it can be restored by the execution
of a corrective maintenance task.
Proactive The approach followed by many organisations with regard to corrective maintenance is to
approach wait till the failure happens and then to deal with it.
This reactive approach is not necessary if you consider the information that is at your
disposal.
• The failure modes analysis, when done properly, will identify each failure mode that
has a reasonable chance of occurring.
• The part that is subject to the failure mode is identified from the bill of material.
• The corrective repair procedure, and the resources needed to perform the task can be
established.
The point being made here is that the only unknown factor about the corrective task is the
point in time at which it will be done.
Procedure Corrective maintenance guidelines must be developed for criticality class A and B
equipment in the establishment so that you can deal effectively with these failures when
they occur by ensuring that the logistic requirements for each corrective task have been
planned. Use the following procedure as a guide.
Step Action
1 Prepare a list of all functional failures that require corrective maintenance work per
plant address reference.
2 Develop appropriate routines (task specifications) to correct each failure according
to the skill required.
3 Identify each routine by means of a unique Document Identification Reference and
register this routine as a controlled document in your CMMS database.
4 Register a corrective task in the CMMS. Do not schedule the task.
5 Allocate the Document Identification Reference of the routine to the task in the
CMMS.
6 Do the logistics planning for the task.
260
Error! No text of specified style in document., Continued
Benefits The result of performing the above procedure is a set of planned corrective maintenance
tasks for each possible failure mode on Criticality Class A and B equipment. The benefits
of this approach is:
• Any logistic requirements such as a specific skill, facility or special tool are made
visible and action to correct deficiencies in the current resource pool can be corrected.
• The analysis of the work content that precedes the development of the task makes it
possible to estimate the duration and resulting consequences on the operational
capability of the plant.
• The repair procedure that is documented in the routine will enable the task to be
performed to the correct standard.
• The strategic spares requirement is a result of this process.
261
Section B
Aggressive defect reporting to feed the backlog
Overview
Introduction In this chapter we discuss the importance of getting early visibility of the outstanding
maintenance workload so that we can select jobs that need to be done during the
shutdown. We explore other sources for maintenance jobs and develop selection
criteria for shutdown jobs.
Topic
Notifications and defect reports
Notification process
Notification form
The impact of backlog
The positive effect of a healthy backlog
Opinion survey: Work priorities
Work priorities
262
Notifications and defect reports
Workload One of the biggest problems that undermine our ability to manage the maintenance
Visibility workload is the lack of workload visibility. This is a situation where failures, potential
failures and defects (the outstanding workload) or backlog is not recorded into the
CMMS. This makes it difficult to plan a shutdown because the backlog must be
identified by means of personal interviews with the maintenance and operating
crewmembers. Hopefully they can remember what needs to be done.
Examples Below are examples of the kinds of things that you have to record and manage.
They are referred to as Notifications for the remainder of this section
Requests 6. Fabricate a ladder and stand to give the operator access to the
top of the collection tank.
7. Install an additional emergency stop button on the bucket
elevator.
Potential • An abnormal noise on the drive of the screw conveyor.
failures • The coupling rubbers show cracks.
• The oil level is below the ‘normal’ mark
• The guard is loose.
Failures, 1. The screw conveyor has tripped on overload.
defects 2. The gearbox is leaking oil.
3. The drive end bearing of screw conveyor has seized.
4. The drive shaft oil seal of the gearbox has torn.
5. The level control circuit has failed.
Deviations 6. The SNPX dosing pump is over-dosing
7. The ring gear automatic lube system cycle time is set too long.
What is Recorded Any condition that has potential consequences in terms of any of the following:
8. Asset condition.
9. Production / service capability or output.
10. Product or service quality.
11. Safety, health or environment.
12. Costs or resource utilisation.
Must be recorded and managed by means of a formal system.
Why are these 13. So that the organisation can take timeous action to either eliminate or reduce the
Conditions consequences of failures.
Recorded? 14. To achieve effective and efficient utilisation of maintenance resources through
appropriate prioritisation and planning of the outstanding workload.
15. To have full visibility of the maintenance workload and backlog at all times.
263
Error! No text of specified style in document., Continued
Common Here are some of the problems that currently exist in organisations:
Problems
A. Lack of common understanding: People have a tendency to only record
things when they have deteriorated to a degree that pose an immediate threat
to operations or safety. Potential failures tend to be ‘ignored’ or not formally
recorded with the result that they eventually become emergencies that need
to be dealt with in an unplanned manner. This leads to an inefficient use of
maintenance resources and extended downtimes.
20. Roman kessenger syndrome: Equipment operators and maintainers that
diligently report defects on equipment are ‘punished’ for their efforts. People
that report potential failures should be applauded for their efforts. Those that
allow things to deteriorate to a point where they become emergencies should
be reprimanded. Unfortunately we do the opposite. The people that are
rewarded in organisations are often those that are the most adept at dealing
with emergencies. Those that are proactive and avoid emergencies tend to be
overlooked.
• First line crafts work allocation: Craftsmen are often allocated to a section of
the plant and are expected to keep it in "good running order". This policy often
ignores the fact that individuals tend to have their own definition of what is
"good running order". This problem is further compounded by the fact that
maintenance events are now managed in an informal manner.
• Duplication of systems: Most organisations have no standard means to
record defects. In all cases, even the primary document on which a defect is
recorded differs for each organisational area or period. This causes much
duplication of effort, in procedure writing and training.
• Limited access to the notification system: Any person that is likely to
identify a potential failure, failure, defect or deviation should have immediate
unrestricted access to a simple system where these things can be recorded.
People should be applauded for reporting things and those that never do
should be reprimanded. Unfortunately we do the opposite
264
Notification Process
Notification The diagram below represents a typical "best practice" Notification Management
process Process. You can adapt the detail of this process to suit your specific requirements.
Notification Process
Supervisor / Assign Job to
Originator Designate Planning ‘Shutdown’ backlog
Asses notification,
Record notification priority, Asses notification
in system, keep required action and supervisor Shutdown
record for follow up and response. Assign Job
and forward request logistics planning priority
Requester
Priority <>
‘Immediate’ Assign WO status,
Supervisor or
Representative If in system x If Planning
Requester reference to
Authorise Priority >=2
Priority notification
notification
‘Immediate’
Planning
Create an Record Priority =1
unplanned Job notification and
Complete Card x reference Job Card in
notification to notification database for
& specify when follow up
required
Issue Job Card Plan logistics Plan logistics and
to artisan and issue job to assign to Master
and follow up supervisor as Schedule when
till complete soon as possible capacity allows
Types of The diagram above shows that the priority that a requester assigns to a notification
notifications determines the procedure that will be followed in the management of the notification.
The requester priority determines whether the notification will be managed as an
unplanned or planned notification.
265
Notification form
Introduction All information regarding notifications must be captured in a consistent manner and
format irrespective of where the notification originates.
This will establish a common format for communication that will reduce
misunderstanding and misinterpretation. This will also standardise the training
required to get members of the maintenance, loss control and operations functions to
complete the notification correctly.
Dual purpose job Some organisations use the job card as a means of recording notifications. This is not
card considered best practice for the following reasons:
• A job card's main function is to issue a job instruction to an technician. A
notification must be interpreted and converted to a job instruction before it
can become a job. Very often, there is a big difference between the
original notification and the actual job that gets done.
• Some notifications do not result in any maintenance work being done at
all. This is the case where the perceived defect was caused by a lack of
knowledge on the side of the equipment user.
• There is not a one to one relationship between a notification and the job
card. A single notification could result in a host of job cards.
• The creation of job cards must be limited to the planner and maintenance
supervisor as they are the people responsible for proper planning and
instructions to technicians. It is sometimes better that the technician
performing the task does not see the original notification, as it could be
misleading.
Design Criteria The notification form must be designed in such a way that the requester can supply
sufficient information to support analysis and proper planning.
The form can either be designed as an on-screen form in the CMMS or a paper form
with carbon copies that can be retained by the requester for follow up.
Each notification must furthermore be identified by a unique serial number that is
printed on the form, or generated by the CMMS.
Cross Reference The unique number assigned to the notification must also be cross-referenced to each
job card that originates from the notification.
266
Error! No text of specified style in document., Continued
Notification Form This is an example of a Notification form that can be adapted to suit your specific
requirements.
FAILURE OBSERVATION
Arced Blown Inaccurate Stopped
Bent Contaminated Jammed Noisy
Blocked Corroded Leaking Vibration
Burnt Cracked Loose Product
Damage
Missing Overheated Torn Tripped
REQUESTER PRIORITY
1 Immediate 3 Next Week 4 Other Date
2 In 48 hrs 4 Next Outage Time
267
The impact of backlog
Work The diagram below illustrates the dynamics of the PM and corrective maintenance and
management
Identify
Plan IfBreakdown
this cycle gets
Priorities, too
work
etc. islong then
done
determine
potential failures
inefficiently & develop into
consumes
how long this cycle takes
Schedule breakdowns
more resources beforewhich
we can
deal with them
means less capacity
Allocate
Work Load
Execute
(Backlog) Capacity
This in turn reduces
Jobs exit at slower
determines the
the planned work
rate at rate
which jobs
capacity
exit
The sources In the diagram above the backlog is depicted as a silo of work to which work is added by
of work in the programmed maintenance that becomes due and defect reports coming in that require
backlog corrective maintenance. These two are the sources of most of the outstanding work that
builds up in the backlog.
The impact of The pipe at the bottom of the diagram depicts how completed work exits from the silo.
priorities and The time that outstanding work spends in the backlog is a function of its priority and the
logistics capability of planning to obtain the logistics, schedule and allocate the work so that it
gets done. High priority jobs should exit first while lower priority work may stay in backlog
for a longer period of time.
What The rate at which jobs exit from the silo also depends on the capacity of the
happens maintenance work execution team to do planned work. When the team has to deal with
when the excessive breakdowns or emergency work then it means they have less capacity for
cycle gets too planned work. This planned work spends more time in the backlog so potential failures
long deteriorate and become failures that have to be dealt with as breakdowns.
More breakdowns and emergency work means more work is done inefficiently to the
capacity for planned work is eroded and a vicious cycle develops.
Shutdown Shutdown work remains in the backlog until it is completed. Shutdown work is however
work in reported separately because including it in the normal backlog reporting will give a
backlog skewed impression of the day to day backlog.
268
The positive effect of a healthy backlog
% Planned % Scheduled
Stable A healthy backlog enables an organisation to maintain a stable workforce that has
workforce sufficient planned work to ensure a high level of utilization and deliver quality work.
Reliability The equipment reliability benefits from having potential failures attended to before they
deteriorate to failures.
Programmed The programmed maintenance is also executed on time so deterioration is defected and
maintenance dealt with in a planned manner.
269
Opinion survey: Work priorities
In our organisation:
270
Work priorities
Objective The objective of this topic is to enable you to develop and implement a procedure to
manage the prioritisation of maintenance jobs consistently so that maintenance
resources are utilised effectively.
Prioritisation ranks jobs in terms of importance and, to a degree, the sequence in
which resources are allocated to a job.
What must you All maintenance jobs that are not emergencies such as:
prioritise? • Corrective maintenance tasks.
• Routine preventative maintenance tasks.
• Safety, health and environment related tasks.
• Requests for additions and modifications.
Any job must be prioritised before it is put into the work-in-progress database of the
CMMS.
Why? The maintenance department does not have an unlimited supply of resources.
Resources are allocated to jobs that make the largest contribution to the goals of the
business first.
Common The following problems exist in organisations that do not follow the correct approach
Problems to prioritisation.
• Jobs are not prioritised and a "first in first out" approach is used.
• No distinction is made between requester and planning priorities.
• The priority of a job is determined by the rank of the requester.
• Job priorities are changed to accommodate changes to the job schedule.
• No formal guidelines exist for prioritising maintenance jobs.
The above deficiencies lead to the ineffective use of maintenance resources and
poor performance in the organisation.
271
Error! No text of specified style in document., Continued
Requester You must distinguish between requester and planning priorities in your organisation.
Priorities Below is a table of typical requester priorities assigned to work requests.
Requester Priority
Immediate /within 48 hours
Next week.
Next major outage.
Decision Table The table below shows how the factors that influence the priority of a maintenance
job can be graded and weighted to guide the allocation of priorities to maintenance
jobs.
Priority Decision Table
Factor Grading
Equipment Criticality A B C
Consequences Serious Moderate Minor
Type of Job RM & S/H/E Repair Request
Requester Priority Immediate Next Week Next outage
Weighting 3 2 1
Note
Requester Priority 1 does not feature in the grading. Jobs that require immediate
response (emergencies) are, by default, priority "0" and therefore classified as
"Unplanned". The above factors and grading only apply to "Planned" jobs routed to
the planning function.
Grading Table This table is a useful tool to determine the priority of a maintenance job using the
above factors, grading and weights.
Planning Priority
Weight Priority
>9 1
8-9 2
5-7 3
=4 4
272
Section C
Introduction In this section we discuss how work should be planned for efficiency and
effectiveness
Topic
The benefits and outcomes of planning
Man
Method
Machine
Material
Measurement
Analysis and scoping introduction
Scoping and planning considerations
Logistic requirements
Assessment checklist
Detail planning
Planned work package
Introduction to estimating
Analytical estimates
Other estimating techniques
Comparative estimating
273
The benefits and outcomes of planning
Forward Planning is the technique of picturing ahead every step in a long series of separate
looking operations and so indicating that for each step the planning done is sufficient to cause it
to happen in the right place at the right time. T
Complexity of The complexity of the planning function is due to the relative complexity of shutdown
shutdowns execution, which is a time of accelerated activity, with numerous vendors, contractors,
and heavy equipment engaged in multiple tasks in close quarters
Three facets There are three major facets to “planning,” as it is loosely called: planning, scheduling,
and control. The prime goal of the shutdown planning process is to produce a detailed,
overall time-based plan—not merely a work list. Planning must deliver the following
Planning Planning is a make or break activity when it comes to shutdowns. Many organisations
achievements have paid a heavy price for neglecting this function. Below are some of the reasons
and
outcomes
The reason for planning Critical outcomes
274
Man
Introduction As far as maintenance automation has come, we’ll never eliminate the need for skilled
workers in the maintenance process. In fact, many manufacturers are finding a shortage
of highly-skilled maintainers with enough experience to perform today’s complex
maintenance procedures. In order to overcome this challenge, companies must focus on
maintaining the efficiency of their workforce if they hope to increase productivity.
Workers need focused, timely, and frequent training in an environment where quality and
efficiency is a common goal.
Skills There are some tasks that anyone can learn to do, but is just anyone going to do those
tasks well? Of course not! Workers with the proper training, experience, and interests will
perform better than those who are just there to get a paycheck.
Finding the right workers and making sure they are properly trained and suited to their
role in the shutdown ensures that quality work will be deliverd.
Experience No amount of training can replace a worker with years of experience. It is experience
that will help workers identify and resolve problems, suggest improvements to
processes, and maintain superior efficiency. This means that experienced workers are
best placed in roles in the shutdown where they can coach and have influence on the
people assigned to a team.
Self- Completing maintenance tasks under pressure of time with a high-level of accuracy
discipline takes practice and discipline. Craftsmen and technicians must work as a team towards
the common goal quality maintenance work on time. Without this level of commitment
quality and efficiency suffers.
Institutional Quality maintenance work must be an institutional goal and a commitment shared by all
habits shutdown team members.
275
Method
Introduction A lot of the work done in shutdowns is done at a low frequency which means that the
technicians and craftsmen may have little or no experience in performing the work. The
work could also be complex and requires a number of steps that have to be performed in
a specific sequence. It may then be necessary for this type of work to invest time in
writing procedures that ensure that the work will be carried out to the correct standard of
quality, even if the technician or craftsman may not have a lot of experience in doing the
work.
Below are some of the considerations that should be given to defining the work method.
Elements An effective procedure has completeness and accuracy, appropriate level of detail,
conciseness, consistent presentation, and administrative control.
Complete- Completeness and accuracy are difficult elements for the writer to accomplish and for the
ness and reviewer to evaluate. They:
accuracy • Depend on thorough research and analysis of the work during the procedure
development stage and a detailed review of the completed procedure by
knowledgeable and responsible plant staff before approval
• Ensure that the procedure’s goal is achieved and all conditions are satisfied
Complete- Completeness is not a function of the procedure’s length or level of detail. Rather, it is a
ness function of whether a procedure has enough information for the user to perform the task
safely and correctly. One way to test for completeness and accuracy is to have a typical
user simulate or perform tasks using the written procedure. This may be a dry run,
simulation, or actual use.
Level of detail The level of detail is based on the responsibilities, training, experience level, and
capabilities of the intended users. Level of detail also is determined by the criticality
and potential hazards of the work and ease or frequency of performance. Proper
level of detail contributes to ease of use and comprehension. Care should be taken
to ensure the procedure does not become cumbersome, thereby affecting its
effectiveness.
You have included the proper level of detail when the least experienced,
trained user can safely perform the procedure as written.
Conciseness Conciseness demands eliminating detail and language that do not contribute to
work performance, safety, or quality; include only “need-to-know,” and omit “nice-
to-know,” information. “Need to know” means just the information required to
safely and efficiently perform the task. For example, when measuring the internal
bore of sleeve the maintainer uses an internal micrometer.
It is ‘nice to know’ how the micrometer works.
Consistent This element ensures that the procedure is readily comprehensible. It demands the use of:
presentation • A consistent terminology for naming components and operations
• A standard, effective format and page layout
• A vocabulary and sentence structure suitable for the intended user
276
Method continued
Format Procedure writing is straightforward if you prepare properly and follow a well-
thought-out and functional format. The format should guide the user to the final
goal or destination. Additionally, the format should guide the writer during the
development of the procedure. A procedure written using standard format is like a
road map:
• When travelling to a new location, a traveller uses a road map
• Before leaving, in order to know what to expect in terms of traffic and the
types of roads (city streets, local roads, or superhighways), and to get a
feel for the number of turns, the distance, and the estimated duration of
the journey.
• While driving, to check that the proper turns were made, to look ahead
for rest or fuel stops, and to estimate progress.
If travelling to a location again and again, the map may not be needed as often.
But if trouble arises (detours, construction, or a traffic jam) or if it has been some
time since the last trip, the map can be used to solve the problem or answer a
question. Procedures can be used in the same way.
Page layout Procedure formats vary according to user needs, acting to guide the reader
through the procedure to extract and use the information in an efficient manner.
The way you present the procedure steps and words on the page is important.
The user sees the overall layout before reading the individual steps or words.
Even if the procedure is well written and clear, the user may decide against
reading the procedure if the text is packed too densely on the page.
Shorter lines Research results demonstrate that it is easier to read and understand shorter lines
of text of text. This is because we tend to take in a few words at a time, moving our eyes
across the page in a jerking motion. In addition, a page with text laid out from
margin to margin looks intimidating, especially if the lines are closely spaced. This
sometimes referred to as a “gray” page.
Open page An open page, with shorter, adequately spaced lines, is seen by the user to be
friendlier and easier to read. However, this may lead to a procedure with many
pages. In an effort to save paper or to reduce the number of pages, sometimes
the temptation is to use every available inch of the page.
277
• TURN ON lubrication system power switch on solid state controller.
• While adjusting air pressure using the air pressure regulator, PRESS and HOLD
MIST PRESSURE button on solid state controller keypad.
• When header pressure is 15 inches H2O, RELEASE PRESSURE button.
GRAY Page with text running margin to margin and no spacing between steps
Method, Continued
Shorter Fewer pages do not necessarily result in a shorter procedure. Rather, this
procedure method results in a darker, more difficult to use a document which may not even
be read. These two competing criteria of document length and page darkness
are in direct conflict and must be balanced.
An open, easy-to-read page is more important than the desire to shorten the
number of procedure pages.
• Line spacing and length are often a function of the font type or size.
Choosing a font type and size is a somewhat subjective human factor.
Rule of thumb As a rule of thumb, however, 12 point fonts are easily read under most lighting
conditions. Anything smaller than 8 point may be hard for most users to read. An
open style gives the page a professional look, makes information easier to find and
read, and helps to increase the users’ confidence in the procedure. A type of page
layout which effectively uses space and is easy to follow is the “T`-format”. The T-
format divides the page into two columns which can vary in width depending on the
type of information you intend to put in each.
Technique For example, as shown in Figure 4-2, the narrower left column can be used to
identify the person performing the step. The wider, light column contains the
actions. Notice how the wider left margin results in shorter lines which can be
more easily read by the user. In the example, the procedure step is shorter since
278
the actor is not identified in each step. This technique may be used for writing
procedures when different persons or organizations (or “actors”) have
responsibility for the actions required to execute the procedure.
279
Method, Continued
ACTOR ACTION
• CLOSE transfer valve V-123.
Operator A
• VERIFY V-123 is CLOSED on monitor 3-1.
Operator B
• PRESS START button to start charging sequence.
• NOTIFY Operator A that charging sequence has started.
• THROTTLE V-456 to maintain 120 psig on gauge 1-2.
Operator A
Major This format is also helpful when major operations occur in the procedure. T-format helps to
operations organize longer procedures into more easily handled topics or units. In the example shown
as Figure 4-3, the major operations are identified in the left column.This helps to guide the
readers to the proper step if they are interrupted or if the procedure is performed over a long
period of time.
OPERATION ACTION
• VERIFY valve V-111 is CLOSED.
Preparation
• OBSERVE ambient room temperature (gauge 3-3).
• OPEN discharge V123.
Charging
• OPEN drum spigot.
280
Machine
Introduction Part of planning work is to determine what tools are necessary the correct and timely
execution of the work.
Right tools As equipment becomes more complex, having right tools to do the work becomes even
for the work more important. Some work that needs to be done in a shutdown may require
specialised tools. There is a difference between standard tools and specialised tools but
sometimes these differences are blurred.
Standard These are the basic tools that every individual within a craft should have in his or
craft tools possession or toolbox as it were. Think about a doctor and a stethoscope comes to
mind. It would be quite a surprise to encounter a doctor that does not have a
stethoscope in his possession. By the same standard you would expect a plumber to
have possession of a pipe wrench and an electrician a multi-meter.
Standard tool Some organisations develop a standard tool list for every type of trade, craft, technician
list or engineer. They then use the list to audit the tools that the people have in their
possession and the condition of the tools as a means to verify that people are properly
equipped to perform the work correctly. Not having the correct tool or using excessively
worn or damaged tools poses not only a risk to the quality of the work but could also
damage equipment
Specialised Any tool that is a required for the successful execution of the work that is not part of the
tools standard tool list for that specific craft is classified as a specialised tool. The resource
planner must determine from the detail plan whether the tool is available in the
specialised tool store and if not take the necessary steps to make it available at the time
it is required. Either through purchase or through hire.
Conclusion Having the right tool for the work at hand goes a long way to ensure that work will not
suffer from poor quality due to a tool deficiency.
281
Material
Introduction The quality of a coat is only as good as the fabric it is made of!
Not having the right material and spare parts at the right time could significatntly impact
the quality of the work
Parts at risk It is unlikely that a problem will be encountered with parts that have a high turnover.
Problems are more likely to occur with parts that have a low turnover because of the
following:
• Due to the low turnover of the part it could lie on the shelf in the store for a long
time and nobody would know that it is incorrect.
• The part could deteriorate or suffer some form of damage while it is in storage
and nobody is the wiser for it.
Quality of There have been cases reported where counterfeit parts have been discovered after
parts they had failed. These were premature failures that were reported to the vendor who
after investigation discovered that the part was counterfeit.
Parts Supply chain workers do not always have the know-how that needs to go with the
damaged transport and storage of certain spare parts so parts are damaged prior to installation.
during
storage or
transport
Task bill of Resource or task planners leave out items that are critical for the task bill of material
material such as seals, retainers, thread locking compound, serrated washers, cotter pins with
the result that items that should have only been used once and discarded have to be re-
used.
What to do Some of the more obvious things that should be dealt with in the work planning and
preparation is to:
• The serviceability, correctness and condition of low-turnover parts are verified
well in advance and then reserved by tagging for the specific task.
• Maintain regular communications with supply chain with regard to parts and
vendor issues and risks.
• Review and update task bill of materials to ensure consumable items are
included in the package and so avoid reuse of items intended for once-off use.
282
Measurement
Introduction To be able to perform in accordance with modern expectations for durability and
reliability, modern equipment is built to higher standards of tolerances and components
are generally stressed at higher levels.
Original Equipment owners, users and operators can only expect to achieve the same levels of
equipment durability, reliability and performance from the equipment if the same standard that went
standards into designing and building the equipment is applied during the maintenance of the
equipment.
What Below are examples of measurable standards that may be relevant to a maintenance
standards task:
• Fastening torque of bolts and nuts;
• Fastening sequence of bolts and nuts;
• Surface finish of machined or ground surfaces
• Alignment tolerance
• Clearance/interference tolerance
• Pre-load clearance
• Run-out tolerance
• Rotating unbalance
• Pressure
• Temperature
• Current
• Potential
• Resistance
• Viscosity
• Concentration
• Impedance
• Vibration
• Velocity
• Etc.
Task Any measurable standards that could have an impact on the quality of a maintenance
standards task must be specified on the works order or the task procedure that is attached to the
works order.
It is not best practice to specify ‘In accordance with OEM manual’ on the works order
unless the technicians in the workshop and the field have direct, 24/7 access to the OEM
manual that contains the standards.
283
Analysis and scoping introduction
Definition Plan according to Webster is a method for achieving an end. Note the word method in
the definition. This means that a plan will always include a method by which the end is
achieved.
Planned Job A planned job package is a job for which the detail and all logistic support elements
Library have been identified. These packages should be stored in a library of planned jobs
from where they can be reused. This is also known as a planned job template. This
library of templates should expand over time, as more and more jobs are planned.
The templates themselves should improve on accuracy every time they were used as
we feedback actual data about the job, the template should be reviewed.
Corporate The library of planned jobs is a corporate resource and should be available to all the
Resource planning offices and departments. This is also something that should be properly
indexed and available on the intranet.
Tip.
The library should contain both generic and unique plans. A unique plan would be an
adaptation of a generic plan.
Example:
Generic plan: Overhaul wet-end of all-metal centrifugal pump.
Unique plan: Overhaul wet-end of Warman Type AH 12/10 pump.
Discussion To what degree do you use planned job templates? What jobs that you currently do in
shutdowns could benefit from templating.
284
Scoping and planning considerations
Engineering Major jobs and modification will usually require engineering. The planning of these
jobs must be delayed until the engineering has been done. It is therefore important
to identify jobs that require engineering as soon as possible to avoid the planning
being delayed at the end.
What needs to be Maintainers and operations people are notorious for their vague and hair-brained
done requests. They are likely to jump to conclusions very quickly about what they think the
problem is or what needs to be done. Half the battle is won and a lot of
misunderstanding avoided by accurately specifying what must be done!
Search the library Once the planner has established what really needs to be done he consults the
planned work or boilerplate library to determine if such a job perhaps already exists
either in generic or unique form. Should a unique template not be available, the
planner will have a discussion with the requestor and visit the job site.
Site visit A site visit to clarify the requirements of a new job request should be carried out in the
company of the maintenance team leader or technician to clarify things like:
• Are there obstructions that may first need to be removed?
• Is the job symmetric in that the steps to remove the item can be reversed to
install the item?
• Where is the location and how much space around the equipment?
• Lifting points?
• Movement of spares, tools and support equipment in and out of the
workplace?
• Are there confined spaces involved?
• Lighting?
• Ventilation?
• Will there be people working above or below this job that may cause hazards
• How close are services, for example power, water, compressed air, etc.
Visualize the Form a mental picture of how the job will proceed in consultation with the technician
work and maintenance supervisor. Break the job down into main elements of about 15 – 45
minutes. Understanding the steps involved enables us to analyse the job method and
through this determine the types and numbers of skills and other resources that can
work on the job simultaneously and also estimate the duration of the job.
285
Logistic requirements
Definition Job logistic requirements are the things that are necessary for the execution of a job.
The unavailability of any of these elements can cause the execution of a job to be
delayed. It is therefore essential that all the job logistic requirements be identified.
Services like water, compressed air, 110 Volt power may be accessible at any point
on a site. If they are always on tap then they in themselves do not need to be
specified as requirements. However, if the person doing the job needs to run an air
hose from the point of supply to the job then the hose of a specified length and
connections at each end becomes a logistic requirement. The planner does obviously
not concern himself with the logistic requirements of jobs that will be contracted out at
fixed prices.
Skills and labour These are the number and time for which persons with appropriate skills will be
required for the job. Example, coded welders, riggers, fitters, etc.
Spares The parts, material and consumables necessary to do the job. Determining which of
the items are stocked or direct purchases will also prove helpful in the planning. Items
need to be properly identified in terms of either part number, drawing number or
standardised description.
Example:
BEARING, deep groove ball SKF 6208 2RS
GASKET, John Thompson Drawing JTX164351
SET SCREW, M10x40mm, Hexagon head, Grade 8.8, Steel
Specialised tools Specialised tools are things that the average technician does not carry in his toolbox
and of which the number is limited.
Examples:
Heavy lifting equipment, portable lifting equipment, scaffolding power hand tools,
impact wrenches, test and measuring equipment, NDT and condition monitoring
equipment, wheel pullers, hydraulic presses, handling equipment, ladders, temporary
lighting equipment, ventilation equipment as well as personal protective clothing that
may not be part of the standard issue.
Documentation There are very few jobs that cannot benefit from documentation. Documentation in the
form of drawings or maintenance procedures is essential on many jobs undertaken
during a shutdown. We need to distinguish between two types of documentation.
General Standards.
These are general documents that spell out certain standards that apply across the
board to all jobs that are performed and include things like:
• Alignment standards for couplings
• V-belt and chain drives
• Fasteners. The torque tables for all fasteners used on the site
Specified documentation.
These are documents that need to be referred to the specific job such as
maintenance procedures, engineering drawings, process and instrumentation
286
drawings, etc
Assessment checklist
Checklist Below is a checklist that you can adapt to your requirements (from “Shutdowns and
Outages” By Joel Levitt).
Job Safety Safety must always be incorporated into the planning process. No job plan is
complete before due consideration has been given to things like:
• Pre-work safety inspections.
• Lock out.
• Fall protection anchor points.
• Fire watches.
• Fire fighting equipment.
• Ladder and scaffold use.
• Confined space entry.
The first option is to eliminate the hazard. Second best is to incorporate safety
actions into the job plan.
Discussion Are you satisfied that these issues are properly covered in your work planning?
287
Detail planning
Detailed task The table below lists some of the activities that are part of detailed job planning. The
planning sequence in which they appear is not necessarily the sequence in which they occur.
• Visualize people doing the job. Try to see how the work would proceed.
• Select and describe the best method to accomplish the job (consult with experts if necessary). This
“best method” can be an area of contention so if the job is an important element of the shutdown, a
round table discussion with experts (including some people who have done that job or similar jobs)
before starting planning, is desirable.
• Consider lockout, confined space entry, fall protection and all other safety factors.
• Determine the job sequence by specific and logical tasks or steps (again the most accurate job step
is between 15 minutes and 4 hours).
• On complex jobs, use PM software to determine the optimum schedule and coordination of crafts
and crews.
• Determine labour resource requirements including:
o Required trades
o Detailed trade sets if trade is not detailed enough (some crafts have sub-sets such as
instrumentation as a sub-set of electrician or high pressure vessel welding as a sub-set
welder)
• List license requirements (code welder, asbestos removal).
• Sequence and timing of different trades (you don’t want people standing around waiting for their
piece of the job to be ready unless it makes sense).
• Establish labour-hours for each activity of the job sequence.
• Determine whether special or extra allowances are required working at height, heat, cold, moving
machinery, etc.
• Determine whether contract resources are needed. This can be done at this stage or later, when all
the jobs are aggregated.
• Determine time needed from operations for cleaning, dewatering or decontamination (this type of
work should be done as part of the plant shutdown, but may need to be done for a specific work
area).
• Determine time needed for cooling down and heating up, make the asset accessible for work.
• Prepare the Bill of Materials for the job, listing necessary materials, and parts required. Be as
specific as possible with part numbers and quantities. Establish an acquisition plan. Determine which
items are in stock, and whether to make the part or buy it? For direct-order items (not authorised for
stocking) the planner prepares the associated Purchase Order Requests or material requisitions.
• Determine what in-house or outsourced pre-fabrication needed. Create Work Orders (or purchase
orders) for fabrication. Get estimates for lead-time or look at actual lead-times for past jobs that were
similar. Consult with the prefabricator if indicated.
• Determine any rebuilding requirements. Create Work Orders for rebuilds. Source the rebuilding.
Prepare paperwork for the rebuilding if it will be outsourced and moved off-site.
• Identify and quantify bulk works (for example mass rebuilding of valves).
• Is large equipment needed? If rentals are needed, prepare Purchase Order Requests with Work
Order references for equipment rentals.
• Identify special tools and equipment required, including safety items. Identify where large numbers of
normal tools will be needed.
• Identify special tools and equipment required, including safety items. Identify where large numbers of
normal tools will be needed.
• Consider how to get parts, people and equipment to the job location, together with ladders,
scaffolding, rigging, cranes, and other heavy equipment. Consider if there will be storage problems
when the items are not in use. Is the ground safe for the loads imposed by heavy lifting equipment?
• Coordinate related work of other groups if important by preparing Cross Work Orders, or add unique
Tasks on the same WO (Work Order) if only minor support is needed.
• Consider disposal issues (for both liquid and solids, asbestos, intermediate products, oils, and other
contaminants).
• Estimate total cost in terms of labour, materials and external charges.
• Perform a risk analysis for what can go wrong. Could the job scope suddenly change, could the job
uncover significant additional work? If so create a contingency plan.
288
• Conduct a risk assessment or JSA. What is there about this job that presents a risk to the whole
shutdown, to the personnel, the public or the environment? Have a contingency section in the plan
where you outline the risk and if it cannot be eliminated how you plan to manage it.
Plan work During work preparation, the planner assembles and documents all the above
package planning efforts within a “Planned Job Package”. All factors that may delay or hinder
effective job completion should be anticipated and steps taken for their avoidance.
Different jobs would involve different levels of complexity and completeness in their
planned job packages. A minor job might only need a work order with a bill of material
and appropriate clearances. A major job might require 50 work orders and all the
elements shown below. The Planned Job Package could include:
Job Planning Sheet with sequenced tasks detailed by Duration and labour-hour estimates for each task. For
trade and skill level. Contractor as well as in-house resources complex, critical and longer duration jobs a Gantt chart
must be included. should be prepared to show trade sequencing and
simultaneous tasks.
Lockout and tag-out, confined space entry, instructions Outside specifications. Standards that apply to this job
that are not included on the work order. such as SABS, DIN, ISO, ANSI, SAE, etc.
Job Safety Analysis describing hazards, safety Other reference documents that the are required such as
requirements and PPE requirements. prints, sketches, photos, specifications, sizes, tolerances,
engineering drawings.
Requirements for decontamination, dewatering, cooling Operating, maintenance and other manuals or sections of
and removal of product. manuals as required.
Details of pre-shutdown fabrication. Estimates of the All clearances and all required permits both legal and
lead-time for the prefab work. The plan should be to perform company completed to the point of safe feasibility. These
as much pre-shutdown fabrication and other preparation as documents might include flame permits, confined space
possible. entry, open line permits, etc. The final lock outs are made by
the responsible artisan and operator
A Bill of Materials including availability, commitments and Sign-off sheets for job completion from contractor, data
staging location. The listing should distinguish between entry, safety and others as appropriate
authorised stock items, direct purchases, indirect purchases,
(contractor wi.l buy) in-house fabrication, and outside
fabrication
289
Introduction to estimating
Introduction Work estimating is an integral part of maintenance management. In fact, the whole
concept of planning as applied to people depends on knowing how long jobs will take.
Reliable time estimates are the basis of all of the following maintenance management
activities:
• Planning the day-to-day workload of technicians.
• Calculating the outstanding workload on each trade.
• Determining how long schedules will take so that production downtime can be
planned.
• Assessing total Programmed maintenance workloads, so that peaks and
troughs can be smoothed out.
• Planning long-term resource requirements.
• Providing a yardstick against which the productivity of workshops can be
measured.
There is a host of work measurement techniques available, each with differing levels
of accuracy and applicability to maintenance work. However, it is important to place
the whole subject of work measurement in perspective.
This definition contains three key words that have a profound significance for work
measurement. They are:
• Competent Worker
• Specified Job
• Defined Level Of Performance.
Competent A time estimate or time standard is set for a worker who is suitably skilled to do the
worker job. It is not set for the super-skilled or very highly skilled worker (unless everyone
likely to do the job falls into that category). Equally, the time is not set for the least
skilled or least qualified person in the team.
It is important to visualise Mr. Average Skill when using any of the estimating
techniques, or to study as wide a range of people as possible when using any of the
measurement techniques that are based on direct observation.
290
Introduction to estimating, Continued
Specified task It is obviously essential to have the clearest possible idea of exactly what is to be
done when determining how long a job should take. This is not usually a problem
when preparing time estimates for maintenance schedules, because schedules
consist of detailed step-by-step instructions. (In fact, if it is not possible to derive an
accurate and reliable time estimate for a maintenance schedule, there is usually
something wrong with the content of the schedule.)
In the case of non-Programmed maintenance jobs, it is sometimes difficult or
impossible to define the method with any degree of clarity until after the job has been
done. Even when the method is known in general terms, all sorts of unforeseen
problems (such as stuck bolts or missing parts) can interfere with the job. However,
even under these conditions, it is still possible to employ techniques that provide
statistically acceptable estimates in "ball park" terms.
Estimates can also go wrong when the scope of work changes dramatically. The initial
scope to change the idler rollers on a conveyor morphs into replacing the idler frames
and the rollers plus the deck plates! This scope change dictates that a new estimate
is required.
Defined level of The last key concept that is considered concerns the speed (or rate) at which the
performance technician will perform the task in question. Obviously different people will work at
different speeds. If work measurement is done sensibly, it is essential to define some
sort of yardstick and set some sort of standards regarding working speed.
Technicians in a maintenance department obviously do not work at the same speed
as workers on an assembly line and the planner will take this into consideration in the
estimating process.
Estimates Skilled planners and trades people can look at a job; mentally compare it with jobs
accomplished in the past, and come up with an estimate. We make estimates all the
time. Estimates are different from guesses in that they are based on experience. They
suffer from the same problems as guesses in that they vary in accuracy with the same
problems as guesses in that they vary in accuracy with the experience and skills of
the estimator.
Great mechanics often make terrible estimators. The skill to do work does not always
translate to the skill to estimate the time needed to do the job. There are several aids
to estimating tables.
Some estimates are based on some quantity, as when a steel fabricator estimates
based on a number of kilograms of steel or an excavator estimates based on cubic
metres of excavation (also called scaling).
291
Analytical estimates
Definition Analytical estimating looks at the work content for each job at the job element level
and estimates are made per element. Factors are then added for travel, fatigue and
working conditions. Factors are added in for fatigue (after numbers of hours) and
working at height, etc
292
Other Estimating Techniques
Historical The labour-hours charged against work or individual jobs are recorded and
estimates accumulated. They are then averaged after elimination of skewed highs and lows.
The resultant averages reflect size and condition of the facility, condition of the
equipment, skill level of the maintenance work force and current state of job
preparation and materials support. Because the work order system is the source of
historical averages, it is often difficult to obtain reliable data on which to calculate the
averages. In too many CMMS implementations, work content is charged to whichever
work order is handy, not necessarily the right one. Another disadvantage is that the
standards include all the lost time typical of that plant.
Universal Predetermined motion times, time study, and standard data evolved into Universal
maintenance Maintenance Standards (UMS) and even these have faded from common usage.
standards Although they are the most accurate method, such standards are too time-consuming
and expensive to set up as well as maintain. Each job is studied by an industrial
engineer and analysed for best practices. Only consider this technique if you have to
change thousands of filters, rebuild thousand of valves, or perform some production
like job. Because of the problems, these standards are not generally recommended.
Construction Construction trade estimates are developed for contractors to use when bidding on
trade estimates construction jobs. While they are not recommended for in-house use for maintenance
they can be used as guides. There are some specific estimating sources for the
chemical industry that are better than others.
The reason for the lack of standard methods is that construction is not among our
most efficient industries and the estimates reflect that level of efficiency. The
standards also include engineering safety factors in the interests of the bidder, and
most relate to construction jobs rather than maintenance jobs. If this is the only guide
you have access to, use it by all means. One idea is to compare the estimated time
for a few known smaller jobs to the actual time taken for those jobs. Use the ratio
between the two on unknown jobs to make the published estimate more relevant to
your conditions.
Flat rate manuals Some industries like automotive and mobile equipment have repair times, techniques
and steps that are well documented. When you buy a truck, for an extra $100 or so
you can get a manual of repair steps, times and other useful information. Consult the
manufacturers of the equipment or components for information on trucks, construction
equipment or components for information on trucks, construction equipment, seals,
bearing, etc.
These times tend to be very well thought through but they assume a level of tooling
and work conditions that you might not be able to duplicate. The OEMs use the flat
rate manuals to reimburse dealers for warrantee repairs or to charge customers for
work done. You can be sure the times are quick (maybe too quick if you have older,
dirty, or rusty equipment). Some dealers pay their mechanics based on these rates
times their pay rate for all the jobs they complete that week.
Engineered These standards are a special case of a flat rate manual that was compiled by the US
performance Navy in the 1970s. They employed an army of industrial engineers to time study all
standards maintenance work performed on Navy bases. The result was a library of manuals with
accurate estimates for work such as painting, carpentry and electrical work, even
standards for pier building!
293
Comparative Estimating
Introduction Data collected from reliable job history and times derived from analytical estimates
can be recorded in a spread sheet of time ranges by trade or discipline. This
information is then readily available for future use. Instead of the planner having to
search through history to determine how long a job took in the past or having to re-
estimate a recurring job again and again, he or she can consult a table like the one
illustrated below.
Reunert Mod 64
Screw Conveyor
Warman 10/12
Pump
Rema Belt
Conveyor
Replace Replace Replace Replace
Reactor Inlet Hose Rotary Seal Spreader Exhuast Box
Replace
Replace Replace top
bottom
sight glass gasket
gasket
Replace
rupture disc
Ultco Bucket
Elevator
Set toolrest
gap on
General Fitting gridstone
Train
conveyor
Replace
shear pin
Application The table above also creates a means by which jobs that have not been done before
can be estimated by making a comparison between the ‘new’ job and those for which
there already are benchmarks. This method is known as Comparative Estimating.
294
Section D
Budget for spare parts and make stocking decisions
Overview
Introduction The maintenance program is only as good as the availability of spare parts. Spares are
required to support the programmed maintenance as well as the corrective maintenance
Topic
Strategic spares requirements
Scheduling pegboard
Requirements planning
Requirements forecasting
295
Strategic Spares Requirements
Introduction The proactive maintenance schedule enables you to develop a requirements profile for
spares that will be consumed over time.
The spares requirements in terms of corrective maintenance can however not be forecast
in a deterministic way because the point in time at which the corrective task will occur is
not known.
Overall The diagram below shows that strategic spares requirements are derived from corrective
process task plans for critical equipment.
296
Error! No text of specified style in document., Continued
Cost Factors When a spare is to be available off the shelf, it implies that the item must usually be held
in the inventory. Inventory items have certain cost consequences that you must be aware
of:
Factor Reasoning
Cash flow The purchase of the item results in an outflow of cash unless the item
is held under a consignment stock agreement.
Interest If the item was purchased with borrowed money, interest will have to
be paid on the loan.
If the item was purchased using own funds, a loss of earned interest is
incurred.
Depreciation The value of the item will depreciate over time. If it is subject to a shelf
life, the risk of a total loss is even higher.
Storage Most inventory items require storage under roof on shelves. This
storage space and shelving is a capital outlay that does not earn
revenue and costs interest.
Administration The computer systems and administrative personnel that are
responsible for accounting for the inventory are an on-going cost
factor.
Stock Many inventory items require programmed maintenance while they are
maintenance in storage. The rotating parts of all mechanical equipment such as
pumps, gearboxes, electric motors should be turned from time to time
to prevent corrosion and brinelling of rolling element bearings.
Technology Technological advances to processes and equipment can result in the
modification or replacement of equipment in the plant. The result of this
is that some strategic spares may have to be scrapped.
Maintenance must give full consideration to all the direct and indirect costs of putting
spares into the inventory. This is even more valid in the face of the fact that it is
maintenance that must be held accountable for all these costs because they make the
decision with regard to what gets put in the inventory.
297
Error! No text of specified style in document., Continued
The following criteria can be used as a guide to determine if a spare should be held in
stock:
If And Then
The corrective task that The lead-time from the point of
requires the spare potential failure to the actual failure
originates from an on is less than the procurement lead-
condition task time of the spare.
The consequences suffered in the Place the spare in the
time between the actual failure and inventory.
receipt of the spare is greater than
the purchase and holding cost of
the spare in one year.
The lead-time from the point of Do not place the item in
potential failure to the actual failure inventory.
is more than the procurement lead-
time of the spare.
The corrective task that The consequences that the Place the spare in
requires the spare does organisation can suffer in the inventory.
not originate from an on procurement lead-time are more
condition task. than the purchase and holding cost
of the spare over one year.
The consequences that the Do not place the spare
organisation can suffer in the in inventory.
procurement lead-time are less
than the purchase and holding cost
of the spare over one year.
298
Error! No text of specified style in document., Continued
Request for All strategic spares requirements must be communicated to the materials department by
Stock means of a formal request that is motivated and authorised by the engineering manager.
Note:
• The Part Number of the item and the Inventory Master Reference number must
be recorded in the Bill of Material of the equipment to which the item applies.
This reference enables you to establish the population of the item.
The Inventory Master Reference number must be allocated to the material requirements
budget of the planned corrective task that requires the item. This reference enables you to
establish the requirement for the item.
299
Scheduling pegboard
Introduction A schedule pegboard is means by which the occurrence of proactive maintenance events
over a period of time is made visible.
Structure A manual version of a pegboard consists of a large board with vertical demarcations of
time and horizontal demarcations for equipment or tasks.
The most popular version displays:
• A list of equipment items or plant addresses that are the subject of proactive
maintenance tasks vertically down the left-hand column.
• A row of pegs or similar markers running horizontally across from each equipment
item or address to indicate occurrences of proactive maintenance tasks.
• A vertical line (usually a length of string) placed vertically across the time
demarcations to show the current date.
Most computerised maintenance management systems can simulate the board described
above as a paper printout or on the computer monitor.
Example The diagram below shows the format of a typical schedule pegboard:
300
Error! No text of specified style in document., Continued
Forecasting Computerised systems have the added benefit today that they not only generate the job
cards according to the schedule but also forecast the requirements for materials and
resources based on the schedule.
301
Requirements planning
Introduction Requirements’ planning is the process whereby the maintenance function determines,
communicates and satisfies its requirement for maintenance support resources. These
support resources include the skills, facilities and special tools that are essential for the
execution of the maintenance programme.
Objective The objective of doing requirements planning is to ensure that the organisations
investment in resources is aligned to the expected maintenance workload. The negative
impact in an over or under investment in maintenance resources on the organisations
business performance is discussed in the table below.
Overall The diagram below shows that requirements’ planning originates from the scheduling of
Process proactive maintenance tasks.
302
Error! No text of specified style in document., Continued
What to plan The successful execution of a maintenance task is dependent on the availability of
certain support elements. The lack of any one of these elements can cause a
maintenance task to be delayed. These elements have already been discussed in the
logistics-planning topic.
Their impact in the operational phase of maintenance is emphasised below.
Element Reasoning
Spares Spares are a key element in the execution of the majority of
maintenance tasks. The right spare in the right quantity at the right
point in time is required.
Skills The technician or technician with appropriate skill is essential to
perform the task. The staffing level of the organisation must be
matched to the workload.
Facilities Some tasks require the use of specialised equipment such as mobile
cranes, heavy transport etc. that is not kept on site permanently and
must be hired from external sources.
Special tools Modern equipment needs a vast array of special tools for
maintenance. Consider the various condition monitoring and
electronic test equipment that is required.
Downtime The execution of off-line tasks requires that the equipment be made
available to maintenance. You need to be able to communicate to
operations what downtime will be incurred by the execution of off-line
maintenance tasks.
Costs All of the above have financial ramifications that impact on the
business performance and competitiveness of the organisation. The
requirement for funding must be communicated. The budget must be
a reflection of the workload and not an estimate of last years costs
plus a percentage.
Communicate The maintenance function needs a means by which these requirements can be identified
Requirements and communicated to functions in the organisation that are responsible for their
provision.
303
Requirements forecasting
Deterministic The proactive maintenance schedule pegboard is able to support the maintenance
forecasting function to generate a deterministic forecast of requirements.
It is called a deterministic forecast because it originates from detail elements that are
accumulated into values or quantities for distinct periods of time.
The requirement for a particular resource or material item is based on scheduled
maintenance tasks that require that particular resource or material item.
304
Error! No text of specified style in document., Continued
Selective The attributes that are assigned to task plans make it possible to extract
forecasting forecasts selectively against any management object such as:
• Plant areas.
• Work centres
• Cost centres.
• Supervisors.
• Repair sections.
• Departments.
• Specific equipment.
• Plant
Example: The diagram below shows an example of a forecast of the man-hours per skill
Man-hours for a 12-month period. This information can be used by management to
per Skill determine the expected workload in any period in the future.
Fitter 8450 600 550 590 1200 900 650 700 750 800 600 660 450
Elect 5550 350 300 320 800 700 500 550 660 450 350 320 250
Instr 2885 220 250 235 440 350 210 190 250 150 230 260 100
Plater 3970 350 230 210 600 450 290 300 360 320 450 220 190
Total 20855 1520 1330 1355 3040 2400 1650 1740 2020 1720 1630 1460 990
Example: The diagram below shows an example of a forecast of the number of tasks per
Jobs by Skill skill for a 12-month period. The number of jobs per skill is a useful indicator of
the administrative and supervisory workload in the maintenance environment.
Fitter 5105 400 366 393 600 200 410 586 500 500 400 450 300
Elect 4175 300 250 270 500 400 390 450 500 350 300 245 220
Instr 1465 110 125 115 220 175 105 115 125 75 115 130 55
Total 11638 860 796 823 1440 875 980 1231 1185 980 918 935 615
Budget The requirement forecast information could be used as a direct input to the
variable maintenance budget.
The staffing levels and the requirements for facilities, special tools, materials etc.
can all be derived from this forecast.
The overhead maintenance costs will however still have to be budgeted for by
conventional means.
305
Section E
Schedule maintenance to minimise operational downtime
Overview
Introduction In this section we discuss how to determine the capacity for planned work
Topic
Manpower capacity for planned work
Equipment non-utilisation profile
Weekly schedule objectives
Draft weekly schedule
Shutdown maintenance weekly schedules
Not allocated (delayed) priority 1 and 2 work
Weekly schedule: Review meeting
Weekly schedule and implementation
Smoothing the manpower requirement
306
Manpower capacity for planned work
Definition The manpower capacity for planned work are the quantity of work hours (man-
hours) that are available in a maintenance section for allocation to planned
work, in a forthcoming period, i.e. 1 week.
Rationale The planner must know what the manpower capacity hours are for each
maintenance section so that the weekly schedule can be loaded with an
appropriate amount of work. The planner cannot develop an effective weekly
weekly schedule without this knowledge
Example The planner must obtain an estimate of capacity hours from each maintenance
Procedure supervisor at least once a week prior to the development of the weekly
schedule. The estimate must consider the factors demonstrated in the
example below. A section comprising fitters is used to explain the procedure.
Step How to calculate manpower capacity
1 Example: Manpower strength: 10 Fitters.
2 Multiply the number of crafts by the normal work hours per week.
Example: 10x46 hours = 460 Fitter hours.
This is defined as potential work hours.
3 Determine if any crafts will be in training, on leave and if there is a
regular pattern of absenteeism that you must consider.
Quantify these hours for each craft type and subtract from the potential
hours.
Example: One fitter will be on training for two days and another
will be away on a soccer tour for one day.
460-(1x9x2)-(1x9) = 433 Fitter hours.
This is defined as available work hours.
4 Calculate the productive available work hours. Do this by reducing the
available hours by the productivity percentage level of each type of
technician from history. The productivity percentage is the percentage
available hours that are recovered to maintenance jobs. In the example
below the productivity percentage is 70%.
Example: 433 Fitter Hours x 70% = 303
5 Finally, calculate the capacity hours by reducing the productive
available hours by the historic unplanned work and ad-hoc planned
work percentage. The unplanned work percentage is the ratio between
the hours recorded against unplanned and ad-hoc planned jobs and
the total hours recorded to all jobs. In the example below the
unplanned work percentage is 35%.
Example: 303 - Fitter Hours - (303x35%) = 197
The 197 capacity fitter hours represents the capacity of the weekly
schedule that you will develop for the fitters in the section.
307
Error! No text of specified style in document., Continued
Recalculation The planner must re-calculate the capacity work hours before he/she develops a
weekly schedule due to changes in the personnel environment.
Plannable The diagram below illustrates how capacity (plannable) hours are derived:
Hours
Plannable Hours
Calendar Time
Productive Unrecovered
Hours hours
308
Equipment non-utilisation profile
Definition The equipment non-utilisation profile identifies the periods of time that equipment is not
used by operations in a selected period of time.
Rationale The planner must know what the non-utilisation profile of each item of equipment is so
that he/she can schedule an appropriate number of jobs that require the equipment to be
off-line into a period of non-utilisation.
One of the fundamental objectives of maintenance is to increase the duration of
equipment operating cycles. It therefore makes business sense to squeeze as many
maintenance jobs as possible into the same non-utilisation period. Maintenance must
exploit operations-initiated downtime for the purpose of performing maintenance
whenever possible.
Maintenance must only negotiate for maintenance downtime once all operations idle time
is fully utilised for maintenance.
Non- The diagram below shows all the possible opportunities for maintenance that could arise.
Utilisation The most cost effective opportunities are those created for Programmed maintenance and
planned operations idle time used by operations for cleaning, changeovers, etc.
Opportunities can also arise from unplanned operations stoppages.
Calendar Time
Actual Unplanned
running time operations Maintenance
(utilisation) idle time
309
Error! No text of specified style in document., Continued
310
Weekly schedule objectives
Objective The objective of this topic is to enable you to develop a weekly weekly schedule that will:
Why is it Characteristics of the maintenance environment are a lack of stability and constant
Necessary change to the workload.
This situation is comparable to a marksman trying to hit a ghost. The target, besides
constantly changing shape and size, never presents itself long enough to get an effective
shot or even evaluate if the target was hit.
The weekly schedule overcomes these problems by enabling the maintenance function to
freeze the target, muster the correct firepower, apply sustained fire and then evaluate the
effectiveness of the volley.
The weekly schedule makes a significant contribution to maintenance effectiveness
especially by reducing the high priority backlog and improving the return on the
maintenance resource investment.
311
Error! No text of specified style in document., Continued
Common Organisations that do not develop and use weekly schedules that fit the above principles
problems have a tendency to:
• Use resources ineffectively.
• Do very little real planning.
• Are reactive in their overall approach to maintenance.
• Incur delays in the execution of maintenance jobs.
• Work in disharmony with operations due to conflicting priorities.
• From time to time too much or not enough work relative to their available man-power
Preparation The following policies and procedures must be implemented in your organisation before
you can implement a weekly schedule.
• Notification management.
• Maintenance Job Prioritisation.
• Job Visibility Structures.
• Work Order Control Process.
• Job Status and Progress Management.
The process The diagram below illustrates the stages in the process of developing a weekly weekly
schedule:
Workforce and
Contractors
Operations
Schedule
312
Draft weekly schedule
Definition The draft weekly schedule is a preliminary list of maintenance jobs that you wish to
allocate to a weekly schedule.
Rationale The planner needs a basis for discussion for the meeting with the maintenance supervisor
and operations representative to develop and approve the weekly weekly schedule.
The draft weekly schedule narrows down the options and allows the meeting to focus on
the significant jobs and issues.
The better the quality of the draft weekly schedule, the more productive the meeting and
the sooner t will achieve consensus.
Deadline The planner must finish the preparation of the draft weekly schedule by midday on
Thursday. If it is not ready by this time, he will not have sufficient time to complete the
weekly schedule process.
313
Error! No text of specified style in document., Continued
Example The planner will use a procedure similar to the example below to develop the first draft of
Procedure the weekly schedule:
314
Error! No text of specified style in document., Continued
Principles The following principles must be applied to the development of the weekly schedule:
• All priority 1 and 2 jobs that have a status planning complete', must be allocated to the
weekly schedule regardless of whether there are sufficient capacity hours.
• The backlog must be kept below 2 to 4 crew weeks.
• A percentage of the capacity hours should be set aside for low priority jobs. Operations
confidence in maintenance will be seriously harmed if priority 3 and 4 jobs are allowed
to remain in backlog indefinitely.
Example The example weekly schedule in the diagram displays the key fields that are needed for
the discussion.
Master Schedule
Planner: P Pesci Supervisor: Fred Fraser Week
Plant: Bleach Reactor Skill: Fitters 26
Job ID Description Status Address Offline Earliest Start Priority Hours Duration
315
Shutdown maintenance and weekly schedules
Definition Maintenance shutdowns or projects are major maintenance events comprising multiple
activities that are performed in a predefined period of time on equipment or plant that,
under normal circumstances will be unavailable for maintenance.
Differences Some organisations use a special policy and procedure for managing maintenance
shutdowns. Others disregard all existing procedures because they perceive a shutdown
as something totally unique and therefore something that must be managed in a different
way.
The only real difference lies in the level of activity. This higher level of activity results in an
increase in the capacity hours. Precedence linking of jobs should be normal practice and
not something that is reserved for maintenance projects.
Integration Shutdown jobs must be integrated into the outstanding work database because they are,
outstanding jobs.
You must allocate shut-down jobs to the draft weekly schedules that are created per
section, supervisor or discipline when a shut-down is due to commence in a given week,
the, just like any other job.
316
Not allocated (delayed) priority 1 and 2 work
Definition These are high priority jobs that you cannot allocate to a weekly schedule in a given week
because of a lack of logistic requirements such as spares, materials, special tools, skills,
etc.
Rationale It is a principle that the planner must allocate all outstanding priority 1 and 2 jobs to the
weekly schedule for the forthcoming week. It is for this reason that the planner must focus
all his efforts on planning these jobs first.
An inability in this regard is a reason for concern. The planner must report the deviation to
all parties concerned so that corrective action can be taken. The planner must present
proposals to resolve the problem at the forthcoming weekly schedule review meeting.
Query and Use the following example query to extract a list of priority 1 and 2 jobs that cannot be
Report allocated:
• Section ...................................................... e.g. 'Bleach Plant'
• Trade ......................................................... e.g. ‘Fitters’
• Completion Progress: ............................... <100% and
• Priority: ...................................................... 1 or 2 and
• Job Status: ‘Delayed’
Reason for Generate the report in such a manner that the reason for the delay is displayed. If no such
Inability capability exists in your CMMS, annotate the report so that the reason for the delay is
visible.
Retain the report for future reference.
317
Weekly schedule: review meeting
Objective The objective of the review meeting is to achieve consensus with the maintenance
supervisor and a representative from the operations department on the contents of the
weekly schedule that will be released for the forthcoming week.
Rationale Participation and commitment are keys to the success of the weekly schedule. You
cannot expect operations and the maintenance supervisor to support the weekly schedule
if they are not given the opportunity to contribute and participate in the process.
The review meeting must be a routine event in the diary of the participants.
Integration It is important that all disciplines within a department or section are represented by their
maintenance supervisors and planners at the review meeting so that the preliminary
planning will enable the task plans of the various disciplines to be integrated within the
same schedule.
This is especially important where various disciplines have to work together on the same
equipment.
Example: The electrician can replace the electric motor slip rings while the fitter is
changing the oil seal on the gearbox of the reactor.
Supporting The planner must make meticulous preparations for the planning meeting. Create an
Documents agenda that lists the standard topics and any issues specific to that particular week.
Make copies of the following documents available to all parties at the meeting.
• A copy of last weeks weekly schedule.
• Draft weekly schedule for each section represented at the meeting.
• Priority 1 and 2 jobs that are ‘Delayed’.
• Equipment Non-utilisation profile.
• Plannable work hours per section or supervisor.
318
Error! No text of specified style in document., Continued
Meeting Conduct the review meeting according to the following guideline. Remember to work for
Procedure consensus and commitment between all participants.
Step Action
1 Present the list of Delayed Priority 1 and 2 jobs to the meeting and discuss possible means of
resolving the logistic delays.
2 If Then
The delay can be resolved. Make notes on how the delay was resolved and
mark the job for inclusion. Update or add the
Planned Start Date and Time.
The delay cannot be resolved. Leave the job unmarked on the Not-Allocated
list.
3 Discuss each job on the Draft weekly schedule in broad terms with the maintenance
supervisor and the operations representative to achieve agreement in principle.
• Provisionally mark each job that the parties agree to in principle and then discuss all jobs
in detail and determine:
4 If Then
The job can be done while the equipment Mark the job as accepted and update or add a
is in operation and the parties agreed to planned start date and time.
the job.
There are jobs that can only be done Analyse the duration of each job and the critical
while the equipment is off-line. path of hammocked jobs to determine if the job
duration is less than or equal to the non-
operational window:
5 If Then
The job duration is less than or equal to Mark the job as accepted and update or add a
the non-operational window of the planned start date and time.
equipment and the parties agree to the
job or jobs.
The duration of the job is greater than the Discuss the problem with the parties and stress
non-operational window of the that all priority 1 and 2 jobs must be done in the
equipment. forthcoming week.
6 If Then
Operations extend the non-operational Mark the jobs as accepted, update the planned
window of the specific equipment to start dates and times. Edit the Equipment Non-
accommodate the jobs. utilisation profile to reflect the extended non-
operational window.
Operations are unwilling to extend the Mark the job as not accepted.
non-operational window for the specific
equipment
Notes • The total estimated hours for the draft weekly schedule in Step 3, is always greater than
the capacity hours. This gives the meeting the option to eliminate some jobs from the
draft. The planner must manage this process carefully to prevent the total estimated
hours from becoming less than the capacity hours.
• If at the end of the meeting, the total estimated hours are still greater than the capacity
hours, the supervisor must implement overtime or the meeting must remove low priority
jobs.
319
Weekly schedule and implementation
Objective The consensus and commitment achieved at the planning meeting held between the
planner, maintenance supervisor and the operations representative must be implemented
and converted to action as soon as possible.
Source The planner uses the following documents to create a final weekly schedule.
Documents • Draft weekly schedule with comments added at the meeting.
• Equipment Non-Utilisation Profile with comments added at the meeting.
• List of Delayed (Priority 1 and 2) jobs with comments added at the meeting.
• Notes/minutes taken at the review meeting.
Example The planner uses a procedure similar to the example below to develop a final weekly
Procedure schedule:
Step Procedure
1 Find each job in the outstanding work database of the CMMS that was
accepted during the planning meeting and update the following fields:
• Planned Start Date ......................................... As recorded
• Planned Start Time ......................................... As recorded
• Equipment On-line Indicator ........................... As recorded
• Job Status ....................................................... "Allocated"
2 Refer to the notes made against jobs on the Non Allocated Job list that were
accepted and take applicable action to resolve the inabilities that caused the
jobs not to be allocated initially.
3 Use the following selection to generate a final weekly schedule.
• Planned Start Date ................................ This Saturday + 7 Days
• Job Status ..................................................... "Allocated"
Sort the report by Planned Start Date and Time.
4 Print the final weekly schedule, and check against the source documents that:
• Jobs have been allocated correctly.
• The total estimated and capacity hours are compatible.
Make appropriate corrections where necessary and re-print.
5 Make provision for the signature of the planner, maintenance supervisor and
operations representative.
6 Sign the final weekly schedule and circulate it to the maintenance supervisor
and the operations representative for their signed approval.
320
Error! No text of specified style in document., Continued
Status The approved Weekly weekly schedule is now the official baseline for planned work in the
forthcoming week. Copies of this document are circulated to all the participants and kept
on record in the planning office for reference purposes.
Job Cards The planner implements the weekly schedule by printing the job cards that represent each
job on the weekly schedule.
The planner uses the same selection criteria used to create the weekly schedule to print
the job cards for the forthcoming week. Issue the job cards and a copy of the weekly
schedule to the maintenance supervisor.
321
Smoothing the manpower requirement
Introduction One of the major benefits of utilising the weekly schedule is that it reduces the demand for
manpower.
Uneven The maintenance workload is characterised by periods higher and lower demand. This
Demand creates a situation where over time, a maintenance section can be either perceived as
over or understaffed in terms of manpower depending on the current workload.
Smoothing The weekly schedule employs the use of priorities and maintenance resource planning
the techniques to smooth the peaks of high demand and fill the valleys of low workload
Requirement demand so that the overall demand for maintenance resources can be maintained at the
optimum level.
The diagram below illustrates the concept.
Peak workload
Workload
Lower
Planned
Priority
Jobs Available Manpower
Higher
Priority
Jobs Minimum workload
Next week
Time
Benefits and The above scheduling technique ensures the optimal utilisation of available manpower by
Control ensuring that resources are always allocated to higher priority work first. The need for
overtime or additional resources can only be justified when the backlog shows an increase
on a week-by-week basis. It is important that the planner must maintain a trend record of
the backlog as it may justify the need for additional resources. The inverse can of course
also be the case where the backlog shows a drop over time in which case it may highlight
an over-supply of manpower.
322
Section F
Use appropriate metrics to drive defect elimination
Overview
Introduction In this section we discuss the type of metrics that can be used to drive defect elimination
and manpower efficiency
Topic
Work management performance indicators (metrics)
323
Work management performance indicators (metrics)
Performance In the table below is an example of work planning and scheduling data that was collected
data in the period of a week
Performance The work management performance of the maintenance section was calculated in the
table below using the data provided in the table above
Actual
#
Performance Indicators calculated from data above Perform
1 Hours booked to planned jobs % = B/A 78.9%
2 Schedule work achievement % C/D 53.6%
3 Hours booked to scheduled work % =E/A 62.5%
4 Programmed Maintenance work schedule achievement =F/G 95.0%
5 Programmed Maintenance compared to total hours booked % =H/A 29.3%
6 Programmed Maintenance jobs that are overdue = J 19
7 Estimating Index = I/B 95.0%
8 Backlog man-hours work =K+L 770
9 Backlog workload in weeks M/B 3.5
10 Actual labour utilisation % A/N 51.9%
11 Planned labour utilisation % D/N 59.1%
324
Section G
Work logistics and preparation
Overview
Introduction The maintenance organisation needs specific procedures to manage and control
the procurement and issue of spares, and maintenance work.
Topic
Logistic requirements
Spares procurement, issue and control
Planned work responsibilities
Unplanned work responsibilities
Planner and supervisor role distinctions
325
Logistic requirements
Definition Work logistic requirements are the things that are necessary for the execution of a
job. The unavailability of any of these elements can cause the execution of a job to be
delayed. It is therefore essential that all the job logistic requirements be identified.
Services like water, compressed air, 110 Volt power may be accessible at any point
on a site. If they are always on tap then they in themselves do not need to be
specified as requirements. However, if the person doing the job needs to run an air
hose from the point of supply to the job then the hose of a specified length and
connections at each end becomes a logistic requirement. The planner does obviously
not concern himself with the logistic requirements of jobs that will be contracted out at
fixed prices.
[2]
Skills and Labour These are the number and time for which persons with appropriate skills will be
required for the job. Example, coded welders, riggers, fitters, etc.
Spares The parts, material and consumables necessary to do the job. Determining which of
the items are stocked or direct purchases will also prove helpful in the planning. Items
need to be properly identified in terms of either part number, drawing number or
standardised description.
Example:
BEARING, deep groove ball SKF 6208 2RS
GASKET, John Thompson Drawing JTX164351
SET SCREW, M10x40mm, Hexagon head, Grade 8.8, Steel
Specialised Tools Specialised tools are things that the average technician does not carry in his toolbox
and of which the number is limited.
Examples:
Heavy lifting equipment, portable lifting equipment, scaffolding power hand tools,
impact wrenches, test and measuring equipment, NDT and condition monitoring
equipment, wheel pullers, hydraulic presses, handling equipment, ladders, temporary
lighting equipment, ventilation equipment as well as personal protective clothing that
may not be part of the standard issue.
Documentation There are very few jobs that cannot benefit from documentation. Documentation in the
form of drawings or maintenance procedures is essential on many jobs undertaken
during a shutdown. We need to distinguish between two types of documentation.
General Standards.
These are general documents that spell out certain standards that apply across the
board to all jobs that are performed and include things like:
• Alignment standards for couplings
• V-belt and chain drives
• Fasteners. The torque tables for all fasteners used on the site
• V-belt tensioning
Specified documentation.
These are documents that need to be referred to the specific job such as
maintenance procedures, engineering drawings, process and instrumentation
drawings, etc
326
Spares procurement, issue and control
Introduction The effectiveness of the work management process in general is dependent on the
availability of spares and services required for the work that must be performed.
It is important that the responsibility of all role players in this process is well defined so
that delays resulting from misunderstandings and poor performance in this area can be
prevented.
Spares for The following actions must be performed to ensure that the spares required for planned
Planned work will be available at the work place.
Work Planned work by definition being all work that originates from the planning function.
Action Responsible
Identify and specify the spares, services and material requirements Planner
of all jobs in the outstanding planned workload.
Communicate requirement to materials by either booking current Planner
stock or generating requests to purchase. (All booking and
purchasing to be referenced to specific jobs).
Draw all spares and materials from the store that are required for Planner
jobs on the weekly schedule and additional work that comes up in
the forthcoming week.
Place all spares and materials in staging areas demarcated for Planner
each job.
Issue spares to technicians from staging area when required by the Supervisor
planned job.
Utilise required spares and return all unused spares to the correct Skill
staging area.
Return to stores all unused spares placed in the staging area. Planner
(Reference return to stores to job for correct credit).
327
Error! No text of specified style in document., Continued
Process The diagram below illustrates the stages in the process described in the above
table.
Outstanding Master
Planned Schedule
Workload
328
Error! No text of specified style in document., Continued
Spares for Unplanned work by definition being all work that does not originate from the planning
Unplanned function.
Work
Action Responsible
Identify and specify the spares, services and material requirements of Supervisor
all emergency (unplanned) jobs.
Draw current stock from stores or generate requests to purchase. (All Supervisor
withdrawals and purchasing to be referred to specific jobs).
Issue spares to trades directly when required by the job. Supervisor
Utilise required spares and return all unused spares to the supervisor. Skill
Return to stores all unused spares placed in the staging area. Supervisor
(Reference return to stores to job for correct credit).
Process The diagram below illustrates the stages in unplanned spares control process
described in the above table.
Job
JobCards
JobCards
Cards
Draw spares as
Store required
Book Current
Stock or Request
to Purchase
Unplanned
Work Request
329
Planned work responsibilities
Introduction This topic describes the responsibility of planning, the maintenance supervisor and
technician in the planned work control process. The success of the process is
dependent on all the above functions performing the specific actions required.
Actions and The table describes the actions and responsibilities with regard to planned jobs issued
Responsibilities from the planning function.
Step Action Responsible
1 Obtain and put into staging area all spares and material. Planner
2 Obtain all external services such as cranes and Planner
specialised equipment.
3 Print all planned job cards and issue to supervisor Planner
4 Do day-by-day allocation of jobs to trades using loading Supervisor
board.
5 Coordinate with operations for opportunities to perform Supervisor
work and allocate jobs and trades accordingly.
6 Supply special tools and facilities required for jobs. Supervisor
7 Control issue of spares from staging area or store. Supervisor
8 Move job cards to "in process" section of loading board Technician
when jobs commence.
9 Supervise jobs, coordinate resources, and resolve Supervisor
delays.
10 Fill in appropriate fields on job card and place in Technician
"complete" section of loading board when complete.
11 Check each job card is completed correctly. Supervisor
12 Sign job off and send to planning for capture. Supervisor
13 Check job card details and record into CMMS. Planner
14 File hard copy of job card. Planner
330
Unplanned work responsibilities
Introduction This topic describes the responsibility of planning, the maintenance supervisor
and the technician in the unplanned work control process. The success of the
process is dependent on all the above functions performing the specific actions
required.
Actions and The table describes the actions and responsibilities with regard to planned jobs issued
Responsibilities from the planning function.
331
Planner and supervisor role distinctions
Introduction It is clear from the preceding topics that there are very distinct differences between the
role of the planner and the supervisor. Organisations that are setting out to employ or
train either of the two must be aware of this distinction.
Tradition Traditionally very little distinction is made between the role of the planner and that of
the supervisor. Supervisors are usually selected from the ranks of technicians for their
strong technical or leadership qualities. The planner on the other hand seems to be
selected for secretarial or administrative qualities. The trend is also to appoint planners
at lower grades than supervisors. The reason for this debilitating situation is general
ignorance of the importance and benefits of maintenance planning with the resulting
low expectations of the planner role.
Ranking It is clear from what we have learnt about the role expectations of the planner that the
planner should be appointed at a level at least on par with the maintenance supervisor.
It is not a case of the one being more or less important than another. The fact is that
the roles are complimentary, of equal importance and key to the success of the
maintenance function.
Characteristics The diagram below discusses some of the characteristics of the different roles.
Planner Supervisor
Plan Organise
Proactive Reactive
Strategic Tactical
Back-office Front-office
Pull Push
Manage Lead
Stable Situational
Design Use
Prevent React
Premeditated Spontaneous
Guide Direct
332
Section J
Checklists and practical aspects of work quality control
Overview
Topic
Quality plan
Items that may require quality checklists
Example of a quality checklist
The case for boilerplate tasks
Examples of boilerplate
333
Quality plan
Quality plan A plan as to how and when “quality events” and “quality materials” are applied to a
shutdown
Quality control The implementation of the “quality events” in the “quality plan”
Quality Quality assurance QA is an umbrella term; It refers to the processes used within an
assurance organization to verify that deliverables are of acceptable quality and that they meet
the completeness and correctness criteria established; QA does not refer to specific
deliverables • The preparation of a “quality plan” for a shutdown is part of QA; The
development of standards is part of QA; The holding of a “quality event” is part of QA
What needs to be Typically what needs to be checked are the deliverables; They are sometimes called
checked or metrics, and their purpose is to specifically describe what is being measured and how
verified? it will be measured according to the quality control plan and process
Developing Checklists provide a means to determine if the required steps in a process have been
checklists followed; As each step is completed, it’s checked off the list; Checklists can be
activity-specific; Sometimes, organizations may have standard checklists they use for
shutdowns; You might also be able to obtain checklists from professional
associations; Remember that checklists are an output of this process but are a tool
and technique of the risk identification process, and are an input to quality control.
334
Items that may require quality checklists
Introduction Checklists are an important part of a quality control system. They need to be to the
point specific about what needs to be checked and always specify the acceptable
standards so that deviations can be identified and reported:
Typical items that Listed below are some of the things that may require checklists during the shutdown
need checklists of a process plant:
• Flushing activities and system for condensate draining;
• Blinding/de-blinding;
• Water wash procedure;
• Passivation procedure;
• Chemical cleaning of equipment wherever possible prior to shutdown;
• Online cleaning of furnace tubes;
• Hot jobs;
• Flange assembly;
• Repair/replacement;
• Material check;
• Non-destructive testing (NDT), for example, radiography, die penetrant
testing(DPT), ultrasonic flaw detection, hydrotest, positive material
identification (PMI), etc;
• Stress relieving (SR);
• Compliance of statutory inspection requirements;
• System of second check for completed jobs;
• Preservation of idle process plants/equipment as per standard procedure;
• Equipment handling procedure, for example, tube bundle, safety valve,
control valve, gasket, etc;
• Identification, collection, segregation, and tagging of all types of gaskets
335
Example of a quality checklist
Welding quality Below is an example of a quality checklist for welding. Note the acceptable standard
checklist column which gives the field supervisor / quality inspector an clear picture of what to
be on the lookout for and enables him or her to spot a deviation. It also acts as a
guideline for the executor of the task
336
Quality check hold point
Quality hold Where there are serious consequences of failure or risk of costly mistakes, a hold
points point is written into the job procedure. In a shutdown, the job can be broken up to
include a hold point. That means that the job cannot proceed until a specific
inspection or test has been performed. This is to ensure that the quality of the work
done up to that point is verified before the job continues to a next stage.
Example of a The diagram below shows activities in the network before and after the insertion of a
hold point Hold-point. The quality assurance work that is to be done at the hold-point point must
be carefully defined in terms standards that need to be inspected or tested for.
337
The case for boilerplate tasks
What is "Boiler plate" originally referred to the rolled steel used to make water boilers.
boilerplate? In the field of printing, the term dates back to the early 1900s. From the 1890s onwards,
printing plates of text for widespread reproduction such as advertisements or syndicated
columns were cast or stamped in steel (instead of the much softer and less durable lead
alloys used otherwise) By analogy, these came to be known as 'boilerplates'.
Legal In contract law, the term "boilerplate language" describes the parts of a contract that are
boilerplate considered standard. A standard form contract or boilerplate contract is a contract
between two parties, where the terms and conditions of the contract are set by one of
the parties, and the other party has little or no ability to negotiate more favorable terms
and is thus placed in a "take it or leave it" position.
Repeatable The point taken from the printing and legal profession is the fact that much of what is
use written into contract documents today is standard blocks or paragraphs of text that occur
in documents. If we apply the same thinking to maintenance work instructions and
procedures then we see that the same principle can be applied.
High If we broke a refinery down to the sub component level, meaning for example electric
populations motors, valves, transmitters, gas detectors, heat exchangers, drive couplings, junction
of items boxes, sections of piping, refractories, stairways, pumps, actuators etc., we would find
that the same types of items are used in large number of locations in the plant. If we took
this a step further and looked at different plants, in different countries and locations the
same would hold true. The same type of item appears repeatedly.
Maintainable For the lack of a better word we can label these items as ‘maintainable items’. These are
items are the items that perform a specific function that is required in the process of our business.
Lego blocks They are the ‘Lego’ blocks from which the plant is composed. Just like the real ‘Lego’ the
of industry blocks are standard and could be used in a variety of different plants.
Boilerplate Applying the boilerplate principle to maintainable items with some foresight it means we
and should create work procedures and PM tactics at the maintainable item level so that it
maintainable becomes boilerplate that we can apply again and again to wherever that maintainable
items item appears in the plant. Developing maintenance work plans and procedures in this
manner would save a tremendous amount of time and effort during the planning process.
Bills of Everything related to the task including the bill of material, special tools should be
material for developed at the maintainable item level from where it can then be built into the
maintainable hierarchy of the equipment
items
338
Examples of boilerplate
Task and Below is an example of a section of boilerplate task that has been created for a type of
acceptable small electrical motor
standard
Task Acceptable standard
Verify that the fan cover is: In place, secure and free of damage
Cooling fins are: Clear of debris
Verify that the terminal box and lid is: Free of holes, gaps or damage
Verify that the terminal box lid and cover
In place and secure
screws are:
Verify that the terminal box lid and cover
In place and secure
screws are:
Verify that the motor mountings, (flange or
Free of cracks, fractures or damage
foot) is:
Verify that the motor mounting bolts and
In place and secure
nuts are:
Verify that the cable glands are: In place and secure
Verify that the cable racks / supports are: Secure and free of damage
Verify that cables are: Routed and secured out of harm's way
Clean the motor
Verify that the motor terminal connections
Secure and free of discoloration
are:
Verify that the earth terminal is: Secure
Verify that the wiring insulation is: Free of damage or discoloration
Verify that the terminal box interior is: Free of significant rust or corrosion
Spray the interior of the terminal box with
silicon spray
Verify that the terminal cover gasket / seal
Splash proof
is:
Motor cable identification tag is: Readable, secure and free of damage
Grease the DE and NDE bearings with
two strokes of
Total Multis EP2 grade grease using a foot
pump
339
Examples of boilerplate continued
Tasks and Below is an example of a section of boilerplate task that has been created for a V-belt
acceptable drive
standards
continued
Task Acceptable standard
Matches the number of grooves in the
Verify that the quantity of V-belts installed:
pulley
Free of significant cracks, fraying or
Verify that the V-belts are:
excessive wear
Free of excessive wear, V-belts are not
Verify that the drive pulley is:
bottoming in the grooves
Free of excessive wear, V-belts are not
Verify that the driven pulley is:
bottoming in the grooves
Secure, locking device screws and key
Verify that the drive pulley is:
is in place and secure
Secure, locking device screws and key
Verify that the driven pulley is:
is in place and secure
Adjust the V-belts tension as necessary
Verify that pulleys are: In alignment with a straight edge
Verify that the V-belt guard is: Secured with all bolts in place
Verify that the V-belt guard is: Secured with all bolts in place
340
Examples of boilerplate continued
Tasks without The example below does not clearly show the acceptable standards as in the previous
acceptable example
standards
# Task Acceptable standards
1 Cleaning of: shell internal
1.1 Shell internal
1.2 Tray internals
1.3 Process nozzles
1.4 Instrument connections
1.5 Orifice assembly: In over flash line
1.6 Coke trap
1.7 Demister
2 Repair/replacement (including hot jobs)
2.1 Shell
2.2 Shell lining
2.3 Tray support ring
2.4 Nozzle/RF Pad
2.5 Internal support structure
2.6 Tray segments
2.7 Down-comer
2.8 Replace demister pad
3 Thickness Survey
3.1 Shell
3.2 Tray intervals
Thoroughness of corrosion inhibitor /
4
chemical dosing
5 Tray assembly
5.1 Alignment of tray
5.2 Valves, bubble cap, etc.
5.3 Clamp, washes, bolts
5.4 Downcomer
Seal-pan leak test
341
Examples of boilerplate continued
Approval Boilerplate tasks should be reviewed prior to the commencement of the next shutdown
planning cycle to ensure that they are updated with the latest requirements and any
issues that we may have experienced with their use are addressed
342
DAY 5
ROOT CAUSE OF FAILURE ANALYSIS
Overview
Introduction In this chapter we introduce how unanticipated failures are identified, analysed and dealt
with in a manner that will ensure that they do not occur again
Section Description
A Failure reporting analysis and corrective system requirements
B Use failure data and Pareto analysis to identify and stratify improvement
D Types of evidence, preservation and use
E Organise the RCFA and apply the process
F Practical RCFA case study using and MS Excel based tool
G Review of failure forensic techniques
H Human factors
I Error management
343
Section A
Failure reporting analysis and corrective system requirements
Overview
Introduction In this section we show how failure data related to unanticipated failures is collected,
stratified and analysed to identify the significant few
Topic
Failure reporting analysis and corrective action system introduction
Failure reporting
Responsibilities
Analysis methods
The FRACAS database
Minimum database requirements
344
Failure reporting, analysis and corrective action system
introduction
Introduction A strong Failure Reporting, Analysis, and Corrective Action System is the
foundation of a good asset performance improvement effort. It provides the
business elements required to close the loop on Root Cause Failure Analysis and
Reliability Centered Maintenance efforts. The FRACAS changes RCFA from what
is often one shot exercises to a managed program for systematically improving
equipment performance
FRACAS This diagram illustrates how FRACAS feeds into other asset performance improvement
interface processes
s
Driven from The FRACAS is an important system that requires management attention just
the top like any other. Purposeful management for success requires that the FRACAS
be driven from the top down through management policies and procedures to
insure quality of effort and meaningful results
Policies The beginning step in the development of the FRACAS is the establishment of
management policies for equipment and process reliability improvement that
include requirements for reporting, analyzing, and correcting system
failures. The policy statement should include a statement of purpose for the
FRACAS, a statement of personnel responsibilities at all levels, and a
description of the basic elements required in the FRACAS
346
Failure reporting
Failure Failures must be reported in ways that lend themselves to analysis with Reliability
reporting Engineering tools such as Weibull Analysis, RCM, and Availability Simulation. The
best reporting schemes use individual failure modes as the basis for failure
reporting. Reporting schemes need to follow the hierarchical structure of the
equipment within the process
Failure Failure modes describe the individual failed components of the maintainable item,
modes including a descriptor for what happened to the component. Failure modes are the
things that occur and cause the system to lose its ability to produce its desired
outputs
Developing Failure modes are best developing using an orderly system that includes a functional
failure analysis of the equipment used in the process. Equipment is generally broken down
modes into a hierarchy that shows graphically how the facility is put together to achieve its
business output
347
Failure reporting continued
Functional
breakdow
n
Failure Failure Modes and Effects Analysis (FMEA) is perhaps the best way o developing
modes and failure modes for inclusion in the FRACAS reporting system. It is an extremely
effects systematic way of looking at the functions of maintainable items to determine the
analysis most likely causes of their loss of function. The causes of loss of functional failure
(FMEA) are the equipment’s failure modes. A thorough FMEA that considers all the failure
modes present produces the most exact results, but may be too time consuming to
be of practical use in the everyday work environment.. A useful group of failure
modes can be generated by developing a list of the most likely failure modes
using a functional breakdown of equipment. Development of the FMEA is best done
by a group of people who work with the equipment day-in and day-out. What is
important is to understand the functions of the equipment and what things break or
fail that cause the equipment to lose it’s function
Maintainable Maintainable items represent the lowest level of the facility hierarchy than can be
items further broken down into components. Maintainable items have specific, well
definable functions that enable the system to produce its desired output. It is the loss
of the function of these items that leads to lost production, lost quality, safety issues,
environmental issues, and operational issues. The maintainable item level is where
we set maintenance tactics and strategies to keep system performance at desired
levels
348
Failure reporting continued
Functions Functions define the reason for the existence of the maintainable items. Most
maintainable items have one or more primary functions and one or more
secondary functions. Functions describe what the maintainable item does, not
what it is. Functional Statements need to be written in a way that makes it easy to
identify what the functional failure is. The best functional statements use everyday
that we all can understand. Local jargon is acceptable as long as everyone who
uses the FMEA will understand what the jargon represents.
Failure The individual failure modes that can cause the functional failures are then
modes identified and allocated to the specific functional failures
349
Responsibilities for the FRACAS
Facility The facility manager is responsible for establishing policies that require the
manager development of the FRACAS. The facility manager provides the top down driven
impetus for insuring that everyone in the organization is focused on reporting,
analyzing, and correcting failures.
Program The FRACAS program champion is responsible for developing the written
champion procedures need to implement the program. The Champion provides upward and
downward communication of program policies, goals, and results. The Champion
has direct responsibility for insuring that required training takes place, and that
each individual in the organization understands what his/her roles and goals are
within the FRACAS program.
Operations Successful development and use of the FRACAS depends on close cooperation
and between the operations and maintenance managers within the organization.
maintenance Breakdowns in communication at this level often lead to significant reductions in
managers the benefits that can be achieved with a well implemented FRACAS. The tone of
communication between these two managers usually sets the tone of
communication between their subordinates.
Maintenance Maintenance supervisors also play an important role in developing and sustaining
supervisors FRACAS efforts. They are responsible for insuring that their maintenance
personnel take the necessary time to insure that information about failed
components is correct, and is in line with the failure modes defined within the
FRACAS reporting system. Again, poor quality of information here will often lead
to poor final reports and information that is not very useful for predicting and
preventing future failures. Good failure reporting requires good communication
between the operations and maintenance supervisors.
Operators Operators provide initial failure reports for the FRACAS. They need to understand
the importance of giving meaningful and accurate reports about the functional
failures they observe. Operators need to have a thorough understanding of the
maintainable items that are present in the system. It is not reasonable to expect
that operators will know or be able to determine what is causing the functional
failure. It is reasonable to expect that they will be able to describe the functional
failure in enough detail to aid maintainers in the troubleshooting process, and to
provide useful information to the FRACAS analyst.
Maintainers Maintainers are in a position to have the greatest impact on the outcome of
FRACAS efforts. They are usually in the best position to determine which
350
components failed, and what happened to them. They may be in a position to
determine what caused the failure mode to occur, but it is not reasonable to
expect that they will be able to determine the cause of every failure mode. The
maintainer has very specific responsibilities that require enumeration.
351
Responsibilities for the FRACAS
Preserving The maintainer is usually the first one on the scene to have direct contact with the
evidence failed components. It is his responsibility to document and record the condition of
the components as he finds them. The maintainer needs to be taught preservation
techniques, and how to record conditions around the component using words and
pictures. In no case should the maintainer attempt to clean or alter the condition of
the failed components. The maintainer should protect the evidence by covering it
loosely with some protection like plastic bags to prevent contamination from
outside sources
Recording The maintainer should record conditions around the failed component. The best
conditions way is to take digital photos and write concise notes about what is found
Identifying The maintainer may be able to determine what caused the component to fail, as
likely causes well as some causal factors that may have led up to the failure. It is important to
and causal allow the maintainer to say “I don’t know” at this point. Frequently the maintainer
factors will not be able to tell what caused the component to fail during an initial analysis
of the scene. In this case saying I don’t know is better than an unfounded guess
as to cause. Determining cause may require further examination by engineering
specialist such as metallurgist and people experienced in determining causes for
the failed components in question.
Failure The failure analyst is responsible for screening initial failure reports to determine if
analyst / the reports are complete, and whether or not further analysis is required. The
reliability analyst may order a Root Cause Failure Analysis (RCFA) depending on whether
engineer or not the consequences of the failure warrant it. The analyst determination to
order the RCFA should be driven by policy and guidelines written into the
FRACAS. The analyst is also responsible for insuring that failure data is analyzed
using available analysis tools on a regular basis to determine whether there need
to be updates to the Preventive and Predictive Maintenance Program, RCFA’s for
recurrent failure modes, or RCFA’s for failure modes exhibiting infant failures
352
Analysis methods
Introduction Well collected failure data allows the analyst to use a variety of analysis methods
to determine how to improve asset performance. A well trained analyst can use
Weibull Analysis, Reliability Centered Maintenance (RCM), Availability Simulation,
and Root Cause Failure Analysis (RCFA) to analyze the data and determine
solutions to asset performance problems.
Weibull Weibull Analysis, invented in the 1930’s by Swedish born Waloddi Weibull, has
analysis become the statistical analysis method of choice for examining equipment failures.
The low number of data points required for making reasonable decisions, as well
as the ability to look at times to failure distributions to determine potential
maintenance tactics give it substantial advantages over other forms of statistical
analysis for making asset management decisions.
RCM RCM coupled with Availability Simulation allows the analyst to look at a wide
variety of potential maintenance tactics to determine which set of tactics can be
applied to equipment failures to achieve the best combination of profit, safety
criticality, environmental criticality, and operational criticality for meeting the goals
of the business. Availability Simulation changes maintenance decision making from
a day-to-day exercise into a strategic planning exercise which can look far into the
future of the assets.
Root cause RCFA is arguably the most powerful tool available for improving asset
failure performance. RCFA allows the organization to analyze and eliminate major
analysis failures as well as the small recurring failures that chip away at company profits
each and every day. The FRACAS database is instrumental in insuring that good
hard data is used to back up the potential causes for failure given during RCFA
exercises. The most important element in successful RCFA programs is the
reliance on hard facts rather than supposition by RCFA participants.
353
The FRACAS database
Introduction The FRACAS database is the repository for all gathered failure information. It must
be developed in a way that allows easy entry of failure data, and easy retrieval of
failure data for analysis using the various methods previously described. The
database may take several forms depending on the size and sophistication of the
organization.
Forms of the The FRACAS database may take the form of a custom-built database for use in
database small organizations, an off the shelf database for use across larger organizations,
or in some cases it may be integrated into the facility’s Computerized Maintenance
Management System (CMMS) or Enterprise Asset Management System (EAMS).
Custom built Small companies or facilities may often opt to develop their own FRACAS
database due to the lack of funds and resources required for purchasing either off
the shelf packages or CMMS/EAMS packages. The advantage to this method is
low entry cost as well as development based on the specific needs of the
organization. It is usually maintained by a single dedicated individual. The major
drawback to this type of system is the inability to share and report data across a
larger user base
Off the shelf There are a large variety of off the shelf FRACAS software packages available
today. They are usually more suitable for larger organizations. Most systems have
some for of analysis ability already built into them, and offer the ability to attach
external documents and pictures to enhance failure reporting and analysis. The
available systems can be used in LAN and WAN environments so that they can be
a global solution for a large company. Off the shelf systems require either total
separate data entry, or some combination of separate data entry and import entry
from either a CMMS or an EAMS environment. In most cases the import data entry
is accomplished by exporting data from the CMMS/EAMS to an office product such
as Excel, and then importing the information into the FRACAS database. Most
providers of FRACAS software are constantly updating and improving the software,
and are open to changing the software based on direct inputs from their user base
354
Minimum database requirements
Introduction As a minimum the FRACAS database must contain elements that allow the user to
analyze failures using Weibull Analysis, RCM, Availability Simulation, and RCFA.
The following list is meant to represent the absolute minimum requirements for the
database.
Equipment The database must contain the equipment hierarchy down to the maintainable
hierarchy item level
Failure Failure modes as described in section one should be in the database in a tabular
modes format. It is helpful if the failure modes are contained in failure mode groups to
minimize the list of failure modes to search when assigned the mode to a given
failure report
Date and The exact date and time of the report must be saved so that successful Weibull
time stamp Analysis can be accomplished. The lack of specific times will impact the ability of
the analyst to determine exact times to failure for specific failure modes. As an
absolute minimum the date of the failure must be recorded
Failure There failure reporter must have the ability to describe what happened in his own
description words to include the functional failure of the maintainable item.
Failure The database must contain information about the business impact of the failure in
impact terms of cost, downtime, safety criticality, environmental criticality, and operational
criticality
Causal Information about what may have caused the failure, or any causal factors that
factors may have led up to the failure must be recorded. This information can be vital
when later analysis of the failures is performed.
RCFA follow Many organizations that undertake RCFA efforts fail to capitalize on the power of
up RCFA because they are unable to close the loop on following up
recommendations. The FRACAS is an excellent place to keep information about
which failures require and RCFA, and who has organizational responsibility for
completing the implementation of RCFA recommendations.
Reporting The FRACAS database should allow the analyst to produce a variety of textual
capabilities and graphical reports to aid in the analysis of failures. Reporting of Weibull data,
failure frequencies for various failure modes, and database structure are extremely
important.
355
Section B
Use failure data and Pareto analysis to identify and stratify
improvement
Overview
Introduction In this section we show how chronic failures sometimes have a bigger impact on the
wellbeing of an organisation than intermittent failures
Topic
Characteristics of a sporadic event
Characteristics of a chronic event
The Pareto principle
How to perform a Pareto analysis
Quantify losses
Justify RCFA on the basis ROI
356
Characteristics of a sporadic event
Introduction A sporadic or intermittent failure event is known as such because it rarely happens
but when it does it happen it has a high impact on the business or society.
High impact Due to the high impact, serious damage and significant losses that are incurred
these events usually catch a lot of attention in the organisation and sometimes
unfortunately in the media as well
Media The prevalence of social media, multi-function and smart phones, Facebook, Instagram,
attention Whatsapp, etc enables people to broadcast details of events that an organisation may
have wanted to keep to itself far and wide with resulting damage to the reputation of the
organisation
357
Characteristics of a chronic event
Introduction Chronic or persistent failure events have a high frequency of occurrence but a low impact
per occurrence
Low impact Due to the low impact of the individual events, chronic events are likely to go
unnoticed.
Management There is usually not a management response to chronic events because they do
response not feature in any report. They happen so frequently that they have now become
part of the landscape and are considered simply to be the cost of doing business.
Chronic
versus
sporadic
event
Cumulative Consider an event that shuts down or slows down the production output of a
effect company on average for three times in a twelve hour shift for five minutes at a
time.
Downtime in a day: 24/12*3*5/60 = 0.5 hours
Downtime in a year: 365.25*0.5 = 183 Hours or 8 Days
Impact on It is quite certain that management would assemble a board of inquiry and hire
the consultants to get to the bottom of any singular event that shuts down the plant for
business 8 days.
Chronic failure losses could be having a higher impact on the business than a
sporadic event.
Problems Sporadic events become problems the moment they occur and have to be dealt
versus with immediately.
opportunities Chronic events are opportunities because we may not even know about them
and they do not have to be dealt with immediately.
The real They real problem with chronic events is that they lead to complacency and a
problem lowering of standards because chronic events could be the precursor to a sporadic
event. All it takes is one more casual factor to enter the equation and a chronic
event becomes a sporadic event with dire consequences. This was highlighted in
the space shuttle Challenger inquiry when the investigation discovered that the
O’rings that failed on that fateful day had been a chronic problem. A lower ambient
temperature at the time of launch was the only additional causal factor that was
required to make a chronic event a sporadic event
358
359
The Pareto principle
Introduction The Pareto principle (also known as the 80–20 rule, the law of the vital few, and
the principle of factor sparsity) states that, for many events, roughly 80% of the
effects come from 20% of the causes.
Origin Management consultant Joseph M Juran suggested the principle and named it
after Italian economist Vilfredo Pareto, who, while at the University of Lausanne in
1896, published his first paper "Cours d' economie politique."
Empirical Pareto showed through observation and the collection of empirical data that:
findings In investment:
20% of the invested input is responsible for 80% of the results obtained.
In his garden:
20% of the pea pods contained 80% of the peas
In sales:
20% of customers contributed 80% of the revenue
In farm ownership in Italy:
20% of property owners owned 80% of the farm land
Application Organisations with large asset bases can generate vast amounts of chronic failure
to failures data. The challenge is finding out amongst all the data what are the types or
modes events that are causing most of the failures so that we can focus our attention to
those that are causing most of the losses,
This is done by doing a Pareto analysis which is based on the Pareto principle.
Data As we have seen from the FRACAS database requirements the database must
required contain information about the business impact of the failure in terms of cost,
downtime, safety criticality, environmental criticality, and operational criticality. The
data most easily to come by in the author’s experience is data with regard to
downtime because this is usually captured by operations in data base for the
purpose or can be extracted from the SCADA system.
The problem is that the details of the failure are usually recorded in the CMMS
from works order data while the downtime data sits with operations so bringing
these two together into one set of data takes a bit of work.
360
Obtain and prepare failure data
Exampl The raw data illustrated below is of poor quality as it contains has not been coded in a
e raw manner that is proposed in the preceding chapter on the FRACAS
data
Filtered After some filtering it was found that ‘A/C Failures’ were a recurring phenomenon in this
data failure data that comprised about 1000 records.
Data quality An observation of the above data shows that failures are not being captured in a
consistent manner and that there are no structures in place for coding the data.
361
Obtain and prepare data continued
Improve The data was improved by analyzing the ‘work carried out’ and making an assumption
d data about the most likely failure mode and two columns to store this information. The downtime
was also calculated from the start and end time of the events and the values rounded up
362
Perform a Pareto analysis
Procedure The purpose of the Pareto analysis in this context is to determine which 20% of the
failure modes account for 80% of the losses. These are quite basic calculations and
it is best to import the data into MS Excel and perform the analysis there
Pareto The table below shows the results of a Pareto analysis performed on the example data set
analysis using the procedure specified above.
exampl
e
Outcome of From the data set of chronic failures it was determined that the failure modes listed
the analysis from ‘Piping’ up to ‘Compressor’ represents 78% of all the downtime incurred. The
reliability team must therefore focus their attention to those specific failure modes.
363
Quantify losses
Event costs To quantify losses it is necessary that an analysis is made of all the cost factors
and the individual factors contribution to the cost. Depending on the scale and type
of business some factors carry a higher weight in the basket of costs that are
incurred with a failure. For the purpose of this discussion downtime or production
loss which is an opportunity loss will also be counted as a cost
Defect From a study undertaken by an Australian company ‘Life-time Reliability’ this is a list of
and the direct costs associated with a specific failure event
failure
true cost
(Life time
Reliabilit
y)
364
365
Quantify losses continued
Losse The table below lists the indirect costs of the incident. Even if a person took this a bit of salt
s it is very clear from the example that there are many costs related to a failure that an
organisation would usually not consider. Not because they are unable to determine it but
because they probably do not want to know about it!
366
Justify RCFA on the basis ROI
Introduction Before spending money on a RCFA, or anything else for that matter, an
organisation should calculate if the expense or the investment is justified.
Organisations that are operating in a competitive environment would be paying a
lot of attention to this and even if a expenditure may seem intuitively justifiable
they would still want to see the ‘numbers crunched’. Using ROI to justify each
project or improvement is also a good means to rank and prioritize investments
when there are more than one offering and funds are limited as they usually are.
Quantify the To keep the calculation straight-forward we will only consider the revenue loss
loss caused by the failures related to ‘piping’ using the data from the Pareto analysis
example. From the example we can see that the failure mode ‘Piping’ represents
37hours of dump truck downtime per month. The dump trucks have a carrying
capacity of nickel ore of 200 tons per load and the cycle time for a load is one hour.
The nickel price is $11k per ton and the nickel concentration in laterite ore is about
1.5%. The demand for nickel has remained high and stable for the last five years
and the company is able to sell every ton it produces. From this data we are going
to calculate the revenue loss.
Quantify the Considering that this type of equipment has a life expectancy of at least 10 years then
loss in life the revenue loss over the life of the fleet for this single failure mode will amount to
cycle terms $146 520k
ROI for the In order to determine the ROI for this exercise we need to determine how much money
RCFA will be spend on eliminating or at least reducing the frequency of event. The following
factors would need to be taken into account
ROI of RCFA The return on investment of 3709% looks unbelievable but is true. There is almost
no human endeavor that is able to give this value of ROI. The sad situation that few
organisations actually calculate the returns that are available to them by investing in
reliability improvement
367
Section D
Types of evidence, preservation and use
Overview
Introduction In this section we discuss how to collect, preserve and use evidence
Topic
The importance of physical evidence, data and documentation
Paper and data based evidence
Parts data and evidence
People data and evidence
Position data and evidence
Dossier of evidence
368
The importance of physical evidence, data and documentation
Introduction The integrity of any analytical process is only as good as the physical evidence, data,
documentation that supports it.
Examples of The following professions rely heavily on data and evidence and therefore attach a
professions high value to it:
that value • Accountants
evidence • Tax investigators
• Forensic accountants
• Detectives
• Pathologists
• Crime scene investigators
• Doctors
• Air transport safety board investigators
• Prosecutors
• Lawyers
• Attorneys
Maintainers Unfortunately maintainers do not have a propensity to collect data and evidence.
Few technicians see any value in a failed part. Good housekeeping practices
dictate that it should be disposed of at the conclusion of the work. The question
why or how the part failed is rarely asked. The approach is replace whatever has
failed, get the machine on line again and move one.
Examples Most evidence, data regarding a failure event can be classified into one of four
categories, also known as the four Ps
• Paper
• Parts
• People
• Position
We will look a little closes at each of them in the following pages
369
Paper and data based evidence
Functions If we are to identify root causes then it is important that we understand what
and functions are failing and how they are failing. From the below we can see that some
performanc functions can suffer a total loss or a partial loss. There is a fundamental difference
e standards between degraded performance or a total loss of performance and this usually
relates to the primary function of the asset. Safety, health and environmental
excursions are usually caused by the loss or degradation of secondary functions.
The impact of failure modes on the system can be quantified much easier when we
understand the functional failures that they cause.
PM program There could be a causal relationship between the failures and the current
maintenance program. The program or the execution could be deficient. The in-
house program as well as the vendor recommended program should be on hand
for the analysis
Maintenance The manual provided by the vendor for maintaining the equipment. It usually
manual contains the vendor recommended PM as well as procedures for the execution of
component exchanges, repairs, adjustments, calibration, trouble shooting.
Operating The manual provided by the vendor for operating the equipment. It usually
manual contains the vendor recommended operating procedures including safety
precautions, set up and adjustment, start up, operation, and shut down of the
equipment.
Maintenance It is regrettable but true that history has shown a strong causal relationship
history between current failures and previous maintenance work so it would be unwise
not to consider this in the course of an analysis.
Modification Any change whether it be to the physical equipment, the operating parameters,
history set points, performance standards, firmware, software, process raw materials,
feedstock, process chemicals, concentrations, fuel or energy source, or any
370
procedural or practical change related to cleaning, mode of storage, transport,
start-up, shut down, vendor selection.
Any organizational changes in roles and responsibilities, training, staff selection
and appointments, working hours should also be considered.
371
Error! No text of specified style in document., Continued
Pictoria Below is a pictorial drawing of an AC system that could be useful as a basic reference
l when analysing the chronic failures of an A/C system
drawing
372
Error! No text of specified style in document., Continued
Conclusion Comprehensive and well organised paper or electronic data forms the basis of a
professional failure analysis. It is also a case of rather too much than to little.
Whatever may be found superfluous to the investigation can easily be discarded
and within seconds. Getting hold of information that should have been at hand but
is not can delay progress by days.
373
Parts data and evidence
Introduction Parts in the first instance represent the physical components or items that were
destroyed, damaged, became defective as a result of the failure and also any
other parts that were replaced in the course of the repair.
Failed part The part that failed and was presumed to have initiated the chain of events that led
to the failure is key to the analysis so it would be a high priority for preservation. It
should be labeled as such. Please note however that at this point it is still a
presumption and will remain as such until it can be validated as a fact.
Samples Where it is relevant it is also important to take samples of lubricants, fuels, partially
processed or damaged products that were in the equipment or process at the time
of failure.
Filters, The filters or the contents of filters and scrapings off magnetic plugs where relevant
magnetic must also be collected, bagged or bottled and tagged.
plugs
Parts that These are parts that were replaced because they were presumed to have suffered
suffered damage during the course of the failure. They should be tagged as such to avoid
collateral any confusion.
damage
Parts These are parts that were replaced because it is standard practice to do so.
replaced as Example:
standard The oil seals are replaced whenever the pinion has been removed
practice These should be labelled as such.
Parts Parts could have been replaced because the opportunity arose and there was
replaced by significant wear to justify it or not.
opportunity
Packaging Parts should be labelled with durable labels and all writing should be done with a
and storage permanent marking pen. If possible the parts should be placed in a bin which
should also be labelled and stored undercover in a secure are to prevent
tampering or weather damage. In some cases wrapping could also be appropriate
to prevent contamination.
374
People data and evidence
Introduction Evidence that can be obtained from people can be very valuable to an analysis but
it tends to degrade rapidly if it is not collected as soon as possible. The facilitator
should therefore move fast to ensure that this evidence is collected as soon as
possible after an event before peoples’ observations fade from their short-term
memories.
Apply People can quickly clam up when they feel that anything they may say may be used
appropriate as evidence against them. It is important to put them and ease and emphasize that
interviewing you are only asking them to share their first-hand observations with you and the
skills information will only be used as a means to determine the root causes of the event.
Physical Only obtain information from the physical witnesses. In other words individuals that
witnesses saw, heard, felt, smelt what happened first-hand. No second-hand, hearsay
information must be allowed to contaminate the evidence.
Conducting Memory recall has been considered a credible source in the past, but has recently
interview come under attack as forensics can now support psychologists in their claim that
memories and individual perceptions can be unreliable, manipulated, and biased.
People are also highly susceptible to suggestive interviewing where ideas about
what was observed is planted by the interviewer.
The facilitator must at all times be mindful that evidence obtained from people is
not always reliable. Other factors that because of issues such as:
Recording of There is nothing inherently wrong with recording an interview but this must be done
interviews with the permission of the person being interviewed and the purpose must be
enhance accuracy and save time and not to put the witness on edge or make him or
her to be less forthcoming in the provision of testimony.
375
Position data and evidence
Introduction The position data and evidence relates to spatial, environmental, time and
operational phase data that may be relevant to the investigation such as:
Location of This relates to the locality of the equipment or parts of the equipment before,
occurrence during and after the failure event.
Timing of The chronological point in time previous events and the current event
occurrence
Operational In what phase of operation did the failure occur. Loading, unloading, tramming,
phase of landing, take-off, cruising, rising, descending, start-up, shutdown, accelerating,
occurrence coasting, braking, winding, unwinding, recharging, discharging, etc.
Physical What was the position and orientation relative to other items or reference points.
orientation
Performance This is data that can usually be downloaded from a SCADA or other performance
data at monitoring system and relates to things like velocity, RPM, pressure, temperature,
occurrence flow, viscosity, density, level, size, concentration, acidity, etc.
Position of Knowing the position of people at the time of the event helps to verify any eyewitness
people at statements against their viewpoints
occurrence
376
Dossier of evidence
Dossier The facilitator must keep a dossier or register of all evidence that is pertinent to the
of investigation. The whereabouts of physical evidence must be indicated
evidenc
e
Records of The facilitator will also keep record of all meetings and consultation events that were
attendance conducted during the course of the investigation
Evidence of The records of attendance are evidence that stakeholders and subject matter
consultation experts were consulted during the course of the investigation
377
Section E
Overview
Topic
Criteria for an RCFA technique to qualify as rigorous
The basic structure of root cause failure analysis
The event modes
Separating the facts from the hypothesis
378
Criteria for an RCFA technique to qualify as rigorous
Introductio With the exception of the a military standard for FRACAS and a Department of Energy
n guideline there are not any formally recognised standards for root cause of failure. As
a result of some research work done by an source unknown the table below provides
some insight into the functionality of some of the processes use in pursuit of identify
the root causes of failures
Considers
Provides chronic
Defines all causal path events and Proprietary
Defines causal to root Delineates quantifies software
Method/Tool Type problem relationships causes evidence losses required
Events &
casual factors Method Yes Limited No No No No
Change
analysis Tool Yes No No No No No
Barrier
analysis Tool Yes No No No No No
Tree
diagrams Method Yes No No No No Yes
Why-Why
chart Method Yes No No No No No
Pareto Tool Yes No No No No No
Storytelling Method Limited No No No No No
Fault tree Method Yes Yes No No No Yes
FMEA Tool Yes No No No No Yes
Apollo reality
charting Method Yes Yes Yes Yes No Yes
PROACT Method Yes Yes Yes Yes Yes Yes
RCFA Method Yes Yes Yes Yes Yes No
Problem The process clearly defines the problem and its significance to the problem
definition owners in terms that are relevant for the business. This can be expressed either in
terms risk or in financial terms so that management is able to verify that the
investment in the investigation was justified.
Combination The process delineates the known causal relationships that combined to cause
of causal the problem. As there is rarely only one cause for a problem, it is necessary to
relationships make visible all the causal relationships that were active at the time of incident so
that they can all be dealt with.
Casual It must establish causal relationships between the root cause(s) and the defined
relationships problem. This means that there must be a clear causal link between the root
between root cause and the problem. Otherwise we may be eliminating a root cause that has
cause and nothing to do with the problem!
problem
Presentation Evidence must form the basis for the identification of causes. The facilitator or
of evidence analyst must demonstrate that his or her findings and conclusions are based on
irrefutable evidence. There must be clear distinction at all times between facts
supported by evidence and assumptions for which there is no evidence.
Presents The recommendations must clearly explains how the proposed solutions will
solutions prevent recurrence of the defined problem.
379
Report It must clearly documents all the above criteria in a final report so others can easily
follow the logic of the analysis.
380
The basic structure of root cause failure analysis
Introductio The diagram below describes the basic structure of a root cause analysis
n
Loss of There must be a clear statement of the problem that needs to be solved or what aspect of
function the business needs to be improved by undertaking this project or investigation. The
problem could be related to things like:
• Production downtime
• Property damage
• Poor reliability
• Poor maintainability
• Product quality
• Product yield
• Energy or fuel efficiency
• Reagents or chemical use efficiency
• Environmental
Describe the The failure modes that we know have caused the loss of function in the past are
modes identified. Chronic events or losses may have a number of failure modes. An
intermittent loss or event will usually have only one failure mode.
Hypothesize The very first thing that the team needs to determine is the physical root cause of
the physical failure. In objective in this step is to determine which part / component / system
roots and failed and then determine how it failed. It is more important to first of all determine
verify how it failed than to determine why it failed. Using brainstorming techniques
generate a list of ideas about possible causes. Each idea (hypothesis) also known
as a conjecture is validated to be either true or not true on the basis of evidence.
Proving what is not true is just as important as proving what is true.
381
Error! No text of specified style in document., Continued
Hypothesize Once the physical root or roots have been identified and verified the next step is to
the human determine if there are causal relationship between the physical root cause of failure and
roots and the actions or in-actions of the humans. Human roots could originate from any of the
verify following people:
• Operators
• Maintainers
• Designers
• Buyers
• Transporters
• Refuel people
• Cleaners
• Manufacturers
• Suppliers
• Vendors
• Warehouse people
• Packers
• Un-packers
Or anybody else that has an impact on the part, component or system.
Hypothesize Very few people have the inclination wilfully harm equipment except for psychopaths and
the latent these are usually not in our employment. Normal people’s work behaviour is shaped by
roots and the environment in which they live and work. It therefore makes sense that having
verify established that some human behaviour is misbehaviour has led to a failure then the
organisation should also apply some introspection and question whether there is
something about our organisation that motivates people to behave in a certain manner.
No root cause analysis can be considered complete unless this aspect is not dealt with.
Latent roots could be related to things like:
• Production bonus schemes
• Procurement policies
• Organisation structure
• Lines of authority
• Delineation of responsibilities
• Lack of supervision
• Documented procedures
• Work standards
• Training
• Supervision
• Shift arrangements
• Availability of tools
• Availability of spares
• Availability of transport
• Availability of PPE
• Availability of lifting equipment
• PM tactics
• Budget constraints
• Workforce turnover
382
The event modes
Introductio The event modes describe the causal relationship between the failure and the
n individual failure modes that have occurred in the past.
Must be In the example above we have had recurring AC failures over time. From the work
facts order history we have established that the AC has failed for number of reasons
Failure The failure modes depicted above represent the individual events that have
modes happened in the past and that have caused the AC to lose it’s cooling function.
These could have common causes or they could have causes unique to that
specific failure mode.
Chronic Chronic events could have multiple failure modes such as displayed in the diagram
events A sporadic event would usually have only one failure mode
The Many failure investigations go wrong from the outset when the facilitator allows
fact conventional wisdom, conjecture, assumptions, opinions and ignorance to be treated as
line fact.
From the outset of an analysis and throughout every step of the process there must be a
clear distinction between facts and everything else.
No harm can come from regarding what may be a fact as an assumption but a lot of
harm can come from treating an assumption or conjecture as a fact.
383
Separating the facts from the hypothesis
Introductio The diagram below shows how the failure modes which have been verified as fact
n are separated from the hypothesis which have yet to be verified
Dropping the As successive levels of hypothesis are generated casual relationships are proposed, verified
fact line or disproved the fact line drops until all the root causes are identified
384
Section F
Practical RCFA case study using an MS Excel based tool
Overview
Introduction In this section we show how what we have learned about failure modes and
causes can be put to use to conduct an actual RCA using an MS Excel template
Topic
Case study
Project registration
Problem statement
First level cause and effect analysis
Second level cause and effect analysis
Third level cause and effect analysis
Fourth level cause and effect analysis (1)
Fourth level cause and effect analysis (2)
Fifth level cause and effect analysis (1)
Root causes and recommendations
385
Case study
Background An emerald mine near the Kafubu river in Zambia cleans emeralds in the emerald
concentrate by ‘cooking’ and agitating the concentrate in a caustic soda solution in
a number of electrically heated kilns.
Failures A number of failure events had occurred over the last year resulting in following losses.
Loss of output, expensive repairs and risk of product theft. In discussion with operators
and maintainers it became apparant that the following failures were occurring more
frequently than in the past.
1. Leaks through inner stainless steel housings of the kilns.
2. Fractures of the agitator drive shafts.
3. Oil leaking from the output shaft seals of the agitator gearboxes. (The gearboxes
had subsequently been packed with grease in an attempt to overcome the
problem.
Cost factors A kiln is only capable of processing one batch per day on a permanent dayshift
basis. That means that any failure that exceeds two hours results in the loss of a
full cycle.
386
Error! No text of specified style in document., Continued
Batch size The amount of concentrate and the caustic soda that is charged per cycle is set by
the CCR operator. Batches are drawn out of feed hoppers. There are feed hoppers
for the caustic soda as well as for the emerald concentrate. A batching system
draws and weighs the batches and discharges the batches into each kiln as
required. The caustic soda is loaded first as can be seen from the above timeline.
Temperature The cooking time and temperature is set by the control room operator.
Access Access to the kilns and surrounding area is strictly controlled for theft reasons. No
control person may enter the area without a security escort. The work area is clearly
visible from the control room which has large windows and is elevated above the
kilns.
Kiln The diagram below illustrates the configuration of the kiln. The kilns is supported at the top
by a trunnion mechanism that enables hydraulic rams to tip it and discharge the contents
into a flume
387
Agitator
drive
Cast iron
outer
housing
Stainless
steel inner
housing Agitator
Heating element
blocks placed around
circumference
Project registration
Introductio As it would be the practice in any organisation, an investigation would start with the
n formal registration of a project in the computerised maintenance management or
other system that the organisation uses for the purpose
388
Noteworthy Note that the function loss / undesirable outcome is written in broad terms to as to
points be able accommodate all failure modes that have been identified as causal to the
loss
Quantification The annual loss that is incurred by each failure mode is calculated and recorded
of loss
389
Problem statement
Introductio A problem statement is completed for each failure mode. In the example below the
n information regarding the first failure mode ‘Kiln stainless steel housings leak’. The
rest of the form is self explanatory.
390
First level cause and effect analysis
Introductio This form is used for every successive level of the cause and effect analysis. The
n example below is for the first level where we have listed ‘kiln stainless steel housings
leak’ as a fact. Note the fact line which is used to distinguish between the facts and
the hypothesis that was generated.
391
Second level cause and effect analysis
Introductio The hypothesis ‘shell cracked’ from the level 1 sheet was proved to be true on the
n basis of evidence. It has now been inserted above the fact line. The team then
proceeded to list hypothesis for what could have caused the shell to crack in the
table below the fact line.
392
Third level cause and effect analysis
Introductio The hypothesis ‘Thermal fatique crack from the level 2 sheet was proved to be true
n on the basis of evidence. It has now been inserted above the fact line. The team
then proceeded to list hypothesis for what could have caused the thermal fatique in
the table below the fact line.
Physical root ‘Thermal fatique crack’ has been identified as the physical root cause of failure.
cause We now need to determine how could a thermal fatique crack develop.
Somewhere in the life cycle of the shell it must have been exposed to conditions
that could give rise to a thermal fatique crack.
393
Fourth level cause and effect analysis (1)
Introductio The hypothesis ‘Cooking temperature set too high’ from the level 3 sheet was
n proved to be true on the basis of evidence. It has now been inserted above the fact
line. The team then proceeded to list hypothesis for what could have caused the
cooking temperature to be set too high in the table below the fact line.
394
Fourth level cause and effect analysis (2)
Introducto The hypothesis ‘Kiln is washed at too high temperature’ from the level 3 sheet was
n also proved to be true on the basis of evidence. It has now been inserted above the
fact line. The team then proceeded to list hypothesis for what could have caused the
kiln to be washed at too high temperature in the table below the fact line.
395
Fifth level cause and effect analysis (1)
Introduction The hypothesis ‘Temperature is intentionally set too high’ from the level 4 sheet was proved
to be true on the basis of evidence. It has now been inserted above the fact line. The team
then proceeded to list hypothesis for what could have caused the temperature to be
intentionally set too high.
396
Root causes and recommendations
Introductio The physical root, human roots and latent root causes of the failure were identified
n using the preceding process and have now been recorded on the root causes form
below. All that remains now is for the team to propose actions aimed at eliminating
the root causes.
397
Section G
Topic
Introduction to corrosion
Uniform corrosion
Pitting corrosion
Crevice corrosion
Filiform corrosion
Galvanic corrosion
Erosion corrosion
Cavitation
Fretting corrosion
Inter-granular corrosion
Environmental cracking
Stress corrosion cracking
Hydrogen embrittlement
Hydrogen embrittlement of stainless steel
Wear
Mechanical fatique
Characteristics of fatique
Mechanical stress overload
Fastener failures
Lubricant failures
Chemical decay
Bearing failures
Soft foot
Fatique endurance
398
Introduction to corrosion
Highly The science of corrosion prevention and control is highly complex, exacerbated by the
complex fact that corrosion takes many different forms and is affected by numerous outside
factors. Corrosion professionals must understand the effects of environmental conditions
such as soil resistivity, humidity, and exposure to salt water on various types of materials;
the type of product to be processed, handled, or transported; required lifetime of the
structure or component; proximity to corrosion-causing phenomena such as stray current
from rail systems; appropriate mitigation methods; and other considerations before
determining the specific corrosion problem and specifying an effective solution.
Thorough The first step in effective corrosion control, however, is to have a thorough knowledge of
knowledge of the various forms of corrosion, the mechanisms involved, how to detect them, and how
the basics and why they occur.2
Natural Simply put, corrosion is the natural deterioration that results when a surface reacts with
deterioration its environment. Different surfaces, environments and other factors add complexity to
the equation
Ten basic There are 10 primary forms of corrosion, but it is rare that a corroding structure or
Forms component will suffer from only one. The combination of metals used in a system and the
wide range of environments encountered often cause more than one type of attack. Even
a single alloy can suffer corrosion from more than one form depending on its exposure to
different environments at different points within a system.
Basic cause All forms of corrosion, with the exception of some types of high-temperature
corrosion, occur through the action of the electrochemical cell (Figure 1). The
elements that are common to all corrosion cells are an anode where oxidation and
metal loss occur, a cathode where reduction and protective effects occur, metallic
and electrolytic paths between the anode and cathode through which electronic
and ionic current flows, and a potential difference that drives the cell. The driving
potential may be the result of differences between the characteristics of dissimilar
metals, surface conditions, and the environment, including chemical
concentrations. There are specific mechanisms that cause each type of attack,
different ways of measuring and predicting them, and various methods that can be
used to control corrosion in each of its forms.
399
Continued on next page
400
Error! No text of specified style in document., Continued
Electro- In a corrosion cell, electrons flow through a metallic path from sites where anodic
chemical reactions are occurring to sites where they allow cathodic reactions to occur. Ions
cell (charged particles) flow through the electrolyte to balance the flow of electrons. Anions
(negatively charged ions from cathodic reactions) flow toward the anode and cations
(positively charged ions from the anode itself) flow toward the cathode. The anode
corrodes and the cathode does not. There is also a voltage, or potential, difference
between the anode and cathode. Source: NACE International Basic Corrosion Course
Handbook, p. 2:9.
401
Uniform Corrosion
Introductio Uniform corrosion is characterized by corrosive attack roceeding evenly over the
n entire surface area, or a large fraction of the total area. General thinning takes place
until failure. On the basis of tonnage wasted, this is the most important form of
corrosion.
Predictable However, uniform corrosion is relatively easily measured and predicted, making
disastrous failures relatively rare. In many cases, it is objectionable only from an
appearance standpoint. As corrosion occurs uniformly over the entire surface of
the metal component, it can be practically led control by cathodic protection, use
of coatings or paints, or simply by specifying a corrosion allowance. In other cases
uniform corrosion adds color and appeal to to a surface. Two classics in this
respect are the patina created by naturally tarnishing copper roofs and the rust
hues produced on weathering steels.
Breakdown The breakdown of protective coating systems on structures often leads to this form of
of protective corrosion. Dulling of a bright or polished surface, etching by acid cleaners, or oxidation
coatings (discoloration) of steel are examples of surface corrosion. Corrosion can resist alloys and
stainless steels can become tarnished or oxidized in corrosive environments. Surface
corrosion can indicate a breakdown in the protective coating system, however, and
should be examined closely for more advanced attack. If surface corrosion is permitted
to continue, the surface may become rough and surface corrosion can lead to more
serious types of corrosion.
402
Pitting corrosion
Introductio Pitting corrosion is a localized form of corrosion by which cavities or "holes" are
n produced in the material. Pitting is considered to be more dangerous than uniform
corrosion damage because it is more difficult to detect, predict and design against.
Corrosion products often cover the pits. A small, narrow pit with minimal overall
metal loss can lead to the failure of an entire engineering system. Pitting corrosion,
which, for example, is almost a common denominator of all types of localized
corrosion attack, may assume different shapes. Pitting corrosion can produce pits
with their mouth open (uncovered) or covered with a semi-permeable membrane of
corrosion products. Pits can be either hemispherical or cup-shaped
Difficult to Pitting corrosion is a localized form of corrosion by which cavities or "holes" are
detect produced in the material. Pitting is considered to be more dangerous than uniform
corrosion damage because it is more difficult to detect, predict and design against.
Corrosion products often cover the pits. A small, narrow pit with minimal overall
metal loss can lead to the failure of an entire engineering system. Pitting corrosion,
which, for example, is almost a common denominator of all types of localized
corrosion attack, may assume different shapes.
403
Error! No text of specified style in document., Continued
Catastrophic Apart from the localized loss of thickness, corrosion pits can also be harmful by
failure acting as stress risers. Fatigue and stress corrosion cracking may initiate at the
base of corrosion pits. One pit in a large system can be enough to produce the
catastrophic failure of that system. An extreme example of such catastrophic
failure happened recently in Mexico, where a single pit in a gasoline line running
over a sewer line was enough to create great havoc to a city, killing 215 people in
Guadalajara.
Troug Pits can have different shapes. The shapes below are classed as trough pits
h pits
404
Crevice corrosion
Introductio Crevice corrosion is a localized form of corrosion usually associated with a stagnant
n solution on the micro-environmental level. Such stagnant microenvironments tend to occur
in crevices Process such as those formed under gaskets, washers, insulation material,
fastener heads, surface deposits, disbonded scoating, threads, lap joints and clamps.
Crevice corrosion is initiated by changes in local chemistry within the crevice:
The cathodic oxygen reduction reaction cannot be sustained in the crevice area, giving it
an anodic character in the concentration cell. This anodic imbalance can lead to the
creation of highly corrosive micro-environmental conditions in the crevice, conducive to
further metal dissolution. This results in the formation of an acidic micro-environment,
together with a high chloride ion concentration.
-Metal ions produced by the anodic corrosion reaction readily hydrolyze giving off protons
(acid) and forming corrosion products
405
Error! No text of specified style in document., Continued
Environmental All forms of concentration cell corrosion can be very aggressive, and all result from
differences environmental differences at the surface of a metal. Even the most benign atmospheric
environments can become extremely aggressive as illustrated in this example of aircraft
corrosion (courtesy Mike Dahlager). This advanced form of crevice corrosion is called
'pillowing'.
406
407
Filiform corrosion
Introductio A special form of crevice corrosion in which the aggressive chemistry build-up occurs under
n a protective film that has been breached. This type of corrosion occurs under painted or
plated surfaces when moisture permeates the coating. Lacquers and "quick-dry" paints are
most susceptible to the problem. Their use should be avoided unless absence of an adverse
effect has been proven by field experience. Where a coating is required, it should exhibit
low water vapor transmission characteristics and excellent adhesion. Zinc-rich coatings
should also be considered for coating carbon steel because of their cathodic protection
quality.
Initiation Filiform corrosion normally starts at small, sometimes microscopic, defects in the
coating. Lacquers and "quick-dry" paints are most susceptible to the problem. Their use
should be avoided unless absence of an adverse effect has been proven by field
experience. Where a coating is required, it should exhibit low water vapor transmission
characteristics and excellent adhesion. Zinc-rich coatings should also be considered for
coating carbon steel because of their cathodic protection quality.
408
Galvanic corrosion
Introduction Galvanic corrosion (also called ' dissimilar metal corrosion' or wrongly 'electrolysis') refers
to corrosion damage induced when two dissimilar materials are coupled in a corrosive
electrolyte. It occurs when two (or more) dissimilar metals are brought into electrical
contact under water. When a galvanic couple forms, one of the metals in the couple
becomes the anode and corrodes faster than it would all by itself, while the other becomes
the cathode and corrodes slower than it would alone
Mechanism Either (or both) metal in the couple may or may not corrode by itself (themselves).
When contact with a dissimilar metal is made, however, the self corrosion rates
change:
Corrosion of the anode accelerates. Corrosion of the cathode decelerates or even
stops.
Galvanic coupling is the foundation of many corrosion monitoring techniques.
History The driving force for corrosion is a potential difference between the different
materials. The bimetallic driving force was discovered in the late part of the
eighteenth century by Luigi Galvani in a series of experiments with the exposed
muscles and nerves of a frog that contracted when connected to a bimetallic
conductor. The principle was later put into a practical application by Alessandro
Volta who built, in 1800, the first electrical cell, or battery: a series of metal disks of
two kinds, separated by cardboard disks soaked with acid or salt solutions. This is
the basis of all modern wet-cell batteries, and it was a tremendously important
scientific discovery, because it was the first method found for the generation of a
409
sustained electrical current.
410
Error! No text of specified style in document., Continued
Further The principle was also engineered into the useful protection of metallic structures by Sir
development Humphry Davy and Michael Faraday in the early part of the nineteenth century. The
sacrificial corrosion of one metal such as zinc, magnesium or aluminum is a widespread
method of cathodically protecting metallic structures.
Effect of In a bimetallic couple, the less noble material will become the anode of this corrosion cell
nobility and tend to corrode at an accelerated rate, compared with the uncoupled condition. The
more noble material will act as the cathode in the corrosion cell. Galvanic corrosion can
be one of the most common forms of corrosion as well as one of the most destructive.
411
Erosion corrosion
Introduction Erosion corrosion is an acceleration in the rate of corrosion attack in metal due to the relative
motion of a corrosive fluid and a metal surface. The increased turbulence caused by pitting on the
internal surfaces of a tube can result in rapidly increasing erosion rates and eventually a leak.
Causes Erosion corrosion can also be aggravated by faulty workmanship. For example, burrs left
at cut tube ends can upset smooth water flow, cause localized turbulence and high flow
velocities, resulting in erosion corrosion. A combination of erosion and corrosion can lead
to extremely high pitting rates.In offshore well systems, the process industry in which
components come into contact with sand-bearing liquids, this is an important problem.
Materials Materials selection plays an important role in minimizing erosion corrosion damage.
selection Caution is in order when predicting erosion corrosion behavior on the basis of hardness.
High hardness in a material does not necessarily guarantee a high degree of resistance to
erosion corrosion. Design features are also particularly important. High hardness in a
material does not necessarily guarantee a high degree of resistance to erosion corrosion.
Counter- It is generally desirable to reduce the fluid velocity and promote laminar flow;
measures increased pipe diameters are useful in this context. Rough surfaces are generally
undesirable. Designs creating turbulence, flow restrictions and obstructions are
undesirable. Abrupt changes in flow direction should be avoided. Tank inlet pipes
should be directed away from the tank walls, towards the center. Welded and
flanged pipe sections should always be carefully aligned. Impingement plates of
baffles designed to bear the brunt of the damage should be easily replaceable.
Other The thickness of vulnerable areas should be increased. Replaceable ferrules, with a
counter- tapered end, can be inserted into the inlet side of heat exchanger tubes, to prevent
measures damage to the actual tubes. Abrasive particles in fluids can be removed by filtration or
settling, while water traps can be used in steam and compressed air systems to decrease
412
the risk of impingement by droplets. De-aeration and corrosion inhibitors are additional
measures that can be taken
Cavitation
Introductio Cavitation occurs when a fluid's operational pressure drops below it's vapor pressure
n causing gas pockets and bubbles to form and collapse. This can occur in what can be a
rather explosive and dramatic fashion. In fact, this can actually produce steam at the
suction of a pump in a matter of minutes. When a process fluid is supposed to be water in
the 20-35°C range, this is entirely unacceptable. Additionally, this condition can form an
airlock, which prevents any incoming fluid from offering cooling effects, further
exacerbating the problem.
Other Also, by processes incurring sudden expansion, which can lead to dramatic pressure
causes drops. This form of corrosion will eat out the volutes and impellers of centrifugal pumps
with ultrapure water as the fluid. It will eat valve seats. It will contribute to other forms of
erosion corrosion, such as found in elbows and tees.
413
Counter- Cavitation should be designed out by reducing hydrodynamic pressure gradients and ing
measures design to avoid pressure drops below the vapor pressure of the liquid and air ingress. The
use of resilient s coating and cathodic protection can also be considered as
supplementary control methods.
Fretting corrosion
Introduction Fretting corrosion refers to corrosion damage, initially at the asperities of contact
surfaces. Fretting corrosion damage occurs due to a one or more factors but
mostly is induced under load and in the presence of repeated relative surface
motion. In vehicles, engine vibration is the most likely cause of the vibration that
starts the fretting corrosion process but other factors such as thermo cycling and
loom anchor points should also be taken into consideration.
Characteristics Pits or grooves and oxide debris characterize this damage, typically found in
machinery, bolted assemblies and ball or roller bearings. Contact surfaces exposed to
vibration during transportation are exposed to the risk of fretting corrosion.
Counter- Lubricants can reduce the friction when surfaces are moving relative to each other
measures under load. Reducing tolerances to prevent movement or mounting fluids such as
Loctite to reduce relative movement of loaded surfaces
414
Inter-granular corrosion
Introductio The microstructure of metals and alloys is made up of grains, separated by grain
n boundaries. Intergranular corrosion is localized attack along the grain boundaries, or
immediately adjacent to grain boundaries, while the bulk of the grains remain largely
unaffected. This form of corrosion is usually associated with chemical segregation effects
(impurities have a tendency to be enriched at grain boundaries) or specific phases
precipitated on the grain boundaries. Such precipitation can produce zones of reduced
corrosion resistance in the immediate vicinity.
Mechanism The attack is usually related to the segregation of specific elements or the formation of a
compound in the boundary. Corrosion then occurs by preferential attack on the grain-
boundary phase, or in a zone adjacent to it that has lost an element necessary for
adequate corrosion resistance - thus making the grain boundary zone anodic relative to
the remainder of the surface. The attack usually progresses along a narrow path along
the grain boundary and, in a severe case of grain-boundary corrosion, entire grains may
be dislodged due to complete deterioration of their boundaries.
415
Error! No text of specified style in document., Continued
Effect In any case the mechanical properties of the structure is seriously affected. A classic
example is the sensitization of stainless steels or weld decay. Chromium-rich grain
boundary precipitates lead to a local depletion of Cr immediately adjacent to these
precipitates, leaving these areas vulnerable to corrosive attack in certain electrolytes.
Reheating a welded component during multi-pass welding is a common cause of this
problem. In austenitic stainless steels, titanium or niobium can react with carbon to form
carbides in the heat affected zone (HAZ) causing a specific type of intergranular corrosion
known as knife-line attack. These carbides build up next to the weld bead where they
cannot diffuse due to rapid cooling of the weld metal. The problem of knife-line attack
can be corrected by reheating the welded metal to allow diffusion to occur.
Aluminiun Many aluminum base alloys are susceptible to intergranular corrosion on account of
either phases anodic to aluminum being present along grain boundaries or due to
depleted zones of copper adjacent to grain boundaries in copper-containing alloys.
Alloys that have been extruded or otherwise worked heavily, with a microstructure of
elongated, flattened grains, are particularly prone to this damage.
Wrought Exfoliation corrosion is a further form of intergranular corrosion associated with high
aluminium strength aluminum alloys. Alloys that have been extruded or otherwise worked heavily,
with a microstructure of elongated, flattened grains, are particularly prone to this damage.
Corrosion products building up along these grain boundaries exert pressure between the
grains and the end result is a lifting or leafing effect. The damage often initiates at end
grains encountered in machined edges, holes or grooves and can subsequently progress
through an entire section.
416
Environmental cracking
417
Error! No text of specified style in document., Continued
Propagation The cracks form and propagate approximately at right angles to the direction of the
tensile stresses at stress levels much lower than those required to fracture the material in
the absence of the corrosive environment. As cracking penetrates further into the
material, it eventually reduces the supporting cross section of the material to the point
of structural failure from overload. SCC occurs in metals exposed to an environment
where, if the stress was not present or was at much lower levels, there would be no
damage. If the structure, subject to the same stresses, were in a different environment
(noncorrosive for that material), there would be no failure. Examples of SCC in the
nuclear industry are cracks in stainless steel piping systems and stainless steel valve
stems.
Stress cells Stress cells can exist in a single piece of metal where a portion of the metal's
microstructure possesses more stored strain energy than the rest of the metal. Metal
atoms are at their lowest strain energy state when situated in a regular crystal array.
Deviations from this lowest-strain state follow reference
Grain By definition, metal atoms situated along grain boundaries are not located in a regular
boundaries crystal array (i.e. a grain). Their increased strain energy translates into an electrode
potential that is anodic to the metal in the grains proper. Thus, corrosion can selectively
occur along grain boundaries.
Highly Regions within a metal subject to a high local stress will contain metal atoms at a
localized higher strain energy state. As a result, high-stress regions will be anodic to low-
stress stress regions and can corrode selectively. For example, bolts under load are
subject to more corrosion than similar bolts that are unloaded. A good rule of
thumb is to select fasteners that are cathodic (i.e. higher on the Electrochemical
Series) to the metal being fastened in order to prevent fastener corrosion.
Cold worked Regions within a metal subjected to cold-work contain a higher concentration of
dislocations, and as a result will be anodic to non-cold-worked regions. Thus, cold-
worked sections of a metal will corrode faster. For example, nails that are bent will often
corrode at the bend, or at their head where they were worked by the hammer.
418
Stress corrosion cracking
Introductio Stress corrosion cracking (SCC) is the cracking induced from the combined influence of
n tensile stress and a corrosive environment. The impact of SCC on a material usually falls
between dry cracking and the fatigue threshold of that material. The required tensile
stresses may be in the form of directly applied stresses or in the form of residual stresses.
The problem itself can be quite complex.
Intergranular SCC of an Inconel heat exchanger tube with the crack following the grain
boundaries is illustrated below
Origins Cold deformation and forming, welding, heat treatment, machining and grinding can
introduce residual stresses. The magnitude and importance of such stresses is often
underestimated. The residual stresses set up as a result of welding operations tend to
approach the yield strength. The build-up of corrosion products in confined spaces can
also generate significant stresses and should not be overlooked. SCC usually occurs in
certain specific alloy-environment-stress combinations.
A complex Stress corrosion cracking (SCC) is the cracking induced from the combined influence of
issue tensile stress and a corrosive environment. The impact of SCC on a material usually falls
between dry cracking and the fatigue threshold of that material. The required tensile
stresses may be in the form of directly applied stresses or in the form of residual stresses,
see an example of SCC of an aircraft component . The problem itself can be quite
complex. The situation with buried pipelines is a good example of such complexity.
419
Continued on next page
420
Error! No text of specified style in document., Continued
Cold forming Cold deformation and forming, welding, heat treatment, machining and grinding can
introduce residual stresses. The magnitude and importance of such stresses is often
underestimated. The residual stresses set up as a result of welding operations tend to
approach the yield strength. The build-up of corrosion products in confined spaces can
also generate significant stresses and should not be overlooked. SCC usually occurs in
certain specific alloy-environment-stress combinations.
Unexpected Usually, most of the surface remains unattacked, but with fine cracks penetrating into
failure the material. In the microstructure, these cracks can have an intergranular or a
transgranular morphology. Macroscopically, SCC fractures have a brittle appearance.
SCC is classified as a catastrophic form of corrosion, as the detection of such fine cracks
can be very difficult and the damage not easily predicted. Experimental SCC data is
notorious for a wide range of scatter. A disastrous failure may occur unexpectedly, with
minimal overall material loss.
Examples The catastrophic nature of this severe form of corrosion attack has been repeatedly
illustrated in many news worthy failures, including the following:
• Swimming pool roof collapse in Uster, Switzerland *
• EL AL Boeing 747 crash in Amsterdam
• Stress Corrosion Cracking (SCC) Chloride SCC.
One of the most important forms of stress corrosion that concerns the nuclear industry is
chloride stress corrosion. Chloride stress corrosion is a type of intergranular corrosion
and occurs in austenitic stainless steel under tensile stress in the presence of oxygen,
chloride ions, and high temperature. It is thought to start with chromium carbide
deposits along grain boundaries that leave the metal open to corrosion. This form of
corrosion is controlled by maintaining low chloride ion and oxygen content in the
environment and use of low carbon steels.
Caustic SCC Despite the extensive qualification of Inconel for specific applications, a number of
corrosion problems have arisen with Inconel tubing. Improved resistance to caustic stress
corrosion cracking can be given to Inconel by heat treating it at 620oC to 705oC,
depending upon prior solution treating temperature. Other problems that have been
observed with Inconel include wastage, tube denting, pitting, and intergranular attack.
421
Hydrogen embrittlement
Introduction This is a type of deterioration which can be linked to corrosion and corrosion-control
processes. It involves the ingress of hydrogen into a component, an event that can
seriously reduce the ductility and load-bearing capacity, cause cracking and catastrophic
brittle failures at stresses below the yield stress of susceptible materials. Hydrogen
embrittlement occurs in a number of forms but the common features are an applied
tensile stress and hydrogen dissolved in the metal.
Sources Sources of hydrogen causing embrittlement have been encountered in the making
of steel, in processing parts, in welding, in storage or containment of hydrogen
gas, and related to hydrogen as a contaminant in the environment that is often a
by-product of general corrosion. It is the latter that concerns the nuclear industry.
Hydrogen may be produced by corrosion reactions such as rusting, cathodic
protection and electroplating. Hydrogen may also be added to reactor coolant to
remove oxygen from reactor coolant systems
c) the use of cathodic protection for corrosion protection if the process is not
properly controlled
422
Hydrogen embrittlement of stainless steel
Introduction Hydrogen diffuses along the grain boundaries and combines with the carbon, which is
alloyed with the iron, to form methane gas. The methane gas is not mobile and collects
in small voids along the grain boundaries where it builds up enormous pressures that
initiate cracks. Hydrogen embrittlement is a primary reason that the reactor coolant is
maintained at a neutral or basic pH in plants without aluminum components
Mechanism If the metal is under a high tensile stress, brittle failure can occur. At normal room
temperatures, the hydrogen atoms are absorbed into the metal lattice and diffused
through the grains, tending to gather at inclusions or other lattice defects. If stress
induces cracking under these conditions, the path is transgranular. At high
temperatures, the absorbed hydrogen tends to gather in the grain boundaries and
stress-induced cracking is then intergranular. The cracking of martensitic and
precipitation hardened steel alloys is believed to be a form of hydrogen stress corrosion
cracking that results from the entry into the metal of a portion of the atomic hydrogen
that is produced in the following corrosion reaction.
Not Hydrogen embrittlement is not a permanent condition. If cracking does not occur and
permanent the environmental conditions are changed so that no hydrogen is generated on the
surface of the metal, the hydrogen can rediffuse from the steel, so that ductility is
restored.
423
Wear
Introduction Wear is the removal of the material from the surface of a solid body as a result of
mechanical action of the counter-body..
Wear may combine effects of various physical and chemical processes
proceeding during the friction between two counteracting materials: micro-cutting,
micro-ploughing, Plastic deformation cracking, fracture, melting, chemical
interaction.
Abrasive Abrasive wear occurs when a harder material is rubbing against a softer material.
wear If there are only two rubbing parts involved in the friction process the wear is
called two body wear.
In this case the wear of the softer material is caused by the asperities on the harder
surface.
If the wear is caused by a hard particle (grit) trapped between the rubbing surfaces it
is called three body wear. The particle may be either free or partially embedded into
one of the mating materials.
In the micro-level abrasive action results in one of the following wear modes:
Ploughing. The material is shifted to the sides of the wear groove. The material is
not removed from the surface.
Cutting. A chip forms in front of the cutting asperity/grit. The material is removed
(lost) from the surface in the volume equal to the volume of the wear track (groove).
424
Cracking (brittle fracture). The material cracks in the subsurface regions
surrounding the wear groove. The volume of the lost material is higher than the
volume of the wear track.
Adhesiv Adhesion wear is a result of micro-junctions caused by welding between the opposing
e wear asperities on the rubbing surfaces of the counter-bodies. The load applied to the
contacting asperities is so high that they deform and adhere to each other forming
micro-joints.
The motion of the rubbing counterbodies result in rupture of the micro-joints. The
welded asperity ruptures in the non-deformed (non-cold worked) regions.
Thus some of the material is transferred by its counterbody. This effect is called
scuffing or galling.
When a considerable areas of the rubbing surfaces are joined during the friction a
Seizure resistance (compatibility) seizure of one of the bodies by the counterbody may
occur.
425
Error! No text of specified style in document., Continued
Fatique wear Fatigue wear of a material is caused by a cycling loading during friction. Fatique
occurs if the applied load is higher than the fatique strengh of the material.
Fatigue cracks start at the material surface and spread to the subsurface regions.
The cracks may connect to each other resulting in separation and delamination of
the material pieces.
One of the types of fatigue wear is fretting wear caused by cycling sliding of
two surfaces across each other with a small amplitude (oscillating). The friction
force produces alternating compression-tension stresses, which result in surface
fatigue.
Fatique overlay of a bearing may result in the propagation of the cracks up to the
intermediate layer and total removal of the overlay.
Erosive wear Erosive wear is caused by impingement of particles (solid, liquid or gaseous),
which remove fragments of materials from the surface due to momentum effect.
Erosive wear may be caused by cavitation in lubrication oil. The cavitation voids
(bubbles) may form when the oil exits from the convergent gap between the
bearing and journal surfaces. The oil pressure rapidly drops providing conditions
for voids formation (the pressure is lower than the oil vapor pressure). The bubbles
(voids) then collapse producing a shock wave, which removes particles of the
bearing material from the bearing.
426
427
Mechanical fatique
Development Fatigue occurs when a material is subjected to repeated loading and unloading. If
the loads are above a certain threshold, microscopic cracks will begin to form at
the stress concentrators such as the surface, persistent slip bands (PSBs),
interfaces of constituents in the case of composites, and grain interfaces in the
case of metals.[1] Eventually a crack will reach a critical size, the crack will
propagate suddenly, and the structure will fracture. The shape of the structure will
significantly affect the fatigue life; square holes or sharp corners will lead to
elevated local stresses where fatigue cracks can initiate. Round holes and smooth
transitions or fillets will increase the fatigue strength of the structure.
The American Society for Testing and Materials defines fatigue life, Nf, as the
number of stress cycles of a specified character that a specimen sustains before
failure of a specified nature occurs. For some materials, notably steel and titanium,
there is a theoretical value for stress amplitude below which the material will not fail
for any number of cycles, called a fatigue limit, endurance limit, or fatigue strength.
Engineers have used any of three methods to determine the fatigue life of a
material: the stress-life method, the strain-life method, and the linear-elastic
fracture mechanics method. One method to predict fatigue life of materials is the
Uniform Material Law (UML).[5] UML was developed for fatigue life prediction of
aluminium and titanium alloys by the end of 20th century and extended to high-
strength steels, and cast iron.
428
429
Characteristics of fatique
In metal In metal alloys, and for the simplifying case when there are no macroscopic or
alloys microscopic discontinuities, the process starts with dislocation movements at the
microscopic level, which eventually form persistent slip bands that become the
nucleus of short cracks.
Random Fatigue is a process that has a degree of randomness (stochastic) often showing
event considerable scatter even in seemingly identical sample in well controlled
environments.
430
Mechanical stress overload
Introductio Mechanical stress overload failures are not caused by cyclic stress. They happen
n when a part is stressed beyond it’s yield point by a force.
Characteristics Unlike mechanical fatique of metal there are no beach marks visible at the
fracture face. The fracture face also tends to have a granular texture that is
uniform in appearance as there is a large fast fracture zone.
Where a fatique fractures usually occurs at a right angle, overload fractures tend
to occur diagonally.
431
Fastener failures
Embedment Embedment happens when the bolt or nut rests on a surface that contains foreign
material, dirt of high spots that flatten over time and cause the bolt to loose tension
and work itself loose.
Stress In a tensile overload failure, the bolt will stretch and ‘neck down’ prior to rupture. One of
overloa the fracture faces will form a cup and the other a cone. This type of failure indicates that
d either the bolt was inadequate for the installation or it was preloaded beyond the
material’s yield point.”
Torsiona Fasteners are not normally subjected to torsional stress. This sort of failure is usually
l shear seen in driveshafts, input shafts and output shafts. However we have seen torsional
(twisting shear failure when galling takes place between the male and female threads (always
) due to using the wrong lubricant or no lubricant) or when the male fastener is
misaligned with the female thread. The direction of failure is obvious and, in most
cases, failure occurs on disassembly.
Impac Fracture from impact shear is similar in appearance to torsional shear failure with flat
t failure faces and obvious directional traces. Failures due to impact shear occur in bolts
shear loaded in single shear, like flywheel bolts and ring gear bolts. Usually the failed bolts were
called upon to locate the device as well as to clamp it and, almost always, the bolts were
insufficiently preloaded on installation. Fasteners are designed to clamp parts together,
not to locate them. Location is the function of dowels. Another area where impact failures
are common is in connecting rod bolts, when a catastrophic failure, elsewhere in the
engine (debris from failing camshaft or crankshaft) impacts the connecting rod.
Cyclic Some of the high strength ‘quench and temper’ steel alloys used in fastener
fatigue manufacture are subject to ‘hydrogen embrittlement.’ L-19®, H-11, 300M, Aeromet
failure and other similar alloys are particularly susceptible and extreme care must be
originated exercised in manufacture. The spot on the first photo is typical of the origin of this
by hydrogen type of failure. The second is an SEM photo at 30X magnification.”
embrittleme
nt
Cyclic Again, many of the high strength steel alloys are susceptible to stress corrosion. The
fatigue photos illustrate such a failure. The first picture is a digital photo with an arrow
cracks pointing to the double origin of the fatigue cracks. The second photograph at 30X
propagate magnification shows a third arrow pointing to the juncture of the cracks propagating
d from a from the rust pits. L-19, H-11, 300M and Aeromet, are particularly susceptible to
rust pit stress corrosion and must be kept well oiled and never exposed to moisture
(Stress including sweat. Inconel 718, ARP 3.5 and Custom age 625+ are immune to both
corrosion) hydrogen embrittlement and stress corrosion.
436
Continued on next page
437
Error! No text of specified style in document., Continued
Cyclic Many connecting rod bolt failures are caused by insufficient preload. When a fastener
fatigue is insufficiently preloaded during installation the dynamic load may exceed the
cracks clamping load resulting in cyclic tensile stress and eventual failure. The first picture is
initiated a digital photo of such a failure with the bolt still in the rod. The arrows indicate the
by location of a cut made to free the bolt. The third arrow shows the origin of the fatigue
improper crack in the second picture – an SEM photo at 30X magnification that clearly shows
installatio the origin of the failure (1), and the telltale ‘thumbprint’ or ‘beach mark’ (2). Finally (3)
n preload tracks of the outwardly propagating fatigue cracks, and the point where the bolt
(unable to carry any further load) breaks-away.”
438
Lubricant failures
Introduction Lubricants are a cocktail of hydrocarbons and other chemical additives that give it the
characteristics required by the application.
439
Additive Lubricants degrade and lose their properties when the additives become
depletion depleted and the lubricant loses some of it’s properties that are required for the
application
Contamination Lubricants also degrade with time or usage when contaminants build up in the
oil. Some of these contaminants especially the solid particles can be removed
from suspension by means of filters. Contaminants that go into solution such as
fuel and moisture cannot be removed by filtering
440
Chemical decay
Introduction Components that contain rubber and other elastomers are likely to suffer chemical
decay. This usually happens first on the surface as cracks which is also known as
dry rot. Eventually the structure crumbles and loses structural integrity.
441
Bearing failures
Clearanc The table below shows the nominal bore diameter in millimetres and the radial
e table clearance values in 1-6 mm (μm)
(pre-
load)
442
Continued on next page
443
Error! No text of specified style in document., Continued
How The diagram below provides some information on the size of common objects to provide
big is a some perspective on the internal clearance (pre-load) of bearings
micron
?
444
Error! No text of specified style in document., Continued
Life The diagram below shows that bearing’s life expectancy depends on the
expectancy correctness of the preload at installation. From the diagram below on can conclude
and internal that excessive pre-load is more harmful than too little pre-load and that correct
clearance preload will provide the longest life
445
Error! No text of specified style in document., Continued
Cause The table below shows the percentage of bearings that fail due to natural causes and how
s many fail prematurely
446
Soft foot
Introduction Any condition where less than perfect surface contact is made between the
underside of the machine’s feet and the surface of the base plate, or frame, is
called soft foot. Soft foot is similar to being seated at a wobbling table. The table
wobbles because at least one leg does not come in perfect contact with the floor.
For a table, this is considered an inconvenience; with industrial equipment, this
condition will result in misalignment and equipment damage.
Effect Soft foot is a commonly misunderstood term and a topic that can be considered on
its own, separate from alignment. The effects of soft foot can be so prevalent in the
alignment process, however, that it must be eliminated before making any
alignment corrections. Pre-alignment checks and procedures include eliminating
soft foot; however, soft foot should be checked in each stage of the alignment
process.
447
Error! No text of specified style in document., Continued
Angular soft foot – Angular soft foot can occur when the foot is touching the base
on either the outside or inside portion of the foot, but the other side of the foot is
bent away creating an angle between the base and the bottom of the foot. In both
cases, tightening the hold-down bolts will result in a distortion of the machine’s
frame as the foot is drawn down to the base.
Squishy foot – Squishy foot, sometimes called spring foot, exists when the gap
between the foot and base has already been filled with shims. The machine will
appear to be fixed of soft foot problems until the hold-down bolts are tightened.
Tightening the hold-down bolts can compress shims that are creased, bent, or
otherwise damaged. This condition can distort the machine’s frame as the foot is
drawn down to the base.
Stress-induced soft foot – Perhaps the most difficult soft foot condition to detect is
caused by forces that are external to the machine. This is referred to as stress-
induced soft foot. It can be the result of pipe strain or stresses induced by the
electrical connections as well as drastic misalignment. Stress-induced forces can
be created during any stage of the alignment process; therefore, eliminating this
kind of soft foot may require more than one check.
Frame distortion – Frame distortion can be caused by uncorrected soft foot. This
condition exists when the soft foot is forced to mate with the base. With lighter
framed motors, frame distortion can bend the motor housing. On larger motors,
frame distortion can lead to premature failure of components, such as bearings.
Distorted bearing housing – Machine frame distortion can distort the bearing
housing. This can result in excessive wear on the top and bottom of the outer race
and lead to premature failure.
Fretting corrosion – Vibration can loosen the bolts holding a motor to its
foundation. A motor with a soft foot is more likely to cause fretting corrosion and
repetitive impact damage to its foundation and bolts. This corrosion will, in turn,
worsen the soft foot condition.
448
449
Fatique endurance
Introduction The diagram below shows the relationship between stress and the fatique
endurance life for carbon steel and aluminium. The higher the stress, the lower the
life. In the case of carbon steel, the fatique endurance life is infinite in the event that
the stress does not exceed 300 Mpa.
Effect Keeping stress levels within the design range ensures an almost infinite fatique life. Every
instance of overload reduces the endurance life.
450
Section H
Human factors
Overview
Topic
Human performance in maintenance
Major incidents that were maintenance related
Human error introduction
Slips and lapses
Rule based mistakes
Knowledge based mistakes
Violations
Local error-provoking factors
Vigilance decrement
Bias
Non-detection
Recognition failures
Premature exits
Errors are pervasive
451
Human performance in maintenance
High No surprise that maintenance work attracts more than their fair share of human
probability of performance problems
error
452
Major incidents that were maintenance related
Three Mile The loss of coolant near-disaster at the Three Mile Island nuclear power plant in
Island 1979 Pennsylvania (1979)
Bhopal 1984 The calamitous discharge of methyl isocyanate at a pesticide plant near the Indian
city of Bhopal (1984)
Mount The crash of a Japan Air Lines B747 into the side of Mount Osutaka (1985)
Osutaka
1985
Piper Alpha The explosion on the Piper Alpha oil and gas platform in the North Sea (1988)
1988
Phillips 66 The explosion at the Phillips 66 Houston Chemical Complex in Pasadena, Texas
Houston (1989)
1989
BAC1-11 The blow out of a flight deck window on a BAC1-11 over Oxfordshire (1990)
Oxfordshire
1990
Embraer 120 The in-flight structural break-up of an Embraer 120 at Eagle Lake, Texas (1991)
Eagle Lake
1991
Blocked A blocked pitout tube contributing to the total loss of a B757 at Puerto Plata in the
pitout tube Dominican Republic (1996)
1996
Oxygen The oxygen generator fire in the hold of a DC9 over Florida (1996)
generator
fire DC9 1996
453
Human error introduction
The James
Reason
model Attentional failure
Carry out a planned task
Slip incorrectly or in the
wrong sequence
Unintended
action
Memory failure
Lapse Miss out a step in a planned
sequence of events
ERROR
Rule-based mistakes
Misapplication of a good rule or
application of a bad rule
Mistake Knowledge based mistakes
Inappropriate response to a novel
abnormal situation
Intended
action
Exceptional violation
Violation Routine violation
Sabotage
Un-intended The important aspect to note about this type of error is that it is not related to a
action lack of of knowledge or skill. The person who makes this type of error knows what
to do and how to do it. His intentions of what he sets out to do is correct. However,
at some point during the execution his actions deviate from the intentions. This is
likely due to a distraction, preoccupation, absent mindedness. This type of error is
therefore an error of execution not intention.
There are two categories of unintended actions:
• Slips
• Lapses
Intended With this type of error the person’s intentions are already incorrect. This could be
action because he or she chooses a course of action that is not appropriate for the situation
at hand does something habitually that is not the right thing to do, does not really
know what to do but proceeds anyway knows the correct course of action or
behaviour but chooses to act inappropriately.
There are two categories of intended actions:
• Mistakes
• Violations
454
Slips and lapses
Introduction Slips and lapses are the most common type of errors that occur. They are part of
everyday life and are a major cause of incidents and accidents. They are mainly
caused by lack of attention, distractions, preoccupations, absent mindedness.
Correct Incorrect
Correct Incorrect
455
Rule based mistakes
Introduction Mistakes occur when somebody takes a course of action that is inappropriate.
Rule based Rule based mistakes occur when a person believes he or she is following the
mistakes correct course of action when doing a task (‘applying a rule’) but in fact the course
of action is inappropriate.
Misapplication A person selects a course of action because it has been successful in the past
of a good rule ‘good rule’. However, some subtle variations on this occasion mean the course of
action, undertaken deliberately ,is wrong .
An experienced lubricator was given on-the-job training to always fill the rollstand
gearboxes with with SAE 90 GL-3 gear oil. (Good Rule) Some time later the
company started replacing some of the gearboxes with hyphoid gear units and
the lubricator continued to use GL-3 oil instead of GL-5 in this case.
GOOD RULE: ‘Always top up the rollstand gearboxes with SAE 90 GL-3 oil’
Correct application of good rule Incorrect application of good rule
Rollstand gearboxes with bevel Rollstand gearboxes with hyphoid gears
gears require SAE 90 GL-3 oil require SAE 90 GL-5 oil
Application of The normally chosen or prescribed course of action is incorrect although it has
a bad rule been used with some success in the past it is best practice.
Examples:
• Increase the trip current setting of a current overload protective device
above the prescribed level when an electric motor has tripped a number
of times.
• Apply belt dressing to a v-belt drive to compensate for slippage due to
worn pulley grooves.
• Use low tensile strength bolts when high tensile bolts are not readily
available
456
Knowledge based mistakes
Introduction Mistakes occur when a person takes a course of action that is inappropriate
Knowledge occur when a person is confronted with a situation which has not occurred before
based and which has not been anticipated. (In other words, one for which there are no
mistakes ‘rules’)
Some A person has no ‘rules’ / procedure to fall back to. In situations like this the person
causes has to make a decision about an appropriate course of action and a mistake
occurs when that selected course of action is wrong. Common problem managers
make is the belief that ‘I know, therefore my subordinates know’ In fact, if a crisis
occurs late at night when all senior people are off-site, the requisite knowledge is
useless if it is not in the mind of the person who has to take the first steps to deal
with the crisis.
How to The first and most obvious way to reduce knowledge based mistakes is to improve
reduce the knowledge of operators and maintainers with regard to the physical assets that
they operate and maintain.
457
Violations
Introduction Violations occur when somebody knowingly and deliberately commits an error
Routine Routine violations occur when ‘rules’ are routinely not followed or adhered to. Eg.
violations ‘Operators never flush the density correction pumps with process water when they
shut down although the SOP specifies it because they think it is not really
necessary because it has not caused problems in the past.
Exceptional Exeptional violations occur when ‘rules’ are sometimes not followed or adhered to. Eg.
violations ‘Operators sometimes don’t flush the density correction pumps with process water when
the shut down is only of a short duration although the SOP specifies it.
Sabotage Sabotage occurs when a persons acts to intentionally cause loss or damage. Eg.
‘The operator allows the dump truck to run out of fuel because he or she knows it
will put the truck out of operation for at least two hours allowing him or her to have
a rest.’
458
Local error-provoking factors
Introductio Human errors are not caused by ‘bad luck’. They are shaped by situation and task
n factors that are part of the environment in which a task is performed. Error-
producing conditions in the workplace are commonly referred to as local factors
present in the immediate surroundings at the time
41
Team beliefs Whatever beliefs the team may have will be upheld by the individual team
member. If the team feels it is infallible then the team members will not consider it
necessary to verify each others work.
Time Under extreme time pressure some people may think that quality may be
pressure sacrificed in order to reduce the time the work takes.
Unworkable Sometimes work and especially safety procedures are an obstacle to the
procedures performance of the work. In this case the work either stops or people carry on with
the work and violate the procedure.
Spares The unavailability of spares such as gaskets, seals, retainers means that items
availability that these items have to be reused where they should have been replaced. This
comprises the quality of the work
459
460
Error! No text of specified style in document., Continued
‘Can-do’ Some people are very optimistic with regard to their capabilities. They take on
attitudes work for which they may not be competent.
Demographics Workforce age could be a factor. Older people are less likely to violate but may
be more likely to suffer slips and lapses.
Documentation Documented task procedures with quantified acceptable standards are essential
to ensure that work is done correctly. Lack of documentation leads to
assumptions and errors.
Technical Technical support that is available in the form of documentation, OEM product
support support, call centres and field technicians help to reduce knowledge based
errors.
Unsuitable The right tool for the job is essential to ensure the quality of the work.
tools
Circadian low Work done during the hours of midnight and 05:00 in the morning have a higher
points likelihood of error because peoples’ energy and ability to concentrate are at a low
level.
Poor Maintenance people are not known for their communication skills. They tend to
communication be the quiet type and are not always willing to share their knowledge. The
communication between different disciplines and between maintenance and
operations is often not what it should be.
Shift Communication between the incoming and outgoing shift could be critical
handovers especially if there is maintenance work in progress. Specific rules need to be in
place to ensure that this happens.
Inexperience High staff turnover at times could leave the section in a situation that most of the
staff may lack experience of the specific work or equipment.
Task Tasks that happen at a low frequency are a risk for errors because nobody has
frequency experience in doing the task and it becomes a novel task for anybody that does
it.
Design Sometimes the task is especially error prone due to a design issue. Known about
deficiencies these deficiencies is already a step in the right direction to deal with them
Housekeeping ‘A place for everything and everything in it’s place’ reduces the risk of old parts
and tool getting mixed up with new parts, tools being left behind in the equipment, dirt
control
461
and other contaminants entering the equipment
462
Vigilance decrement
Introduction People have a natural tendency to become less vigilant and observant if the work
becomes repetitive or has a low level of activity or is not challenging mentally.
Repetitive A maintainer that follow the same inspection list day after day, week after week will
work soon become bored and allow his or her mind to wander to other activities that are
more interesting.
Few hits When the ratio of the number inspections to number of ‘hits’ (defects found) is low
maintainers are likely to become complacent and allow their attention to wander
See what is Maintainers that do the same inspection repeatedly begin to develop a mental
expected image of what to expect so even when there is a defect they are unlikely to see it.
463
Bias
Introduction Ideally work related decisions should be taken following a careful rational process
in which we consider all options, evaluate each in turn and then select the best
course of action. This is often not the case because everybody suffers from bias
and that is why it should always be taken into consideration
Confirmation When confronted with an unfamiliar problem we often develop a theory to explain
bias the situation. Once we have such an idea we tend to search for information that
will confirm what we suspect and ignore information to the contrary. People rarely
try to prove themselves wrong! The initial incorrect fault diagnosis can block
attempts to consider other possibilities.
Example For instance, a truck was reported to be pulling to the left when brakes were
applied. The maintainer considered that this was because the brakes were
binding on one of the left wheels. After time-consuming and unsuccessful
attempts to correct the problem, a quite different problem was eventually
discovered. The brakes on the right-hand side were not working!
464
Non-detection
Introduction Despite new techniques and technology for detecting faults, we still rely on the
human eyeball for most fault-finding tasks. Non-detection errors typically involve a
failure to notice a visible fault during an inspection
Other Other factors include inexperience and not being sufficiently trained in knowing
factors what to look out for in the way of signs and symptoms. At other times we do not
take into account the physiological limitations of the human visual system.
On along boring inspection the mind will tend to wander to other matters
465
Recognition failures
Examples Below are two examples of object misidentification that in the one case had tragic
consequences.
Tunisian ATR72 airplane ran out of fuel due to incorrect fuel gauge installed
The investigation into the crash into sea of a Tunisian airplane ATR-72 near Sicily on
August 6, 2005 arrived at the astonishing conclusion. The wrong type of fuel gauge was
installed in the aircraft. The gauge was designed for another type of aircraft, the much
smaller ATR-42. The gauges have exactly the same appearance with the part number
indicated on the casing being the only distinguishing feature. Thirteen people died and ten
people including the pilot received prison sentences of up to ten years
466
Premature exits
Introduction As the name implies, premature exits involves terminating a job before all the
actions are complete – like getting into the shower with your socks on. As we
approach the end of a routine task, our minds jump ahead to the next activity, and
may lead us to leave out some late step in the first task.
Driving off with the fuel nozzle still inserted in the tank
Leaving the isolator in the locked position after the work is completed
Leaving the isolation valve in the closed position when the work is completed
467
Errors are pervasive
Introductio We need to accept the fact that errors are pervasive. Humans are fallible beings that
n will continue to make errors. The first step to error management is to acknowledge
this and then act to reduce the risk
Specification/design errors
Manufacturing errors
Acquisition/procurement errors
Installation errors
Commissioning errors
Operating errors
Maintenance errors
Normal deterioration
Specification Sometimes the equipment is not fit for purpose because of an error that occurred
and design during the functional specification or design stage
errors
The equipment is not fit for purpose due to an error made during manufacturing.
Manufacturing Processing, tooling, machining, drilling, grinding, milling, heating, cooling,
error converting, forming, extruding, punching, riveting, welding, folding, soldering,
coating, painting, sealing, printing, mixing, etc.
Supply chain An error can occur during the acquisition of parts, materials, chemicals, fuels,
errors lubricants, feedstock, raw material, etc.
Transport There are many opportunities for damage during packaging, transport and storage
and storage of parts and materials including the expiry of shelf-life.
468
Error! No text of specified style in document., Continued
Installation They part that has reached this stage without damage could suffer damage
errors during installation.
Commissioning Some parts or equipment require special care and attention for a period after
errors installation and start-up to accommodate things like curing and bedding in.
Operating It could be that the equipment is novel to the operators and through some error
errors of omission or commission the equipment suffers damage.
Normal What we are seeking is to reduce the human error factors so that we only need
deterioration to deal with normal deterioration
469
Section I
Error management
Overview
Topic
Error management introduction
Improve people’s knowledge
Excessive dependence on memory
Interruptions
Tiredness, fatique and sleepiness
Inadequate coordination
Unfamiliar work
Ambiguity
Highly routine work
Inadequate design
Aspects of good design
Error resistant design
Spares, tools and equipment factors
Task procedures
Tasks that are more likely to be error prone
Criteria for good reminders
Error management teams
470
Error management introduction
Introduction There is nothing new about trying to manage error. All responsible organizations
involved in hazardous operations have long employed a wide variety of error
management (EM) measures. In maintenance organizations, these include:
History These techniques have evolved over many decades. Though some are tried and
tested, they have collectively failed to prevent a steady rise in errors. Their
limitations include being piecemeal rather than principled, reactive rather than
proactive, and fashion–driven rather than theory-driven. They also ignore the
substantial developments that have occurred over the last 20 years in
understanding the nature and varieties of human error
Considerations Below are some of the principles and considerations of error management
and principles
1. Human error is both universal 8. Safety-significant errors can occur
and inevitable at all levels of the system
2. Errors are not intrinsically bad 9. Error management is about
3. You cannot change the human managing the manageable
condition, but you can change 10. Error management is about making
the conditions in which humans good people excellent
work 11. There is no one best way
4. The best people can make the 12. Effective error management aims at
worst mistakes continuous reform rather than local
5. People cannot easily avoid those fixes
actions they did not intend to 13. Managing error management is the
commit most challenging and difficult part of
6. Errors are consequences rather the EM process
than causes
7. Many errors fall into recurrent
patterns
471
Improve people’s knowledge
Introductio Error management starts by improving peoples knowledge about human error, its
n causes and ways in which it can be prevented or controlled.
What Below are some of the things about human error that people need to learn:
knowledge? • The limitations of short-term memory
• How fatique, lack of sleep can increase the risk of human error
• Distractions
• Pre-occupations
• Ambiguity
See the Once people become more aware of their own vulnerability they are in a better
signs position to recognize the danger signs and take action before an error occurs
472
Excessive dependence on memory
Introduction Our memories are not always as reliable as we think, particularly when we are tired.
Memory lapses are the most common errors in maintenance. It is tempting fate to interrupt
a part-completed job without adequate reminders to tell you, and others, of its stage of
progress.
You run the risk of a memory lapse every time you try to keep a critical task step in mind to
perform later without any reminders. It is better to assume that you will forget, and take
precautions, than to hope that you will remember.
Precautions If a person accepts that he or she is likely to forget something then that person
against loss needs to take precautions such as taking notes, using a diary, setting an alarm on
the cellphone, using an electronic diary. Creating a habit of using checklists and to-
do lists is better than constantly having to
473
Interruptions
Counter- The most likely error is an omission. It is very important that you are aware of
measures these risks and take steps to combat them. An obvious counter-measure is to
anticipate the ‘now where was I?’ question when you take up the task again, and
to leave behind a clear reminder of exactly where you had to stop.
474
Tiredness, fatique and sleepiness
Introduction Fatigue is a subjective feeling of tiredness which is distinct from weakness and has
a gradual onset. Unlike weakness, fatigue can be alleviated by periods of rest.
Physical Physical fatigue is the transient inability of a muscle to maintain optimal physical
fatique performance, and is made more severe by intense physical exercise
Micro sleep A micro-sleep is a temporary episode of sleep or drowsiness which may last for a
fraction of a second or up to 30 seconds where an individual fails to respond to
some arbitrary sensory input and becomes unconscious. Micro sleeps occur when
an individual loses awareness and subsequently gains awareness after a brief
lapse in consciousness, or when there are sudden shifts between states of
wakefulness.
Implications Fatigue can increase your chances of making errors, particularly memory lapses.
for errors Sleepy people are also more irritable and harder to work with. Micro sleeping
during activities that require constant alertness such as driving a motor vehicle,
flying an aero plane can lead to catastrophic events
477
Inadequate coordination
Poor Sometimes people fear that they will give offence it they are seen to check the
communications work of colleagues to thoroughly or ask too many question.
Coordination danger signs include rushed shift handovers, a lack of adequate
communication, not asking questions because you feel silly or do not want to
offend a work colleague, and working with unfamiliar people.
478
Unfamiliar work
Introductio If you are performing a task that is not part of your normal duties, even if you used to
n perform it in previous years, you are entering a danger zone for error. If an
unfamiliar task is being performed on the basis of ‘trial and error’, you must
recognize that your chances of getting it wrong are greatly increased.
Lack of A further point to note is that a significant number of incidents have involved
recent supervisors helping out by getting involved in hands-on-work. Although such
experience people may be technically qualified to perform the work, and indeed highly
motivated to do a proper job, their practical skills may be degraded.
479
Ambiguity
Introductio Any situation in which you are unsure of what is going on should be a sign to call a
n halt and clarify the task. Such situations are particularly common in team-based
work environments, where ‘diffusion of responsibility’ can result in people assuming
that someone else knows what is going on and has taken charge.
Counter Just as with poor coordination, ambiguity driven errors are likely to persist as long
measures as people are unwilling to communicate and get clarification when situations of
uncertainty arise.
480
Highly routine work
Introductio Any procedure that you can perform ‘with your eyes closed’, such as opening and
n closing access covers or checking oil levels, is a danger zone for slips and lapses.
Because we are so familiar with such tasks, our attention may wander elsewhere,
leaving our actions largely under the control of the ‘mental autopilot’.
Counter While we cannot stop such tasks from being ‘on automatic’, we can remain vigilant
measures and spot the errors that will occur from time to time.
481
Inadequate design
Introduction Investigation of the crash into sea of a Tunisian airplane ATR-72 near Sicily on August 6,
2005 has arrived at astonishing conclusion. The wrong type of fuel gauge was installed in
the aircraft. The gauge was designed for another type of aircraft, the much smaller ATR-
42.
Ambiguity The level indicated by the gauge was wrong, causing the pilot to think that the
plane needed less fuel that it did. Then, when fueling at the airport, less fuel than
was needed was loaded in the airport of Bari. While in flight the airplane ran out of
fuel, although the gauge still showed fuel available. Thirteen people died and ten
people received prison sentences ranging from eight to ten years for their role in
the accident.
482
Aspects of good design
Introductio The way that equipment has been designed can go a long way in reducing human
n error.
Innovative The possibilities for innovation are not, by any means, exhausted. Technological
development is always offering new opportunities for innovative design. But
innovative design always develops in tandem with innovative technology, and can
never be an end in itself.
Usefulness Anything designed is meant to be used. It has to satisfy certain criteria, not only
functional, but also psychological and aesthetic. Good design emphasises
usefulness while disregarding anything that could detract from it.
Aesthetic Aesthetic quality is integral to usefulness because the things we use every day affect our
person and our well-being. Only well-executed objects can be beautiful.
Under- It clarifies the structure. Better still, it can make the design outcome talk. At best,
standable the thing designed is self-explanatory.
Honest It does not make the design outcome more innovative, powerful, or valuable than
it really is. It does not attempt to manipulate the consumer with promises that
cannot be kept.
483
Error! No text of specified style in document., Continued
Unobtrusive Designed items fulfilling a purpose are like tools. They are neither decorative
objects nor works of art. Their design should therefore be neutral and restrained
in order to leave room for the user’s self-expression.
Durable It avoids being fashionable and therefor never appears antiquated. Unlike
fashionable design, it lasts many years – even in today’s throwaway society
Thorough Nothing must be arbitrary or left to chance. Care and Accuracy in the design
process show respect towards the consumer.
Need for Maintenance work should be able to be executed with minimum requirement for
special tools specialised tools
Field repairs Should not require precision work to be done during field repairs
484
Error resistant design
Introductio Below are a few examples of design that was done specifically with error reduction
n in mind
485
Spares, tools and equipment factors
Introduction Many errors occur when technicians have to make do with standard tools instead
of tools that are required for the specialised work they have to do.
8 Are replacements available when special tools are sent away for
servicing or calibration?
9 Is there a procedure and a person that ensures that all tools are
issued and returned in a clean and functional state?
486
Task procedures
Introduction Task procedures are an essential element of maintenance and operations regardless of the
level of knowledge and experience of the people.
Basic Procedures should meet the following basic criteria in that they should specify:
requirements • Tasks to be performed by the operator / maintainer
• Instrument readings and samples to be taken
• Conditions to be maintained
• Safety precautions
• Safe operating limits for critical parameters
• Critical operating and maintenance parameters
• Results of exceeding safe limits
• Corrective and emergency actions
Basic • Be accurate
requirements • Be understandable
continued • Use familiar language
• Include input from process and design engineers and operations and
maintenance personnel
• Reflect how operations are actually performed
• Be thoroughly documented
• Be dated and/or have a revision number on every page
• Be reviewed and updated at regular intervals to capture procedural,
equipment, critical operating parameter, software, and process changes
• Be approved
Ease of Procedure users must be able to quickly and easily obtain current, approved procedures
access to prepare for and perform their jobs. Needed procedures must be readily accessible-
available-at all times. Procedures may be available as printed (hard-copy) documents,
they may be viewed on computer screens, or they may be printed, as needed, from
electronic files. The current, approved procedures must be available to ensure that only
up-to-date procedures are used to perform operations and maintenance tasks.
Clarity In addition to being readily available, procedures must be clear. They must be
written concisely in a straightforward manner and must consider both the difficulty
and importance of the task(s) being described. They must also consider the skills,
experience level, and needs of the user. If the user does not understand a
procedure, or does not have confidence in its accuracy, the procedure will most
likely not be used or it will be used incorrectly. Procedure training will foster
understanding and use of procedures.
Many of the guidelines and regulations cited address the need to ensure
Procedure procedures are current and accurate. This means that a procedure management
management system should be in place to implement and guide the development, review,
and control approval, distribution, accessibility, and updating of procedures. (See Chapter 3,
How to Design an Operating and Maintenance Procedure Management System.)
As mentioned in Chapter 1, we should treat procedures with the same respect as
we do equipment and process materials. They are a major investment. Revisions
or modifications to procedures should be analyzed, tracked, and approved in the
same manner as mechanical or technological changes.
Reviews and To ensure that procedures are accurate and reflect current practices, they must be
audits periodically reviewed. Revisions caused by changes or improvements in
equipment, process technology, standard practice, or facility status must be
incorporated as they occur. This is a function of your procedure management
system. The effect of changes in environmental and safety regulations on
procedures must not be overlooked. A Management of Change system directly
supports and controls these revisions.
489
490
Tasks that are more likely to be error prone
Introduction Recent psychological research has identified a number of task properties that are likely
to increase the probability that a particular step will be omitted. Some of the more
important of these features are as follows:
Informational The greater the informational loading of a particular task step—that is, the higher
loading the demands imposed upon short term memory—the more likely it is that items
within that step will be omitted.
Functionally Procedural steps that are functionally isolated—that is, ones that are not
isolated obviously cued by preceding actions nor follow in a direct linear succession from
steps them—are more likely to be left out.
Recursive Recursive or repeated procedural steps are particularly prone to omission. In the
steps case where two similar steps are required to achieve a particular goal, it is the
second of these two steps that is most likely to be neglected.
Steps that Necessary steps that follow the achievement of the main goal of a task are likely to be
follow the omitted. This is an instance of a general principle: steps located near the end of a task
achievement sequence are more prone to omission. Such “premature exits” are due in part to the
of main goal actor's preoccupation with the next task, particularly when the current activity involves
largely routine actions.
Lack of Steps in which the item to be acted upon is concealed or lacking in conspicuity
conspicuity are liable to omission.
Interruptions Steps following unexpected interruptions are especially prone to omission. This
can occur because the person loses her place in the action sequence and
believes herself to be further along than she actually is, or because some
unrelated action is unconsciously “counted in” as part of the task sequence
Planned Tasks that involve planned departures from standard operating procedures or
departures from habitual action sequences are liable to strong habit intrusions in which the
from habitual currently intended actions are supplanted by a more frequently used routine in
action that context, and thus omitted.
Combining A number of these omission provoking properties can combine in a single task
step. When this occurs, the effects are additive and the result is a recurrent error
trap that predictably snares a large number of people.
491
Criteria for good reminders
Reminders In order to work effectively, reminders (memory aids to prevent the omission of
necessary task steps) should satisfy all of the conditions described below
492
Error management teams
PLANT
Events
493
Section G
Discussion of software and templates to support analysis
Overview
Introduction In this section we discuss some of the tools that can be used to guide and record
the outcomes of a root cause analysis
Topic
Comparison of software tools and processes
494
Criteria for an RCFA technique to qualify as rigorous
Introductio With the exception of the a military standard for FRACAS and a Department of Energy
n guideline there are not any formally recognised standards for root cause of failure. As
a result of some research work done by an source unknown the table below provides
some insight into the functionality of some of the processes use in pursuit of identify
the root causes of failures
Considers
Provides chronic
Defines all causal path events and Proprietary
Defines causal to root Delineates quantifies software
Method/Tool Type problem relationships causes evidence losses required
Events &
casual factors Method Yes Limited No No No No
Change
analysis Tool Yes No No No No No
Barrier
analysis Tool Yes No No No No No
Tree
diagrams Method Yes No No No No Yes
Why-Why
chart Method Yes No No No No No
Pareto Tool Yes No No No No No
Storytelling Method Limited No No No No No
Fault tree Method Yes Yes No No No Yes
FMEA Tool Yes No No No No Yes
Apollo reality
charting Method Yes Yes Yes Yes No Yes
PROACT Method Yes Yes Yes Yes Yes Yes
RCFA Method Yes Yes Yes Yes Yes No
Problem The process clearly defines the problem and its significance to the problem
definition owners in terms that are relevant for the business. This can be expressed either in
terms risk or in financial terms so that management is able to verify that the
investment in the investigation was justified.
Combination The process delineates the known causal relationships that combined to cause
of causal the problem. As there is rarely only one cause for a problem, it is necessary to
relationships make visible all the causal relationships that were active at the time of incident so
that they can all be dealt with.
Casual It must establish causal relationships between the root cause(s) and the defined
relationships problem. This means that there must be a clear causal link between the root
between root cause and the problem. Otherwise we may be eliminating a root cause that has
cause and nothing to do with the problem!
problem
Presentation Evidence must form the basis for the identification of causes. The facilitator or
of evidence analyst must demonstrate that his or her findings and conclusions are based on
irrefutable evidence. There must be clear distinction at all times between facts
supported by evidence and assumptions for which there is no evidence.
Presents The recommendations must clearly explains how the proposed solutions will
solutions prevent recurrence of the defined problem.
495
Report It must clearly documents all the above criteria in a final report so others can easily
follow the logic of the analysis.
496
Training
Expertise
PetroKnowledge provides Industry Conferences,
In-house and Corporate Training Courses, Seminars and
Workshops across the following subject areas:
• Project Management
• Contracts Management Clear Concepts. Clean Environment. Membership No.: 125094
• Supply Chain Management The PMI® Registered Education Provider logo is a registered mark
of the Project Management Institute, Inc.
www.petroknowledge.com
GET IN TOUCH
Dubai - UAE
Tel: +971 4 567 1530
E: info@petroknowledge.com
www.petroknowledge.com
Follow us on: