Module 2

Maintenance Engineering
Maintenance is not merely preventive maintenance, although this aspect is an

important ingredient. Maintenance is not lubrication, although lubrication is one
of its primary functions. Nor is maintenance simply a frenetic rush to repair a
broken machine part, although this is often the dominant maintenance activity.
Maintenance Engineering is the discipline and profession of applying engineering
concepts to the optimization of equipment, procedures, and departmental budgets
to achieve better maintainability, reliability, and availability of equipment. It is a
routine and recurring activity of keeping a particular machine or facility at its
normal operating condition so that it can effectively deliver its expected
performance or service without causing any loose of time on account of accidental
damage or breakdown.
A person working in the field of maintenance engineering must have in-

depth knowledge of or experience in basic equipment operation, logistics,
probability, and statistics. Maintenance engineers not only monitor the existing
systems and equipment, they also recommend improved systems and help decide
when systems are outdated and in need of replacement. Maintenance function
also involves looking after the safety aspects of certain equipment where the
failure of component may cause a major accident. For example, a poorly
maintained pressure vessel such as steam boiler may cause a serious accident.
Good maintenance engineering is vital to the success of any manufacturing or
processing operation, regardless of size. The maintenance engineer is responsible
for the efficiency of daily operations and for discovering and solving any
operational problems in the plant.
Aim of Maintenance Engineering
The aim of maintenance engineering is to ensure the following;
1. The machines and/or facilities are always in an optimal working condition

at the lowest possible cost.
2. The time schedule of delivering to the customers is not affected because of
non -availability of machinery /service in working condition.
3. The performance of the machinery /facility is dependable and reliable.
1
4. The performance of the machinery /facility is kept to minimum to the event
of the breakdown.
5. The maintenance cost is properly monitored to control overhead costs.
6. The life of equipment is prolonged while keeping the acceptable level of
performance to avoid redundant replacements.
Maintenance is also related with profitability through equipment output and its
running cost; Maintenance work enhances the equipment performance level and
its availability in optimum working condition but adds to its running cost.
The responsibility of the maintenance function should, therefore, be ensure that

production equipment /facilities are available for use for maximum time at
minimum cost over a stipulated time period such that the minimum standard of
performance and safety of personal and machines are not sacrificed. The objective
of maintenance work should be to strike a balance between the availability and
the overall running costs.
Responsibilities and Functions of Maintenance Group
Even while real maintenance practice may be unique to a particular facility or

industry, it is nevertheless reasonable to divide tasks and responsibilities into two
broad categories;
• Primary functions; that demand daily work by the department;

• Secondary ones assigned to the maintenance department for reasons of
expediency, know-how, or precedent.
1. Primary Functions
1.1 Maintenance of Existing Plant Equipment
This activity represents the physical reason for the existence of the maintenance
group. Responsibility here is simply to make necessary repairs to production
machinery quickly and economically and to anticipate these repairs and employ
preventive maintenance where possible to prevent them. For this, a staff of skilled
craftsmen capable of performing the work must be trained, motivated, and
constantly retained to assure that adequate maintenance skills are available to
2
perform effective maintenance. In addition, adequate records for proper
distribution of expense must be kept.
1.2 Maintenance of Existing Plant Buildings and Grounds
The repairs to buildings and to the external property of any plant—roads, railroad
tracks, in-plant sewer systems, and water supply facilities—are among the duties
generally assigned to the maintenance engineering group. Repairs and minor
alterations to buildings—roofing, painting, glass replacement, service electrical or
plumbing systems or the like are most logically the horizon of maintenance
engineering personnel. Road repairs and the maintenance of tracks and switches,
fences, or outlying structures may also be so assigned. It is important to isolate
cost records for general clean-up from routine maintenance and repair so that
management will have a true picture of the true expense required to maintain the
plant and its equipment.
1.3 Equipment Inspection and Lubrication
Traditionally, all equipment inspections and lubrication have been assigned to the
maintenance organization. While inspections that require special tools or partial
disassembly of equipment must be retained within the maintenance organization,
the use of trained operators or production personnel in this critical task will
provide more effective use of plant personnel. The same is true of lubrication.
Because of their proximity to the production systems, operators are ideally suited
for routine lubrication tasks.
1.4 Utilities Generation and Distribution
In any plant generating its own electricity and providing its own process steam,
the powerhouse assumes the functions of a small public utilities company and
may justify an operating department of its own. However, this activity logically
falls within the realm of maintenance engineering. It can be administered either
as a separate function or as part of some other function, depending on
management’s requirements.
3
1.5 Alterations and New Installations
Three factors generally determine to what extent this area involves the
maintenance department: plant size, multi-plant company size, company policy.
In a small plant of a one-plant company, this type of work may be handled by
outside contractors. But its administration and that of the maintenance force
should be under the same management. In a small plant within a multi-plant
company, the majority of new installations and major alterations may be
performed by a companywide central engineering department. In a large plant a
separate organization should handle the major portion of this work. The industry
must permit flexibility between corporate and plant engineering groups when
installations and repairs are done outside the maintenance engineering
department. However, the handling of all new work by a separate organization
from maintenance management and policies would be counterproductive.
2. Secondary Functions
2.1 Storekeeping
In most plants it is essential to differentiate between mechanical stores and

general stores. The administration of mechanical stores normally falls within the
maintenance engineering group’s area because of the close relationship of this
activity with other maintenance operations.
2.2 Plant Protection
This category usually includes two distinct subgroups: guards or watchmen; fire
control squads. Incorporation of these functions with maintenance engineering is
generally common practice. The inclusion of the fire-control group is important
since its members are almost always drawn from the craft elements.
2.3 Waste Disposal
This function and that of yard maintenance are usually combined as specific
assignments of the maintenance department.
4
2.4 Salvage
If a large part of plant activity concerns off-grade products, a special salvage unit
should be set up. But if salvage involves mechanical equipment, scrap lumber,
paper, containers, etc., it should be assigned to maintenance.
2.5 Insurance Administration
This category includes claims, process equipment and pressure-vessel inspection,

liaison with underwriters’ representatives, and the handling of insurance
recommendations. These functions are normally included with maintenance since
it is here that most of the information will originate.
2.6 Other Services
The maintenance engineering department often seems to be a catchall for many

other odd activities that no other single department can or wants to handle. But
care must be taken not to dilute the primary responsibilities of maintenance with
these secondary services. Whatever responsibilities are assigned to the
maintenance engineering department, it is important that they be clearly defined
and that the limits of authority and responsibility be established and agreed upon
by all concerned.
Types of Maintenance
Broadly, there are two main types of maintenance categories that can further be
sub-divided into various maintenance-type groups.
• Proactive / Preventive maintenance

• Responsive / Corrective Maintenance / Breakdown Maintenance
1. Preventive Maintenance
Preventive Maintenance refers to the fixing of problems before they appear. This
means such maintenance prevents the problem. Inspection of equipment at
regular intervals to check the machine’s condition and take necessary action is
the motto of preventive maintenance.
5
Figure 2.1 Work-flow of Preventive Maintenance
Types of Preventive Maintenance
Preventive maintenance is categorized into the following five types;
• Time-Based Maintenance (TBM)

• Predictive Maintenance (PDM)
• Failure Finding Maintenance (FFM)
• Condition-Based Maintenance (CBM)
• Risk-Based Maintenance (RBM)
1.1. Time-Based Maintenance (TBM)
Time-based maintenance or TBM calls for maintenance at a fixed time. Typically,

a specified interval is scheduled, and maintenance work is carried out to restore
equipment efficiency and performance following the equipment manufacturer’s
maintenance plan. Time-based maintenance also requires the replacement of
items based on their service life capability.
1.2 Predictive Maintenance (PDM)
This type of maintenance, as the name implies, is forecasting an equipment’s

probability of failure and planning maintenance to stop it. The following
information should be stored and examined by the organization in order to
6
accurately forecast the equipment’s ability to function and execute predictive
maintenance:
a. Equipment history
b. All records of downtime, defects, performance, etc.
c. Equipment condition with respect to working time.
After analysis of the above data and including the experience with similar
equipment maintenance dates are fixed.
1.3 Failure Finding Maintenance (FFM)
In failure finding maintenance, potential hidden failures are searched at regular

intervals and if discovered are repaired to prevent major breakdowns. So basically,
this is not a specific type of maintenance but a functional check. Failure finding
maintenance increases the system reliability.
1.4 Condition-Based Maintenance (CBM)
In this technique, the real asset condition is tracked and the need for additional
maintenance is determined. Based on visual examination, predetermined tests,
performance data, etc., the equipment condition is examined during this type of
maintenance. Maintenance is planned when a failure or a hint of declining
performance is detected.
1.5 Risk-Based Maintenance (RBM)
RBM considers the philosophy of maintaining the assets carrying the most risk
during failure. This philosophy determines the most economical use of the
maintenance resources and optimizes the risk of failure.
RBM strategy works on the following steps:
1. Data Collection 2. Risk Assessment and Evaluation of Consequence and

Probability of failure 3. Ranking of Risks 4. Creating an Inspection Plan based on
those risk ranking matrices 5. Maintenance planning and Mitigation of risks.
Equipment carrying the greater risk and failure consequences are frequently
monitored and maintained. This philosophy provides a systematic approach to
7
determine the most appropriate asset maintenance plans in the most economical
way.
2. Corrective maintenance
Corrective maintenance is any maintenance task that resolves a problem with a

piece of equipment and returns it to operational condition. This is also known as
reactive/responsive maintenance. Corrective maintenance work can be both
planned and unplanned.
Figure 2.2 Workflow of typical corrective maintenance philosophy
Normally there are three situations that call for corrective maintenance:
• If a piece of equipment or part breaks down

• If any issue is identified during condition monitoring
• If routine inspection discovers any potential fault
There are two types of corrective maintenance
• Planned or Scheduled corrective maintenance

• Unplanned or Unscheduled corrective maintenance
2.1 Planned corrective maintenance
Planned corrective maintenance is the corrective action that is not immediate but
planned or scheduled according to the severity and nature of the observed defects.
The risks involved and costs involved are major parameters to determine the
8
planned corrective maintenance schedule. This is also known as deferred
corrective maintenance.
Example: An AC is not providing proper cooling due to refrigerant gas leakage.

So, a work order is created to repair it during the next inspection.
2.2 Unplanned corrective maintenance
Unplanned corrective maintenance needs immediate attention due to some kind

of critical failure and must be repaired without delay as it directly relates to cost.
This philosophy is also known as Immediate Corrective Maintenance.
Example: A pump is inspected and repaired after every 200 hours but it breaks
down after 150 hours of operation and it calls for an emergency repair. Similar
cases are examples of unplanned corrective maintenance.
Preventive vs Corrective Maintenance
S.N Preventive Maintenance Corrective Maintenance
1 PM is aimed to prevent the failure Corrective maintenance refers to

and scheduled before equipment maintenance after failure of
failure. asset.
2 PM involves proper planning to It is less complex and simple
prevent equipment failure, hence it process.
is complex.
3 It prevents equipment failure and Normally more expensive as
normally less expensive. equipment already failed and
need replacement or extensive
repairs.
4 Chances of equipment failure Loss of production and time due
reduces and so negligible to asset failure
production and time loss.
5 PM is performed at regular It is only performed when a
intervals. breakdown occurs.
6 Preventive Maintenance reduces the Corrective Maintenance increases
need for corrective actions. need for preventive actions.
9
7 Equipment lifespan and efficiency is Overall equipment lifecycle and
increased by regular preventive efficiency reduces.
maintenance
8 From the safety of employees and This is hazardous considering
working environment safety.
considerations, PM is better.
9 Smaller number of technicians It requires a greater number of
perform PM decreasing the employees or technicians to
workload. perform CM which increases the
workload.
3. Miscellaneous Maintenance Types
Opportunistic Maintenance is often advantageous in machines with multiple

malfunctioning components. When a piece of equipment or a system needs to be
taken apart for maintenance on one or a few worn-out parts, the opportunity can
be used to maintain or replace adjacent worn-out parts, even if they are not yet
failed. It is actually not a specific maintenance system, but it’s a system of utilizing
an opportunity which may come up any time.
Window Maintenance is a set of activities that are carried out when a machine
or equipment is not required for a definite period of time.
Design-out maintenance is a set of activities that are used to eliminate the cause
of failure, simplify maintenance tasks, or raise machine performance from the
maintenance point of view by redesigning those part and facilities which are
vulnerable to frequent occurrence of failure.
Maintenance Tools
1. P-F Curve (Potential failure and Functional failure)

2. FMEA (Failure Mode and Effects Analysis)
3. PMO (Planned Maintenance Optimization)
4. SCADA (Supervisory Control and Data Acquisition)
5. Lean Six Sigma
6. RCA (Root Cause Analysis)
10
1. P-F Curve
A P-F curve is a graph that shows the health of equipment over time to identify
the interval between potential failure and functional failure. The eventual failure
of any equipment is inevitable. Wear and tear naturally occur with continual
usage. In the same way our pair of shoes eventually get worn out after 500 KM of
walking, the key plant equipment (e.g. pumps, motor bearings) will ultimately
reach their functional failure point.
The good news is that the functional failure point (i.e. the end of equipment life)
takes a long time to occur. The P-F curve helps to characterize the behaviour of
equipment over time. It’s used to assess the maximum usage that can be gained
from the equipment.
Potential failure and functional failure
There are two main points of the P-F curve that need to be identified.
• Potential failure indicates the point at which we notice that equipment is

starting to deteriorate and fail.
• Functional failure is the point at which equipment has reached its useful
limit and is no longer operational.
These two points define what’s called as the P-F interval—the time between when
the failure is initially noticed and when the equipment fails completely.
.
Figure 2.3 P-F interval
11
How to create a P-F curve
The basic parts of the P-F curve are given above. Actual data can be expected to
vary on a case-to-case basis. For instance, the lifespan of a heavy-duty pump
might not be the same as that of a mechanical bandsaw. It then follows that
expected failure points for different equipment will vary. Care must be considered
when building P-F curves. Different types of equipment are expected to have
varying interval values.
For example, assume that a pump that’s been normally operating for eight
months suddenly produces more noise than usual. Unnecessary noise can be a
sign of failure. With the inspection and confirmation of maintenance personnel,
we can then say that the first noticed sign of failure (i.e. the potential failure point)
occurred at eight months.
Note that the actual start of deterioration might have happened before the
eight-month mark. So, we can assume that the actual start of failure happened
some time before point P. However, it is only the potential point of failure that we
can measure in time with certainty as it was the first event when noticeable
symptoms of failure were recorded.
For the same example, we can suppose that the pump continues to operate
for another six months until it totally breaks down—that is the functional failure
point at 14 months.
Figure 2.4 Creation of P-F curve
12
How to maximize the curve
Now that we’ve visualized how the P-F curve relates to real-life scenarios, we have
the chance to prepare for the inevitable functional failure. The idea is to balance
our resources to prolong the P-F interval economically.
Common practice is to maximize the use of the P-F curve with condition-
based maintenance (CBM). By applying CBM and proactively checking the
condition of the equipment, we are able to infer the rate of deterioration over time.
Maintenance personnel are then able to plan and assess whether it is cost-efficient
to mitigate the causes of failure given the projected P-F interval.
The P-F curve and CBM
At the early signs of failure, it may be helpful to perform routine CBM tasks to
assess the health of the equipment. Continuing with our pump example, a P-F
curve coupled with CBM tasks to monitor pressure and flow rate conditions may
resemble the following Fig.2.5. A maintenance team can attach condition
monitoring sensors to the equipment after the point of potential failure to assess
how much more the equipment can be maximized.
Figure 2.5 Maximization of P-F interval
13
2. FMEA (Failure Mode and Effects Analysis)
FMEA stands for Failure Mode and Effects Analysis, and the name tells a lot about
the process. FMEA is a structured method that aims to identify potential failures
and their corresponding outcomes. The FMEA process is considered a bottom-up
approach; the analysis starts with specific data that builds up to form a more
general plan of action. In this case, each component of the observed system is
thoroughly examined for likely breakdown causes. For every identified breakdown
scenario, corresponding effects should then be pointed out. This allows the
organization to have an extensive map of failure modes and effects, organized
according to their level of impact on the business. Developing an FMEA process
equips organizations with a strategy to identify potential breakdowns before they
even occur. This process of risk assessment can streamline the efforts of
maintenance teams towards efficiently increasing reliability.
FMEA can be broadly classified into two categories: Design FMEA (DFMEA)
and Process FMEA (PFMEA). Each of these would concentrate on different areas,
potentially coming up with more specialized findings.
2.1 Design FMEA (DFMEA)
Design FMEA relates to the way that a system, product, or service was
conceptualized. As the name suggests, DFMEA focuses on the design aspect of a
developmental process. It is primarily beneficial in testing out new product ideas
before introducing them to real-life scenarios.
2.2 Process FMEA (PFMEA)
The nature of PFMEA differs slightly as it looks into current processes and
procedures that an organization is already performing. PFMEA would typically
address potential failures that can have significant impacts on usual operations.
Some examples of business impacts are process stalls, human errors, and
environmental and safety hazards. Because of its nature, PFMEA can be
performed more effectively when historical data is available.
14
FMEA Work Principle
FMEA works by collecting as much information from the production floor as

possible. Maintenance and reliability teams, being closest to the equipment and
processes, are valuable assets to provide a collection of ideas on how failures can
potentially occur. The effects of each potential failure are then assessed. Finally,
the severity of each of the effects is then rated and evaluated to form a weighted
scale.
Figure 2.6 A sample of an FMEA matrix
By assigning weights, FMEA effectively becomes an objective decision

criterion that the organization’s functions can align to. A Risk Priority Number
(RPN) refers to the risk value that each outcome amounts to. The RPN becomes
the basis of whether or not teams should take actions to address a potential
failure. It is important for the relevant teams to have the same level of
understanding of the RPN and their corresponding actions.
15
Failure Modes
Failure modes describe the specific ways by which failures can occur. Different
forms of FMEA will focus on different specific areas. For example, the object that
is assumed to fail can be a component of an equipment, the equipment itself, a
subsystem, a system, or even a certain process.
A set of failure modes can tell a lot about the functional impact of a failure. In
such cases, the levels of service that failure events allow are categorized into their
respective groups. For example, to establish the level of functionality after a
failure, a set of failure modes can resemble the following categories:
a. Full function failure

b. Partial or degraded function failure
c. Intermittent function failure
d. Over function failure
e. Unintended function failure
When performing FMEA for more specific applications like physical equipment,
failure modes tend to become more specific. For example, take the case of
specialized equipment, such as a centrifugal pump. Failure modes for a pump can
include hydraulic failure, mechanical failure, corrosion, or human intervention.
As we could imagine, the list goes on as more types of equipment are analysed.
Effects in FMEA
Effects describe the consequences or repercussions of an identified failure event.

These effects can reflect a failure’s impact on safety, productivity, and overall
reliability. To get the whole picture of the extent of a failure, it would help to think
of an effect in at least three levels:
1. Local Effect
Local effects are the consequences of failure to the item being observed or items
immediately adjacent to it. When looking at physical assets, for example, local
effects might consider the consequences of having a faulty component such as a
pump used in water circulation.
16
2. Next Higher-Level Effect
As the name suggests, the next higher-level effects would consider the impacts of
failure on the larger subsystem that it affects. Following our equipment example,
the next higher-level effect of having a broken pump could describe consequences
to the larger cooling system it belongs to.
3. End Effect
The end effect is the highest-level effect that considers repercussions to the whole
system, facility, or organization. Again, continuing from our example, the end
effect of a compromised cooling system might be delayed production schedules or
worse, a complete standstill of operations.
With the failure modes and effects mapped out, the risk of each scenario
occurring could be assessed more systematically. The steps to take given the
failure modes and effects would then become more apparent after evaluating the
other components of the FMEA process.
The Components of an FMEA Process
With long lists of failure modes and corresponding effects, the next challenge is
building strategies to handle each scenario. An objective approach to quantifying
the seriousness of failure events would be to identify the other components of the
FMEA process.
Corresponding weights and values are assigned to each of the identified

components to form a matrix. Examples of these components are listed below. The
final value after evaluating this set criteria would give a Risk Priority Number
(RPN). We can think of the RPN as a score that gives an idea of what needs our
attention more urgently.
1. Probability of Failure
Think of the probability of failure as the likelihood that a component, equipment,

or process would fail. A higher rating would mean that failure is almost certain.
In assessing the probability of failure, it is important to consider the entirety of
17
the lifetime of an asset. The seasonal fluctuations of failure events also need to be
considered where applicable.
Accurately identifying the rating for this criterion could really use a
combination of worker experience and robust historical data. Maximizing the use
of a CMMS (Computerized Maintenance Management Systems) to collect historical
data can lead the maintenance teams to data-driven assessments.
From a scale of 1 to 10, the following numerical ratings typically correspond to

the following descriptions of the likelihood of failure:
• 1 – Extremely unlikely or no chance of experiencing failure or breakdowns

• 2 to 4 – Minimal chance of failure can occur as experience would show
• 5 to 7 – An occasional chance of failure can occur
• 8 to 9 – A high chance of failure is expected to occur
• 10 – An occurrence of failure is inevitable within the time frame observed
2. Detectability
Detectability answers the question: “Will there be a warning to allow the failure
event to be avoided?” In this component of FMEA, easily detectable failure events
are given a low rating, while events with no chance of detection are given the
highest rating.
For example, with highly reliable sensors in place, a faulty HVAC system
might have a relatively low detectability rating. Components with absolutely no
way of detection should be given a high rating to reflect a potentially bigger
problem.
3. Severity
Another factor considered in the calculation of risk priority is the severity of an

identified failure event. Severity attempts to quantify the seriousness of an effect
or series of effects, given a failure mode.
Because of the subjective nature of evaluating severity, this criterion is

usually set by the company. One of the common ways to approximate the
seriousness of failure effects is to calculate the monetary implications caused by
18
a breakdown. Other factors such as safety hazards and nonconformance with
government regulations are also factors that influence severity.
Starting Procedure of the FMEA Process
As with most company-wide initiatives, introducing a process such as FMEA is

usually first approved by higher management. While suggestions and proactive
measures may start from the actual workers, starting the FMEA process would
require collective action that involves the whole organization.
After a green light has been signalled from the higher-ups, the process of
gathering all requirements then follows. The following high-level steps are
commonly found in FMEA processes. Think of this as a general checklist for
starting with FMEA:
1. Identify the component, equipment, system, or process to analyze.

2. Assign a team and team leader that would kick off the process. In this stage,
it is important to involve the right people closest to the operation.
3. Describe what is being analyzed.
4. Identify potential failure modes.
5. Identify the effects related to the failure modes.
6. Set the criteria for evaluating the risk of each failure mode and their effects.
This would include the probability of occurrence, detectability, and severity.
7. Design a method of prioritization based on the calculated Risk Priority
Number from previously evaluated components.
8. Take the necessary actions to eliminate or reduce identified risks.
9. Measure the success of risk reduction after implementing the established
actions.
Some Tips for Successful FMEA Implementation
The general steps to initiate FMEA in an organization have been enumerated in

the previous sections. However, we are still left with a huge amount of freedom in
terms of how we could apply a process such as FMEA to our organization. To give
an idea of some ways to make this endeavour a guaranteed success, here are a
few tips that might help:
19
1. Tailor-Fit the Evaluation Criteria
The way FMEA is carried out can vary widely from each organization, and they
have certain differences for a reason. Companies will have their own business
strategies and therefore focus on different aspects of their operations. To reflect
this in your FMEA process, you should be strategic in assigning weights and
identifying categories in your decision criteria.
2. Be Consistent With the Rating Scales Used Within the Organization
After identifying the criteria that best reflects your company’s objectives, it then
helps to stick with a rating pattern. The consistency of your rating scales would
be an effective way of aligning the organization towards common goals.
Consistency allows you to seamlessly work within various groups and functions.
3. Identify Specific Risk Control Processes for Each Failure Mode
At the end of the day, FMEA is essentially a risk assessment tool. For it to be an
effective tool, it is not enough to identify the failure modes and corresponding
effects. Specific processes and procedures need to be in place to eliminate, or at
least reduce, the risks identified.
4. Engage the Team
A lot of the information that goes into the FMEA process relies on the day-to-day
experience of the workforce. By engaging the right people, you can ensure the
reliability of your data. It is important to recognize the value that comes with the
collective experience of your team.
5. Use Tools You Already Have
You can only gather so much information with the limitations imposed by our
human capacities. A superpower that you might not realize is the tools that are
running 24/7 in the background. CMMS programs that record equipment
performance even when you’re not looking can provide you data sets that you
might have overlooked manually. Coupled with a more-than-capable workforce,
these tools can maximize the potential you already have.
20
Common FMEA Mistakes
Now that we’ve explored some useful tips for successful FMEA implementation, it
also helps to look into potential pitfalls. Look out for these points when
implementing FMEA procedures:
1. Timing Is Key – Do Not Perform FMEA Too Late
FMEA processes should be performed as early as the conceptualization stages of

a new design or concept. The idea behind this process is to identify risks before
they even occur. You want to be miles ahead of a potential breakdown. Starting
FMEA while in the middle of design implementation complicates what would have
been simple changes to the initial concept.
2. Unclear Ownership
The required actions you identify are only as good as the actual execution. Without
clear accountability within the team, it doesn’t really matter how many plans you
lay out. Clear ownership of the required actions should be established as part of
the process.
3. Not Having Commitment
FMEA is not a one-time process. The effectiveness of implementing FMEA relies

on the continual execution of agreed actions. Moreover, the real success of FMEA
is establishing a culture that is always seeking continuous improvement.
4. Lacking Proper Documentation
Without a disciplined documentation process, you risk starting from scratch even
after recurring scenarios. Actions or significant events that have transpired should
be properly documented as a way to assist future teams who will face similar
situations.
5. Failing to Identify the Root Cause
Coming up with solutions is under the assumption that you have identified the
root cause of a problem. Without the diligence to confirm the root cause, time and
effort could easily be put to waste. What’s worse, you might be convinced about
rectifying an issue, only to find out the hard way that the problem still persists.
21
Conclusion
FMEA is an achievable process that offers substantial benefits to organizations of

all types and sizes. Through FMEA, the risks associated with failure events are
exhausted and systematically identified.
While the seriousness of failure effects can seem subjective, FMEA offers methods
that quantify the repercussions of failures. This then allows the organization to
perform actions that effectively reduce or even eliminate risks. Having a
comprehensive FMEA process sets up organizations to be prepared.
3. Planned Maintenance Optimization (PMO)
Planned Maintenance Optimization (PMO) is a method of improving maintenance

strategies based on existing preventive maintenance (PM) routines and available
failure history.
While most companies have identified the need for a preventive maintenance (PM)
program, the effective execution of such maintenance activities can be challenging
given the everyday demands of a facility. Unseen circumstances that require
urgent attention can easily derail planned activities and can potentially disrupt a
smoothly running plant. PMO provides a method through which maintenance
activities are carried out more efficiently. By performing PMO, a new maintenance
strategy is derived from existing PM tasks. Given the existing tasks, modifications
on the schedule and frequency of the routines are done based on the failure
history of the equipment. With a relatively shorter time to develop, the resulting
strategy can be similar to performing RCM.
The three phases of PMO
The PMO process can be summarized in three phases:
a. Data collection
Any attempt at optimization starts with good, reliable data. Data on equipment
performance, particularly on failure history over time, must be collected. A
minimum time period must be set to ensure that enough insight is obtained from
22
the data. Tools such as a CMMS program can make this process easier and more
accurate.
b. Data analysis, review, and recommendations
The collected data must be analysed to identify which equipment is the most
critical. Some points to consider are criticality to the plant’s operations, cost to
repair, MTBF (mean time between failure), and MTR (mean time to repair).
The information gathered from analysing the data must then be reviewed against
existing PM routines. Some key points to review are: 1) whether the PM routines
are scheduled correctly to align with the MTBF and MTR data points, and 2)
whether failure points are within acceptable tolerances set by original equipment
manufacturer (OEM) specifications or industry standards. Any substantial
deviations from such checks can be a source of improvement from a maintenance
standpoint.
Based on the review, recommendations on modifications for the PM tasks should

be made. Schedules and frequencies of activities need to be optimized to meet
MTBF and MTR constraints. Any missing maintenance activities, as well as
redundancies in tasks, need to be addressed accordingly.
c. Agreement and execution
Agreed action items must be delegated properly. Identified task owners should be
accountable for any required action and monitored for progress. Note that the
PMO process is a continuous effort and reviews should be done habitually.
Benefits of applying PMO
Regular maintenance activities are clearly a key part in ensuring a plant’s

reliability. But PMO further increases the benefits of maintenance activities by
showing substantial reductions in costs.
In the laboratory and life sciences industry, a PMO program is estimated to reduce
overall maintenance costs by around 25%. Payback periods of investing in a PMO
strategy are estimated at around 12 to 24 months, just considering the measured
savings from maintenance costs.
23
Aside from the improvements in uptime and reliability that come with a robust
maintenance strategy, PMO methods enable company resources to be spent more
wisely without sacrificing the quality of execution of maintenance tasks.
Optimization of an existing preventive maintenance plan
There are a few different approaches to preventive maintenance optimization

(PMO).
One of the easiest methods is simply asking the technicians. Odds are, they’ve
been performing the same PM tasks for a while, and they’d probably have some
insight on what could be done better. If any of their tasks seems irrelevant, they’ll
let us know if we ask. While this is the simplest way to track down superfluous
PM tasks, it’s not the most precise, and it is pretty subjective. That said, it’s easy
to do when we’re just starting to optimize PM in our facility.
The next way is a bit more precise, though it is based on some industry
assumptions. A few decades ago, a maintenance and engineering manager named
John Day, Jr. proposed the 6:1 rule. This rule asserts that for every 6 PM tasks
we perform, we should be finding one corrective maintenance task.
This rule isn’t perfect, but it can give a good starting point for optimizing the PM.
If we’re performing more than six PMs for each CM, we may want to scale back a
bit, but only after doing some research. If it looks like the PM:CM ratio is too high,
it is advised to analyse the types of failures we’re preventing. If they don’t pose too
much of a threat, scaling back relevant PM tasks could be a good idea.
On the other hand, if we have more CM than the ratio dictates, we might be facing
one of these two possibilities:
1. We’re not doing enough PM.
2. We’re doing too much of the wrong PM.
Again, some extra analysis will help you make the right choice here. Do some
digging into the preventive maintenance tasks you’re performing and see if those
address the right issues. If they are, our PM timing or quantity may be off. We’ll
want to scale those back and replace them with more relevant tasks if they’re not.
24
A similar approach involves tracking a single asset’s hours performed on PM and
emergency maintenance. If we have more emergency repair hours than PM hours,
we’ve probably got a problem and will want to do some analysis on the root cause.
These last two approaches give us more precision, but they do take more planning,
so keep that in mind when we start streamlining our PM.
Conclusion
Maintenance activities, particularly PM activities, are already proven concepts

that increase the overall performance of a plant. With continuous practice, PMO
is a tool that can help execute PM activities more efficiently and effectively.
4. SCADA System
Supervisory Control and Data Acquisition (SCADA) is a computer control system

that is used to monitor and control plant processes. This software uses data
communications, graphical user interface, and extended management to monitor
and control systems.
The biggest manufacturing companies in the world are also known to be the
most data-driven establishments. In an age of growing technological capabilities,
the importance of collecting data is pushed to the limit with the use of systems
such as SCADA. By collecting and monitoring real-time data, SCADA software
shows an overview of how each key equipment in the plant is performing. Sensors
on the equipment send signals through remote terminal units (RTUs) and
programmable logic controllers (PLC). RTUs and PLCs give the supervisory control
and data acquisition system the ability to pinpoint anomalies in system functions
based on the collected data, thereby allowing the user to promptly take action on
the issue. SCADA allows maintenance personnel to make more informed
decisions. A modern SCADA system is applicable to a wide variety of industries—
oil and gas, energy, manufacturing, and virtually any corporation that benefits
from accurate and timely data monitoring.
Key Components of a SCADA System
Think of SCADA software as a bridge that links equipment with operators and
maintenance personnel. The system requires some key components to facilitate
25
the transmission of data from the physical equipment to the operator’s display
screen.
Figure 2.7 Example of a supervisory control system
a. Sensors and Manual Input
Digital or analogue sensors serve as measuring tools that collect data from various
parts of the plant. SCADA sensors may range from simple binary options, such as
an on or off signal, to more complex tools that measure flow rate, temperature,
and pressure. In addition, technicians or operators at the remote or central
location can manually input data into the system.
b. Conversion Units
Data collected by sensors is only useful if it can be converted into a form that is
easily comprehensible. Remote terminal units (RTU) and programmable logic
controllers (PLC) are the devices that can translate the collected data into usable
information. Since information is collected throughout an entire system, the sheer
amount of data can be great.
c. Human-Machine Interface (HMI)
Data feeds that are converted by the RTUs and programmable logic controllers
meet at a master unit known as the supervisory system or the human-machine
interface (HMI). This interface brings useful information to the maintenance team.
26
At this point, one operator can have a complete picture of an entire process or
system. The data is presented in an easily digestible format, and the employee can
take control of certain pieces of equipment to make repairs or isolate failures.
While a human-machine interface and SCADA share many similarities, they are
fairly different.
d. Communication Infrastructure Network
All the SCADA components are located throughout the plant and must be linked
together by a communication infrastructure network. Conventionally, telephone
lines and circuits have served as this network with newer wireless options now
available that use radio waves or cellular satellites.
Figure 2.8 Communication infrastructure network
How It Works
As mentioned earlier, think of SCADA software as a bridge that links equipment

with operators and maintenance personnel. The system requires some key
components to facilitate this transmission of data from the physical equipment to
the operator’s display screen. This, in turn, allows maintenance technicians to
perform certain tasks or monitor and control asset behaviour along the way.
27
Some typical tasks include checking sensors and other devices that may be
installed at remote substations or monitoring and control stations, tracking
machine system events for future reference, and changing the level or speed of
industrial processes from a central spot.
In essence, modern SCADA systems allow technicians to stay in one spot,

yet extend their virtual reach to many different assets, locations, and systems.
The automation presented by SCADA systems allows greater efficiency and better
decision making for the management team, leading to greater productivity,
increased safety, and revenue generation.
A wide variety of companies, organizations, and businesses can use

supervisory control and data acquisition systems in order to improve efficiency,
share quality data across departments, and better identify and address systems
issues. Both private and public sector organizations can benefit from SCADA
systems including small, basic manufacturing plants to large, multi-million-dollar
corporations. Companies within the oil and gas, power, water, and transportation
industries often employ SCADA systems, as well as businesses in energy, food,
healthcare, and recycling. Modern systems can be configured to do everything—
from managing the operations of freezer and refrigeration systems at a food
distribution company to reducing downtime on a production line at a
manufacturing plant. SCADA systems help businesses comply with health and
safety regulations, meet government compliance requirements, boost efficiency,
and save money.
Benefits of SCADA
Since SCADA systems provide flexible, scalable means to monitor and control
what’s happening throughout complex industrial processes, on a shop floor, or
within remote substations, it can make a significant contribution to maintenance
and reliability efforts.
Reliability-centred maintenance (RCM) is a maintenance strategy that boosts

equipment and asset reliability to a top priority. The strategy is designed to
optimize the maintenance system of an entire organization with the intent to
28
improve efficiency and timely production. RCM typically focuses on identifying and
prioritizing different failure modes. This focus helps in scheduling activities that
will prevent major system failure. It’s easy to see how SCADA software can go hand
in hand with an RCM system, as SCADA provides a great deal of automated
information and data on the performances of various assets and machinery in a
plant. SCADA also allows human intervention early in the process, preventing the
failures that an RCM strategy is designed to seek and identify. Here are some real-
life SCADA applications.
a. Automating Electric Distribution
Electrical utilities around the world are facing an increase in the demand for
power, as well as a lower-than-ever tolerance for outages. Maintaining these
complex electrical grids and equipment to be as perfectly reliable as possible is a
significant challenge. SCADA systems play an important role in helping utilities
achieve that goal.
In this industry, sensors can collect information from various points at each
substation. Additional manual data can be added by either the central or
substation staff. Besides collecting data and sending alerts when certain
conditions signal an outage or potential for an outage, a well-programmed SCADA
system can automate certain repairs. In addition, a SCADA system can pinpoint
the location of other problems, minimizing the time it takes for a technician to
locate and diagnose the problem. Finally, various backup and redundant checks
and processes can increase reliability throughout the power grid.
b. Identifying Failures on a Production Floor
Being able to identify potential failures before they occur is critical to maintaining
a world-class level of uptime on any shop floor. For example, let’s look at the
failure of a steam turbine component. If the overspeed trip device is not working
properly, it may have no immediate effect on your overall system.
However, a SCADA sensor that recognizes this failure is not designed to only repair
that component. Instead, the sensor alerts the maintenance department of the
issue, so it can be repaired before its load drops suddenly and acceleration ensues.
29
If the turbine is allowed to deteriorate to this level of failure, a company may incur
damage from flying blades or even employee injury from the malfunction.
c. Manage Sensitivity and Security in IT
SCADA systems help IT and telecom organizations better control sensitive systems
and monitor remote environments. Sensors can provide around-the-clock “eyes”
on things like the temperature of servers or the humidity in rooms with sensitive
IT equipment to avoid or minimize damage caused by environmental factors.
In addition, SCADA systems work well within security applications such as alarm
contact closures, magnetic door sensors, and motion detectors.
d. Nerve Centre of Alternative Energy
Wind-powered energy is rising in popularity as the world demands more renewable

energy options. SCADA systems play a critical role in modern wind farms. The
system brings each piece of turbine equipment together with substations and
weather monitoring equipment.
These systems can even track all activity on a wind farm within a 10-minute
window, which gives the human operator near real-time activity tracking. If
anything looks amiss in the system, action can be taken almost immediately to
protect the equipment and safety of the surrounding community. In addition, the
SCADA system can track energy output and any functionality errors that can be
used as evidence for warranty claims on equipment.
Data is sent through a fibre-optic network that summarizes not only the
performance of the turbines and other equipment but also the performance of the
wind itself. Meteorological equipment must be incorporated into the system in
order to determine if lower power production is due to equipment issues or low
winds.
Origins of SCADA
Before the technological age dawned, industrial plants and manufacturing

facilities employed teams of technicians to manually control and check on all their
equipment to keep them operating. This highly resourceful and labour-intensive
30
system used paper-based records, pushbuttons, and analog dials to perform
monitoring and control.
For smaller facilities, this manual system was workable for a time. However,
as companies grew in size and reach, it became more difficult for them to rely on
nonautomatic industrial processes, especially over longer distances. Original
automation tools began with timers and relays, which reduced the number of trips
technicians needed to make to remote locations.
As growth continued, though, organizations found that timers and relays

had their limitations. They were not flexible enough and took up a great deal of
physical space. Around 1950, computers were entering the picture and brought a
new level of hope for industrial plants. One of the most useful initial developments
was telemetry, which allowed off-site monitoring and data transmission. SCADA
originated in the 1970s with the advent of microprocessors and programmable
logic controllers (PLCs).
Today, the biggest manufacturing companies in the world are also known
to be the most data-driven. In an age of growing technological capabilities, the
importance of collecting data is pushed to the limit with the use of systems such
as SCADA.
By collecting and monitoring real-time data, SCADA shows an overview of

how each key piece of equipment in the plant is performing. Sensors on the
equipment send signals through remote terminal units (RTU) and programmable
logic controllers. This gives the system the ability to pinpoint anomalies in system
functions based on the collected data, thereby allowing the company to promptly
take action on the issue.
SCADA allows maintenance personnel to make more informed decisions.

The system is applicable to a wide variety of industries, including oil and gas,
energy, and manufacturing. Virtually any corporation that benefits from accurate
and timely data monitoring will benefit from SCADA.
31
The Evolution of SCADA
Like just about all industrial computer systems, SCADA was first implemented on
huge, mainframe computers. This dictated the fact that they were standalone
systems, functioning and housed in a single location. These were known as
monolithic systems.
A couple of decades later, SCADA adapted to the shrinking computer

hardware, PC-based software, and local area networks. This allowed modern
systems to break free of initial walls and join similar systems to share information
and data more efficiently. However, many of these systems were still proprietary
and closed. These were known as distributed systems.
By the turn of the millennium, SCADA joined the ranks of other computer
systems in a more open environment. This now networked SCADA system ran on
ethernet, which allowed multiple systems, vendors, and partners to join in the
network and connect to the SCADA system.
Although technology has continued to advance, many industrial plants still

use proprietary technology, making data transfer cumbersome. Some companies,
however, have evolved to offer a SCADA system of linked and connected devices.
Historically, the challenge with SCADA systems was finding an efficient

communication infrastructure network to connect the devices. Early technologies
relied on proprietary systems to link the components. Though effective to some
extent, those systems had limitations in keeping up with the times.
Meanwhile, information technologies have been emerging over the last few
decades. The development of structured query language (SQL) databases was
significant in advancing data management. Modern SCADA systems incorporate
SQL capabilities, further linking to enterprise resource planning systems for a
smoother and more holistic operation.
Market research analysts also state that the industrial control systems
market, which includes SCADA systems, is projected to reach $181.6 billion by
2024. The Industrial Internet of Things, cloud technology, and evolving web
technology will no doubt have an impact on future SCADA systems.
32
Conclusion
Modern SCADA systems, together with developments in IT technologies, are

becoming an integral part of plant systems. These systems uses Industrial
Internet of Things technology such as sophisticated and simple sensors to collect
a wealth of data from key components and industrial processes throughout an
organization, regardless of geographical location. Technology then allows the
transfer and conversion of that data into usable information. Human-machine
interface (HMI) systems display and deliver the data to skilled technicians who
can make data-driven decisions quickly and efficiently.
SCADA systems can work together with RCM and maintenance strategies
that focus on predictive maintenance. SCADA provides the data and technology
to allow a great deal of automation and data collection, which means that
problems and failures can be spotted at a point before they cause major equipment
damage, shut down an entire production line, cause a serious accident, or result
in an environmental catastrophe. As technology continues to develop into the
future, the potential for SCADA systems and related processes is great in helping
companies increase revenue and safety.
5. Lean Six Sigma
Lean Six Sigma is a process that aims to systematically eliminate waste and
reduce variation.
Over the past few decades, streamlining processes has been identified as
the key to unlocking the maximum efficiency of a plant. Particularly used in
manufacturing industries, some studies have correlated being ‘lean’ with
inventory management.
This correlation allowed studies to find out that since the 1980s, major
manufacturing companies have shown an increasing trend in being lean. This
reinforced the need to focus on improving processes, and continued the pursuit
for applying Lean Six Sigma practices.
In the maintenance setting, the same concepts of being more efficient are
definitely becoming more of a requirement than an option. The philosophy of
33
continually improving processes by taking out redundant steps—while still
consistently maintaining high standards—is being realized to drive the overall
performance of a plant.
Lean Six Sigma combines two concepts that go together to increase

effectiveness and efficiency: being Lean aims to eliminate waste while Six Sigma
aims to reduce variations in processes.
The Lean method
It’s estimated that maintenance activities can make up 15 to 70% of the total cost
of production of a factory. A huge impact on the total spend causes an equally
huge motivation to remove any non-value-adding part.
Being lean focuses on identifying and eliminating unnecessary steps of a

process. Various practices that promote a lean system have been developed to
provide a guide for process improvements. Though most of these are historically
developed for the manufacturing industries (e.g. 5S principle, Kaizen, Value-
Stream Process, etc.), its application to maintenance procedures has started to
pick up.
The key to lean maintenance is identifying the causes of waste proactively.

Having a CMMS system that measures your biggest causes of waste is a good start
for collecting insightful data. The time between failures of operation and the time
it takes to repair equipment can be key areas to investigate. Moreover, proactive
approaches to maintenance such as PM and PdM are shown to reduce unplanned
downtime.
The Six Sigma method
The Six Sigma method is a framework to ensure that processes are created to
consistently provide high quality output.
The five main phases of the method are simplified as DMAIC:
• Define the problems you want to address and the objectives you want to
achieve. This process involves the identification of resources, benefits, and
timelines.
34
• Measure your baseline metrics as a comparison for future progress. Agree
on the methods to collect data accurately and consistently.
• Analyse your data to identify root causes. It is important to establish cause-
and-effect relationships between incidents and root causes to get to the
source of the problem.
• Improve the process by implementing solutions to the identified problems.
This phase may include testing and prototyping to ensure that the root
causes are identified accurately.
• Control systems must be put in place to monitor the progress and
effectiveness of implemented solutions.
The Eight Kinds of Waste
At the end of the day, Lean Six Sigma aims to remove useless steps to your
processes and to provide consistent quality of service across tasks. This is a
constant process that needs continuous effort. Keeping an eye on the eight main
kinds of waste—easily remembered as DOWNTIME—can keep you in check on
how lean the plant is running.
i. Defects – errors and any output with poor quality

ii. Overproduction – producing more than the required amount
iii. Waiting – any unplanned downtime or idle time
iv. Non-utilized talent – any indications of overstaffing or unused workforce
v. Transportation – unnecessary distances travelled from one location to
another
vi. Inventory – inefficient storage management
vii. Motion – unnecessary movement of people or equipment
viii. Extra processing – various processes that have no added value
Conclusion
Though the philosophy behind Lean Six Sigma was developed for the
manufacturing industries, its applications to maintenance is more relevant than
ever. Maintenance activities are essential to a plant’s overall performance and
Lean Six Sigma offers methods to perform maintenance activities with consistently
high standards while reducing unnecessary costs.
35
6. Root Cause Analysis
Root cause analysis (RCA) is a systematic process of identifying the origin of an

incident.
When feeling under the weather, it’s perfectly natural to address any pain
or discomfort by some sort of first aid treatment or a superficial remedy. However,
if you consult a medical professional, then the approach might be a little more
thorough. You might find yourself being asked a series of specific questions about
your condition, and might even go through some laboratory tests to get to the
source of your illness.
The same is true for plant and maintenance incidents. While an immediate
response is usually required, there is always value in performing a systematic
analysis of possible root causes.
RCA is the process that aims to identify the cause of a particular event. In
the plant setting, this event usually refers to any potential problems that will
disrupt standard operations. At a very high level, the usual suspects (i.e., usual
causes of problems) can be categorized as:
• Technical issues affecting physical parts

• Human causes, or when an assigned individual does not perform a task
correctly
• System causes, or lapses in processes
The general process of RCA requires you to describe what happened, why and
how it happened, and what steps are needed to prevent the same event from
happening in the future. The process can get very complex depending on the
situation. Thankfully, some common methods were developed to aid in identifying
the root cause.
Common Methods Used in RCA
RCA makes use of a number of methods that help teams to brainstorm and
pinpoint likely causes of issues in the facility. The following methods can assist
maintenance teams when performing root cause analysis.
36
a. 5 Whys
The name of the method pretty much explains the steps: ask why and ask it again.
Asking “why?” five times usually gets to the bottom of the problem, but don’t let
the name stop you from asking more times. The idea is to drill down to the details
of an event until you are left with the actual root cause. The 5 Whys method is the
simplest RCA tool. It’s often best for operators and others performing the day-to-
day labour in the facility.
Figure 2.9 Example involving a faulty mixer subjected to 5 Whys
b. Fault Tree Analysis
A more visual method to determine root causes is by using a fault tree diagram. A
fault tree diagram starts by having the problem at the topmost block. The
immediate causes preceding the problem event are listed, then they branch out to
form the second layer of the diagram. Each immediate cause branches out to its
own prior causes. This process is continued until the most basic events are
37
identified, which then become your potential root causes. The same mixer can
resemble the following fault tree diagram:
Figure 2.10 Fault tree diagram of the faulty mixer
c. Fishbone Diagram (AKA Ishikawa Diagram)
Another visual method to identify root causes is by using a fishbone diagram (also
known as an Ishikawa diagram, named after its creator, Kaoru Ishikawa). It starts
by specifying the problem on the rightmost part of the diagram. The factors
contributing to the main problem are then listed as categories. Specific causes
under each category are then listed down to identify the source of the problem.
As a general guide, the following categories are used as starting points:
• Environmental
• People
• Equipment/material
38
• Procedures
Applying these basing categories as a starting point, the mixer problem can be
translated into a fishbone diagram.
Figure 2.11 Fishbone diagram of the faulty mixer
Additional RCA Tools
The following two methods—FMEA and the Pareto method—tend to be more

forward-looking than most other RCA tools, and work best when performed on a
routine basis rather than only after an equipment failure.
a. FMEA
Failure Mode and Effects Analysis is a method for identifying ways in which assets
might fail. One takes stock of the potential failure modes that individual assets
might experience and analyses how those failures might impact business
processes.
FMEA differs from the other RCA tools discussed so far because it looks
forward at what might happen rather than hypothesizing over a failure that
already occurred. However, it can still be useful when it comes to finding root
causes. Facilities that take the time to perform FMEA will have a ready-to-use
39
database of potential causes and effects to draw upon when analysing a failure
event, ultimately expediting the process.
b. The Pareto Method
The Pareto method is based on what’s commonly called the Pareto principle, which
states that 80% of all problems result from 20% of all causes. When drawn up into
a chart, potential causes of the problem are listed from left to right in order of
impact (greatest on the left, least on the right) and frequency. Each problem is
represented in the diagram as a bar, and that bar’s height represents its
frequency.
In addition to the bars in the chart, a line is also charted across the diagram
to show the cumulative impact of each cause (ascending from left to right). A
Pareto diagram can be used to visualize data from FMEA in a way that helps
maintenance teams target the most important issues first. That way, the team
spends less time on tasks that don’t matter.
Advantages of Effective Root Cause Analysis
Effective root cause analysis helps maintenance teams focus on fixing the core
causes of problems rather than constantly treating symptoms. A few ways in
which RCA achieves that include the following.
a. More Efficient Problem Resolution
Whenever a machine breaks down, maintenance teams often focus solely on

bringing it back online. In fact, about 56% of all facilities use a run-to-failure
maintenance strategy with at least some of their assets.
However, without researching the root causes of these breakdowns, they are
unlikely to go away. Odds are, the asset will break down again in the future. When
performed correctly, RCA helps maintenance teams focus their preventive
maintenance on the most important tasks. Given that as much as half of all PMs
ultimately accomplish nothing, that could translate into vastly reduced
maintenance costs.
40
b. Puts Everyone on the Same Page
When getting to the root of a problem, it’s common for individuals to blame other
people, departments, etc. One goal of RCA is to avoid this type of situation where
everyone blames one another for problems instead of looking at core systemic
issues.
The problem here is that issues related to human error need to be resolved
with adequate processes and controls—the issue won’t necessarily be solved by
removing a given human being from the situation since any other person could
make the same mistake. As such, the root cause is related to processes and
procedures, not people. Proper RCA avoids this problem by helping the team work
together to identify issues that are related to systems, processes, and machines
while driving toward actionable plans. Ultimately, it helps people get on the same
page.
c. Builds a Culture of Continuous Improvement
By focusing on identifying root causes of problems, maintenance teams switch

their perspective from maintaining the status quo toward continuous
improvement. In fact, a core facet of Kaizen (or continuous improvement) is the
analysis of existing processes, which RCA embodies perfectly.
As maintenance teams perform RCA on failed equipment, that process

naturally translates into finding ways to improve existing processes as well. After
all, the purpose of root cause analysis is to get to the fundamental causes of an
issue and work on repairing those rather than focusing on fixing failed equipment
alone.
d. Overall Better Quality
When root causes are discovered and properly dealt with, equipment runs more
reliably, resulting in fewer breakdowns, overall better processes, and more
consistent output quality.
41
Implementing RCA
While RCA methods are very common and well-known to the maintenance
community, there can be challenges to making RCA thrive.
The first step to mastering this process is knowing the methods that are
available to conduct RCAs. The next steps are setting the proper mindset and
improving the quality of execution to drive the initiative toward success.
Keep in mind the importance of collecting data accurately and involving the
correct groups to analyse that data. To implement RCA effectively, it should be a
repeatable process that is collaboratively executed by the group.
Tips to Perform Effective Root Cause Analysis
In order to successfully implement RCA and receive its full benefits, it must be
done correctly. The following pointers can help you implement root cause analysis
effectively in your facility.
1. Collect Solid Information
Good information is vital to completing any process successfully, and RCA is no

exception to that rule. In order to get the most out of it, you’ll need to make sure
you’re collecting data from your facility’s processes.
There are several ways to do this, of course. One of the simplest is to

implement a CMMS at your facility if you haven’t already. Computerized
maintenance management systems provide a way to collect data from work orders,
meter readings, and so forth, all of which can be invaluable when analyzing an
issue.
As you consistently collect good information on your facility’s equipment

and processes, you’ll make RCA more precise. In addition, the practice of collecting
that information supports proactive RCA as you notice trends in the data leading
up to potential future problems.
2. Create a Repeatable Process
Generally, the most effective processes aren’t necessarily the most perfect, but the
ones that can be easily repeated. While making sure you’re continuously
42
improving your root cause analysis is important, it’s unlikely to become a regularly
used tool in your facility if it’s not fundamentally repeatable.
Some ways to create a repeatable RCA process include:
• Clearly outlining what triggers RCA in your facility

• Keeping it simple and straightforward
• Fostering a culture of continuous improvement rather than merely meeting
the status quo
• Documenting RCA procedures in a clear, step-by-step format
3. Facilitate Incident Reporting
In order to analyse incidents, you first need to be aware of them. Logging asset
data can help with that, but it’s absolutely vital for your employees to feel free to
report incidents or problems when they occur.
As such, incident reporting in your facility should be fear-free and open to

everyone. One way to accomplish this is to make your incident reporting process
anonymous. Employees can fill out a form without having their own name
attached to it, which helps eliminate the anxiety that’s often associated with
reporting an equipment breakdown, fault, or accident.
4. Prioritize Causes
RCA is most effective when you’re able to prioritize causes. Rather than spreading
your time and efforts across numerous potential causes, you’re able to focus on
resolving the issues that have the most impact (and the greatest cost).
As mentioned above, FMEA and Pareto diagrams can help your team
prioritize the right causes. After figuring out a number of potential causes, it’s
often worthwhile to analyse the potential impact of each one to see where you can
make the greatest difference.
5. Take Your Time
It’s important not to rush the RCA process. While you don’t want to delay it or
spend too much time analysing the issue—resulting in “analysis paralysis”—
43
neither do you want to rush to a superficial conclusion of what caused your
problem.
Make sure you’ve assessed as many probable causes as are reasonable to

consider and have gotten to the true underlying issues in your facility before
creating a plan of action. Remember, it’s often important to try to find multiple
potential causes rather than stopping after the first since most complex problems
have multiple contributing factors.
6. Get a Qualified Team Together
RCA is best done as a collaborative effort. After all, there may be multiple issues
at play, and it’s important to have a variety of skillsets and expertise at the table.
Potential qualified team members include:
• Maintenance professionals
• Operators
• Reliability engineers
In addition, you’ll want someone who has enough authority to help the team
overcome organizational roadblocks in the investigation process.
Finally, at least one person you select for your RCA team should have solid
investigation skills. They should be the sort of person who is naturally diligent
and impartial with a keen eye for detail.
7. Be Clear on the Problem
Even with a repeatable process and a solid team, RCA will still get you nowhere if
you’re unclear on the actual problems you’re discussing. Before beginning your
discussions, you’ll need to pinpoint exactly what the problem is and how it shows
itself in your processes.
Without that, one of two things might happen:
1. Your team finds a solution to a problem you don’t actually have, or;
2. Each member of the team has their own mental concept of the issue, turning
the discussion into an unproductive argument.
44
Neither result will help you solve the actual issue, so make sure everyone is
clear on the problem before you begin your analysis.
8. Measure Your Results
Finally, it’s important to measure the results of your RCA process in order to gauge
its success. If the same incident occurs again, that’s your cue to perform a more
in-depth analysis or make other adjustments to your process in the future. In the
end, your RCA and other processes will be in a consistent state of improvement.
Common RCA Mistakes
Some of the most common root cause analysis mistakes involve poor definitions
or focusing too much on the wrong thing. Others simply ignore root causes
entirely, rendering the process pointless.
a. Not quite defining the problem
One of the most important parts of root cause analysis is defining the problem
well. When defining the problem at hand, just saying something is wrong isn’t
enough. Rather, you want to dig into the specifics of when it occurs, how prevalent
it is, and any domino effects it causes.
Often, businesses don’t get specific enough about the actual problem, and
that leads their RCA down the wrong path.
b. Focusing on the wrong things
Another frequent issue is a tendency to go on odd tangents instead of focusing on

the root processes.
For example, if you need to rework a corrective maintenance task performed

last week, you might end up looking too much at your people and not enough at
procedures. This might take you down a tangent where you find the work failed
because Jerry wasn’t paying attention, which happened because he was tired from
lack of sleep because the baby wouldn’t stop crying. It might be a root cause, but
there’s nothing actionable there.
A better route would be to see how your maintenance processes might have
failed to account for human error. The job failed because there were no quality
45
control processes in place in case someone did something wrong. Unlike Jerry’s
sleep schedule, that’s something you can change.
c. Going too narrow
Conversely, instead of branching out on weird tangents, another common mistake

is taking your RCA too narrow. If you have a complex systemic issue, you’ll need
to account for multiple contributing causes, not just one. Yet companies often
focus on just one issue to the exclusion of all else.
To solve this problem, use a tool that will help you look at multiple factors,
such as fault tree analysis or a fishbone diagram.
d. Ignoring the results
Once you get some potential root causes, you need to make plans based on your
findings. Teams will sometimes revert to the more superficial “causes” in their
analysis, ignoring the root causes entirely. In doing so, they end up treating
symptoms rather than problems.
Ignoring data is actually pretty common. Oil rigs might ignore 99% of their sensor
data. Management teams might ignore the true causes of their problems.
Basically, when you get to the end or your RCA, find solutions to the root problems
you find first.
Conclusion
RCA is a powerful process that enables the organization to identify the source of
a problem. Performing RCA processes effectively can significantly improve a
plant’s performance by implementing correct solutions that last.
Maintenance Costs
Breakdown maintenance stops the normal activities and the machines as well as
the operators are rendered idle till the equipment is brought back to normal
condition of working. It involves higher cost of facilities and equipment that
have been used until they fail to operate, also associated penalty in terms of
expediting cost of maintenance and down time cost of equipment.
46
Preventive maintenance will reduce such cost up to a point. Beyond that
time, the cost of preventive maintenance will be more when compared to the
downtime cost. Under such situation, a firm can opt for break-down maintenance.
Figure 2.12 Preventive Vs Breakdown Maintenance Costs
Equipment’s breakdown results in loss of production, costly emergency repairs,

delays in production schedules besides keeping the men and machinery idle. The
costs of break down generally surpass total cost of preventive maintenance, which
includes like cost of inspection, cost of service and scheduled repairs up to the
point ‘M’. Beyond this optimal point, an increasingly higher level of preventive
maintenance is not economically justified and it is economical to adopt breakdown
maintenance policy. The optimal level of maintenance activity ‘M’ is easily
identified on a theoretical basis, to do this the details of the costs associated with
breakdown and preventive maintenance must be known.
Various cost components of maintenance include:
1. Downtime (Idle time cost) cost due to equipment breakdown.

2. Cost of spares or other material used for repairs.
3. Cost of maintenance labour and overheads of maintenance departments.
4. Losses due to inefficient operations of machines.
5. Capital money required for
6. Equipment's replacement.
47
Replacement Economy
The replacement situation arises due to the following reasons;
1. Weak performance of the existing equipment and needs expensive

maintenance.
2. Failure of the existing equipment because of industrial accident or some other
reason, or anticipating the failure of an existing equipment soon.
3. Availability of mechanized or fully automated modern equipment with better
design, made the existing equipment outdated.
• When a machine loses its efficiency gradually the maintenance becomes
very expensive. Therefore, the problem is to determine the age at which it is
most economical to replace the item.
• On the other hand, certain items such as bulbs, radio, television, and
computer parts fail suddenly without giving any indication of failure and
they become completely useless. These items are to be replaced immediately
as and when they fail to function.
Replacement problems fall into the following categories (depending upon the life
pattern of the equipment involved.)
1. Replacement of the equipment that wears out or becomes obsolete (because of

constant use or new technological developments) with time.
• Determination of economic life of an asset.
• Replacement of an existing asset with a new asset.
2. Replacement of the equipment that fails completely (replacement due to
sudden failure).
• Individual Replacement Policy: Mortality Theorem
• Group Replacement of items that fail completely
3. Other replacement problems
• Recruitment and Promotion Problems

• Equipment Renewal Problems
48
1. Replacement of the equipment that wears out or becomes obsolete with
time
Costs to be considered:
• Capital recovery cost (average first cost), computed from the first cost
(purchase price) of the machine.
• Average operating and maintenance cost (O & M cost)
• Total cost which is the sum of capital recovery cost (average first cost) and
average maintenance cost
Figure 2.13 Economic Life
The capital recovery cost goes on decreasing with the life of the equipment
and the average operating and maintenance cost goes on increasing with the life
of the equipment. From the beginning, the total cost goes on decreasing up to a
particular life and then it starts increasing.
The economic life of the equipment is the point corresponding to the

minimum total cost. It can be observed that the average cost per period go on
decreasing, longer the replacement is postponed. However, there comes an age at
which the average cost per period tend to increase. Thus, at this age the
replacement is justified.
A machine loses efficiency with time and we have to determine the best time
at which we have to go for a new one. In case of a vehicle, the maintenance cost
is increasing as it is getting aged. These costs increase day by day if we postpone
the replacement.
49
2. Replacement of the equipment that fails completely
A system generally consists of a huge number of low-priced components that are

increasingly liable to failure with age. Electronic items like bulbs, resistors, tube
lights etc., generally fail all of a sudden, instead of a gradual deterioration. The
sudden failure of the item results in complete breakdown of the system. The
system may contain a collection of such items or just an item like a single tube
light. The costs of failure, in such a case will be fairly more than the cost of the
item itself. In addition, the value of the failed item is so small that the cost of
keeping records of individual ages cannot be justified.
For example, a tube or a condenser in an aircraft cost little, but the failure
of such a low-cost item may lead the airplane to crash. Hence, we use some
replacement policy for such items which would minimize the possibility of
complete breakdown.
The following are the replacement policies, which are applicable for this
situation.
• Individual replacement policy in which an item is replaced immediately after

it fails.
• Group replacement policy is concerned with those items that either work or
fail completely. In this policy, a decision is made as regard to ‘at what equal
intervals, all the items are to be replaced simultaneously irrespective of
whether they have failed or not, with a provision to replace the items
individually, which fail during the fixed group replacement period’.
There is a trade-off between the individual replacement policy and the group
replacement policy. Hence, for a given problem, each of the replacement policies
is evaluated and the most economical policy is selected for implementation. The
optimal period of replacement is determined by calculating the minimum total
cost. The total cost is calculated using: probability of failure at time ‘t’, number of
items failing during time ‘t’, cost of group replacement and the cost of individual
replacement.
50
3. Other replacement problems
Apart from industrial replacement problems, replacement principles are also

applicable to the problems of recruitment and staff promotion.
In staffing problems, with fixed total staff and fixed size of staff groups, the
proportion of staff in each group determines the promotion age.
In the organizations where staff frequently float away, applying probability

concepts it is possible to determine number of candidates to be recruited every
year so as to maintain constant workforce in the organization.
The word renewal means that either to insert a new equipment in place of
an old one or repair the old equipment so that the probability density function of
its future lifetime will be equal to that of new equipment.
Service Life of Equipment
The term ‘service life’ is usually applied to products to indicate the period of time
over which they can function as they were intended, giving users the service they
expect. So, for instance, the service life of a boiler is the length of time it can
function as a boiler i.e., providing heating and hot water.
Service life may be thought to begin at the point of sale i.e., when the
customer buys the product, to the point it is discarded. Some products however,
are discarded before the end of their service life for various reasons, including the
arrival of better products on the market, boredom or simply a desire for change.
A product said to have a long service life may suffer the occasional
breakdown during that time. However, if it can be maintained and repaired to
allow it to function as before, it should not normally interfere with the service life.
Poor repairs can however, adversely affect service life.
Factors that can determine the service life of a product include:
• Quality of manufacture
• Materials used
• Flexibility in use
• Intensity of use
51
• Operating/environment conditions
• Care in distribution and use
• Built-in obsolescence
• Maintenance and repairs
Manufacturers can use tools and calculations (reliability analysis and

maintainability, for instance) to determine a product’s expected service life.
Specifying a product’s service life represents a commitment on the part of a
manufacturer.
52

Module 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2

Uploaded by

Copyright:

Available Formats

Maintenance Engineering

Maintenance is not merely preventive maintenance, although this aspect is an

A person working in the field of maintenance engineering must have in-

Aim of Maintenance Engineering

The aim of maintenance engineering is to ensure the following;

1. The machines and/or facilities are always in an optimal working condition

The responsibility of the maintenance function should, therefore, be ensure that

Responsibilities and Functions of Maintenance Group

Even while real maintenance practice may be unique to a particular facility or

• Primary functions; that demand daily work by the department;

1.2 Maintenance of Existing Plant Buildings and Grounds

1.3 Equipment Inspection and Lubrication

1.4 Utilities Generation and Distribution

In most plants it is essential to differentiate between mechanical stores and

2.2 Plant Protection

2.3 Waste Disposal

2.5 Insurance Administration

This category includes claims, process equipment and pressure-vessel inspection,

2.6 Other Services

The maintenance engineering department often seems to be a catchall for many

• Proactive / Preventive maintenance

Types of Preventive Maintenance

Preventive maintenance is categorized into the following five types;

• Time-Based Maintenance (TBM)

1.1. Time-Based Maintenance (TBM)

Time-based maintenance or TBM calls for maintenance at a fixed time. Typically,

1.2 Predictive Maintenance (PDM)

This type of maintenance, as the name implies, is forecasting an equipment’s

1.3 Failure Finding Maintenance (FFM)

In failure finding maintenance, potential hidden failures are searched at regular

1.4 Condition-Based Maintenance (CBM)

1.5 Risk-Based Maintenance (RBM)

RBM strategy works on the following steps:

1. Data Collection 2. Risk Assessment and Evaluation of Consequence and

Corrective maintenance is any maintenance task that resolves a problem with a

Figure 2.2 Workflow of typical corrective maintenance philosophy

• If a piece of equipment or part breaks down

There are two types of corrective maintenance

• Planned or Scheduled corrective maintenance

2.1 Planned corrective maintenance

Example: An AC is not providing proper cooling due to refrigerant gas leakage.

2.2 Unplanned corrective maintenance

Unplanned corrective maintenance needs immediate attention due to some kind

Preventive vs Corrective Maintenance

S.N Preventive Maintenance Corrective Maintenance

1 PM is aimed to prevent the failure Corrective maintenance refers to

3. Miscellaneous Maintenance Types

Opportunistic Maintenance is often advantageous in machines with multiple

1. P-F Curve (Potential failure and Functional failure)

Potential failure and functional failure

• Potential failure indicates the point at which we notice that equipment is

Figure 2.4 Creation of P-F curve

The P-F curve and CBM

Figure 2.5 Maximization of P-F interval

2.1 Design FMEA (DFMEA)

2.2 Process FMEA (PFMEA)

FMEA works by collecting as much information from the production floor as

Figure 2.6 A sample of an FMEA matrix

By assigning weights, FMEA effectively becomes an objective decision

a. Full function failure

Effects describe the consequences or repercussions of an identified failure event.

The Components of an FMEA Process

Corresponding weights and values are assigned to each of the identified

Think of the probability of failure as the likelihood that a component, equipment,