ARP-A Manual Jan 2021 v2.0 A4

Asset Reliability
ARP-A
Practitioner
RELIABILITY ADVOCATE
COURSE MANUAL
www.mobiusinstitute.com
This is designed as a guide only.
In practical situations, there are many variables, so please use this information with care.
Version 1.0
© 2020 - Mobius Institute – All rights reserved
DO NOT COPY OR REPRODUCE IN ANY FORM

MOBIUS INSTITUTE | ARP-A R-01: Getting Started
Contents
Introduction from Jason Tranter 7
R-01: Getting Started 9

NINE COMMON MYTHS 10
KNOW YOUR GOALS 15
R-02: What Are the Benefits? 17

WHY RELIABILITY? 17
WHAT DO WE GAIN WITH RELIABILITY? 18
WHAT CAN RELIABILITY HELP US REDUCE? 19
CASE STUDIES IN RELIABILITY 19
R-03: Assessing the Benefits 21

HOW TO BEGIN 21
WHY RELIABILITY? 22
SET PRIORITIES 23
THE BUSINESS PROCESS REVIEW 23
HOW ARE YOU PERFORMING RIGHT NOW? 24
MEASURING RELIABILITY 24
MEASURING PROGRESS 24
R-04: Culture Change 27

WHY DO WE NEED THEIR SUPPORT? 27
WHY CAN’T YOU JUST FORCE PEOPLE TO CHANGE? 28
THE ROLE OF UNIONS 30
CHANGE MANAGEMENT 30
WHAT MESSAGES ARE YOU SENDING? 31
LEARN FROM HISTORY 32
R-05: Selling to Senior Management 33

HOW DO WE SELL THE BENEFITS? 33
3
MOBIUS INSTITUTE | ARP-A Contents
R-06: Establishing the Strategy 37

YOU NEED A PLAN 37
ESTABLISH A TEAM 39
SUPPORT FROM HR AND OTHERS 40
R-07: Understanding Failure 41

COMMON BELIEFS 42
COMMON FAILURE PATTERNS 43
THE REALITY 43
R-08: Defect Elimination 47

WHAT IS DEFECT ELIMINATION? 47
DESIGN FOR RELIABILITY 49
PROCUREMENT 51
TRANSPORT 51
ACCEPTANCE TESTING 51
OPTIMAL OPERATION 52
R-09: Asset Strategy 57

WHAT ARE THE TYPICAL OUTCOMES? 58
WHO SHOULD DEVELOP THE STRATEGY? 61
GET ORGANIZED 61
ANALYZING RELIABILITY DATA 62
DATA-DRIVEN APPROACHES 63
INTERPRETING FAILURE MODES 66
ASSET CRITICALITY RANKING 67
CRITICALITY RANKING 69
PREVENTIVE MAINTENANCE OPTIMIZATION 72
RCM AND FMEA 73
DECIDING TO PERFORM RCM OR FMEA 73
INTRODUCING RCFA 76
BROWN-PAPER PROCESS 79
VENDOR ANALYSIS 82
R-10: Work Management 83

WORK MANAGEMENT FLOW 83
STRATEGY BASED WORK AND WORK REQUESTS 84
ESTABLISHING A PRIORITY SYSTEM 85
PROCESS REQUESTS 85
JOB PLANNING 85
4
WHY PLAN? 85
JOB SCHEDULING 87
JOB EXECUTION 88
COMMISSIONING 89
CLOSE-OUT AND FEEDBACK 89
ANALYSIS AND PROCESS MANAGEMENT 89
SHUTDOWNS, TURNAROUNDS, AND OUTAGES 90
R-11: Spares Management 91

ACCURATE DATABASE 92
CONTROL ACCESS 92
SELECTION PROCESS 93
CARING FOR SPARES 94
R-12: Precision and Proactive Work 97

PRECISION LUBRICATION 97
CLEAN AND COOL 101
PRECISION INSTALLATION 102
PRECISION ALIGNMENT 103
PRECISION BALANCING 107
PRECISION FASTENING 109
RESONANCE ELIMINATION 109
5S AND THE VISUAL WORKPLACE 110
R-13: Condition Monitoring 115

BASIC APPROACH 115
VIBRATION MONITORING 116
ULTRASOUND 118
ELECTRIC MOTOR TESTING 120
OIL ANALYSIS 122
WEAR PARTICLE ANALYSIS 123
INFRARED ANALYSIS 124
VISUAL INSPECTIONS 125
PERFORMANCE MONITORING 126
NON-DESTRUCTIVE TESTING 127
R-14: Breaking Out of the Reactive Maintenance Cycle of Doom 129

STEP ONE 130
STEP TWO 131
STEP THREE 131
STEP FOUR 131
5
STEP FIVE 131
STEP SIX 133
IF YOU WANT TO STAY IN… 133
R-15: Continuous Improvement 135

KEY PERFORMANCE INDICATORS 136
REVIEW PROGRAM STRATEGY 137
CONTINUAL EDUCATION 137
R-16: Implementation Strategy 141

LAYING THE FOUNDATION 143
EXPANDING THE PROGRAM 145
6
MOBIUS INSTITUTE | ARP-A Introduction from Jason Tranter
Introduction from Jason Tranter

Thank you for selecting this course to learn about reliability improvement and performance improvement.
I have been personally involved with condition monitoring and reliability improvement since about 1984. I have
observed many companies, mainly in the US and Australia, try to improve reliability, initially through vibration analysis.
In this course, I share the successes and failures of these companies. But in more recent years, I have been more actively
involved in training people to improve reliability, as the founder and managing director of Mobius Institute. We provide
training all over the world via e-learning as well as live courses.
This course explains what is involved in improving reliability, which will improve the performance of your facility or
plant. It focuses on the challenges involved and how to overcome them. As you are taking this course, I assume you want
to learn about condition monitoring, work management, reliability improvement, and so on. It places an emphasis on
the “whys” and “hows.” This is particularly important when it comes to getting the people in your company on board,
without which your reliability program will not work.
This course starts with an introduction that dispels some of the myths of reliability improvement, and the following
modules deal with the benefits, which you will need to know in order to get senior management and critical employees
on board. We will then go through culture change, general strategies, failure patterns, defect elimination, and asset
strategy. We will continue with work and spares management, precision and proactive work, various condition
monitoring technologies, breaking out of reactive maintenance, continuous improvement, and implementation steps.
The goals of this course are for you to
1. Gain a clear understanding of why you should improve reliability
2. Determine how to get support for the reliability improvement initiative
3. Gain a solid appreciation of what it will take to improve reliability
4. Understand how you can assess equipment health and measure the improvements gained
I hope, when you have completed this course, we will have met these goals. Thank you.
NOTES
7
R-01: Getting Started
What is a “reliable plant” and what is it we are trying to achieve? We often say we are trying to achieve improved
reliability and improved performance.
A reliable plant has machinery and equipment that operate at the desired level of performance and quality
when called upon to operate, allowing the company to achieve its goals. Regular maintenance and checks can
contribute to this.
This reliability should be achieved without excessive costs, maintenance, and downtime. Equipment should not
need to be replaced often, and unnecessary redundancy, excess labor, and massive consulting bills should be
eliminated.
A reliable plant is clean, safe, well organized, efficient, dependable, stable, and competitive. As a result, a reliable
plant will have happy customers, happy owners and shareholders, happy regulators and insurers, and happy
employees. It is a safe and rewarding place to work.
NOTES
9
A reliable plant is clean, In a reliable plant, equipment does not break down unexpectedly,
safe, well organized, equipment achieves (or gets close to) its design lifetime, and the majority
of maintenance work is planned and scheduled based on condition. In a
efficient, dependable,
reliable plant, everyone is well trained with the required skills and tools
stable, and competitive. (including contractors), and equipment is operated optimally according to
As a result, a reliable plant will standard operating procedures.
have happy customers, happy In a reliable plant, work areas and plant/facility equipment are clean and
owners and shareholders, happy tidy. Work areas and storerooms are organized and documented, and
regulators and insurers, and the database is correct and up to date. Machines are lubricated, aligned,
happy employees. balanced, and tightened with precision. Work is performed safely and
efficiently, following documented procedures.
In a reliable plant, project and design decisions are based on the total cost of ownership: the equipment should
be maintainable, reliable, easy to operate, and energy efficient. Spares and equipment purchase decisions are
based on the total cost of ownership through their operation and disposal. The storeroom contains parts and
spares that can be justified. All work performed should be prioritized and justified.
In a reliable plant, employees are less stressed and less frustrated. The plant should not close (reliability can
ensure this). Work is performed with pride, targets are met, and the plant is safer for all employees. There is less
harm to the environment and improvements get made—too often, problems are identified and plans are made
to fix them, but nothing is done. Managers value your opinion and you are given ownership of projects.
A reliable plant is what we are trying to help you achieve.
NINE COMMON MYTHS

There are many common myths that hold reliability improvement programs back. Many people have tried to
improve reliability in their plants and have failed. You need to understand the challenges and the ideas that are
in people’s heads regarding reliability in order to change their minds.
Myth #1: Asset or equipment failure cannot be avoided, so we can only reduce the consequences.
It is true that there may be problems due to purchase and design decisions. But no matter what, we can extend
the life of equipment by taking the right steps. We must take proactive steps to improve reliability and extend
asset life.
Why do assets fail? Too often, it’s because we kill them through improper maintenance practices. For example,
because a fan is out of balance, it rotates in a circular motion, and the bearings are taking that extra load. This
will reduce the life of the structure, the motor, and the fan itself, and it can contribute to other quality problems.
Because of the offset in the alignment between this pump and motor, the bearings, the seal, the shaft, the
NOTES
10
coupling, and the foundations themselves are all being pounded 24/7. Just a little bit of misalignment can suck
the life out of the components.
Likewise, if a bearing is slightly cocked during installation, the rolling elements are pounded with each rotation
and the asset’s life is shortened.
If there is a tiny particle, just a few microns in size, caught in the bearing’s grease, the rolling elements will roll
over it and cause a tiny indentation that can spread until it becomes a spall. The same thing can happen in a
gearbox. The particles that do this type of damage are too small to see or feel.
If part of a machine is on standby, the vibration that transmits to it can cause false brinelling—the rollers chip
away at the raceway of the bearing. This can cause the unit to fail soon after it is put into use. There are many
reasons for equipment failures, but many are self-inflicted. If we change what we do, we can extend the life of the
equipment.
Myth #2: The path to improved performance lies entirely within the maintenance group.
The fact is that everyone is responsible for reliability, from the design phase through operation. As with safety,
everyone has to look at their own actions and observations to make sure everything is conducive to reliability.
NOTES
11
Fact: Everyone is Many maintenance personnel feel they are playing whack-a-mole with the
machines under their care: as soon as one problem is fixed, another pops up.
responsible for reliability,
They are too busy fixing problems to take proactive steps to improve reliability.
from the design phase We need to dig deeper and find out why the problems keep popping up.
through operation. Within the machine, some problems develop very quickly while others
take time. Where are the problems coming from? It starts with the design
department, where corners may have been cut in production. Procurement buys the product because it is cheap.
Because the design of the machine stresses it, maintenance problems arise. Maintenance may do what they need
to do to get the machine going and put off actually improving the machine.
Performing condition monitoring is like asking a doctor to do some tests on your body. Like your doctor, the
condition monitoring professional can tell you if a problem is developing. While informing us of developing
problems is helpful, those problems still exist. Hopefully, we can schedule the right time to do the needed work,
get the right people with the right skills, and order spares. This approach improves the reliability of the plant, but
not of the equipment. The root causes of failure are still there.
What if the design, procurement, maintenance, and spares departments all said, instead of “This will do,” “This
is the best way”? If they then took care of the equipment and maximized the value from the equipment, we
would achieve the design lifetime and reduce our total cost of ownership. The up-front cost of reliability can
be a challenge, particularly for management, but it will pay off in the long run as production, the quality of the
products, and safety improve.
Your doctor does not tell you that a heart attack is coming and suggest that you come in for a new heart in seven
weeks. He or she sees the plaque buildup and the high blood pressure and suggests preventive maintenance,
such as dietary changes and exercise. In the same way, we need to make changes to the way we are operating
the equipment—it should be balanced, aligned, operated in a consistent way, and cleaned. Like your body if you
follow your doctor’s advice, the machine will have a longer life and avoid failures. The maintenance department
will be happy because problems won’t pop up all the time. Operations will be happy because the machines are
performing well. Management will be happy because overall costs are going down as a result of all this.
Myth #3: We have to replace or overhaul equipment before it can fail.
The fact is, time-based maintenance can waste a lot of money. Condition monitoring gives a warning of the root
cause and the failure, and we can act accordingly. Some maintenance work should be time based, but other
maintenance work should be condition based.
One of the basic principles of condition monitoring is that we watch for the tell-tale signs and only perform
maintenance when needed. The P-F Interval is what we want to watch. It is shown in a graph with the x-axis
being the passage of time and the y-axis being the condition of the asset. The line will stay steady as long as the
condition is stable. It will start to curve down when a defect is initiated, and will continue downward until failure
NOTES
12
if left unchecked. The good thing is that the machine will start warning us during this time by making different
sounds, changing its temperature, etc., and condition monitoring allows us to catch those signals. Our goal is
to detect the problem as early as possible. An early warning of an unavoidable problem gives us time to order
spares and plan the maintenance for a convenient time. The longer the problem goes undetected, the more
the costs of maintenance go up, as do all the risks associated with failure. We are interested in the curve from
Potential failure to Functional failure, namely, the P-F Interval.
Myth #4: The solution to improving performance is entirely technical.
The real key to success is developing a culture of reliability. The technical aspect is important, but reliability is a
people issue.
We need a vision and a strategy. Senior management needs to emphasize that safety and reliability are core
aspects of our values and key to our success. Everyone must participate, contribute, and believe. Maintenance
and production should work together to achieve reliability.
Myth #5: The reliability engineer knows the most about equipment reliability.
NOTES
13
The people who operate and maintain the equipment day after day know a lot about the equipment. Many times,
they understand why the machines fail and have ideas on how to improve them, but their voices are not heard.
Myth #6: The goal is to improve reliability at all costs.

The goal is actually to maximize the value of the assets in the plant and minimize the impact on safety, health,
and the environment.
To maximize the value, it is necessary to perform a business process review.
Myth #7: You must analyze historical data before you can make improvements.
Analysis is good, but we should be proactive from Day One. Analyzing data does not fix a failing machine.
Myth #8: The best approach is to cut costs.

Cutting costs is not the answer—take care of the fundamental issues and the costs will take care of themselves.
Cutting costs in maintenance and reliability will result in decreased reliability for the plant. If we improve
reliability by making the aforementioned changes, the maintenance costs will go down as a result.
Myth #9: Everyone will always understand that you are delivering value.
Unfortunately, many plants let reliability people go when things are going well, and reliability soon starts going
down without them. To make sure this doesn’t happen to you, you must consistently communicate the benefits
of the program. This benefits you and the company.
What happens if you do not sell the program? When reliability in a plant improves, the program itself can be seen
as a source of savings. Often, a new manager will come in and see cutting reliability as an opportunity to reduce
costs. They may even assume they will be promoted for saving the company money (before reliability plummets).
Because of the resulting failures in the plant, someone will restart the reliability program, only to cut it again
when things are going well, because no one is selling the program and the culture has not been changed. I call
this the reliability roller coaster. Keep selling the program to avoid it!
How does “reliability improvement” compare to other business improvement initiatives?

There are many strategies out there that aim to help businesses achieve better results. However, people can
become confused and tired of flavor-of-the-month programs, which may include Lean, reliability centered
maintenance (RCM), defect elimination (DE), Kaizen, Toyota Production System (TPS), Six Sigma, total quality
maintenance (TQM), and so on. The aforementioned programs may have different focuses, but there are a lot
of common elements in all of them. There is also a lot of overlap between these programs and what we do. The
profitability of the organization is the ultimate goal for all, but the key is a holistic approach.
I prefer the term “reliability and performance improvement” over “reliability improvement.” Reliability
improvement always leads to performance improvement.
NOTES
14
All of these initiatives start from the same base: increasing the uptime of the equipment. This involves stopping
breakdowns and reducing downtime (including planned downtime), minor stoppages, slowdowns, and
changeover times. We need to reduce waste: of time, all types of resources, energy, and money. We must also
improve quality. All of this leads to a happy customer and ultimately to business success, which may mean
improved profitability for some organizations. Safety and environmental impact are also important aspects to
consider.
KNOW YOUR GOALS

All businesses exist for a reason. A mining executive once said, “Turn as much rock into as much money as
possible.” An airline’s goal is “as many safe landings as take-offs.” A goal of the waste water industry is that “no
flush comes back.”
How do we achieve these goals? We can improve quality through statistical analysis and a customer focus.
We can reduce waste of time, energy, costs, resources, etc. We can improve production “flow” by managing
bottlenecks, increasing throughput, ending stoppages, and making changeovers more efficient. We must reduce
or manage breakdowns. These goals are the reason we are employed. No company is paid specifically for
meeting a maintenance goal.
We must continuously look for ways to improve our approach. As part of this, we
Reliability need to do root cause failure analysis. We can utilize a data-driven approach with
improvement always reliability analysis, KPIs, or condition monitoring. Defect elimination is a crucial
leads to performance process as well—eliminating the root causes to proactively reduce failure.
improvement. To achieve the company’s goals, teamwork is essential. Everyone needs to buy
into reliability so that the culture of the company changes. Everyone needs
to know how they can contribute and how they can benefit. Root cause analysis is also a people-centered
process—everyone’s opinion is valuable. We need standardization in written procedures and standard operating
procedures. Finally, we need a customer focus through delivery time, price, quality, and flexibility.
NOTES
15
MOBIUS INSTITUTE | ARP-A R-02: What Are the Benefits?
R-02: What Are the Benefits?
WHY RELIABILITY?
Let’s talk about some of the benefits of reliability. It is important for everyone involved in the organization to
understand why we are doing this.
A reliable plant is a safer, more productive, more secure, more environmentally friendly, and more competitive
plant, and its employees are less stressed and more fulfilled. If you are a maintenance person, for example,
reliability allows you to do your work the way you were trained: precision alignment, precision lubrication,
precision bearing installation, etc.
A reliable plant may also use less energy, may have higher quality products, is more insurable (and may have
lower insurance premiums), and can be ISO 55000 certified.
Ultimately, a reliable plant maximizes the value from its assets. A plant that does this is one that remains in
business, giving everyone that works there great security.
NOTES
17
WHAT DO WE GAIN WITH RELIABILITY?

Availability: There are many ways we benefit from the reliability program, and the first is improved availability,
or downtime reduction. Through condition monitoring, we minimize the impact to the business by managing
premature failure. We can also reduce premature failure (defect elimination). We improve work management and
spares management. All of this ensures the availability of the equipment so it can do the job it was purchased to
do. This also saves money.
Capacity: From a capacity point of view, what is the plant capable
It is not enough for the of achieving? In addition to reducing unplanned downtime, we can
reduce planned downtime through better planning. This also eliminates
equipment to be available—
unnecessary work and achieves higher reliability. Through this we improve
it should be producing at asset efficiency since we are reducing idling and minor stoppages. It is not
its highest capacity. enough for the equipment to be available—it should be producing at its
highest capacity.
Competitiveness: Improving competitiveness is very important for businesses. By increasing availability, capacity,
throughput, and quality, we increase revenue. The organization then has the option of reducing prices or
NOTES
18
investing in the business, both of which can result in competitive advantage and satisfied customers.
Asset life: Through our reliability improvement initiative, we are able to extend the asset life. This is especially
critical in an aging plant. We can do this through improved reliability. Prescribing the correct operation is part of
this, so that the machine runs with less stress and strain.
Employee satisfaction: Employees should have a sense of purpose and job security. Targets should be achieved.
Employees should experience less frustration and greater safety at work.
Safety: A reliable plant is a safe plant. A reliable plant will have fewer serious failures. There is less maintenance.
Repairs and installations carry the risk of injury, especially if the work is rushed and the workers are poorly
trained, following incorrect (or no) procedures, and using the wrong tools.
Compliance: Every organization will have regulations they have to comply with to improve safety and reduce
environmental incidents. Reliability can help a plant achieve these benchmarks.
WHAT CAN RELIABILITY HELP US REDUCE?

Costs: We can reduce expenses associated with replacement parts, spares inventory, overtime labor, general
labor (depending upon your circumstances), contractors, energy, fines, and insurance premiums. The company
can also save money because they may have thought they needed to invest in new equipment (or even a new
plant) to achieve higher capacity, when they really just needed a reliability program.
Waste: By improving product quality, we have less waste. We also produce less waste when we improve
changeover practices and operate machines longer at optimum conditions. We also avoid wasting energy and
have fewer environmental incidents.
Each plant should have a certain capacity when everything is operating at its best. If the plant is achieving that
capacity, it should mean they are making the money they need to, jobs are secure, and everyone is happy.
When the maintenance department schedules downtime, that capacity is temporarily reduced and the company
loses money. Sometimes this is necessary but, as mentioned before, this planned downtime can be reduced.
Unscheduled downtime due to breakdowns further reduces the plant’s capacity. Process rate losses (minor
stoppages, slowdowns, changeovers, etc.) also take a toll on capacity. On top of this, there are quality losses—
some products do not meet quality requirements and cannot be sold. If a new version of a product is coming
out, there are changeover or transition losses. In some cases, companies may experience market losses because
the marketing department underestimates the capacity of the plant in order to avoid disappointing potential
customers. Our goal is to minimize all of these losses.
CASE STUDIES IN RELIABILITY

When Dow Chemical went through all their plants, they found that they were wasting 10% of their revenue and
15% of their resources. They were also disappointing their customers and constantly having to deal with new and
NOTES
19
OEE is measured by repeated unplanned events.

looking at the combination Aberdeen Group surveyed many companies to assess their achievements in
of the availability of the terms of overall equipment effectiveness (OEE). OEE is measured by looking
at the combination of the availability of the equipment, the throughput
equipment, the throughput
of the equipment, and the quality of the product. If a piece of equipment
of the equipment, and the is always available, running at capacity, and always delivering products of
quality of the product. If the expected quality, the OEE is 100%. Because of issues with downtime,
a piece of equipment is availability, throughput, and quality, most plants achieve less than that.
always available, running Aberdeen Group looked at the data and compared the highest performers
to the worst performers (laggards). They found that the best performers
at capacity, and always
achieved 26% higher OEE and reduced their maintenance costs by 30%
delivering products of the through reliability practices.
expected quality, the OEE In conclusion, improving reliability makes more sense than just cutting costs,
is 100%. will reduce maintenance and operating costs, and will increase availability
and capacity. It will reduce energy consumption as well as the number of
safety and environmental incidents. It will improve plant morale.
NOTES
20
MOBIUS INSTITUTE | ARP-A R-03: Assessing the Benefits
R-03: Assessing the Benefits
We need to step back when looking at a new reliability initiative or evaluating an existing program. We must
assess where we are now and determine where we want to go before we try to go there. We need a way to
quantify the opportunity before we try to sell the opportunity—make the numbers specific to your organization’s
capabilities and needs. We also need to measure our progress.
HOW TO BEGIN
Here is a key way to start the initiative:

• Determine why you need to improve reliability
• Assess your current state
• Set your goals
• Evaluate the value of the gap (between where you are and your goal)
NOTES
21
If our goal is industry best practice, we have to take into account the design of our plant and the standards set
by similar plants. This is our target. Then we mark where we currently are. That gap in between is the money
the plant could be making, and this is how we justify the program (in addition to compliance with regulations).
Over time, we will close that gap. Of course, this may lead to management thinking it’s safe to cut the reliability
program, but that gap will widen again if they do. As we get closer to our goal, we can change the goal. If our
original aim was the standard set by a similar plant, we may find we can go further and do better than that plant.
To reach the goal, some up-front investment may be needed for training and equipment.
We can use this information to convince senior leadership of the value of the initiative, set goals and targets,
establish our KPIs, set properties, and assign criticality. When we know what we want to achieve, we will know
which pieces of equipment are most likely to influence our ability to reach our goals.
WHY RELIABILITY?
Why do we endeavor to improve reliability? How will your plant in particular be affected? Reliability seems to be
common sense, but that might not help you sell the program to senior management.
Ultimately, we improve reliability to add value to the business. To do this, you have to determine how reliability
NOTES
22
Ultimately, we improve impacts your business and set priorities.

reliability to add value to What is most important to your business? Is your goal to increase capacity,
the business. To do this, increase uptime, improve quality, increase throughput, reduce costs,
improve dependability, improve safety, reduce environmental incidents,
you have to determine how
extend the life of the plant, or something else? Most plants would want all of
reliability impacts your these things, but you need to prioritize.
business and set priorities.
What is the value to the shareholders? There are ways to reduce cost, to
improve output, and to ensure compliance.
SET PRIORITIES
Every organization has limited resources and limited time. This is why prioritization is essential. Once you have
set the priorities, you need to decide which machines to focus on when setting up a reliability program. Some
machines impact capacity, uptime, or quality much more than others. If your priority is throughput and you have
a bottleneck, it is necessary to focus on the machine with the bottleneck.
Then how do you prioritize, how do you justify your expenditure, and how do you get support for your program?
When we do a criticality analysis, we identify areas in which the consequences of failure are especially destructive.
This is something to keep in mind when setting priorities, and it is something ignored by many companies.
THE BUSINESS PROCESS REVIEW

It is essential that you know how your business/organization will benefit from improved reliability. How will
improving reliability add value? Make sure you can answer that question yourself before trying to convince senior
management.
Briefly, the business process review involves deciding how to improve the performance of the business. This is
what everyone is employed for, and it could mean improving shareholder value, market confidence, employee
satisfaction, or market share.
The second part of the business process review is to pinpoint any constraints that restrict the plant’s ability to
achieve those goals. Examples of constraints could be skills, culture, technology, capital, cash flow, statutory
issues, assets, raw materials, competition, and utility. Consider how your plant may be contributing to those
constraints, such as if you are wasting water. Knowledge of constraints will also help us when looking at the
criticality of equipment and prioritizing.
Third, look at the risks. What are the risks to safety, the environment, customer impact, and your brand and
reputation? Decide how to manage those risks.
As reliability people, we tend to focus a lot on risks and forget to look for opportunities that are available in the
NOTES
23
business. Fourth in the business process review, we should look for opportunities to increase output, improve
quality, and reduce cost and waste. Maintenance and production will be concerned about threats to output and
quality. After doing that, be sure to take off the “risk hat” and put on the “opportunity hat”: What is it that we
can do to increase the output, increase the throughput, reduce waste, and so on? If this was your business, you
wouldn’t only try to avoid the bad things, but you would try to make the good things happen. A machine failure
will reduce the output of a plant, but there may be something you can do, especially if you talk to maintenance
and operations, to improve product quality and output. Thinking of opportunities will allow you to look at the
equipment a little differently.
Do you know why you are trying to improve reliability? Can you say it in a clear statement and would senior
management agree with it? Would the “front line” agree?
HOW ARE YOU PERFORMING RIGHT NOW?

Set a measurement standard so that you can measure where you are and your progress. The changes will
encourage you and senior management and show that the effort is worth it. An assessment process would be
helpful at this time to decide what we are good at and where we need to make improvements.
You can come up with some categories for your review and audit, or you can use charts such as radar charts.
Look at each area and rate your plant. Possible categories include leadership and management support, people
and organizational structure, planning and scheduling, preventive and predictive maintenance, production and
maintenance, resources and inventory management, and reliability improvement. Each category can then be
broken down into more detail. You need a way to measure where you are because from now on, ideally once a
year, you need to review these categories and see if you’ve made improvements.
MEASURING RELIABILITY
What is reliability costing you now? Assess how the plant’s areas of weakness are affecting performance. This will
show the value of making improvements.
When you review performance, you can benchmark against the plant’s best performance, against its design
capacity, or against industry best practice.
It is important to know where we are today so we have a reference for comparison. This is especially important if
a discussion arises of removing the reliability program. You can then remind senior management of where they
started from.
If you are not in a position to perform a detailed benchmark, seek outside support.
MEASURING PROGRESS
You need a way to measure your progress. Establish basic KPIs to keep track of how you are performing in each
NOTES
24
area over time. Are we improving, and what are the opportunities for improvement? Don’t use too many KPIs
because that will get confusing, but make sure you measure what you want to improve.
Do not use the KPIs as a carrot or a stick. It can result in people adjusting results and priorities to achieve KPIs.
Use them to identify opportunities for improvement and not as opportunities to punish people.
In conclusion, we need to know where we are now and where we want to go before we try to go there. This
process lays the groundwork for everything that follows.
NOTES
25
MOBIUS INSTITUTE | ARP-A R-04: Culture Change
R-04: Culture Change
Culture change is an area that a lot of people overlook.

What is the most challenging aspect of improving reliability? Many would guess that it’s all the technical issues
with condition monitoring, precision maintenance, reliability analysis, and so on. Or would it be the analysis
process through RCM, FMEA, RCFA, and so on? Would it be the financial issues that come from investing in
reliability?
Or do you recognize that the real challenge is dealing with the people in the program? We need everyone on
the same page and speaking the same language about reliability. We need everyone on board, from senior
management to the machine operators.
WHY DO WE NEED THEIR SUPPORT?

People can be a roadblock to the process. We can try to initiate changes but then face skepticism and reluctance
NOTES
27
to change. We can buy all the new tools, such as the laser alignment system, but if people do not believe in the
program and are not trained in it, they will not use it properly. That’s the negative part.
On the positive side, there are other important benefits to working well with the “plant floor.”
The mechanics, electricians, and operators who spend so much time with the machines know a lot about those
machines. They see what causes machine failure and slowdowns. Some of them may have made suggestions for
improvement in the past but saw nothing happen. We need to engage with all departments and listen to their
suggestions.
Remember to also get the support of operations, maintenance, engineering, purchasing, and those dealing with
safety, product quality, and environmental effects.
WHY CAN’T YOU JUST FORCE PEOPLE TO CHANGE?

You could try it, and you may get some short-term results. But you will create resentment and lose people’s
desire to make improvements. This approach is forcing the horse to water, but the best approach is to make
everyone thirsty so that they want to drink the water—that is, reliability improvement. Get people engaged and
NOTES
28
involved in the process. People are not robots. We cannot just reprogram them.
We cannot engineer our way to success. Data analysis does not change behavior. People will change if they want
to change. Most people do not mind change, but they don’t like to be changed. If people feel it is their idea, and it
is in their best interest, and if they participate in the process, they will want to get involved.
How do we create a culture of reliability? First, we need to understand
Most people do not mind people. Second, we need to understand the culture-change process.
change, but they don’t like We need to understand the human psyche and personalities. All people are
to be changed. If people different, but they fall into certain categories. Some will support us, some
will defy us, and some will sit back and wait. We need to know this so we can
feel it is their idea, and it is
plan how to manage each group.
in their best interest, and
Another way to categorize people is that they are positivers, fence-sitters,
if they participate in the
and doubters.
process, they will want to
Positivers are open to new ideas. They will be enthusiastic about the
get involved. program and make suggestions. They are enthusiastic for change to take
place, and they will act rather than merely talk. Your success depends on
identifying and engaging with positivers early on.
The fence-sitters are a large group. They are neither very positive nor very negative, and they will go with the
flow. They are easily influenced by other people in the organization.
The doubters actively defy change. They may say things like “We tried that and it didn’t work,” or “It won’t work
here.” They will try to influence the fence-sitters.
Within the doubters, the dragons are the ones who talk the talk but cannot walk the walk. They may appear to
support the program but will allow their doubts to surface soon afterward. Not only will they not do what they
agreed to do, they may actively defy the program. They are dangerous, and you need to figure out who they are
and watch them. Dragons in management roles are particularly dangerous.
The handbrakes, in contrast, are those who actively seek to protect their turf. They are the ones who make
excuses when presented with the program. In some ways, they are easier to deal with than the dragons because
they are easy to spot. Focus your attention to the positivers, try to get the fence-sitters on your side, and watch
out for the dragons. The handbrakes just need to be managed somehow.
The culture-change personalities can usually be broken down this way: The champions will make up about 5%, and
they will lead the change. The rest of the positivers, around 20%, will get involved early on. The fence-sitters make
up around 50%. The handbrakes make up about 20%, and they will only change when there is no other option. The
dragons, at around 5%, will almost never change. Focus on the positivers and the fence-sitters. The doubters may
be the ones who retire early, move to another plant, or face being let go if layoffs become necessary for the plant.
NOTES
29
THE ROLE OF UNIONS

When talking about people and personalities, it is a good idea to note the role that unions play in this process.
Unions have a long history, but basically they want people to have security in their jobs. Unions want people
to be safe at work and have opportunities for improvement, skills development, and so on. What we are doing
is perfectly in line with the goals of the true union movement. A profitable plant has the ability to pay people a
higher salary, offer job and pension security, and provide opportunities for improvement. It could be useful to
meet with union leaders and discuss your goals, especially if you can identify a positiver among them.
Unions need to know how their members will benefit from reliability: job security, safer work environment, up-
skilling, reduced frustration, and reduced 3 a.m. and Sunday call-outs. Get union leadership involved and make
them feel included, and they will be willing to work with you.
What is it that everyone, including union leadership, is worried about? They may be worried about job losses—
that if machines do not break down, people are not needed to fix them—so you need to reassure them and
explain the true case. This is something you need to address yourself. On the face of it, the aim of the reliability
program is to reduce costs, especially those associated with unnecessary reactive maintenance, rework, and so
on. What we want is to move people from reactive work to proactive work that will reduce future failures. Some
organizations may actually need to reduce head count due to competition. If this is the case, be honest and give
the reasons for the redundancies. Overtime can be an issue as well. Some people love it for the extra money, but
no one really has a right to overtime and overtime is a place where the plant may need to reduce costs.
CHANGE MANAGEMENT
To facilitate the culture-change process, we need to identify the positivers and other personalities. Recognize how
individuals will benefit by pinpointing what is wrong now. Then devise training and communication plans to keep
them informed on the progress. Make sure you do not have handbrake personalities in key roles. Try to convince
the fence-sitters to work with you, and don’t waste too much time on the dragons.
What are the key components of change management?

The first is that people must aspire to the vision presented to them. Second, people must be dissatisfied with
the current state. Point out the causes of their dissatisfaction and the ways they can fix those problems. Third,
people must be rewarded for change. An important way of rewarding people is to listen to their suggestions and
recognize them for their ideas. Fourth, people must be appreciated for the changes made. Fifth, people must be
actively involved in the process.
You also have to allay people’s fears. They must understand that they will not lose their jobs—or if they may, it
has to be very well managed. People must understand that cost cutting and improved efficiency allow proactive
tasks to be performed. We must explain that overtime will probably be reduced (some form of compensation
will help, such as a potential increase in base salary if reliability is improved). Do not give false hope. Don’t get
NOTES
30
people enthusiastic about reliability if you are not going to listen to their suggestions, give feedback and training,
or perform the proactive work they know the machines need. They will dismiss your program as a flavor-of-the-
month deal.
WHAT MESSAGES ARE YOU SENDING?

In many organizations, the only workers to receive a reward are the “firefighters”: the ones who come in and save
the machine at all hours. If you only praise this person, you are reinforcing the behavior you want to get rid of.
For one thing, this person knows they will receive less praise if you achieve your goal of fewer breakdowns. Also,
those who do the regular work to improve the machines feel ignored. Instead, when speaking of the person who
came in the middle of the night to get the machine running, acknowledge that your team failed—that person
did a great thing, but it should not have been necessary. Discuss what should be done to avoid that problem in
the future. Make a big deal out of people’s ideas for improvement and get them involved in the change process.
Then, reward those improving the machine.
Reward the positivers. Recognize them in meetings, the blog, or the newsletter. Send them to conferences and
for training and certification. Use the right language when mentioning failures.
NOTES
31
Watch the subtle messages you may be sending. If a machine breaks down and you say to the workers, “When
will it be fixed? What can we do to expedite the repairs? What will all this cost?” you send the message that they
need to hurry up and make the repair as cheap as possible. “When will you have it installed? Do you really need
to do laser alignment? Could we get it running and align it later?” You are not asking these people to do a better
job if you say these things.
Instead, try saying, “Is this likely to fail again? What caused it to fail? What can we do next time to get a warning of
failures?” Also say, “Good job following the installation guidelines. Joey will be along when you have it precision
aligned to go through his commissioning checks.”
Explain the benefits. Answer the question “What’s in it for me?” They know the shareholders will benefit, but
explain how they will benefit. For example, a maintenance person will benefit from fewer call-outs that take them
away from family events. They can do most of their work within regular hours. They may experience greater job
satisfaction and less frustration. Increased profitability for the plant means job security for workers. Increased
reliability means the equipment is able to perform its function. No one wants to deal with an irritating machine
that doesn’t work properly. They can also take pride in keeping the machines running. Fewer machine failures
mean improved safety at work.
LEARN FROM HISTORY

You need a plan to change the culture. It must take technology and people into account, have measurable small
steps, and have realistic goals and milestones. It must be easily articulated so that people believe in it.
Learn from the safety initiative. Everyone now understands the benefits of safety, there is strong support from
leadership, and it is everyone’s responsibility. There are signs everywhere to remind people of safety issues.
Further, there is an intolerance for breaches of the safety procedures. In most organizations, a worker will not
be asked to do something to fix a machine if it goes against safety procedures. The same intolerance should
be present for breaches of reliability. Taking the time to fix a problem when it becomes apparent will prevent a
future breakdown which will require more work and more time. Just as safety matters, reliability matters.
In conclusion, we need people to believe in reliability. They must understand what’s in it for them and how they
will benefit. And people need to understand how they can contribute.
NOTES
32
MOBIUS INSTITUTE | ARP-A R-05: Selling to Senior Management
R-05: Selling to Senior Management
You are asking them to HOW DO WE SELL THE BENEFITS?

support your program If you do not have the support of senior management, you will struggle to
and spend money, and experience true success with your reliability improvement implementation. There
are three ways we can approach this: technical, financial, and personal.
they need to know
what’s in it for them. The Technical Approach
We all (hopefully) understand the technical strategies, such as RCM/FMEA analysis,
early warnings via vibration or infrared, precision maintenance, etc. It all makes sense. However, a manager will
not have had your training and experience, and those terms will not mean much to them.
At the end of the day, they just want to know how all of this adds value. You are asking them to support your
program and spend money, and they need to know what’s in it for them. Why should they back this project?
NOTES
33
The Financial Approach

From a financial perspective, you need to understand the benefits related to the organization. How will reliability
help it achieve its goals? Call to mind the performance, constraints, risks, and opportunities, which were
discussed in a previous lesson. If you understand these categories, you should be able to make a good case for
the reliability improvement program.
Can you put a value on what it means to improve production output by 1%? Reduce maintenance costs by 1%?
Reduce scrap and spoilage by 1%? And so on. It may take some research to find these answers, but it is important
for you to know them so that you can speak the same language as the manager.
Be familiar with the financial language the manager would know: terms like return on investment, net present
value, hurdle rate, discounted cash flow, payback period, etc. Understand what the company’s expectations
are regarding these concepts. There is a good financial business case to be made for improving reliability and
performance. Be careful not to overreach and promise too much.
Familiarize yourself with financial terms, the concepts behind them, and the time value of money. Decide when
the money should be spent and when the returns will appear. Consider the availability of capital and cash flow.
Management may be considering other opportunities to expand the plant, or another project.
NOTES
34
The Personal Approach

Finally, the personal approach: Senior management are human beings, and they have their own goals.
Understand their issues and what they are trying to achieve, what their responsibilities are, how their success
is measured, their priorities, and how they want to be remembered. They have responsibilities to their
shareholders. They do not want to see their company in the newspapers for all the wrong reasons.
When you make someone aware of a problem or certain risks, they need to do something about them. They
do not want to put their name behind a project that won’t succeed, but they have a responsibility to manage
their risks. Remove their fear of failure by giving them the confidence that you can achieve what you say you will
achieve.
Reliability ticks all the boxes: excellent investment, improves safety, reduces risk, keeps the regulators happy,
gives shareholders a better return, and makes the company brand more secure. They will not regret their
decision as long as you implement the program correctly. This is a great opportunity, but you have to help them
see the light.
In conclusion, you must have senior management support. You may want to fly under the radar and make
changes quietly, but this approach has a limited life. Aside from the need for resources, your confidence in the
program when interacting with other people will go a long way toward getting them to work with you to achieve
the company’s goals. The focus should be on the business case you prepare: this is where we are, this is what we
want to go, and this is how I will help you get there. You must maintain that support by constantly communicating
your progress. If you lose their support, they may cut the program.
NOTES
35
MOBIUS INSTITUTE | ARP-A R-06: Establishing the Strategy
R-06: Establishing the Strategy
YOU NEED A PLAN

If you want to improve reliability, you need a plan. You need to plan each step and decide how to keep senior
management on board, change the culture, schedule the maintenance work, etc. We can talk about all the great
things you can do, but you have to actually do it. It is not as easy as it sounds, but you can break it down into
small steps that can be measured. Each step forward is a good one. Look back at how far you have come.
NOTES
37
Don’t try it without a You must have a plan that is prioritized according to the greatest value that can
plan. Break out of the be generated. Based on what we have talked about in previous lessons, you
should know how to add value and focus your attention in those areas. It must
reactive maintenance
be based on the plant’s identified weaknesses, as they will give the biggest bang
cycle of doom before for the buck. Keep in mind what’s best for the organization and what people
progressing further. can buy into. You need milestones and a way to measure progress. You must
communicate your progress, even when things don’t quite work out. Keep the
program alive in their minds. Continuously review and improve your plan based on the plant’s priorities and
circumstances.
The plan must be realistic. This is a long-term process. It is easy to lose management support, and it is easy to
lose focus and become distracted. Make sure people do not slip back into their old habits. It is also easy to give in
to the naysayers. You may need help to do this—possibly from a mentor outside the organization.
Avoid these traps: Don’t try it without management support and plant-wide buy-in. Don’t try to be “world class”
one area at a time, and don’t try to engineer your way to success. Don’t try it without a plan. Break out of the
reactive maintenance cycle of doom before progressing further.
NOTES
38
The Roadmap to Reliability

In the roadmap to reliability, there are some elements we need. One component is condition-based maintenance,
but there are also run-to-failure and interval-based maintenance. These strategies serve different purposes.
How do you make the decisions necessary to have a functioning program? Some do a detailed RCM analysis on
every asset. It’s much better to understand asset criticality and prioritize the work accordingly. It’s great if you
have good failure data—you can do an analysis to identify where your opportunities are for improvement. Stage
four includes defect management, spares management, planning and scheduling, precision skills, operation, and
ongoing maintenance. This roadmap includes all the things you need to take care of in order to have a successful
program. Stage one includes leadership, culture change, vision and strategy, understanding human nature,
training, and buy-in. Stage five deals with ongoing monitoring and improvement. If you work in a reactive plant, a
problem is knowing where to start. This roadmap gives a rough order that you can use.
We call our implementation roadmap the “reliability success master plan.” It is a transition guide that helps you
get from where you are to where you want to be as smoothly, painlessly, and efficiently as possible. We need to
make the assessment, justify what we are doing, prove the concept, get approval, make a plan, and roll it out.
We need to engage, stabilize our maintenance practices, break out of the reactive maintenance cycle, analyze
the criticality of the equipment, and come up with a detailed asset strategy. Then we can do all the condition
monitoring in parallel with the ongoing measurement of reliability and KPIs, communicating the benefits, and
educating and motivating people.
There is a more complicated version of the implementation roadmap which includes the same steps in more
detail. This will be discussed later in the course.
ESTABLISH A TEAM
Another step is to establish a team. You cannot do this program by yourself. One way to do this is with a steering
committee. Make sure you have a group of people with the right attitude (positivers) that represent the key
departments: operations, production, maintenance, engineering, health and safety, quality, environment, and
reliability (if you have one). Include unions, if they are present at your plant. This is a way to make sure you have
everyone’s support, rather than this just being a reliability group’s project.
The role of the steering committee is to

• Oversee the process (not do it all themselves)
• Set the goals, mission, and vision
• Be the conduit between management and the program
• Lead the culture-change process
One person needs to be the champion. Normally, there is one person who is the head, or chair, of the
NOTES
39
steering committee. There are certain qualities that person should have, and they are leadership qualities, not
management qualities: the ability to motivate and inspire people, and to get people to contribute. They don’t take
credit for all the good things that happen. They should be invisible in some ways and visible in others.
SUPPORT FROM HR AND OTHERS

As part of this, we need support from human resources. Those people, and particularly the champion, are taking
on a big challenge. You will be dealing with people who have been doing things the same way for a long time,
and it is easier for them to continue in that way. HR will monitor you and your team and make sure you are all
right. They will look at things like job descriptions, performance goals, and even titles. If you want someone to do
something that is not in their job description, they may balk. HR can help here.
One important thing is adding “reliability” to more people’s job titles—for example, “reliability technician” instead
of “maintenance technician.” HR can help you create job descriptions related to reliability improvement so that
people know what the expectations are.
Union support is critically important. There should be communication so that they appreciate how important
reliability is to the health, safety, and well-being of their union members.
In conclusion, you must have a plan with milestones, targets, and stretch goals. Everyone needs to understand
the plan, mission, and vision for the future. You must review the plan and assess your progress continuously. Feel
free to use the aforementioned roadmaps to make your own strategy.
NOTES
40
MOBIUS INSTITUTE | ARP-A R-07: Understanding Failure
R-07: Understanding Failure
When we talk about solving problems and maintenance strategy, there are some things we need to understand
about how failure works. In this module we will talk about how equipment fails, as there are assumptions that
hold the reliability strategy back.
NOTES
41
COMMON BELIEFS
What do people believe about the time-to-failure of rotating machinery? Does it wear out to a point where it
could fail at any minute? Does the probability of failure simply increase over time? Or is failure independent of
time—failing randomly due to equipment lifespan, operation, and installation?
What would happen if we took 30 machines, each with a bearing, ran them for a period of time, and waited
for them to fail? Many believe they would run for a certain time, and there would be a small window of time in
which they all started to fail. The probability of failure was low for most of their lifetimes (because none of them
failed during that time), and it went up at a certain time. Someone might decide to allow the machine to run for a
certain number of hours and simply schedule a shutdown to replace the bearing when it reached that window of
time when many of them failed. Replacing the bearing before that time would waste a lot of money.
Let’s graph the probability of failure against time of failure, with the y-axis being the risk of failure and the x-axis
being the months of the year. All the machines run happily for a certain amount of time and fail at around the
same time—these are age-related failures. Let’s say they all fail after November.
If we were to graph that statistically, we would have a sharp upward curve between November and December.
NOTES
42
It would be logical to schedule maintenance for that time since our equipment would be likely to fail according
to the pattern, and we don’t want any unexpected failures. This practice of replacing the bearing just before it is
expected to fail is the common idea behind preventive maintenance, or time-based maintenance.
COMMON FAILURE PATTERNS

As you may have guessed, this pattern may not be the reality for much of the equipment in our plants.
Often, bearings or other equipment fail soon after installation or replacement. It can happen soon after intrusive
inspections as well. This type of failure is called infant mortality. Statistically, there is a high probability of failure
after installation, overhaul, or repair of a machine, as well as after an intrusive inspection. After a certain period
of time—weeks or months—that probability dramatically drops. Sometimes, problems suddenly surface after a
planned shutdown. This is because a damaged bearing can be accidentally installed, bearings may be damaged
during maintenance, bearings may not be lubricated properly, or the part may have been installed improperly,
among other reasons.
If we combine the two aforementioned failure patterns, we get dramatic curves at the beginning and end of the
graph, with a long, smooth line in between. This can be called the bathtub curve. We would think, logically, that
if we did a better job with installation and commissioning, we could minimize the number of infant mortality
failures. There would still be a small curve because workers are human. A lot of the maintenance practices
that we talk about in this course deal with reducing infant mortality. Regarding the other curve at the end,
unfortunately, this is not the curve that most of our equipment follows—hence, time-based maintenance may not
always be the best solution.
We also have random failures scattered along the graph. The curve may be straight across but a bit higher—an
equal probability of failure across time. We do not really know how long a machine will run. We are concerned
about what will happen when we start the machine because we have experienced infant mortality failures, but
after that we are concerned about random failures. Why might the equipment fail? Because of the way it was
aligned, installed, operated, lubricated, etc. It may be hot or dirty, or the bolts may not have been tightened down
properly. There are many more reasons. The degree of each of the aforementioned factors leads to a pattern.
That’s where condition-based maintenance comes in. Otherwise, with a random failure pattern like this, when do
we do maintenance work?
THE REALITY
The graph showing the bars that came up to similar levels was not real. This graph is from a real study—all 30
bearings failed at different times. It is a random failure pattern. Therefore, time-based maintenance is not the
best strategy, as you will still experience unplanned downtime. On the other hand, some bearings will keep
running for a long time. With time-based maintenance, you may end up taking out a perfectly good bearing and
putting yourself at risk for infant mortality with its replacement. It is quite likely that you will actually introduce
NOTES
43
a fault with this strategy. If a bearing is working well, there is no need to change it. Use condition monitoring to
find out when the root causes of failure are present or to detect early warnings of failure. Much of the time, there
will be a long warning time, allowing you to schedule the maintenance at a convenient time, possibly the next
planned shutdown.
Many studies, including an extensive study by Nolan and Heap, found that over 90% of failures are not age
NOTES
44
related. There are a number of infant mortality failures, followed by random failures. So what does our reliability
improvement program aim to do? First, it aims to reduce the probability of failure at the start. If we improve
our maintenance and commissioning practices, and the way we care for the spares, there will be fewer infant
mortality failures. By improving the way we operate and lubricate the equipment, we reduce the probability of
failure during the random period. The more we care for the machines, justified according to the criticality of the
equipment and other factors, the more we reduce the likelihood of random failure. We can have confidence that
the plant can be run at the proper speed, achieving the proper level of uptime and the right quality.
68% of equipment followed the pattern shown in the Nolan and Heap
You need a strategy that and other studies, and another 14% followed a pattern in which failure
matches the failure modes was completely random and there was no infant mortality. Another 7%
followed a pattern in which the probability of infant mortality was even
and patterns of the equipment
lower than that of random failure down the line. So roughly 90% of
you are dealing with. You will the equipment you have is subject to random failures. Therefore, 90%
therefore need to understand should have condition monitoring, and we should employ condition-
your equipment’s failure based maintenance (these terms will be clarified later). Only 10% of
modes and patterns. equipment follows an age-related failure pattern. For these machines,
time-based maintenance may be best.
What do we do with this information? We use it to establish our asset strategy, or our maintenance plan: decide
which machines should get condition-based maintenance, time-based maintenance, or run-to-fail.
In conclusion, most failures are not age related. You need a strategy that matches the failure modes and patterns
of the equipment you are dealing with. You will therefore need to understand your equipment’s failure modes
and patterns.
NOTES
45
MOBIUS INSTITUTE | ARP-A R-08: Defect Elimination
R-08: Defect Elimination
WHAT IS DEFECT ELIMINATION?

Assume we have a machine that is critical enough to warrant some care and attention. What if we could put a
force field around the plant and stop all sources of defects from getting into the plant? A defect is anything that
can cause that machine to fail—any possible root cause of failure. With a defect elimination program, we are, in
essence, trying to build a force field around the plant: We make sure anything we buy for the plant is designed for
reliability, maintainability, operability, and so on. If not, we will reject it. And we do not want cheap spares that will
end up giving us problems. We do not want unreliable transportation that may allow new parts to bounce around
in the truck. We need to make sure all contractors have the right skills, especially during shutdown periods. We
do not want unqualified service providers balancing our machines or overhauling our gearboxes. We need to
scrutinize the maintenance strategies of the OEMs we buy from. We need to set up acceptance testing to make
sure anything coming into the plant is fit for purpose. Then, the assets will be placed on our shelves, ready to
replace a machine or be installed into one.
NOTES
47
Now we need to put a force field around the machine to protect it from us: It is easy to put contaminated
lubricant into the machine. We need to make sure parts on the shelf do not become damaged. We need to use
precision installation practices. We need to make sure we operate the machine correctly, as closely as possible
to its best efficiency point, and in a consistent way. We should not perform
unnecessary PMs, particularly intrusive maintenance practices that could harm
With a defect elimination
the machine. Likewise, intrusive inspections should be avoided whenever
program, we are, in possible. And we want to use our internal condition monitoring programs to
essence, trying to build make sure no damage has resulted from any of the aforementioned actions.
a force field around the Now this machine has the best chance of delivering the maximum value.
Along the way, we keep the machine clean, make any adjustments, replenish
plant: We make sure
lubricants, and continue to monitor. This is basic defect elimination.
anything we buy for the
The P-F Interval was introduced earlier in the course—the idea that we can
plant is designed for
monitor a machine and see the symptoms of the beginning of failure. Defect
reliability, maintainability, elimination comes before, looking at all the areas that could lead to failure.
operability, and so on. Ideally, it would help us avoid the P-F curve. We could go further back and
consider the design practices and other aforementioned issues (obviously
not for the present machine, but for future ones). If we do this, the y-axis, which corresponds to the machine’s
condition, will remain high and steady since all the reasons that the machine could fail have been removed.
Bearings and other components do, of course, have a lifespan and they will eventually need to be replaced.
In this section we will summarize

• Designing for reliability
• Value-driven procurement
• Choosing the right contractors and service providers
• Reliability-focused transportation
• Acceptance testing
• Optimal equipment operation
• Operator-driven reliability
In future modules we will discuss

• Taking care of spares
• Work management and spares management
• Reviewing vendors of equipment and services
NOTES
48
• Precision maintenance and commissioning

• Proactive tasks: lubrication, cleaning, etc.
• Using RCFA (root cause failure analysis) to eliminate “unique” defects
DESIGN FOR RELIABILITY

In addition to meeting whatever technical requirements a machine has, we consider the life-cycle costs of the
machine. Therefore, we put a high priority on maintainability, operability, and reliability.
Maintainability is the ability to
• Isolate defects or their causes
• Correct defects or their causes
• Repair or replace faulty or worn-out components without having to replace still-working parts
• Prevent unexpected breakdowns
• Maximize a product’s useful life
NOTES
49
• Maximize efficiency, reliability, and safety

• Meet new requirements
• Make future maintenance easier
• Cope with a changed environment
https://en.wikipedia.org/wiki/Maintainability
In the design process, making sure the machine will be maintainable is a great thing that will save the plant a lot
of money.
Operability is the ability to keep equipment, a system, or a whole
industrial installation in a safe and reliable functioning condition
according to pre-defined operational requirements. (Adapted from
Wikipedia definition)
Reliability refers to the ability to have the machine run without it
failing. Decisions made during the concept part of the project, the
design, the construction, and installation phases determine 95%
of the total cost of ownership (TCO). A lot of organizations, due to
budget constraints or the desire to make the numbers look good,
will prioritize low up-front costs. They are setting themselves up for
future costs.
Different phases of the design process have more influence over
the life of the product than others.
Once the equipment is installed, we have no ability to influence
the design and, thus, the inherent reliability, that is, the best it can
achieve once it is designed, selected, and installed. Most of what
we have talked about so far has been the problems that we inflict
on the equipment after that point. We need to minimize that harm,
but we need to keep design in mind for when the opportunities
arise for acquiring new or repaired equipment.
We have a certain amount of influence at the beginning, and it diminishes over time. To make the right decisions,
we need to understand the importance of design. This is one reason we need senior management support for
this program, since they have so much control over the purchasing decisions of the organization. Involve the
maintainers and operators of the equipment in the decision. It is a common mistake for an organization to build
a new plant and fill it with poorly designed equipment.
NOTES
50
PROCUREMENT
The same is true for the procurement process, but it is ongoing: buying new lubricants, bearings, electrical
connectors, etc. There are choices that can be made that will reduce the cost of ownership, and others that will
reduce the cost of purchase. Our goal is to influence the purchases in order to reduce the total cost of ownership.
We need the highest level of maintainability, operability, and reliability. The procurement team may not have that
understanding and may not know how to look for this. They are in charge of finding something with a certain
specification and, when found, to purchase the one with the lowest price. Our job is to provide a specification
that will only allow a piece of equipment with maintainability, reliability, and operability to be selected.
We have to achieve the lowest life-cycle costs. Focus on the total cost of ownership, not the purchase cost.
The former includes maintenance, downtime, operation, lubrication, and replacement costs, as well as energy
efficiency.
On a related note, one problem the industry faces is that there are companies producing counterfeit parts. These
parts do not meet the stringent standards that the companies advertise. Procurement departments that are
looking for low-cost parts may inadvertently purchase counterfeit bearings, lubricants, filters, etc.
We need to consider the service providers. Look very carefully at the service they are providing, rather than just
telling them to go balance a rotor—give precise instructions. Send someone with them to inspect a gearbox, for
example.
TRANSPORT
You may have a machine that is in good condition when it leaves the vendor but is impacted during transport by
vibration, dust, etc. Do acceptance testing to make sure it is up to your standards. Proactively dealing with root
causes is much better than waiting until the equipment fails and arguing with the OEM about the warranty. Just
take steps to eliminate the possibility of these failures occurring in the first place.
With a rolling element bearing that is vibrating, there is something called false brinelling. The rollers are chipping
against the surface of the bearing—for example, when it is in the back of a truck. The bearing will be degraded
and will fail soon after.
ACCEPTANCE TESTING
Say it is part of your purchase requirements that these tests will be performed to prove that this equipment is
fit for purpose. We assume that when we buy a part, it is in excellent condition on delivery and is designed for a
long life. We assume it will operate without resonance and that the bearings are in good condition, that there is
no soft foot, etc. But to be proactive in the selection process, we should assign the vendor tests to perform and
also perform some ourselves, with the desired results specified. That puts the supplier on notice to make sure it
fits the requirements. Companies still find a lot of equipment that fails the tests. Imagine what would happen if
NOTES
51
you failed to perform the tests.

It is common for fans to not have the best bearing design. There can be resonance, unbalance, misalignment, and
other things that depend on circumstances. Again, look at how it is transported.
When you get a motor, look at the windings, connections, stator, rotor, and so on. Do the checks on gearboxes at
certain stages during the overhaul process to make sure they are in good condition.
If you do not check, you are just trusting the supplier. Most OEMs are not actively trying to deceive you, but many
are unaware of the problems, such as resonance and false brinelling, in the products. If it does not meet your
standards, it does not get into the plant.
Also, be sure to check that what was delivered is exactly what you ordered. If you put the wrong product on the
shelf as a spare, you will have the wrong spare when you need the right one.
Perform quality assurance/quality control testing: test your own work as well to make sure everything is as it
should be. Look for the signs of failure and the root causes of failure (such as contaminated lubricants).
OPTIMAL OPERATION
Then we must operate the equipment properly. If we have the right design and have purchased the right
equipment, we should be able to operate the machines under the stresses and loads they were designed to cope
with. Make sure the equipment is operated in a way that enables it to reach its maximum lifespan: consistently,
according to procedures, and without excess waste.
Recognize that the operators of the equipment often understand the events that lead to changeover issues,
quality issues, maintenance issues, and so on. They know when the machine has been making strange vibrations
or noises, and then it fails. Engage with them and take their advice. Operators spend a lot of time with the
equipment, so they are in a great position to give advice and to perform basic maintenance tasks, such as
cleaning and lubrication, as well as condition monitoring.
Hence, we need to educate the operators on how important it is that the equipment is operated correctly. They
need to understand that when they do not operate the equipment in the optimal way, they are sucking the life
out of that equipment. (Be sure to convey this without pointing fingers!) If they have an understanding of how
their actions lead to equipment problems, they will be more likely to operate the equipment correctly. Later, we
will discuss the brown-paper process, which is a collaborative way of bringing people together to get their ideas
on why equipment fails or underperforms, and to come up with solutions together. As part of that process, they
will learn the proper way of operating the equipment, whether that information comes from you or from other
operators.
MCP Consulting, in association with the UK Department of Trade and Industry, found that
• 40-50% of equipment breakdowns are caused by poor operating practices
NOTES
52
• 30-40% of breakdowns are caused by poor equipment condition or design

• 10-30% of breakdowns are caused by poor maintenance practices
In an earlier module, we discussed the implementation of this program and mentioned we would come back to
it. As we talk about design issues, procurement issues, working with vendors, acceptance testing, etc., we need to
understand all these opportunities to improve reliability. But, when creating a plan, you have to pick your battles:
are you going to address the design and procurement process, the vendors, acceptance testing, the operators, or
something else? Go back to your plan and decide in what order to address the issues. Make sure you do not give
the idea that certain people are being “picked on” or blamed—their department is just one of the areas that will
be addressed.
In what ways do the operators affect reliability? Operations issues include poor startup, poor shutdown, poor
operation, running equipment beyond its limits, deadheading pumps, not controlling pH in the process stream of
a chemical plant, and not controlling moisture content. In many cases, operators have not been coached on the
proper way of operating the equipment. Make sure operators on all shifts operate equipment consistently.
A pump was designed to meet certain criteria. If you do not operate it at its best efficiency point, there can be
NOTES
53
recirculation, cavitation, and other issues that degrade the bearings, the impeller seals, etc. Operators need to
understand these consequences.
The other way we are going to work with operators is by implementing standard operating procedures: everyone
starts up the machine properly, shuts it down, goes through the changeovers, etc., so that the machine is
operated in a steady way.
It is reasonable to expect that, with so much on the line in terms of expense and production requirements, there
will be standard ways of operating the equipment. But what is the best way of operating the equipment? We will
discuss this more with the brown-paper process, but one way to do it is to have each shift document how they
operate the equipment and look at the performance. As a group, you can then decide on the best way, or ways,
to operate the equipment. Each shift may find a slightly better way to operate the machine, and each group tells
the other what they found during the discussion process.
We need to look at something called operator asset care, or operator-driven reliability (ODR). When breaking
out of the reactive maintenance cycle of doom, how do you take a group of maintenance technicians and get
NOTES
54
them to perform tasks to extend the life of the equipment when they are already so busy? One way is to engage
the operators and get them to perform certain tests, inspections, and basic maintenance work, utilizing their
experience with the machine. This will save the maintenance team time that they can use on tasks that require
their specific skills. Also, if the operators are performing these basic tasks, there will be fewer failures, which will
free up more time for maintenance.
ODR utilizes operators in some proactive maintenance tasks. Because they work with the machines every day,
they are in a perfect position to perform these tasks. They can even use some basic condition monitoring tools to
check the equipment. They can keep the equipment clean and, maybe, do some lubrication work.
We have to be careful when going about doing this. The maintenance department may feel we are taking jobs
from them and operators may question why they are being asked to do maintenance work. This is why the
implementation plan is important. If you have engaged with them, involved them in the plan, and trained them,
they will be ready when you implement ODR.
Other studies have been done, such as the one by the Japanese Institute of Plant Management, which found that
70% of failures are preventable by operators, while 30% require intervention by technical specialists. The point
is not that the operators are to blame for the failures. It is that the operators are in a position to prevent 70% of
failures. This takes a huge burden off the technical specialists and engages people who are already working with
the machines. This is a great opportunity to break out of the reactive maintenance cycle of doom.
Another study, from Whirlpool, completed 23 RCM analyses and identified 1,864 tasks to minimize failures:
68% were performed by operators, while 31% were performed by technicians. 237 redesigns of process and/
or equipment were needed in order to prevent future failures. Again, ODR is very important. In another study,
66% of tasks were performed by operators, 32% by maintenance, and 2% of processes and/or equipment were
redesigned.
Operators can perform non-intrusive inspections, cleaning, lubrication, and basic vibration, ultrasound, and
temperature readings. As for lubrication, it is generally better to have a dedicated lubrication team if you can.
ODR is one of the keys to breaking the reactive cycle. It frees up time for the maintenance crew to do other tasks.
Proactive tasks will get done by people who should not be distracted by break-in jobs. It reduces the root causes
of failure. Operators will have a greater connection to the reliability improvement initiative.
The fourth way we can work with operators has to do with production rates and quality, in addition to failure.
Earlier, we discussed four areas: performance, constraints, risks, and opportunities. Here, we are looking at
opportunities. How can we increase the production rate and the quality of what we produce? How can we
reduce waste, keep customers happy, achieve production targets, and achieve higher capacity? People in the
maintenance and reliability departments may think that’s not their purview. That is part of the reason we put a
steering committee together and get senior management support. Why just focus on machines breaking down?
What about the minor stoppages, slowdowns, and changeover and transition losses? We all serve the same goals,
NOTES
55
so why not look at these issues as well? The people who work with the equipment will have opinions as to why
these things happen and how to prevent them.
A common measure of the reliability program is the OEE (overall equipment effectiveness)—the combination of
the uptime, production rate, and quality. A lot of maintenance people may be focusing on the uptime, but they
should look at other areas as well.
In conclusion, defect elimination lets us get ahead of the root causes of failure. We know what causes failures
in most industries, and you can set proactive tasks to avoid those failures. The extra effort and up-front cost
will reduce the overall cost of ownership. Everyone needs to appreciate this concept, as the temptation to save
money can be overwhelming.
NOTES
56
MOBIUS INSTITUTE | ARP-A R-09: Asset Strategy
R-09: Asset Strategy

What is our asset strategy? What do we want to determine? We need to be clear on what analysis work has to be
done so that we can develop the right maintenance plan for the equipment.
• First, we need to perform the right maintenance tasks. When do we need to perform the tasks, what are
those tasks, why are we performing them, and who can perform them?
• Second, we need to perform the right monitoring tasks. When do we perform them, which technologies will
we use, and why are we performing them?
• Third, we need to proactively eliminate failures. How are we going to do it, when, why, who is going to do it,
and what methodology are we going to use?
What do we need to know?

• We need to understand failure modes: how they occur, the symptoms that will appear as failure is
occurring, and the failure rate—how quickly it will go from a good condition to first signs of failure to
functional failure (P-F Interval).
• We need to know the root causes of failure: how they develop, why they develop, the symptoms, and the
lead time—once we observe that a root cause exists, how much time we have before the machine starts to
fail. All failure modes have a root cause.
• We need to eliminate failures: when, why, who will be involved, and what methodology will be used.
How do we get all that information on all those assets?

• There is a lot of common knowledge in the industry regarding what has worked for other people and what
has caused failures in certain assets. It is a common response to think your situation is unique and your
equipment is unique. There may be aspects of your plant that are unique. But if you have electric motors,
pumps, fans, compressors, gearboxes, or couplings, and if the machines have rolling element bearings or
sleeve bearings that are lubricated by oil or grease, if they are susceptible to vibration, or if they get hot or
cold as failure is occurring, then there are common strategies and common knowledge about how failure
occurs and how to avoid it. You may have unique food processing, mining, or other equipment, but if you
are concerned about rolling element bearings, there is already a wealth of knowledge out there. You may
still have unique aspects to your equipment, but you can start with the basic, common faults and use the
time available to focus on what is unique.
• Look at your internal experience: what have you learned from past failures? You have seen the
maintenance records and spoken with the operators, and there is much you can learn there.
• You can perform failure modes, effects, and criticality analysis (FMECA), reliability centered maintenance
NOTES
57
(RCM), preventive maintenance optimization (PMO), and root cause failure analysis (RCFA). But when are
we going to use those techniques, why, who will perform them, and which of these methodologies is best?
These approaches take a certain amount of expertise and time, which you may not have.
How do we get that information?

We can do some analysis ourselves. Weibull allows us to learn from statistics, Pareto is another statistical process
that looks at what is causing most of our problems, and condition monitoring will give us information about
the nature of failures, through inspections and tests. Another option is to use consultants who can share their
experience with you.
It is important to clarify the concept of failure. There’s the catastrophic failure where the machine goes “bang.”
Functional failure is when the machine no longer performs its function. We also need to look at waste: of time,
energy, materials, and resources. Excessive waste is also a failure.
We need to be concerned about quality, as it has an impact on the happiness of customers. It also impacts waste,
as the plant has to throw out parts that do not pass quality analysis, and then reworking those rejected products
takes more energy and time. We need to understand rate and throughput issues, including slowdowns, minor
stoppages, etc. Also, constant failures, interruptions, and missed targets cause a lot of frustration for workers.
These are sources of failure that must be eliminated.
Consider the value of what you are doing: the ROI and the effect on the company culture. Everything we do has
to add value. We need to understand priorities: determine the most critical asset, and the second most and
third most, and so on, so that we can prioritize the work we are going to do. Pinpoint the gaps that exist in the
people we have (numbers and experience), the skills of the workers, the available tools and technologies, and the
funding.
WHAT ARE THE TYPICAL OUTCOMES?
Condition-Based Maintenance
In condition-based maintenance, we make the decision to perform maintenance based on information that
indicates that the work is required and that the machine will fail if we do not perform the work. To find out if
work is required, we perform inspections, look at performance data, perform classical condition-monitoring tests
such as vibration analysis or infrared, test the oil, etc. How often we perform the tests depends on criticality. All
of this allows us to determine when to replace bearings, replace lubricant, re-align the machine, clean filters,
replace tires or belts, re-lubricate a bearing, and more.
There are a lot of decisions we have to make based on design parameters:
• Logic: CBM, PM, or RTF
• Criticality: To justify the program
NOTES
58
• Failure modes: To help us decide which technologies to use

• P-F Interval: To decide how often to test
Let’s quickly review the P-F Interval. A piece of equipment might run in good health for a long time, and then
something happens to initiate failure. At a certain time, we will be able to detect it. We call that “P” because there
is a potential for failure. Over a period of time—possibly seconds for some equipment, or years for others—we
see all the tell-tale signs that the condition is degrading. We need to understand the failure symptoms of every
critical piece of equipment. Is it a change in temperature, a change in vibration, a change in sound, a change in
performance, a leak, and/or a change in the electric current? This will help us decide which condition monitoring
technologies are best (give us the earliest warnings) and how often to use them.
Interval-Based Maintenance
What is interval-based maintenance? Some people call it preventive maintenance but, because in some places
that term includes condition-based maintenance, interval-based maintenance is the more precise term. When we
cannot detect onset of failure but the asset is too critical to run-to-fail, we need to perform scheduled restoration
or replacement tasks. For example, we may know how long it takes for a cog to wear out, so we will replace it
NOTES
59
at that time. We could also look at the cog and see whether or not it is worn—some interval-based tasks can be
made condition-based (though the inspections would remain interval-based).
Interval-based maintenance works best with age-related faults. We can make the decision to replace our tires
after traveling a certain number of miles, or we can inspect the tires at that time and decide whether to replace
them sooner or later. There is a risk associated with replacing those tires, or that cog, purely based on time:
we may wait too long and have it fail, or we may replace it earlier than needed and expend time, energy, and
expense. With a condition-based strategy, we perform tests to determine the condition of the equipment,
but that takes time, money, and expertise. So we have to decide which strategy to use. With interval-based
maintenance, the interval can vary and may be determined by time, running hours, distance traveled, production
runs completed, etc.
Hidden Failure Finding Tasks
We must carefully scrutinize the existing PMs that may be historical or imposed by OEMs. The trouble is,
everything that takes our time takes us away from the proactive steps we should be performing. There may be
existing tasks in your plant that are just a waste of time. They may even be inducing failure in the equipment.
Later, we will discuss PMO (preventive maintenance optimization) in more detail.
Another part of our asset strategy includes hidden failure finding tasks. Some failures can occur without us
knowing, and we only find them when we try to get the machine to run: for example, a safety switch, a pressure
relief valve, or a standby pump. We assume that a standby pump will run when we need it to, but how do we
know that it actually will? We cannot measure the vibration because it is not running. We need a way to look for
these hidden failures. One strategy is to just cycle through the pumps, which is a good idea anyway. We must
identify the risks associated with hidden failures and develop a strategy to manage them.
Perform a Redesign or Improvement Project

Another strategy is to perform some sort of redesign. Sometimes we cannot perform a periodic PM or a CM test,
and the asset is too critical to run-to-fail. But there may be a change to the design that can reduce the number
of failures. This step may come from necessity, due to the criticality of the equipment and our inability to detect
failures. But also, through our natural process we may identify opportunities for redesign to reduce the likelihood
of failure. You may, for example, decide to change the type of coupling used. You can use a different type of
switch or change its location. You can provide additional weather protection, use a variable-speed drive, improve
gear teeth hardness, or use ball bearings instead of roller bearings, and so on.
Run-to-Failure Maintenance or No Maintenance Action

The first is run-to-failure maintenance. This is when you allow equipment to fail because prevention of failure
cannot be justified. This is not reactive maintenance or inability to see problems coming. Run-to-failure
maintenance is a deliberate decision to not perform interval-based or condition-based maintenance. Everything
NOTES
60
If you have good data, we do takes time and, therefore, costs money. If we try to do everything,
we probably won’t do anything very well, so we have to make some hard
use it, but don’t make
decisions. For much of the equipment, simply dealing with the failures as they
that your first and only happen may be the most cost-effective strategy. Be sure to report this decision,
priority. You may have as well as your reasoning, to management.
to wait a long time for Some assets are simply less important than others. They are not expensive
good data, so don’t to replace, they do not pose a safety risk, they are not critical, and they do
let the lack of it hold not cause secondary damage when they fail. You carry spares or you can get
them easily. Therefore, you cannot justify collecting and analyzing condition
you back from being
monitoring data and performing preventive maintenance tasks on them. That
proactive. Collect data does not mean we deliberately make the machine fail—you will still lubricate it
along the way. and take action if it is making noises or heating up.
WHO SHOULD DEVELOP THE STRATEGY?

The reliability group, which you may be part of, is probably the best team to do it. You may choose to work with
maintenance, but they are often more focused on fixing problems than preventing them. We need a focused,
well-trained, dedicated team, ideally with experience or access to people with experience. That team should
be 100% focused on reliability and able to perform the analysis, create standards and guidelines, and perform
condition monitoring tasks.
When you have a reliability group, there is a lot that can go wrong. Don’t let people think that reliability is solely
the responsibility of the reliability group. If that happens, the reliability team will get the blame every time there
is a problem, and no one else will take responsibility for reliability. Just as safety is everyone’s responsibility, so is
reliability. Don’t locate the reliability team in isolation so that there is a disconnect between them and everyone
else. Don’t let them get dragged back into maintenance. Make sure the team works with people to get everyone’s
advice and involvement. Get out of the office: walk around the plant and talk to people. Look at how monitoring
and tasks are being performed. Learn from people, follow suggestions, ask for advice, and share successes.
GET ORGANIZED
It is important to be organized through this process. It does not make sense to develop an asset strategy if you
do not have an up-to-date master asset list or equipment register.
Ideally, there should be a management-of-change process so that if you are proposing to change anything in the
way the equipment is maintained or operated, the change should be documented.
The bill of materials (BOM) is a key part of being organized and in control. This lists the equipment as well as all
NOTES
61
the components and parts that make up the equipment. Then, when you do failure modes analysis, you know
what the components are and what can fail. This will also help us keep track of the spares we need.
Documenting a BOM is a big job, and it is often left undone because people do not believe it is worth the time.
When maintenance work is performed, it is a good opportunity to document that information. When the machine
is open, look inside to document the bearings, gears, etc.
ANALYZING RELIABILITY DATA

Now we will talk about analyzing reliability data, and the information we need to help us develop the asset strategy.
There are two basic ways to develop the asset strategy: proactively tackle common problems based on experience
(but not data), and use data to identify which assets are the least reliable, as well as their failure patterns.
There are many common root causes of problems: poor lubrication, poor alignment, poor balancing, poor
installation practices, resonance, poor fastening, poor operating practices, and so on. All of these will lead to
failure, and we do not need to do root cause failure analysis. If we have data, we could use it to justify investing
in a lubrication program, for example, but if we do not have the data, we should not wait for it. The machines are
failing right now.
NOTES
62
DATA-DRIVEN APPROACHES
As for the data-driven approach, we can take two basic approaches: creating a bad actor list with Pareto analysis,
or taking an analytical approach while looking at your failure information—how often you are experiencing
failure—and determining which failure curve you have. Weibull analysis
will help you there. If you know the reliability of all the assets, you
If you have 10,000 assets,
can create a reliability block diagram and determine what the future
they will not contribute availability will be, assuming that the reliability stays constant. You can
equally to your reliability perform simulations of possible changes using software.
problem. Some of them We would love to have data to use in making decisions, but if we do
are bad actors—they break not have it, we should just start acting now to improve reliability. If you
down relatively frequently, start measuring now while improving reliability, note that the MTBF may
change—perhaps failure used to be when the machine stopped working
keeping the maintenance
and now it is when someone detects it with a vibration analyzer. We are
department busy and also affecting the MTBF by the proactive changes we are making. Just be
frustrating the operators. aware that the data will not be constant due to these factors and take that
NOTES
63
into account when making decisions.

Pareto analysis is a good tool because it will highlight which assets and types of failure are sucking up all your
resources, wasting your time, affecting production, and so on. It is said that 80% of problems come from 20%
of the assets. If you have 10,000 assets, they will not contribute equally to your reliability problem. Some
of them are bad actors—they break down relatively frequently, keeping the maintenance department busy
and frustrating the operators. We need to identify these machines (using our records if we have them). If we
proactively deal with those problems, we could save a lot of time and cost. We can look at a chart and see that,
for example, two assets are failing more than others.
We may have, for example, 76% of failures from three machines. We need to deal with those machines. We could
simply record which machines are failing, but we would like more information: is it the motor, the gearbox, or the
pump? Could it be related to couplings, bearings, or seals? It is useful to know which specific assets are giving us
problems.
In a large pulp mill, a study of three years of excellent equipment-level downtime records showed that, of over
12,000 items, 87 items (less than 1%) caused 80% of all unscheduled maintenance downtime. Within those
87, some were probably much worse than others. How hard would it be to determine and deal with the root
NOTES
64
cause? Even the existing data would allow this plant to prioritize what they do. In this paper mill, the amount of
downtime dropped after PM programs were implemented on those machines.
It is important to appreciate good data. Don’t be in a position where you say, “We don’t have good data, but it’s
the only data we have, so we will base all our decisions upon it.”
Ideally, we would like to document and know which asset failed, when it failed, what failed, and the consequence
of the failure. You might collect and document that information in different ways. Make sure you document
the failure in detail so you can use it as a reference later (don’t just write “other,” for example). Data must be
recorded properly, but we have to be realistic. If there are 200 failure codes, they will not be used properly.
You need good data. What is being recorded right now? Are people recording maintenance work against the
correct asset? Ideally, we would like to know which asset failed, when it failed, what failed, and the consequences
of failure. You may need to dedicate one person to gather this information. If we could record the date, the time,
which asset it was, and the nature of the failure, that would give us great information.
Unfortunately, when we look at a chart, we often see things like “It’s not working” or “Failed again.” At least
NOTES
65
we know which assets are the troublemakers, but we do not know the nature of the trouble. Be precise when
documenting the failure.
One way to record data is through failure codes. If we have 200 failure codes, however, it is not going to work.
The technician will not have the patience to scroll through them and choose the right code.
INTERPRETING FAILURE MODES

We can now use this information with greater sophistication. When talking about failure modes, we saw some
curves showing things like infant mortality, random failures, the bathtub curve, and age-related faults.
Those curves come from the analysis of mean time before failure (MTBF). There are some issues with using MTBF
data, some of which were mentioned before, but that information can tell us whether failure is likely to be the
result of maintenance activity (infant mortality), or whether it’s random (so we should use condition monitoring)
or age related (so we might use interval-based maintenance).
A simple case study involves cooling towers that were maintained by a contractor. The refinery experienced a
NOTES
66
lot of failures and began inquiring about them. They decided to do some analysis, and they looked at how often
each individual fan was failing. They found that only a limited number of fans were failing, due to infant mortality.
The contractor, while “fixing” the problem, was introducing new problems, and the fan would just fail again.
They decided to bring maintenance in-house and use precision skills, with the right procedures and tools, and
reliability improved. In this case, they had data to prove the situation. Senior management was able to see that
they could save a certain amount of money by bringing maintenance in-house.
ASSET CRITICALITY RANKING

In this section we will talk about asset criticality ranking (ACR). This allows you to prioritize and justify all your
activities.
We have to make a lot of decisions: Should we perform a full RCM on every asset? If not, on which ones? Which
assets do we repair first? Which spares should we keep in inventory? Can we justify condition monitoring, using
more than one technology, online monitoring, acceptance testing, and/or a re-design of an asset? And how do we
NOTES
67
justify our decisions? We can use ACR to determine which spares to keep, justify
Criticality is the
PMs, and so on.
combination of the
What is criticality?
frequency of failure
and the consequences Criticality is the combination of the frequency of failure and the consequences
of failure. To put it another way, it is the importance of the machine and the
of failure.
likelihood of failure. It is a measurement of the risk we face. Let’s say we have a
production line, and the product goes from one machine to the next. We know
that the line will shut down if any one of those machines fail. Therefore, we can conclude that they are all equally
critical. But we have to go further and ask, “Which of those machines would we least want to fail?” Of course,
you do not want any of them to fail on that crucial line, but one of them, for example, may be more expensive
to repair than the others. The parts may only be available overseas. It may have a shorter lead time to failure.
Therefore, this machine is more critical than the others.
Are any of the machines less reliable than the others? You probably have one that fails often and is likely to get
NOTES
68
the line shut down. Therefore, it is a good idea to look for the root cause of failure, and you may be able to justify
condition monitoring on that machine. Now check if any of the machines affect the product quality more than
the others. Or, will one machine put workers’ safety at risk if it fails? Therefore, these machines are not equally
critical. On a related note, some machines will start making noises before they fail while others will fail without
warning. Their criticality may be equal, but one of them needs extra attention.
CRITICALITY RANKING
How do we define criticality ranking?

Be careful: the size of an asset does not determine its criticality. A small asset may be powering an important
machine. You need to rank criticality for all assets: pumps, motors, switches, gears, bearings, etc. There are a few
ways to define the ranking.
• You can figure it out for yourself. The disadvantages are that you will get zero buy-in and you probably do
not know as much about the consequences as you think you do. You need to involve more people.
NOTES
69
• You could have a team meeting, with all the stakeholders, and set the criticality rankings. A problem with
this approach is that everyone believes their machine is the most critical, and the battles may continue long
after the meeting.
• You could perform reliability centered maintenance (RCM) or failure modes, effects, and criticality analysis
(FMECA). You will determine criticality this way. However, it takes a long time and we need to know
criticality now.
• We could keep it simple and label machines “critical,” “essential,” “non-essential.” However, the decision-
making power is lost here.
Instead, include all the stakeholders (in maintenance, operations, safety, health, environment, etc.), define the
consequences we are concerned about, assess the reliability (the probability the failure will develop), and assess
the detectability of the failure.
First, take an asset and decide whether the consequences of its failure are insignificant, minor, moderate, major,
NOTES
70
or extreme. On another axis, graph the probability, or likelihood, of failure, from rare to almost certain, within
your chosen time period. The combination of those factors will give you the criticality ranking.
Next, break the consequences of failure down into categories, such as equipment, people, environment,
production, and product quality/safety. Some consequences carry more weight than others, for example, if your
product quality could affect the health of a customer or a machine failure could be dangerous for workers.
Near the beginning of this course you learned how to decide on the most critical issues for your plant. Now,
break each consequence down into categories (insignificant, minor, moderate, major, extreme). For example,
under “equipment,” the worst case scenario may mean the destruction of equipment, the destruction of other
equipment, and the spare being unavailable in state. The worst case scenario under “people” may mean single or
multiple fatality. As a group of stakeholders, we will go through this process asset by asset. We will combine the
scores of reliability and consequences to get the criticality score.
Let’s go further. So far, we have given equal weight to the consequences of failure. A fatality has been scored
equally to secondary machine damage. There are two ways to solve this problem. We could simply use a larger
range of numbers under “people” (say, 1-10), or we could apply a weight to certain consequences. Under the
weighting system, certain consequences could also be scored lower, say, if your plant will not seriously impact the
environment even if the asset experiences its worst case scenario.
To go further, we can change the reliability score to reflect detectability. If a fault in an asset can be easily
detected, there is little probability that the machine will be allowed to fail, and the consequences of failure will
never surface.
We have just developed a wealth of useful, actionable information. If you store this information and have a way
of analyzing it, you could go back and see the reasons for the rankings. This would allow you to focus on, say,
machines that are only critical because their failure is undetectable and suggest condition monitoring for them.
You may look at other consequences and think of ways to minimize the risks to the environment or to safety,
even if the machine were to fail.
There is more that we can do. So far, we have talked about the equipment as a unit, but what about the criticality
of the individual components? Each component may have different problems and different criticality. For the
most critical machines, we may want to do a criticality analysis of each component so that we can do something
about them or monitor them. At this point, we are nearly doing RCM. We have spoken about the seriousness
of each consequence; similarly, the probabilities of each consequence developing are not equal. We could go
further and discuss probability and detectability for each consequence. We use criticality to justify digging deeper.
A criticality spectrum has system criticality on one end, followed by basic asset criticality and ACR, and it goes
more into depth until we reach RCM/FMECA at the other end. With this subject, we can start with something basic
and get into more and more detail. But the point is to use criticality to justify your efforts, and you end up with
the information that allows you to make decisions.
NOTES
71
PREVENTIVE MAINTENANCE OPTIMIZATION

In this section we will talk about preventive maintenance optimization (PMO). The basic issue here is managing
your resources. You have to use your time on the tasks that will give you the most value. And you do not want to
perform tasks that will harm the machine. In many organizations there will be a history of PMs that have been
developed for many reasons, such as previous failures or OEM requirement, and it has never been reviewed
since. Some PMs were developed because failure modes and condition monitoring were not understood.
Condition monitoring may be a better solution than PMs. In this section and the next, we will talk about the
techniques we can use to develop asset strategy—PMO in this one and a summary of RCM and FMEA in the next.
With PMO, we start with our list of PMs and scrutinize every one of them. For each, we ask, do they add value? If
not, we scrap them.
You will not feel comfortable with some of the necessary decisions, and you may feel there is a risk in
discontinuing those PMs. You may wish to seek the advice of a consultant or even the OEM. The OEM may justify
why they want you to perform certain tasks. With time, confidence, and experience, you will become more
comfortable deleting superfluous PMs.
We must remove all tasks that are wasting time and money and inducing fault conditions. Some PMs are OEM
warranty requirements. Check if the asset is still under warranty. If not, you may not need to perform those tasks.
Decide if the PM is the right strategy—the OEM may not know what the best strategy is. In a few cases, PMs from
the OEM involve having to purchase more services or equipment from them. Don’t be afraid to challenge the
OEM if you think you should be able to use your own equipment.
All of this is about risk management. You are taking a risk by discontinuing a PM,
Scrutinize any PM but the payoff will be worth it if the PM was not needed. Look at the history of the
that does not deliver equipment and understand what the maintenance needs really were. If the PM was
real value. due to a previous failure, check to see if that failure may happen again.
Scrutinize any PM that does not deliver real value. If a PM just says “check pump,”
does the technician know what to look for? Make sure you have desired outcomes you can measure and have the
technician record the measurement.
Watch for any PM that is intrusive. Having to open up a machine introduces a safety issue, and it is an
opportunity for something to go wrong, reducing the reliability of the equipment. People make mistakes. Any
time you open a machine, there is a risk of contamination and damage. This is particularly true if it involves
removing parts, inspecting something, and putting it back together. Things can get misassembled, contaminated,
dropped, etc.
Are there PMs that are based on an age-related failure pattern, but the machine actually has a random failure
pattern? PMO is scrutinizing existing PMs to judge whether they are valuable. The other approach is to start with
a clean piece of paper and look at each asset: what are the failure modes, interval-based tasks, condition-based
NOTES
72
tasks, and hidden failure finding tasks we should perform? That creates a whole new set of PMs. Pure PMO takes
what you have and eliminates what you do not need, but does not necessarily create new PMs. Advanced PMO
does go that extra step, possibly replacing time-based tasks with condition-based tasks. This approach comes
close to RCM.
You have to be brave when it comes to PMO. Each time you delete a PM, you hope it does not lead to a problem.
Trust the science, history, and expertise that guide the decisions in terms of infant mortality, random failures, and
age-related failures. Apply the right strategy in the right locations.
Again, PMO purely reduces or removes PM tasks; RCM creates new ones. When you are trying to break out of the
reactive maintenance cycle of doom, while you may not be ready to perform full RCM, PMO is a useful task.
RCM AND FMEA

In this section we will briefly talk about reliability centered maintenance (RCM) and failure modes effects analysis
(FMEA). RCM and FMEA are basically the study of every failure mode and the consequences of each failure in
order to determine what we can do about them.
Earlier, we discussed how to determine which assets we would run-to-fail, which ones we would perform
condition-based maintenance on, which ones we would perform interval-based maintenance on, how to check
for redesign opportunities, and how to check for hidden failures. RCM and FMEA overlap quite a lot. RCM
starts from a system level, where we note the function of the system and look at each asset within the system
(considering how each asset affects other assets and how they might interact). FMEA narrows the focus a bit by
looking at a machine and a failure mode of that machine. FMEA is classically used in product design to look at the
failure modes and the probability of them, and to decide whether to modify the design of the product. The main
thing to know about RCM and FMEA is that they give us the information we need to set asset strategy. We can use
the criticality ranking to help guide us in this effort.
This work overlaps with PMO. Ultimately, you are trying to set the right asset strategy. PMO starts from the
existing asset strategy of existing PMs and weeds out the ones you don’t need. In a more sophisticated PMO,
we can add back the ones that we do need. RCM and FMEA start with a clean sheet of paper, investigating the
failure modes and determining the PM and CM tasks needed. When we go through a piece of equipment and
determine all the failure modes, we need to decide what to do with each failure mode. These are not necessarily
decisions about the asset, because each piece of equipment can have multiple strategies applied to it due to the
differences between the failure modes and their probabilities and consequences. Classical RCM and FMEA will
deal with each and every failure mode. The process is time consuming, so we can use criticality to decide where
to apply these techniques.
DECIDING TO PERFORM RCM OR FMEA

Any significant number of assets in your plant means a lot of work. If people take the job seriously and think
NOTES
73
What does success about every failure mode and every task, you will end up with an extremely
really mean? It means thick project report. You will never have time to perform all those tasks. You
have to take a practical approach, whether it is something you do internally or
that an RCM analysis
something you ask a consultant for help with. RCM purists will say you need
was performed, and it to consider every single asset and failure mode. That’s great if you have the
came out with a list of resources to do all that analysis work and follow the results. The practical
directives that involved side has to determine how much analysis we actually do, with the help of
proactive tasks as well criticality. Use criticality to focus on the assets that most need the analysis. If
you’ve performed Pareto and criticality analysis, you can start from that top
as reactive tasks.
5%, analyze them, and start work on them. Then, do the next 5%, and so on.
Hopefully, criticality will come down after you’ve done some work on that top
5%. People in your plant will see the improvements and feel a sense of achievement.
The sad fact is that only about 15% of RCM projects are successful. What does success really mean? It means
that an RCM analysis was performed, and it came out with a list of directives that involved proactive tasks as
well as reactive tasks. If we only determine the CM tasks that can detect failure and the PM tasks that prevent
that failure, but we do not address the root causes of failure, the project could be considered a failure. Some
RCM programs only come out with time-based maintenance tasks. We need a set of tasks that eliminate the root
causes and deal with the onset of failure. We need to implement them and make sure they are being performed
on schedule and that they are making a difference. Performance and reliability should improve. That would be a
successful RCM project.
What do RCM and FMEA help us understand?

• The function of the asset
• How it can fail
• What causes it to fail
• The results of failure
• The consequences of failure
This all helps us determine what we should do about it and what to do if it appears we cannot do anything about
it (possibly a redesign).
These are the seven classical steps of RCM:

• Functions: What are the functions?
• Functional failures: In what ways can it fail?
• Failure modes: What causes each failure?
NOTES
74
• Failure effects: What are the results of a failure?

• Failure consequences: In what way does the failure matter?
• Proactive tasks and intervals: What should be done?
• No proactive task available: What should be done?
If you are not doing these things, in theory, you are not doing RCM. We will go into this in more detail in the ARP-
Engineer course.
You can use a flow chart to determine prioritization and the questions you need to ask along the way. It starts
with the business review and moves on to criticality questions. Another process helps us deal with the root
causes of failure and the onset of failure. There are additional questions that need to be asked, whether dealing
with the proactive tasks or the onset of failure. We need to understand the technologies that will be required
to perform either the detection or the prevention. What is the probability of success in detecting or preventing
those failures? Will we always detect it with the chosen technology? Can we afford to take measurements
NOTES
75
frequently enough to detect it in time? Do we have the skills to perform the CM or PM tasks? How much will this
cost? How much time will be required and do we have that time? Knowing the answers to these questions will
help us plan our program and make the necessary adjustments.
INTRODUCING RCFA
In this module we will talk about root cause (failure) analysis (RCA or RCFA). This topic could fit in two places: here
in asset strategy or in continuous improvement. RCFA is a tool that looks backward to determine the root cause
of poor performance or failure and seeks to fix it.
Even with our RCM strategy, maintenance improvement, and performance improvement, there will be failures.
There will be poor performance in production. We need a logical way to figure out why the problems are
occurring and what we can do about it. RCA or RCFA is a method that is used to address a problem or non-
conformance in order to get to the root cause of the problem. It is used so we can correct or eliminate the cause
NOTES
76
and prevent the problem from recurring. In some companies people are frustrated with RCFA because, after
failures were found, nothing was done about them.
Root cause analysis (RCA) is analyzing a process, for example, production line problems. RCFA gets to the
root cause and eliminates it. These are the five basic steps of RCFA:
• Define the failure or process irregularity
• Investigate the root cause
• Create a proposed action plan and define timelines (get approval for the project)
• Implement the proposed action
• Verify improvement and monitor effectiveness
We can also use FRACAS—failure reporting and corrective action system—to help document and track the
progress.
It is not enough to define the root cause of failure. Make sure action is taken to eliminate it.
There are quite a few techniques that can be used to determine the root cause:
five whys, fault tree analysis, Ishikawa or fishbone, and many more.
The point is to keep
Five whys: You continue to ask “why” until the answer to that “why” can be
asking “why” until you
considered the root cause. For example, why won’t my car start? The battery
get to something that is dead. Why? The alternator is not functioning. Why? The alternator belt is
can be improved and broken. Why? The belt was well beyond its useful service life. Why? The car was
which, if improved, will not maintained per the recommended service schedule. Solution: Replace belts
solve the problem. according to the recommended schedule. The point is to keep asking “why”
until you get to something that can be improved and which, if improved, will
solve the problem. You can go through the physical reasons until you get to the
human side of it. Are the people being trained? Can they see the failure? You could keep going deeper, but you
just need to get to a point where you can control the outcome.
The fault tree analysis (FTA) goes through a similar process but asks more questions. The fact is, there might
not be one single root cause. It may be a combination of factors or influences. The car may not start because
the battery is dead, but it may also be out of fuel or have an engine that needs tuning. The result may be
complicated, with “ands” and “ors.” This is quite sophisticated, and we have to ask how much sophistication and
time are warranted. Ideally, the operations department should be empowered to perform RCFA. Everyone should
keep asking “why” and thinking of how to improve that asset and avoid the problem in the future.
If we cannot solve the problem logically, we can try the Ishikawa diagram, or fishbone technique. This is a
brainstorming technique to decide on the reason for failure: material, man, machine, method, or environment.
NOTES
77
Consider all the issues to do with each category. This exercise may help us find the cause of the failure, but it
could also help us identify things that could cause failures in the future.
The Human Element

Humans are fallible. As part of RCFA, we ask people questions about what they heard or saw, anything they might
have observed before the failure. We use the information they give us to make decisions. Unfortunately, people’s
memories are not completely accurate.
What they remember may not have actually happened. They may not have seen what they were looking at. You
may ask questions in a way that influences them (leading the witness). You have to be careful to not jump to any
conclusions. Don’t just latch on to something that supports your hypothesis.
There is also peer pressure. If, for example, a manager is quite strong willed and voices an opinion on what is
wrong, others may go along with it. Humans are subject to cognitive biases. For example, the recency effect says
that the latest explanation is best, the anchoring effect says that the first explanation is best, and others skew the
NOTES
78
memory in different ways.
RCFA Through Condition Monitoring

If you perform condition monitoring, don’t forget that it gives you a lot of good information. Through vibration
analysis, we can detect unbalance, misalignment, resonance, and other problems which can lead to failure. Other
CM technologies detect other root causes. These technologies detect root causes and, later, the failure. If failure
occurs, look back at the CM data to see evidence of the root cause.
Oil analysis and wear particle analysis show us the particles and contamination in the lubrication. Examining
those particles will show us the root cause. Look at the bearings themselves. Do not throw away old bearings—
see what the surfaces tell you: slippage, corrosion from standing still, etc.
RCFA is a key to continuous improvement, but we must learn from our failures. The aforementioned
psychological issues play a role here (human error and cognitive bias). RCFA is also a key component in culture
change—dealing with frustrating, unsafe, and costly failures (this is also a good opportunity to engage with
workers, get their observations, and ask for advice). We can use condition monitoring data to aid in the RCFA
process.
BROWN-PAPER PROCESS
Now we need to talk about the brown-paper process. Basically, the idea is to engage with many people in the
organization to get their ideas, feedback, and contributions to help us develop an asset strategy and target
problems that we can resolve.
We have two big challenges: we need to change the culture and we need to unlock the wealth of knowledge on
the plant floor. Could there be a way to kill two birds with one stone? Using this process, we can.
First, we engage people and involve them in the improvement process because it is motivating and creates
believers. Second, we learn a great deal about problems in the plant and motivate people to help solve them.
These problems include bottlenecks in production, waste, bureaucracy, equipment failure, bad practices,
shortcuts, procedures not followed, and inconsistencies.
How much do you really know about problems in your plant? Generally, people on the front line know a lot more
about these problems than management. The people who work with machines every day know a lot about them.
We need to change these attitudes: “We’ve always done it that way. That’s not the way we do things around here.”
What do you think would happen if you asked employees for their opinions in an unstructured way? What do they
think will happen? They probably believe their feedback will be ignored, and that may have happened in the past.
Here is an employee feedback process that was very successful in one plant:
• Establish a team to handle suggestions
NOTES
79
• The team works with the suggestee to quantify the benefits (make sure it is a plausible suggestion that will
not negatively affect another area, and it has value)
• Financial benefits
• Culture change or continuous improvement
• Once per week a meeting is held with senior management
• Suggestee leads the presentation (after coaching from the team)
• Manager says “yes” or “no”; if “no,” explains why
• Suggestee is involved in or leads the implementation (if possible)
• Suggestee reports back on results of initiative
These suggestions can be related to anything: put a cover on a machine so
We have two big the bearings don’t get wet during wash-downs, store this part somewhere
challenges: we need to else because it hurts our backs to move it, change the position of this switch,
etc. It is important to involve the person who made the suggestion in its
change the culture and
implementation, as they have ownership over this idea and will get things
we need to unlock the done. This process will raise morale and improve reliability. When you see
wealth of knowledge on results, be sure to recognize the person who suggested the improvement.
the plant floor. A BHP mine site tried the above process in Australia. They saw that the
commodity price for iron ore was about to drop, and they needed to do
something in order to avoid having to close mine sites and let staff go. They asked employees for suggestions on
improving the viability of the mine.
• They were amazed (and overwhelmed) at the response
• They received 750 ideas in the first seven days at just one site
• Across the iron ore business, they identified 4700 initiatives
• With the initiatives implemented in the past two years, they have saved A$1.6 billion in recurring costs
Just one thing they did was to replace the heavy-gear oil in haul trucks: 984 liters filled at 4 liters per minute in
colder months takes a long time. The suggestion was to use the holding tank to heat the oil and hold it closer to
the truck to make it flow faster. The oil now pumps at 50 liters per minute, and the filling time was reduced from
4 hours to 20 minutes. This is an example of reducing waste—of time, in this case, and time equals money.
Some key points: This was not a top-down process. It was bottom-up, and it caught on fire because people saw
that changes were being made, their voices were heard, and their opinions were respected and acted upon. The
key to their success was empowering the frontline workforce.
NOTES
80
Another approach that was used in Australia is the brown-paper process.

They took people from production and maintenance and paid them to come in off-shift for some short meetings.
They covered the walls with brown paper, showing the chart of plant processes, and every time a problem was
identified they added it to the chart with a sticky note. Of course, there were clusters in areas where a lot of
problems existed. They then discussed solutions. This type of communication was effective because people from
different departments and shifts saw how their actions impacted other groups. They could also compare their
strategies. The dialog solved a lot of problems by itself. The visuals helped as well and served as a motivator
during the improvement process. It was also an effective tool to show management. This approach was used by
a dairy company, and they saw their production improve. For a dairy farm to improve production, they have to
be able to offer farmers a better price per liter of milk than the competition. Because this company was able to
reduce their costs through improvements made in the plant, they were able to do offer a better price. Also, the
morale in the company went up even though there was resistance in the early days.
Here are the steps of the brown-paper process:
NOTES
81
• Use the charts to map progress. Keep them in a prominent location

• Add a red sticky note where problems exist
• Replace them with a green sticky note when problems are resolved
• Use the chart as a great indication of progress and a motivation
• Invite senior management to visit and see the charts, and have suggestees explain the solutions
The brown-paper process uses small groups at a time. Meetings are repeated so everyone gets an opportunity to
talk. You need a mixture of different trade skills plus operators and supervisors. Hourly staff will have to be paid
to attend, as you may not be able to take them away from their normal job. This is more manageable than an
open “request for ideas” system.
Whatever process you use, the key is that you work with people on the plant floor, utilize their energy, motivate
them, involve them, and get stuff done together.
VENDOR ANALYSIS
Let’s talk about vendor analysis. In the section on PMO, we discussed the fact that we had to scrutinize all
the PMs, including the PMs that come from the OEMs—the vendors. We should also scrutinize our selection
process. If a vendor is aligned with our thinking, in terms of the need for reliability, for acceptance testing, and
for inspections during overhaul, and the idea of condition-based and interval-based maintenance, we should
work with them. They will help us achieve better results in the future. There are a lot of other things to do before
vendor analysis, but keep it in mind in case the opportunity comes up.
This concludes the asset strategy section. This is the phase where we shift from reactive maintenance to world-
class, proactive precision maintenance, which leads to improved reliability and performance. We have to
scrutinize the decisions we’ve already made, and this is what the asset strategy allows us to do—prioritize, use
some logic, and make sure we are focusing where we need to. We will collect data that shows us where we have
poor reliability and which assets are giving us problems. With this strategy, we will achieve better results. Even if
you are already some way along your journey, make sure you have completed these steps.
NOTES
82
MOBIUS INSTITUTE | ARP-A R-10: Work Management
R-10: Work Management
This module will discuss work management, a broader topic than just planning and scheduling. The goal of the
work management program is to create a streamlined process for managing all maintenance work that reduces
costs (and waste) and ensures jobs are performed efficiently, in harmony with operations.
Work management is a fundamental requirement of a reliability improvement initiative. I do not believe you can
improve reliability without work management: planning and scheduling, as well as the other elements we will
discuss. Work management ensures the right jobs are being done, the jobs are following procedures, etc.
Ultimately, the goal of work management is to get the right people with the right skills doing the job at the right
time, with the right parts, and the right tools, in a safe manner. The job must be efficiently performed with
precision while following all procedures and meeting performance standards. Any discrepancies should be
reported so the next job will go smoothly. Make sure your work management program ticks all the bold words.
WORK MANAGEMENT FLOW

Let’s look at the flow of the process. We start with the asset strategy that we’ve just developed. It will tell us when
to perform inspections, perform CM tasks, look for hidden failures, and perform interval-based maintenance
NOTES
83
tasks. We should have a procedure and a schedule in place. We also have the tasks that arise out of the CM
tasks, inspections, and observations. Workers need a way to generate work requests, and we need a way to
process and prioritize them, and then we need to plan the job. We may need certain spares, materials, tools, and
people with the right skills. We should ideally kit the job so the job is ready to go. We look at all job requests and
decide which ones to do, considering criticality and the Pareto analysis. Jobs must be done correctly according to
procedures and the set schedule. At the beginning of the program, it is a good idea to leave room in the schedule
for break-in work. Make sure you have procedures in place for dealing with work requests as they come in, and
for the break-in work. Finally, we have close-out and feedback—making sure the work was done properly with the
right tools, procedures, and materials.
STRATEGY BASED WORK AND WORK REQUESTS

The strategy-based work at the beginning comes from our asset strategy. We also have time-based, production-
run-based, and distance-based tasks that need to be done. We also have to deal with work requests that come
as a result of condition monitoring tests, inspections, and observations. We need a system in place to manage
work requests: recording them, processing them, prioritizing them, and making sure they are done.
NOTES
84
ESTABLISHING A PRIORITY SYSTEM

We need a prioritization system. Some jobs are more urgent than others. Among the highest-priority work is
break-in work, followed by maintenance requests of various types, depending on the criticality of the assets
involved and their effect on production. Avoid marking all work as “urgent.” This may result in proactive work
being put off too long. It may be helpful to separate the maintenance team so that one person (or more, when
needed) is always assigned the proactive jobs. These tasks will prevent future failures, and they must be done.
PROCESS REQUESTS
Job requests will come in and they must be processed. These could be time-based jobs that come from the asset
strategy, work requests that result from condition monitoring tests, inspections, or observations by maintenance/
operations, or urgent break-in work.
JOB PLANNING
In the job planning stage, we look at the jobs that have to be done and make a plan so that they get done. Ideally,
planners are experienced in mechanical work and/or electrical work, and a different person would schedule the
jobs. We need people to create procedures detailing how the work should be done, how much time it should
take, the spares and tools needed, and the skills that the workers need. Otherwise, time will be wasted. These
jobs should be done with military precision, have the least impact on operation and production, cost the least
possible amount, and make the greatest use of the mechanical and electrical tradespeople.
Proper job planning saves money and increases safety. With experience, the time needed to plan the jobs will
decrease since we can consult previous job plans and their feedback. We should aim to plan at least 80% of all
jobs (note that break-in work is not classified as planned work). We should also aim for 90% schedule compliance.
Metrics like this give you a guide to how effectively the process is working. Ultimately, the goal is improved
performance via improved reliability, which is partially achieved by following job plans. Each planner/scheduler
should manage 15-20 technicians (30 technicians if planning and scheduling are managed by different people).
What is the normal, day-to-day function of a planner? They are like a stage manager, making sure everything goes
smoothly, and of course they have to manage interruptions. The scheduler deals with production, availability
of job plans, machines, and other resources, and makes things happen. Those are two very different ways of
working, and one person playing both roles will be less efficient.
There may be jobs that can be performed without planning if the task can be done quickly, the steps are well
known (make sure this is true and not a case of the job having consistently been done incorrectly!), and the parts
and tools are available.
WHY PLAN?
Do you think your department is too busy to plan? Planning is actually much more efficient. Take your best
NOTES
85
technician(s) out of the pool of people who do maintenance work and have them do the planning and scheduling.
An EPRI study found that with a 40% increase in work planned, utilization increased between 20% and 40%.
Therefore, if the organization has 40 maintenance workers, a 20% increase is equivalent to 8 people. It is like
creating additional resources—and those resources can be doing proactive tasks.
The shorter the P-F Interval of an asset, the more a job is rushed, leaving little time to get the right people and
tools. We might also have to spend more money sourcing the spares and tools.
In contrast, with the asset strategy in place, as well as good condition monitoring, we have a lot more time and
flexibility to plan before there is any risk of the equipment failing. This minimizes the costs, safety risks, and
impact on downtime.
The job plan should include the following:

• The required spares and other materials
• Documentation of the steps required to complete the job
• The tools required
• The need for other equipment
• The skills required by the technician(s)
• The number of technicians required
NOTES
86
• All safety procedures, permits, and requirements (e.g., OSHA)

The planner should also create job kits so that all materials and parts are ready for the job to be executed.
JOB SCHEDULING
The scheduler looks at the planned jobs, checks the available equipment with operations, checks the availability
of people skilled to do the work, and schedules the job. The scheduler must coordinate with operations to find
the most convenient time to perform the work. The scheduler must also coordinate with the maintenance
supervisor to determine who should perform the job.
If your plant is still in the reactive maintenance stage, you should strive to plan one day in advance. There will be
a lot of break-in work, but planning will help the other tasks get done as well. Over time, you should be able to
plan work at least one week in advance. The planner should be aware of the work being done. He or she should
periodically leave the office, talk to workers, and get their feedback as to the time allotted for the jobs.
The scheduler
• Determines the available craft hours
• Compiles a list of jobs
• Determines the remaining craft hours available
• Selects which jobs should be performed based on priority, equipment availability, etc.
NOTES
87
• Selects additional jobs in reserve in the event that there is less break-in work
The schedule is then presented to the maintenance manager for approval and reviewed by operations and
production. The scheduler can then prepare the jobs (parts, permits, instructions, etc.).
Time will be wasted if the equipment is not available, or the technician has to hunt for parts or tools, or if they are
unclear about the steps required to perform the job. Maintenance, operations, and schedulers need to work very
closely together to make sure the technicians and the machine are ready at the same time.
JOB EXECUTION
And then the job must be executed. The goal here is to do it once and do it right. All jobs must be performed
safely, with precision, and following documented procedures.
As mentioned earlier, we have to deal with break-in work. Break-in work is urgent, so space must be left in the
schedule to deal with it. Check for pre-made job plans for that work, and do it as efficiently as possible. When
break-in work crops up, we must talk to operators to find out what happened and whether it has happened
NOTES
88
The goal here is to before. If it is related to safety, stop and make sure you completely understand
do it once and do it the situation. Complete all the necessary steps required to control risks.
Check that you have the manuals, drawings, parts, and tools that you might
right. All jobs must be
need. Notify the appropriate people. We do not want to delay the work, but
performed safely, with it is important to take a moment to consider how to do the job as well and
precision, and following efficiently as possible. Break-in work must still be performed with precision.
documented procedures.
COMMISSIONING
After performing the work, we have to start the machine again. Commissioning
is a series of processes by which equipment is tested to verify its functionality according to its design objectives
or specifications. Do this while starting up the machine. Vibration analysts or other technicians may be involved
here.
The actual requirements for correctly commissioning equipment are beyond the scope of this course. As an
example, a large manufacturer recorded their OEE and saw that there were often problems during and after
shutdowns, partially due to incorrect commissioning. After they implemented correct commissioning processes,
those problems went away.
CLOSE-OUT AND FEEDBACK

After doing the work, we need to learn from what went well and what did not. Close-out and feedback are
essential elements of an effective work management process. The planner can get this information by going out
and talking to people, asking how the job went, and asking how the procedure could be improved.
The following should be performed:

• Enter comments made into equipment history
• Update job instructions for next time
• Track equipment condition and patterns
• Adjust frequency of inspections if required
• Manage spare parts
• Record hours and track maintenance costs
• Adjust job duration estimates if required
ANALYSIS AND PROCESS MANAGEMENT

We take the information that we have learned to continuously improve the process.
NOTES
89
SHUTDOWNS, TURNAROUNDS, AND OUTAGES

Shutdowns, turnarounds, and outages are important parts of the reliability improvement picture. The normal
work can be managed in a certain way, but there is a level of urgency during shutdowns that may introduce
other problems into the equipment. With our experience, we should be able to plan this process so that it runs
more smoothly. As we improve reliability, there should be fewer jobs to perform. Also, you need to make sure
contractors follow your procedures so they do not introduce problems. You can do this through supervision,
always using the same contractors, specifying a skill set, etc. Other than this, shutdowns are beyond the scope of
this course.
In conclusion, you cannot succeed with condition monitoring unless you have effective work management. Even if
you know that something has to be done, it will not get done unless it is planned and scheduled. We can improve
reliability and save money if we execute all jobs in an organized, efficient manner. But we need a process and we
must follow that process.
NOTES
90
MOBIUS INSTITUTE | ARP-A R-11: Spares Management
R-11: Spares Management
This module is important for three reasons: we can reduce the costs associated with spares management, we can
make the planning and scheduling process much more efficient, and we can improve reliability by caring for our
spares. Our goals are to reduce maintenance costs, reduce the planner and scheduler’s workload, and increase
equipment reliability by minimizing our spares inventory, taking care of those spares, and ensuring we have quick
access to the right parts and materials when we need them.
Where does it fit into our work management process? When planning the job, we need access to the spares,
materials, and tools in order to work efficiently. A lot of time can be wasted here if these things are unavailable or
we cannot find them. Correctly managing the spares can reduce the planner’s workload by up to 70%.
Spares and materials make up 40-60% of all maintenance costs in most organizations—this amount of money
must be managed correctly. The holding cost alone could be 30% of the purchase price. Is there a tax burden
where you live because you have millions of dollars of inventory sitting around, which are considered assets? We
NOTES
91
should only hold the spares that we really need. There is a temptation to keep spares “just in case” or because
you got a deal by buying in bulk—but you may never need those extra spares.
Spares management is important so that we can stop searching for
Our goals are to reduce spares that are not there or are in the wrong place or hidden. They
need to be stored in a convenient location so that we do not waste
maintenance costs, reduce
time fetching them. Also, one person or department cannot keep the
the planner and scheduler’s spares without the knowledge of others. Another important aspect of
workload, and increase spares management is to prevent the wrong spare from being used
equipment reliability by (and reused). We also have to make sure that the spares are in perfect
condition when they come off the shelf—they should not be sitting
minimizing our spares inventory,
there degrading, vibrating, and getting wet and dusty.
taking care of those spares, and
Inefficient spares management affects maintenance and reliability in
ensuring we have quick access
many ways, such as technicians having to wait for spares, search for
to the right parts and materials them, and travel to collect them.
when we need them.
It is important to have an accurate database, control access to the
spares, keep the spares in suitable locations, and have a process to
select which spares will be kept in inventory. We may even sell off some spares to get rid of the liability and
free up space.
ACCURATE DATABASE
All available parts should be entered correctly. The database needs to be maintained so that people have
confidence in it. Otherwise, people will waste time searching for spares, hold spares in their own possession, or
order new spares when there may be spares in the inventory.
Ideally, there will also be an accurate master asset list, or equipment register, that uses a structured naming
system. We don’t want to hold spares for equipment we no longer have. We should also have an accurate bill of
materials.
CONTROL ACCESS
Ensure that spares cannot be used without it being logged in the database. We cannot maintain an accurate
database if people can take spares or other materials without updating the database. To make sure of this,
make it easy to record this information. One option, especially useful for spares that are used frequently, is a
“supermarket” system of searching, accessing, and documenting use of spares and materials. Some companies
have “vending machines” for smaller spares.
Think about the location where the spares are kept. If people have to travel long distances to get spares and
NOTES
92
materials, a lot of time (and energy) is wasted. It also creates frustration and means the equipment is down
longer.
SELECTION PROCESS
Entire courses can be taught on the selection process. Establishing a selection process is extremely important.
You need balance: don’t hold too many spares, but make sure the necessary spares are there. We need a way of
understanding what spares we need, particularly for critical equipment. Through our asset strategy process, and
with the bill of materials, and understanding failure modes, we should be able to determine what parts we are
likely to need. Check if any of these parts are also needed by other machines.
It is possible to hold the parts on our shelves, but we could also explain to the vendors and suppliers that we
need fast access to certain parts and spares. They can keep them in stock. Spares will be much easier to manage
once you have reduced break-in work, since you will have a longer warning time.
Critical spares are not just the spares associated with critical assets. Remember that criticality is a combination
NOTES
93
of consequence and likelihood of failure, and the likelihood is a combination of reliability and detectability. In
determining critical spares, we have to consider the failure modes, which parts are likely to fail, the likelihood of
that failure occurring, the lead time to failure, and the likelihood of us detecting it.
We also have to consider the probability that our assessment of reliability and detectability is correct. Say we
have a good vibration program and, because of this, we are confident that we will detect a developing fault. But
what if the analyst misses the fault for some reason? People are human. But now we have a problem because
we thought we’d have enough lead time to order the spares. We have to be realistic about detectability. Are we
taking our measurements on schedule? Are we sure we are taking them frequently enough?
Consider which parts will be required if failure occurs: catastrophic versus detected, secondary damage versus
repair/overhaul. And consider redundancy: unit B will operate if Unit A fails, but what if Unit B also fails due to a
hidden fault? You must track which spares are critical so that you do not run low. Some spares may be used in
multiple machines, including non-critical machines.
Consider the lead time to accessing needed spares. Consider the ordering time, delivery time, repair time, impact
on production, availability of substitutes, possibility of repairing the existing part, and the availability to have the
spare made locally, among other issues. It is extremely important that we get this right—we want to have the
right spares controlled in a database.
CARING FOR SPARES

Make sure spares are being taken care of so that they fit the purpose when we need them. Let’s say we have
a motor that has passed acceptance testing, is sealed, and is perfect for our machine. What state is it in when
we go to use it, after it has been stored some time? Has it been subject to vibration, so that the bearings are
damaged? Has the seal been broken, so that it is dusty or wet?
Consider the shelf life of your spares. Some parts degrade with time, especially in certain environments.
Are the parts vibrating? This will cause false brinelling to bearings. It is best to isolate the equipment from
vibration. Another solution is to regularly turn the shaft so that the bearings don’t end up in the same place and
so that the lubricant moves around. There are other recommendations for environmental conditions and the like.
We will further discuss lubricant management later. We want to make sure lubricants are not sitting in the rain or
getting dirty. Contamination may even get in through the seal.
We should apply the 5S principle when caring for spares: sort, set in order, shine, standardize, and sustain. The
storeroom needs to be kept clean and organized. Be sure to store the tools properly too.
The bottom line is, it is essential that spares are cared for so they do not degrade. Inventory levels should be
maintained so that spares do not pass their use-by date. The investment in environmentally and vibration-
controlled storage areas is worth it.
NOTES
94
In conclusion, we can reduce the holding and purchase costs, improve work efficiency, and improve equipment
reliability. We do this by ensuring that we have the right spares, available when needed, in top condition, without
having too many spares. This is one of the keys to a successful initiative—actually, to a well-run business.
NOTES
95
MOBIUS INSTITUTE | ARP-A R-12: Precision and Proactive Work
R-12: Precision and Proactive Work
Our goal is to do everything possible to reduce the likelihood of future failures. Do the job correctly the first time
and take steps to eliminate root causes of failure (from a maintenance perspective).
The examples provided in this section mainly deal with rotating machinery, but the principles apply to all
equipment: electrical connections, transformers, steam traps, structures, extruders, spray booths, mobile
vehicles, mining equipment, and more.
PRECISION LUBRICATION
Precision lubrication is key to rotating machinery and hydraulic equipment. This should be one of the areas
you look at as early as possible in your reliability improvement initiative. A huge number of failures arise from
poor lubrication. Use the correct type and volume of lubricant and eliminate all forms of contamination. There
is probably no single action that improves reliability in the plant more than ensuring that machines (especially
bearings and gears) are lubricated correctly.
NOTES
97
As bearings roll along the raceway, there is so much pressure at that point that the metal surfaces actually deflect
slightly. That is part of the design. As long as the bearing was chosen correctly, it should be made to withstand
the load. But that assumes the correct lubricant is being used in the correct volume. It protects the bearing. The
same is true of gearboxes. The load on the teeth as they mesh together should be bearable if they are properly
lubricated.
There is less than 1 μm between the rolling elements and the raceway
There is probably no single (and between the gear teeth). If you look under a microscope, those
surfaces are rough. The only thing holding those surfaces apart is the
action that improves reliability
lubricant. The lubricant is under so much pressure, and the gap is so
in the plant more than ensuring small, that it can temporarily turn into a solid. Bearings are designed
that machines (especially to last for a very long time if they are used and lubricated correctly.
bearings and gears) are But what if there is contamination (particles or liquid) or it is the
lubricated correctly. wrong lubricant? That oil or grease has certain chemical properties,
which need to be correct for the application or else there will be wear,
NOTES
98
drastically reducing the life of the bearing or gear.

If we have too much lubricant, or liquefaction, or if someone has pumped in too much grease, that grease can
physically damage the rolling elements. It can cause an increase in temperature and/or ooze out and get into the
windings of the motors, for example.
Sometimes, the person chosen to grease the bearing is not chosen for the right reasons. It may be considered
an easy job, so a person is chosen who is capable of walking around and squirting grease into a bearing. It is an
important job, and we should make that job a point of pride for someone who is skilled in it. There are tools we
can use to ensure the right amount of lubricant is being used. We can use ultrasound to listen to the bearing as
grease is being pumped in to make sure there is neither too much nor too little. It is worth the investment, as it is
very easy to over or under lubricate the bearings.
Managing Contamination
We need to manage the contamination in grease or oil. It is very difficult to see whether oil has been
NOTES
99
contaminated by water until there is way too much.

Oil with 0.1% water still appears normal. However, at 0.1%, we are already down to about 25% of the bearing’s
original life. We will have a 75% reduction in the life of the bearing or gear because there is water in the oil that
we cannot even see.
Where is the water coming from? Water contamination can occur because of wash-downs, poorly sealed
reservoirs, condensation, and breathers. Leaving sealed drums of oil stored outside in the rain can allow water
to “breathe” into the drum as it heats and cools. The oil going into the machine may be already contaminated by
water.
Water in the oil can corrode the bearings and also adversely affect the oil itself. It can cause fluid breakdown,
such as additive precipitation, oil oxidation, acid formation, thickening, varnish, and sludge. Water contamination
can also cause a reduction in lubricating film thickness, leading to accelerated metal surface fatigue.
When bearings fail, be sure to take them out and investigate. If we see corrosion, there is probably water in the
oil, and we need to find out how it is getting in there. In a standby machine, water can pool around the rollers
and cause corrosion.
Particle Contamination
Now let’s look at particle contamination. As mentioned earlier, there is just a 1-μm gap between the bearing and
gear surfaces. What would happen if a hard particle got into that gap? Because the gap is so small, only small
particles get in there. Rolling elements or gear teeth roll over the particle, and the surfaces are damaged. While
the indentation left by the particle is small, the rolling elements keep rolling over it, and it spreads and gets
worse. A big spall in a bearing may have started as a small indentation.
A study was performed on helicopter gearboxes and bearings, using filters of different sizes to clean the oil. They
began with 40-μm filters and moved on to 25-μm filters, with little improvement. The situation was a bit better at
10 μm, but not by much. It was at the 3-μm mark that significant results were seen. Make sure your filter is fine
enough to catch particles of this size. Those are the ones that get into tiny gaps and damage the bearings. Some
companies tried this and found their filters blocked with too many particles. They need to investigate where all
those particles are coming from.
We have to filter the oil before it goes into the machine. The oil is probably already contaminated when you
purchase it. Make sure you keep the storage area clean so the oil does not get dirty. We need to eliminate the
ingress during operation (using breathers or sight gauges, for example). Don’t contaminate the oil when samples
are taken. And filter them out from the working gearbox.
Particles can easily be transferred when dirty oil cans are used. Purchasing and using a proper oil container will
not help if you leave it sitting outside and getting dirty.
It is a good idea to color code the drums and the machines that use that oil. We can also use sight glasses to
NOTES
100
clearly see the oil level and avoid the need for dipsticks, which can introduce contaminants. There are specialized
products out there to test the oil.
Breathers, Filters, and Cleanliness

Breathers are important because the air in an industrial facility will contain moisture and dust particles—and
maybe worse, depending upon the industry. Old breathers may keep bugs out, but not much else.
Be aware of the way filters work and their limitations. Most filters only remove half the particles of the designated
size. Beware of counterfeit oil filters.
CLEAN AND COOL

Aside from taking care of the lubricants, you need to keep the machine clean and cool. Dirty equipment results in
contamination and increase in heat. A general rule of thumb states that for every 10°C hotter a motor runs, the
life of the motor will be cut in half. It can be hard to keep a motor clean in a dirty factory, but if you are having
NOTES
101
reliability issues with the motor, excess heat due to dirtiness may be the reason.
If any spout is tilting upward, rain and dust will get inside the machine.
Even if you work in a harsh, dirty environment, it is important to remove the dust from the motors periodically.
This is something that operators can do.
Poor handling and PRECISION INSTALLATION

installation were identified Now we will talk about precision installation and precision shaft
as a major cause of bearing alignment. Installation has a huge impact on the reliability of the
machine. It is essential that all components are installed correctly, to the
faults—approximately 16%,
correct tolerances. The standards and procedures discussed in the work
though results will vary. management phase will help us here.
A study was done on the reasons bearings fail. Poor handling and installation were identified as a major cause
of bearing faults—approximately 16%, though results will vary. Poor installation can also lead to fatigue. Poor
NOTES
102
installation can damage the bearing. In this case, the lip was broken off. In another case, the bearing was
hammered into place, chipping the raceway.
When we replace a bearing, we can look at the wear patterns and see if it was misaligned, skewed, or damaged
when put into place, or placed under excessive load. If the inner race is cocked on the shaft at any angle, it puts
the raceways and the rolling elements under additional load, reducing the life. If the outer race is cocked, the
raceways are also put under load.
PRECISION ALIGNMENT
We have to look at alignment as well, especially when dealing with rotating machinery. Shaft and belt alignment
and soft foot corrections, are key to ensuring rotating machinery runs smoothly. Misalignment adds significant
load on bearings, gears, couplings, shafts, seals, and other components. Even if a machine looks like it is aligned,
it doesn’t mean it is precision aligned.
We can have an offset misalignment, and it will generate forces with every rotation that will reduce the life
NOTES
103
of those components. These repetitive, once-per-revolution loads stress the shaft, the seals, the coupling,
the bearings, and the foundations. If there is a gap—an angle between the two shafts—it is called angular
misalignment, or gap misalignment. This also reduces the life of the machine with each rotation. The patterns on
the bearing will tell us what the case was, but we need to prevent it.
A study was performed in which it was found that the life of a bearing was reduced drastically in relation to the
NOTES
104
angle between the shafts. For 10 minutes of a degree, the life was reduced to about 30%. Five minutes reduced
the life by half. What do we mean by minutes? Imagine we have 360 degrees, and a degree is broken up into
60 minutes. So 5/60 of a degree of an angle between the two shafts is enough to reduce the life to one-half of
the expected L10 life of a bearing. There is no way, with a straightedge or even dial indicators, to achieve these
tolerances.
Just because you have purchased a laser alignment system does not mean you have precision alignment. You
need to set tolerances and make sure the machine feet have proper shims. Your workers need to be motivated
to notice that last little thin shim.
Just because the vibration is not indicating a problem, it does not mean you have precision alignment.
Misalignment is tricky to diagnose.
The answer is to precision align the machines. Ask those doing the alignment work what alignment tolerances
they are using. Do they consider thermal growth? The machine can go out of alignment when it gets hot. What
do they do if they become bolt bound or base bound? What type of shims do they use, and do they measure the
thickness? How are they moving the machines?
NOTES
105
If they are using dial indicators, we need to check the aforementioned issues and see if they handle “bar sag.”
Check for parallax/reading errors and calculation errors.
We need the belts and pulleys to be in alignment with the correct tension on the belts.
Soft Foot
What is soft foot? Ideally, when you put the motor on the base, the base itself is perfectly flat and the four feet of
the motor are on exactly the same plane. That could be because the base itself is not flat, the feet are not on the
same plane, the feet are bent, they have crud underneath them, there are too many shims, the shims are bent,
etc. Soft foot not only makes it much harder to achieve precision alignment, but when you tighten the bolts you
are actually distorting the frames, and that can affect the tolerances within, and in the case of a motor, it changes
the air gap between the stator and rotor.
There may be a high spot on the foundation under one foot. When we tighten the other bolts, we will squeeze
the base down.
NOTES
106
It is a bit tricky, but rotors PRECISION BALANCING

must be precision balanced, In this module we will talk about precision balancing. Precision balancing
ideally to ISO 1940 G 1.0 is another pillar of a successful reliability improvement program. It is a bit
tricky, but rotors must be precision balanced, ideally to ISO 1940 G 1.0 to
to ensure there will be
ensure there will be minimal vibration and force exerted on the machine
minimal vibration and force and structure. When a shaft is balanced, every time it rotates it puts the
exerted on the machine and load on the bearings that the bearings were engineered to take, and
structure. therefore on the structure as well. If there is any amount of unbalance, it
moves in a circular motion, adding load.
A rotor that is balanced spins smoothly and the bearings that were designed to deal with it will run for a very
long time. If we add an unbalanced weight, it creates an unbalanced force. When the rotor spins, it does so in
a circular motion. When it spins faster, those forces go up a square proportional to speed, so if we double the
speed the forces will increase by a factor of four. Therefore, machines that operate at higher speeds are more
NOTES
107
sensitive to unbalance. To balance the rotor, we do some tests and find out where the unbalanced weight is. If we
can remove it, we do so. If not, we add an equal weight to the other end. We can choose to balance it a little, or
we can balance it tightly, which is the G 1.0.
Bearings get pounded with every rotation because they attempt to stop a shaft from moving them in a circular
motion. When the unbalance gets bad enough, the vibration will be high enough for analysts to take notice.
But for all the time the machine has been running before then, the whole structure has been vibrating, possibly
affecting local processes and product quality. We need to precision balance and align the machine, and if it goes
out of balance or alignment, we have to correct it before it gets too bad. A buildup of dirt can cause unbalance, so
the components will need to be cleaned before the machine can be balanced.
We can observe axial motion as well in an unbalanced machine.
Rotors can be balanced “in-situ,” where the rotor never leaves the machine. We take vibration readings and
balance the machine. Another way is to remove the rotor and “shop balance” it. You may be able to do this at
your site; if not, you can send it to be done. There are pros and cons to both methods. No matter what, you need
NOTES
108
to set the tolerances and have evidence that those results were achieved.
PRECISION FASTENING
In this module we will talk about fastening. This will mostly deal with rotating machinery, but fastening also
applies to electrical equipment.
There is a right way and a wrong way to fasten two items together. If we put too much torque, we can damage it.
Too little torque results in harm. If we do not bolt things together correctly with the right combination of bolts,
washers, and so on, it causes problems. One may imagine those performing this work are doing it correctly, since
they have been doing it so long, but it is an opportunity to provide training. Be sure to not point fingers at anyone
for how they have been doing it, but just explain the right way and show why.
We can look at obvious issues of bolts failing. When doing electrical connections, we have to have the two
materials bolted together without the washers in between—we need the current to flow straight through.
If bearings are not bolted down tightly, they will rattle around with the vibration. If that machine were perfectly
balanced and aligned, it may just sit there and not rattle even if the hold-down bolts were loose, but in practice,
it is best to keep the bolts tightened to the correct torque. If workers are using torque wrenches, do they know
what the torque should be?
We can have other problems as well due to weakness, cracks, corrosion, or loose bolts that can cause, for
example, a running motor to sway. It increases the vibration and the vibration can lead to failure.
Resonance causes amplified RESONANCE ELIMINATION
vibration. If the machine Resonance control is slightly related to fastening.

generates vibration at a Resonance is present in every structure that you deal with. Resonance,
frequency which is close to however, is often overlooked by maintenance people, reliability
specialists, and even vibration analysts. Resonance causes amplified
its natural frequency, that
vibration. If the machine generates vibration at a frequency which is close
vibration will be amplified, to its natural frequency, that vibration will be amplified, causing more
causing more movement. movement. That movement can lead to cracking, damaged bearings, and
failure.
The machine attached to a certain reciprocating compressor experiences resonance, causing the compressor to
sway. This is not a case of weakness or looseness—it’s only because it is operating close to its natural frequency.
If this machine could be operated at a lower or higher speed, it would not vibrate like this.
If you see a machine starting and speeding up, when it gets close to the natural frequency of its support beam,
it will bounce. It takes a driving force to make it bounce, such as a little bit of unbalance. The bouncing will stop
as the speed continues to increase. Sadly, it is not uncommon for the operating speed to be close to the natural
NOTES
109
frequency. It is very common for pumps, fans, and other machines. The machine may sway, rock forward and
backward, bounce, twist, or exhibit another mode, none of which are good for the reliability and health of the
equipment. You can solve this problem by adding mass, stiffening the machine, changing the operating speed, or
in another way. But first we have to spot resonance and understand it.
5S AND THE VISUAL WORKPLACE

In this module we will talk about 5S and the visual workplace. This is part of continuous improvement, but it is
also a proactive step to improve the way we work, improve cleanliness, improve the organization, reduce waste,
and improve efficiency. The 5S principles were mentioned in the spares management section, but they apply to
other areas of the plant. They are sort, set in order, shine, standardize, and sustain. These principles may seem
like common sense, and they are.
There is no point trying to get everything else right if the workshop is disorganized. This is how it should look:
• No materials on floors, except in designated shipping, receiving, and “work in progress” areas, and
NOTES
110
especially no materials in aisles

• Floors, racks, and shelves free of dust and debris
• No materials extending outside shelves or bins into the aisles
• No materials on top of or below racks, unless designed for storage
• All boxes aligned with shelves and neatly stacked
• No broken boxes or bins
• No unwrapped bearings, or other parts which depend on their packaging for protection
• The inside of all bins wiped clean
• A place for everything and everything in its place
http://veleda.ca/Basics.html
NOTES
111
This is especially important for culture change and morale. A disorganized workspace sends the wrong message
when talking about reliability improvement. Workers should be able to clearly see what the procedures are and
where things go, and they should be able to find what they need and not trip over things. People should not be
allowed to put things in the wrong places. This will not happen overnight, but the process needs to begin. Putting
time into this (or anything) will create time in the future—more time for proactive tasks.
If you have tools hanging up or stored in a convenient way, workers can easily find the tools. It is also easy to see
if tools are missing.
The Visual Workplace

A visual workplace removes confusion and clutter. It ensures everyone knows what is right, what is wrong, and
whether procedures are being followed.
In another module, we showed a picture of a lubrication storage area with color-coded barrels. This is also an
example of 5S in action. It is clean and organized. This system makes human error less likely. We can purchase
products that let us easily see oil levels, in a reservoir for example.
NOTES
112
Work instructions and procedures should be clearly displayed and illustrated at the place where that work is
being done. Shelves should be labeled, with arrows and colors for clarity. Add minimum and maximum values,
such as weight limitations. On the floor, mark where things are supposed
A visual workplace removes to go. Don’t rely on people’s memories.
confusion and clutter. It One creative idea used at a plant was to place an angled sign on top of
ensures everyone knows the tool cabinet so that tools could not be placed on top and workers
what is right, what is wrong, were more likely to put them away.
and whether procedures are Mark gauges with the desired pressure range.
being followed. In conclusion to precision and proactive work, every single proactive task
you perform will reduce the likelihood of future failures and will help us
break free from the reactive maintenance cycle. Every maintenance task that is performed with precision also
reduces the likelihood of future failures. We have to create an environment where these tasks can be performed,
and where the right way to do things is enforced and encouraged.
NOTES
113
MOBIUS INSTITUTE | ARP-A R-13: Condition Monitoring
R-13: Condition Monitoring
The goals of our condition monitoring program are to monitor critical equipment to detect when the root
causes of failure exist, and to be forewarned when functional failure will occur. We utilize the condition-based
maintenance (CBM) philosophy to plan maintenance based on the condition of the equipment, not its age.
Where does CBM fit into our work management process? It comes at the beginning as part of our asset strategy.
Condition-based tests will be performed, and if the analysts find problems, they will submit work requests.
BASIC APPROACH
Condition monitoring is taking action based on evidence of reduced health or a root cause that will lead to failure.
We can find this evidence by using performance data, process data (temperature, flow, pressure, etc.), (ideally
nonintrusive) inspections, or observation of any indication of condition.
We need to use criticality analysis to determine whether monitoring is justifiable and whether multiple
NOTES
115
technologies (or online systems) should be used. FMEA and RCFA, or common knowledge about common
components, will tell us which technologies are effective for which failure modes. We need to know, as closely as
possible, the P-F Interval, or lead time to failure, of a machine to determine the monitoring rate. In this module,
we will look at a number of technologies that we can use.
VIBRATION MONITORING
We will start with vibration monitoring. Vibration analysis has been successfully used for many years to detect the
nature and severity of the fault. As a wide range of faults develop, the vibration changes as a result. Challenges
include making sure you are taking the measurement frequently enough, that you are testing in the correct
locations, that you have the settings correctly set, that you are able to detect the changes when analyzing the
data, and that you recognize the fault condition conveyed in the vibration. This is not the easiest technology to
use. It takes training, skill, and experience to do it right. But the systems are becoming easier to use and guidance
is available. Machine learning systems are also becoming more effective. Especially with online monitoring, you
can have software look at the data and at least indicate if there is a
When we measure vibration, significant change. Sometimes, it can even diagnose the fault, assess
the severity, make a recommendation, and send the information
what we are really looking for is
to the maintenance management system so that the work order is
change: increased amplitudes, generated.
frequency changes, etc. Different
Machines always generate vibration. When we measure vibration,
failure modes generate different what we are really looking for is change: increased amplitudes,
vibration patterns. frequency changes, etc. Different failure modes generate different
vibration patterns. If a fan is not balanced or a pulley is eccentric, the
vibration will change in predictable ways. If the bearings are worn, there is a crack in the outer race, or there is
excessive clearance, the vibration will change in predictable ways. And if two machine components are coupled
together and there is wear in the coupling, or there is looseness, or too much force is being applied, or the shafts
are not aligned properly, the vibration will change in predictable ways.
Measuring Vibration
We have a sensor, which we place on the machine. The vibration is transmitted into that sensor, an electrical
system goes into our device, and it can be presented to us either as a number that can be compared against
an alarm limit or a chart, or a more complicated pattern that we can look at as a time waveform or a spectrum.
There are different levels of complexity. Because a machine vibrates up and down, side to side, and axially, we
might place that sensor in a number of locations, on the motor and the pump, for example, and in all directions.
When selecting a place to put the sensor, we need a good transmission path between the components we are
testing and the sensor.
We also need repeatability, meaning we test the machine in the same way every time. If the machine is running
NOTES
116
at a different speed or under a different load, or the sensor is not in the same place, then that will change the
measurement. The analyst is left wondering whether the change was in the machine or the condition. Vibration
lets us see inside the machine. The vibration from the fan, the bearings, the rotor bars, and the rotor transmits
up through the bearings to where the sensor is. The vibration analysis process can then break that vibration up
according to the different components.
Vibration from the shaft creates a very simple sine wave. This is one frequency. If the machine is going faster,
the cycles bunch together. If there is more vibration, maybe due to unbalance or misalignment, that once-
per-revolution vibration goes up. If, for example, I have twelve blades on the fan, there are twelve cycles per
revolution. If the blades are in good condition, it would be a little bit of vibration. The bearings themselves,
under good conditions, do not generate vibration. If there is a problem, they will generate vibration at a different
frequency. The vibration of all those components added together gives us our raw time waveform.
NOTES
117
The Fast Fourier Transform (FFT) takes that waveform and breaks it up into the individual frequencies. We can
also turn the time waveform into a spectrum, in which the peaks represent those individual frequencies. This is
what vibration analysts do: look at the spectrum to see if the amplitudes of the peaks have changed and diagnose
the problem based on the pattern. The pattern relates to the type of fault. The amplitude relates to the severity.
Bearings (and gears) can generate unique frequencies, and when they start to fail they generate high frequencies,
which we can detect. The amplitudes are low, so we will need special techniques. We can tell from the frequency
whether the problem is in the outer race, the inner race, or the rolling elements.
If we look at the raw time waveform of a gear with damaged teeth, we can see spikes in vibration every time they mesh.
We can also use phase analysis—comparing the vibration vertically and horizontally—and we can see if the
motion is circular or elliptical. Phase analysis is a great tool for distinguishing between misalignment, unbalance,
and other problems.
Pros and Cons

The pros of vibration analysis are that it provides great detail on the nature and severity of faults, provides an
early warning of faults, and covers a wide range of fault conditions.
The cons are that it requires training to perform correctly, the sensor must make contact with the machine (this
may be difficult or unsafe in some cases, and a sensor may have to be permanently mounted), the equipment
can be expensive, testing the machine correctly is time consuming, the analysis takes time (the newer software
can cut that down), and accuracy is difficult to achieve—training and experience are needed for that.
ULTRASOUND
Now we will discuss airborne and structure-borne ultrasound. There is a simple way and a more sophisticated
way to approach ultrasound. Basically, the operator can wear headphones, listen to the sound, and look at the
meter that shows the amplitude level. With ultrasound, you are listening to frequencies that are so high that you
cannot normally hear them. The machines—steam traps, bearings,
Different problems have unique electrical equipment, and other applications—can generate sound at
sounds. Once a person is frequencies above 20 kHz, which is above our ears’ ability to detect.
The instrument transforms the frequencies into ones we can hear,
familiar with the sounds, they
and that is what the operator is listening to. If the sensor makes
can determine the problems. contact with the item being tested, it is structure-borne ultrasound.
The instruments also have a readout of the amplitude, which can also be used as a gauge: how “loud” it is, and
how it has changed from last time. The sensors and instruments come in different shapes, sizes, costs, and
capabilities.
Ultrasound can be used to listen to the bearing in order to detect the earliest signs of failure and for signs of poor
NOTES
118
lubrication. Ultrasound can be used during the process of greasing a bearing. As grease is being applied, you can
hear the sound change, and you know the bearing has enough grease and you can stop before it has too much.
With airborne ultrasound, the sensor does not make contact with the machine. The operator waves the
instrument in the air to catch the highly directional sound waves (very convenient in a noisy factory). This is
especially helpful in finding leaks.
Different problems have unique sounds. Once a person is familiar with the sounds, they can determine the
problems. Ultrasound can be used in all types of applications: mechanical, electrical, and process. The sounds can
also be recorded and analyzed later.
Pros and Cons

The pros of ultrasound are that it is very versatile, the entry point is relatively inexpensive (depending upon the
instrument you want), the sound is directional, it is complementary to vibration and infrared, and it is good for
operator-driven reliability.
The main con is that ultrasound is highly subjective—it is up to the analyst to decipher the sound. It is qualitative,
not quantitative. Also, it can lull you into a false sense of security, as you may think this technology will solve all
your problems. There are a lot of faults that ultrasound cannot detect.
NOTES
119
ELECTRIC MOTOR TESTING

Now we are going to talk about electric motor testing. We can use vibration, ultrasound, and infrared on electric
motors, but there are special tests we can perform on them utilizing measurements of current and voltage.
In addition to a couple of normal mechanical faults, we can detect problems with the windings, insulation, the
rotor, the stator, the power supply, and the connections. We can test the motor when it is running and when it is
not running.
A study was conducted in 1985 and found that 41% of motor failures were due to bearing failures, but 47% were
due to rotor and stator failures. Like all tests, you have to determine whether electric motor testing is necessary.
We can do RCM analysis and determine that there is a failure mode due to broken rotor bars, cracked end
NOTES
120
A study was conducted in rings, poor power supply, or problems with connections, and decide to use
condition-based maintenance to detect those problems. Or you can look at
1985 and found that 41%
the history and how the machines are used in your plant.
of motor failures were
The way an electrical motor works is we apply voltage to the stator. That
due to bearing failures,
creates a rotating magnetic field. Current is induced in the rotor, which is
but 47% were due to rotor sitting in the middle of that magnetic field, turning it into a magnet which
and stator failures. is attracted to that rotating magnetic field, and it starts spinning. We apply
three phases of voltage to the motor, and we want that voltage to be smooth
and sinusoidal. If that magnetic interaction between the rotor and the stator is smooth, the current will be
smooth and sinusoidal. However, there are situations when the voltage applied to the motor is not clean. It can
have harmonics. There may be connection problems where the voltage on one of the phases is lower than the
others. There might not be balance between the three phases. If there is a problem with the stator or the rotor—
broken rotor bars, cracked end rings, etc.—then as the rotor turns, it changes the magnetic interaction, which
changes the current. We can see problems with the current signature, and we get sidebands in the spectrum. We
can perform these tests on one phase of the current or on all three, plus analyze the vibration.
MCSA, ESA, and MCA

Motor current signature analysis (MCSA) involves measuring the current on one phase of the motor with a
current clamp (or the flux) while the motor is running, and analyzing that signal with a vibration analyzer. We look
for any distortion.
Electrical signature analysis (ESA) involves the measurement of three-phase voltage and current. We learn about
the power supply (voltage) and the current (the motor itself). This can be done in the motor control center, which
is useful if the motor is in a hazardous environment. Some systems can also detect mechanical faults through the
current.
Motor circuit analysis (MCA) systems test the motor when it is not running. This can be performed on new
motors, motors that have just been rewound, and motors in inventory. This system applies a low voltage to the
connections, and we manually turn the rotor. If everything is in balance, we get a smooth, cyclic pattern. If we do
not get that, there may be problems. This technique measures resistance, inductance, and capacitance, looking
for fluctuations as the shaft turns.
Pros and Cons

Pros of motor testing are that it helps us understand the condition of the rotor, stator, connections, insulation,
and power supply. Off-line and on-line tests can be performed. The motor can be tested remotely.
Cons include safety issues when performing on-line tests, the technique requires training, and you must assess
the ROI and the likelihood of motor failures before justifying the program.
NOTES
121
By now you know how OIL ANALYSIS

important oil is to the Now we will talk about oil analysis and wear particle analysis. By now you
machine. It needs to have know how important oil is to the machine. It needs to have the correct
chemical properties, it needs to be there in the right volume and viscosity,
the correct chemical
and it should not be contaminated. That’s what oil analysis allows us to
properties, it needs to be check.
there in the right volume and
At the microscopic level, the surfaces of bearings and gears are rough.
viscosity, and it should not The oil reduces the friction and keeps the components cool and clean.
be contaminated. The machine designers usually specify which oil to use. You may be able
to switch to a better lubricant, but just make sure it is the right one. If
there is contamination or insufficient viscosity, or the surfaces are too close together, we will have wear. As
mentioned before, hard particles damage the surfaces of gears and bearings. Water contamination can cause
corrosion.
NOTES
122
Oil analysis has been in active use for many years. Kits and mini-laboratories can be purchased for on-site
testing, and/or oil samples can be sent off-site for testing and analysis. Tests are performed on lubricating oils
(combustion engines and non-combustion rotating machinery) and on hydraulic oils. Analysts are looking for
three main things:
• Check the chemistry of the lubricant to make sure it has its additives and viscosity and make sure it is able
to do its job. Otherwise, we change it (condition-based oil changes)
• Check for contaminants: particles, water, fuel, soot, process material, etc.
• Check for wear: if there is metal-to-metal contact and pieces of metal are being shed, we can use oil
analysis to detect some of those particles
Pros and Cons
The pros of oil analysis are that we can understand the condition of the lubricant, detect contamination of the
lubricant, and detect wear of the lubricant.
The cons are that it requires an investment to take the samples correctly, a cost is associated with the lab
service, the test results can be complicated (choose your lab carefully), and you must understand the technique’s
limitations.
WEAR PARTICLE ANALYSIS

Wear particle analysis complements the standard oil analysis. Wear particle analysis, or wear debris analysis, is a
very important tool if you have critical gearboxes, oil-lubricated bearings, diesel engines, and other oil-lubricated
components, plus hydraulic fluids. When wear occurs, metal particles are shed as a result. Rubbing, sliding,
particles stripping off pieces of metal, extra load, and friction all create different particles of different sizes,
shapes, and colors. We can tell if there was heat, excessive load, etc.
We can look through a microscope at the oil that was taken (it may need to go through some preparation first) and
see these shapes and deduce what is happening in the machine. We can see fatigue chunks, coal, gear spalling,
red oxides, filter fibers, and so on. When we perform our oil analysis tests, such that we can see elements like iron,
lead, tin, etc., we can usually only detect the presence of those wear metals are under a certain size (say, 10μm). If
they are bigger, the process used to detect them does not work. This is the domain of wear particle analysis.
Pros and Cons
Pros of wear particle analysis are that it allows us to understand the condition of bearings, gears, and hydraulic
components, we can detect sources of contamination, and it can be performed rather inexpensively.
Cons are that it requires an investment to take the oil samples correctly, there is a cost associated with the oil lab
service (if you go that route), and the test results can be complicated.
NOTES
123
INFRARED ANALYSIS
In this module we will talk about infrared (IR) analysis (thermography). There are a couple of tools we can use to
measure temperature change. We can use thermal imaging or a simple spot radiometer. As the tool is moved
around, the laser beam indicates the center of the area it is measuring. The thermal imaging view shows a
color-coded range of temperatures. The further we get from the target, the larger the area measured. This
will decrease the accuracy, as the tool will take the average reading of everything in its range. This is a tool an
operator can use to check the temperature of bearings and other components.
Infrared cameras are becoming more affordable. We simply move it around to see a thermal image of the
components. However, it is easy to be misled that something is too hot or too cold due to the settings of the
device and the emissivity of the object. If you have components that are similar to each other and should be the
same temperature, it is easy to see if one is hotter than the other. However, check to make sure the heat is not
simply being reflected off something else. The component itself may not be hot. Something could look hot when
it is actually your reflection bouncing off the surface and coming back to the camera. Wind can affect the reading,
NOTES
124
as can reflections from sunlight, humidity changes, etc. This technique looks very simple, but it is easy to make a
mistake and jump to the wrong conclusions.
You can even hook an infrared camera up to a smartphone (remember that you get what you pay for). We can
use infrared for mechanical applications, electrical applications, and others, see the temperature gradient, and
assess whether that temperature is acceptable. If not, what is causing the change? Some cameras combine the
visual and thermal images so you can see the components more clearly.
If you have similar components that should all be the same temperature and one is clearly hotter than the
others, check for a problem.
Pros and Cons

A big pro of thermography is its
versatility: you can check the air A big pro of thermography is its versatility: you can check the air
coming from steam traps, see if there is a buildup of material in pipes
coming from steam traps, see
or tanks, look for electrical shorts, check for wear on bearings, and
if there is a buildup of material more. This technology is relatively affordable but, again, check the
in pipes or tanks, look for quality of the device. It is easy to use if you know what you are doing,
electrical shorts, check for wear and it works well with vibration and ultrasound.
on bearings, and more. The cons are that it is not as simple to use as people think; you really
need to understand the technology to do it right. You also have to
understand the equipment being tested, and its process, to interpret the results and diagnose the faults (or lack
of them). Make sure you also understand thermal properties. Note that heat in mechanical applications usually
indicates a late-stage fault condition. Infrared is a good tool to test electrical components, but you need to be
very careful when doing so. You can install thermal imaging windows (made of a special material that allows the
image to be taken) in cabinets so you do not have to open them.
VISUAL INSPECTIONS
This is the last of the series on condition monitoring, and we will discuss visual inspections, performance
monitoring, and non-destructive testing. By visual inspections, I mean any sort of observation a human can make.
Aside from technology, we should use our eyes, nose, hands (when safe), and ears to detect problems. We can
do this in two ways. First, as part of a preventive maintenance task, a person can go out and perform a visual
inspection of the equipment, ideally looking for something specific. Second, any time we perform a condition
monitoring task (vibration, infrared, or whatever), the technician should perform visual inspections as well. Listen
to the machine. Look around the machine for water, oil, or a steam leak. Are the bolts loose, is there cracking,
or are there rubber particles underneath the coupling? Is there an unusual smell? If it is safe to do so, touch the
bearings. That information can provide work requests and/or help the technician diagnose the problem. A slowed
beat from a motor, for example, can be difficult to detect through vibration analysis, but it is audible.
NOTES
125
When you start your reliability program, you should be walking through the plant and using your senses to
check for yourself. Aside from visual inspections of the machines themselves, watch how the technicians are
performing their tasks and make sure everyone is following the 5S guidelines. If we sense something unusual,
such as a burning smell or a hot bearing, we can avoid failure and waste.
Visual inspection can be part of a PM program; however, it is essential that the goal of the inspection is clear and
that information can be recorded rather than checked off. Don’t tell the technician to check the temperature. Tell
them to make sure the temperature is between 25° and 32°C, or to record the temperature. The trouble with logs
is that they often go unchecked. Everyone should be encouraged to record and report observations.
PERFORMANCE MONITORING
It is easy for us to get wrapped up in the technologies, but we should not forget performance monitoring. This is
part of condition monitoring, both in terms of assessing the condition of the equipment and diagnosing the fault.
We improve reliability to improve performance. A change in performance is an important thing for us to take
NOTES
126
Aside from technology, we note of and we should find the cause.

should use our eyes, nose, The performance of plant equipment provides an indication of health
hands (when safe), and ears to and the need for improvement (and it may explain changes in CM
data). By watching the performance of the equipment, we can tell how
detect problems.
it is being operated.
We might see, for example, that the pump is being operated outside its proper range. This might result in
recirculation, cavitation, or other problems. By not operating the equipment properly, we are putting it under
additional stress and load, which will impact its reliability. Those working with the machines may not even be
aware that they are not operating properly.
NON-DESTRUCTIVE TESTING
Non-destructive testing (NDT) is a set of noninvasive techniques used to determine the health of the equipment
or to take a measurement. Most of the techniques mentioned so far are non-destructive tests. There are a
number of NDT techniques:
• Magnetic particle testing (MP)
• Liquid penetrant testing (PT)
• Radiographic testing (RT)
• Ultrasonic testing (UT)
• Electromagnetic testing (ET)
• Laser testing methods (LM)
• Leak testing (LT)
The ultrasonic testing listed here is different from the ultrasound we have discussed. Infrared is also NDT. We
want to detect cracks, corrosion, or anything that indicates a developing problem that must be dealt with.
Magnetic particle inspection is detecting surface and shallow subsurface discontinuities in ferromagnetic
materials. A person applies a magnetic field and then uses the applicator to check if there is a problem.
Liquid penetrant testing is used to find cracks and bad welds. The first step is to clean the part. The second is to
apply the penetrant, which will sit on the surface of the material and go into any cracks that are there. We then
re-clean the surface, leaving the penetrant in the crack. We apply a developer, which brings the penetrant back to
the surface so we can see where the crack is located. Fluorescent light may be used to see the crack.
Radiographic testing involves the use of either x-rays or gamma rays to view the internal structure of a
component. The rays go through to a receiver on the other side that detects the rays, and if there is a crack or
another problem, we can see it. These systems have become more sophisticated, and even automated, so this
NOTES
127
technique has become easier.

With ultrasound testing, we push a high-frequency wave into the material. The wave will bounce off any cracks
and also the other end of the material (allowing us to find out the thickness).
There is a lot more we can say, but this is just an awareness course. We go into these areas in a lot more depth
in the ARP-Engineer course. In conclusion, the condition monitoring technologies can provide an early warning of
fault conditions, as well as the root causes of future fault conditions. It is important to understand the strengths
and weaknesses of each technology. It is also important to correctly assess which technologies should be used,
and how frequently they should be used, for each asset.
NOTES
128
MOBIUS INSTITUTE | ARP-A R-14: Breaking Out of the Reactive Maintenance Cycle of Doom
R-14: Breaking Out of the Reactive Maintenance

Cycle of Doom
We have talked about several techniques so far for breaking out of the reactive maintenance cycle. Now we will
bring them all together. You will not have success in terms of reliability and performance improvement unless
you can break out of this reactive maintenance cycle. It is hard to break out unless you take specific steps.
What is the reactive maintenance cycle of doom?
We suffer from preventable failures occurring. Resources are taken by those breakdowns, making it difficult
to find time and technicians for proactive work. Because we are in a rush, repairs are performed poorly, or
temporary repairs are done. Therefore, there is a lot of repeat work and no RCFA to determine why the failures
are happening. No action is taken to prevent them from recurring. Some people may realize the cause of the
failures, but their suggestions are pushed aside. There are head and budget reductions. Morale in the plant
declines and standards drop. The backlog grows and PMs are missed. As a result, preventable failures occur as
NOTES
129
the cycle begins all over again.

We have to break out of that cycle. Let’s discuss six ways to do it.
STEP ONE
First, change the reliability culture. This is not easy because you will have to get everyone’s support, but it is
necessary. Eliminate the resistance and create bottom-up drive.
People need to understand how they benefit if reliability improves. They need to understand how they can
get involved in the improvement process, and finally, they need to be involved in the process—ask for their
suggestions and their opinions on how to implement their suggestions. We discussed the brown-paper review,
which is a way to get people from all departments involved in the process.
Ensure management are on board. To do this, you will need a strong business case, and education helps. Get
some wins on the board: start with a few visible projects, publicize the benefits, and reward the participants.
NOTES
130
STEP TWO
Second, find the low-hanging fruit. Which assets are failing the most? Focus your attention where you will gain
the greatest impact.
STEP THREE
The third step is work management. Get more work done by the same team. Do a better job and reduce future
problems. Reduce costs and improve safety.
Take your best person off the tools and have him or her plan and schedule jobs. Plan at least one day in advance.
The planner should not have any other duties (that person’s focus will make everyone else more efficient).
Each planner/scheduler should manage 15-20 maintenance technicians. Planned and scheduled jobs are more
efficient than unplanned jobs. Due to the 20% gain in efficiency, you will gain 8 people on a 40-person team.
Operations must contribute and buy in to the maintenance plan. Mend any fences that need to be mended so
you can all work together. Note that you do not need a fancy computer maintenance management system at this
stage. It may overburden your team. You need a system, but it does not have to be that sophisticated.
STEP FOUR
Fourth, we need communication and cooperation to ensure planned jobs get done and that there is a focus on
adding value.
Maintenance needs to see things through the eyes of operations. Operations is there to produce a product
that generates revenue. None of us exist unless operations provides the product or operates the equipment
to provide the service. Operations gets frustrated when they miss targets due to equipment failure (or to
maintenance).
Likewise, operations needs to see things through the eyes of maintenance. Maintenance needs access to
machines today so they are available (and safe) tomorrow. Operations needs to understand how to operate the
equipment properly so it does not generate future failures.
As part of this, we need efficient morning meetings to lay out the plan. We need cooperation and agreement on
the maintenance plan. Afterward, we need feedback on whether the plan was correct. And we need to institute
standard operating procedures.
STEP FIVE
Fifth, we need to do everything to eliminate the root causes of failure. We really need a laser focus on this.
If we can identify through Pareto analysis which equipment is causing us the most problems, and find out why that
is happening, we can start eliminating the root causes. To do this, we also need to understand criticality, do a bit of
RCFA, and understand through common knowledge that lubrication and shaft alignment, for example, are essential.
NOTES
131
We need to understand criticality so we can deal with the machine that has been wasting our time, costing a lot
of money, and generating a lot of waste. This will allow us to make substantial improvements within 12 months.
To break out of the reactive maintenance cycle, it is crucial that we assign one person to perform nothing but
proactive work. Take another one of your best, most positive people off the tools for this. Otherwise, the failures
will keep occurring.
To break out of the reactive Start prioritizing and instituting proactive tasks. Make sure bearings are
maintenance cycle, it is installed properly, machines are aligned properly, electrical connections
are made properly, etc. Use the PMO technique to eliminate unnecessary
crucial that we assign one
PMs. Do this even if you are not ready or able to do the full RCM at this
person to perform nothing time.
but proactive work. Follow the 5S system and the visual workplace. Getting things clean
and organized has a psychological benefit and also makes things more
efficient. Organize the storage areas with labels, and the lubricant storage, and make sure we can easily see the
NOTES
132
oil levels. Can workers tell what the pressure level should be?
STEP SIX
Sixth, utilize condition monitoring. Initially, we can use basic techniques to detect failures that will occur in the
near future to begin to break the back of reactive maintenance. Feed the planning and scheduling process.
Start with a small program internally, since you may not have the skills or budget for more. Handheld vibration
analysis, ultrasound, or simple IR are good places to start. Alternatively, use outside consultants for more
sophisticated testing and troubleshooting.
You will have to make a decision about which technologies to use. Criticality shows us which machines to test,
and common knowledge will tell us which technologies are best for that equipment and its failure modes.
IF YOU WANT TO STAY IN…

How do you ensure that you don’t break out of the reactive maintenance cycle of doom?
Only reward people for emergency work. Push people to get jobs done quickly. “Git ‘er done; we’ll do it properly
later.” All of these will keep you within that cycle.
You will stay in the cycle if you do these things: Don’t give responsibility to people. Assume you’re the smartest
person around. Control everything; don’t delegate. Don’t ask for suggestions. Ignore suggestions when they are
given. Don’t hold people responsible or accountable.
If you really want to stay in, buy alignment equipment, lubrication systems, etc., but don’t provide training and
don’t address culture change. Defer training if there is budget pressure.
Finally, repeat these mantras: This is the way we have always done things. That’s how things are in industrial
plants. We tried that. That won’t work here.
And make sure you blame operations (who blame maintenance), poor design, old plant, new plant, dirty plant,
clean plant, regulators, OEMs, lack of time, lack of money, old staff, new staff, unions, management, or anyone
else.
Seriously, stop the blame game and commit to making changes and working as a team. In you do not do anything
differently, don’t expect a different result. In conclusion, reliability can be improved in any plant. The culture can
change. The improvements can be sustained. There will be a positive ROI.
NOTES
133
MOBIUS INSTITUTE | ARP-A R-15: Continuous Improvement
R-15: Continuous Improvement
Continuous improvement is an important part of the reliability improvement initiative. We will not get it right on
day one, so we have to improve the program. But you may wonder when the reliability improvement program
ends. The answer is “Never.” If we take the focus off these reliability improvement initiatives, we will slip back into
old habits.
Take a leaf from the “safety” book. Companies do not have a three-year safety program and then finish thinking
about safety. Reliability improvement is a living program.
We continue to look for opportunities to improve or, at the very least, make sure we do not lose ground. We
have to continue to refine and review our understanding of the business, the criticality ranking (it may have
changed due to resolved problems), and the reliability strategy. In another module we assessed our strengths
and weaknesses, and we also looked at business: what it was trying to achieve, our constraints, our risks, and our
opportunities. However, business changes, as do economic conditions, the availability of capital, the competition,
etc., so we need to keep up.
NOTES
135
Continue to perform RCFA to learn from failures and make further improvement. Record KPIs, monitor progress,
and occasionally set new targets (auditing and benchmarking). And continue to educate. People forget things and
lose their awareness of issues, and their skills need to be refreshed. Employees change their positions, and new
employees come in.
We need to continue to communicate the wins and the mistakes—we need to learn from mistakes and be
encouraged by the wins. This is also important so that senior management sees the value in what we are doing
and so that people on the plant floor continue to be energetic.
We need to improve the results achieved and sustain the momentum of the reliability improvement initiative.
KEY PERFORMANCE INDICATORS

Let’s talk about the key performance indicators (KPIs). How do you know if you are making progress unless you
are taking measurements and seeing that they are improving, staying stagnant, or falling?
Measuring and reporting your progress is motivating, for you and everyone else. This also helps us identify
opportunities for improvement. It is especially important to senior management. Don’t assume that senior
management knows about the progress you are making. You have to make sure they know. KPIs help you
measure your performance and enable you to demonstrate that improvements are being made.
NOTES
136
There are leading and lagging KPIs. Lagging metrics look backward. They measure the effect your program had
in the past. For example, mean time between failures (MTBF) deals with the failures that you have experienced
in the past. Maintenance costs also deal with things that happened in the past, but they are not necessarily an
indication of what we can expect in the future.
Leading KPIs provide an indication of what you can expect to achieve in the future. If our oil is clean, we can
expect fewer failures. If condition monitoring tasks are performed on schedule, we can expect improvement. Also
look at the current number of planned jobs.
What should we measure? We need safety-related KPIs (this might relate to OSHA in the US). There are a number
of maintenance-related KPIs, including availability, utilization, OEE, PM compliance, and schedule compliance. We
spoke about total capacity near the beginning of the course. Those issues that lead to reduced capacity can be
measured with KPIs.
There are a few “dos” and “don’ts” with KPIs. Be careful what you measure because you tend to get what you
measure.
• People tend to focus on achieving goals, potentially to the exclusion of all else
• You want the KPIs to be indicators of the desired outcome (they should be aligned with your strategy and
what the business wants to achieve)
• Only have KPIs of measurable metrics (and ensure that everyone agrees on the equation)
• Don’t measure too many things
• Balance your metrics—don’t just focus on one area
• Update them at least annually
• Make sure people do not adjust goals or activity just to meet KPIs
REVIEW PROGRAM STRATEGY

We also need to review our program strategy. We will not get it right on day one, but we need to review the
criticality rankings, our asset strategies, our work and spares management program, our targets and KPIs, and
the status of the culture of reliability. Reliability is a long journey and the review process will enable us to take
stock of what has been achieved and what comes next.
CONTINUAL EDUCATION
Continual education is also important in continuous improvement. We provide training so people have the skills
and awareness, and so that they buy in to the program.
Donald Rumsfeld famously said, “There are known knowns; there are things we know that we know. There are
NOTES
137
known unknowns; that is to say, there are things that we now know we don’t know. But there are also unknown
unknowns—there are things we do not know we don’t know.” Some people may think they are doing something
correctly, but what they think they know may not be correct.
Are you sure you know what you think you know? And what about everyone else in the organization? This quote
was attributed to Mark Twain, among others: “It ain’t what you don’t know that gets you into trouble. It’s what you
think you know that just ain’t so.”
Hopefully you have learned some things from this course. But think about a course you took in the past: how
much did you remember about the course two days later? Six weeks or six months later? Our memories may
NOTES
138
Reliability is a long journey be faulty. We have to refresh our knowledge with training, conferences,
and the review process will books, e-learning, etc.
enable us to take stock of Our memory is not perfect, and if we don’t always use it, we may lose it.
Think of how often people are retrained in safety topics and procedures.
what has been achieved and
Therefore, we need to refresh our memories and improve our knowledge
what comes next. frequently.
Managers often worry, what if we train people and they leave? They should say, what if we don’t train them and
they stay? We have problems when people do not understand the technologies, the principles, etc. If you are
worried about people leaving, you may need to pay them more.
In conclusion, reliability improvement is an endless process, just like safety improvement. If you relax for a
minute, the plant may slip back into its old habits and you will once again experience poor performance. It is
therefore essential that you measure, analyze, and communicate continuously.
NOTES
139
MOBIUS INSTITUTE | ARP-A R-16: Implementation Strategy
R-16: Implementation Strategy
This module acknowledges that you now understand the reliability improvement process, developing the asset
strategy, justifying the program, and so on. You are now ready to take the exam if that is your desire. What we
would like to do in this module is tell you about a method that we have of implementing the program. Different
companies have different approaches to implementation.
I briefly mentioned the implementation roadmap, or our “reliability success master plan.” If you follow the steps
and stages on the roadmap, you will be successful.
Let’s address the elephant in the room. How can a single roadmap fit everyone’s situation?
Everyone’s situation is different: different industries, different people, different management, different goals,
different implementations, different ages of plant, different regulations, and so on. But to a large degree,
everyone’s situation is the same. We all need management support to make this work. You must achieve a
culture of reliability. You must understand your organization’s goals. You must know what you are good at, and
NOTES
141
you must recognize your weaknesses. You must eliminate the root causes of failure. You must measure and sell
the progress of your program—even though you may be great at your job, you cannot trust that management
will see your value unless you continue to sell yourself.
Most implementations struggle in the same places: trying to focus on technical solutions, focusing too heavily on
reliability analysis (instead of reliability improvement), leaving it to outside consultants, doing a bit here and a bit
there without a plan, trying to force reliability down people’s throats, improving reliability for reliability’s sake, etc.
We developed this implementation roadmap to make sure we do not fall into those traps and that we deal with
all the things that need to be dealt with for a successful strategy.
NOTES
142
LAYING THE FOUNDATION

To summarize our strategy, the green blocks on our chart are the foundational blocks. They fuel the engine
that will make all of this work. Then there’s the “go,” as we roll it out. Then we have steps to prevent future
failures from occurring, and we are monitoring what’s going on. At the bottom is transformation: measuring,
communicating, getting people on board, and this is a never-ending process.
Why can’t we just jump to world-class? A speedy approach in too many areas is often an uncoordinated
approach. It just won’t work.
The purple boxes on our chart are all about work management, spares management, CBM, optimized operation, and
defect elimination. Of course we would like to do all that. But you have to get the groundwork right so that you are ready.
NOTES
143
We need some information, and we need the time, and we need maintenance under control so that we can
properly plan. We must break out of the reactive maintenance cycle or nothing else will work. To do this, we must
lay the groundwork. These steps are setting us up for success. We need the right support and to get the people
on board.
The beginning is where we determine why the business needs to change. We assess where we are,
understanding the business needs and determining what the gap is. We set the KPIs. We are establishing the
business case. We go on to make sure we have internal support and senior management support.
In order to do that, we develop pilot projects to prove that reliability works. We choose visible projects, get the
right people involved, take on the projects, and measure the benefits. We will use our initial success to refine and
NOTES
144
justify the program, get support, and motivate people.

As early as possible, we involve other people. Walking around the plant and talking to people, we ask for
suggestions and opinions, and we identify the positivers. Use the trainings and meetings to bring them together,
and make sure they know what’s in it for them.
All of this—the communication, measurement, benchmarking, etc.—feeds into the continuous improvement
program.
EXPANDING THE PROGRAM

Once we have senior management support, we can now expand the program. We can establish the steering
committee, formalize the plan, get buy-in on the plan, and define the preliminary asset strategy. When ready, we
make the launch.
To break free of the reactive maintenance cycle of doom, it is necessary to get maintenance under control,
implement spares and work management, employ the 5S strategy, establish relationships, enforce precision
procedures, and perform basic proactive tasks and monitoring. We can then go back to our asset strategy and
make sure we are clear on what is critical. Review your Pareto analysis and the suggestions from the field.
We can now transition to world-class reliability improvement, and we can
get to some tasks we were not ready for at the beginning. Through this
Never forget, you must
process you are reducing waste and costs, and you are improving quality,
keep measuring, educating, throughput, availability of equipment, and the capacity of the plant.
reinforcing, communicating,
But never forget, you must keep measuring, educating, reinforcing,
justifying, and improving. communicating, justifying, and improving.
In conclusion, sadly, lots of programs fail. It is hard to improve reliability
without guidance. Our master plan is an attempt to provide you some guidance. It is important to learn from
other people’s experience, good and bad. I hope the strategy I have just laid out, as well as the course in general,
has helped and motivated you and shown you what you can achieve if you put the right effort and the right focus
in the right places.
NOTES
145
www.mobiusinstitute.com
AUSTRALIA - BELGIUM - MEXICO - INDIA - UNITED STATES

and training centers in 50 countries.
Mobius Institute quality brands:
VIBRATION ANALYSIS & CBM ISO 17024 / ISO 18436-1 RELIABILITY TRAINING & GLOBAL CONDITION A CONTENT-RICH COMMUNITY
TRAINING & CERTIFICATION ACCREDITED CERTIFICATION MONITORING CONFERENCES FOR CBM PROFESSIONALS

ARP-A Manual Jan 2021 v2.0 A4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ARP-A Manual Jan 2021 v2.0 A4

Uploaded by

Copyright:

Available Formats

Asset Reliability

DO NOT COPY OR REPRODUCE IN ANY FORM

Introduction from Jason Tranter 7

R-01: Getting Started 9

KNOW YOUR GOALS 15

R-02: What Are the Benefits? 17

WHAT DO WE GAIN WITH RELIABILITY? 18

WHAT CAN RELIABILITY HELP US REDUCE? 19

CASE STUDIES IN RELIABILITY 19

R-03: Assessing the Benefits 21

THE BUSINESS PROCESS REVIEW 23

HOW ARE YOU PERFORMING RIGHT NOW? 24

R-04: Culture Change 27

WHY CAN’T YOU JUST FORCE PEOPLE TO CHANGE? 28

THE ROLE OF UNIONS 30

WHAT MESSAGES ARE YOU SENDING? 31

LEARN FROM HISTORY 32

R-05: Selling to Senior Management 33

R-06: Establishing the Strategy 37

SUPPORT FROM HR AND OTHERS 40

R-07: Understanding Failure 41

COMMON FAILURE PATTERNS 43

R-08: Defect Elimination 47

DESIGN FOR RELIABILITY 49

R-09: Asset Strategy 57

WHO SHOULD DEVELOP THE STRATEGY?  61

ANALYZING RELIABILITY DATA 62

INTERPRETING FAILURE MODES 66

ASSET CRITICALITY RANKING 67

PREVENTIVE MAINTENANCE OPTIMIZATION 72

RCM AND FMEA 73

DECIDING TO PERFORM RCM OR FMEA 73

R-10: Work Management 83

STRATEGY BASED WORK AND WORK REQUESTS 84

ESTABLISHING A PRIORITY SYSTEM 85

CLOSE-OUT AND FEEDBACK 89

ANALYSIS AND PROCESS MANAGEMENT  89

SHUTDOWNS, TURNAROUNDS, AND OUTAGES 90

R-11: Spares Management 91

CARING FOR SPARES 94

R-12: Precision and Proactive Work 97

CLEAN AND COOL 101

PRECISION INSTALLATION 102

PRECISION ALIGNMENT 103

PRECISION BALANCING 107

PRECISION FASTENING 109

RESONANCE ELIMINATION 109

5S AND THE VISUAL WORKPLACE 110

R-13: Condition Monitoring 115

VIBRATION MONITORING 116

ELECTRIC MOTOR TESTING 120

OIL ANALYSIS 122

WEAR PARTICLE ANALYSIS 123

INFRARED ANALYSIS 124

VISUAL INSPECTIONS 125

PERFORMANCE MONITORING 126

NON-DESTRUCTIVE TESTING 127

R-14: Breaking Out of the Reactive Maintenance Cycle of Doom 129

STEP TWO 131

STEP THREE 131

STEP FOUR 131

STEP FIVE 131

STEP SIX 133

IF YOU WANT TO STAY IN… 133

Introduction from Jason Tranter 7

R-01: Getting Started 9

KNOW YOUR GOALS 15

R-02: What Are the Benefits? 17

WHAT DO WE GAIN WITH RELIABILITY? 18

WHAT CAN RELIABILITY HELP US REDUCE? 19

CASE STUDIES IN RELIABILITY 19

R-03: Assessing the Benefits 21

THE BUSINESS PROCESS REVIEW 23

HOW ARE YOU PERFORMING RIGHT NOW? 24

R-04: Culture Change 27

WHY CAN’T YOU JUST FORCE PEOPLE TO CHANGE? 28

THE ROLE OF UNIONS 30

WHAT MESSAGES ARE YOU SENDING? 31

LEARN FROM HISTORY 32

R-05: Selling to Senior Management 33

R-06: Establishing the Strategy 37

SUPPORT FROM HR AND OTHERS 40

R-07: Understanding Failure 41

COMMON FAILURE PATTERNS 43

R-08: Defect Elimination 47

DESIGN FOR RELIABILITY 49

R-09: Asset Strategy 57

WHO SHOULD DEVELOP THE STRATEGY? 61

ANALYZING RELIABILITY DATA 62

INTERPRETING FAILURE MODES 66

ASSET CRITICALITY RANKING 67

PREVENTIVE MAINTENANCE OPTIMIZATION 72

RCM AND FMEA 73

DECIDING TO PERFORM RCM OR FMEA 73

R-10: Work Management 83

STRATEGY BASED WORK AND WORK REQUESTS 84

ESTABLISHING A PRIORITY SYSTEM 85

CLOSE-OUT AND FEEDBACK 89

ANALYSIS AND PROCESS MANAGEMENT 89

SHUTDOWNS, TURNAROUNDS, AND OUTAGES 90

R-11: Spares Management 91

CARING FOR SPARES 94

R-12: Precision and Proactive Work 97

CLEAN AND COOL 101

PRECISION INSTALLATION 102

PRECISION ALIGNMENT 103

PRECISION BALANCING 107

PRECISION FASTENING 109

RESONANCE ELIMINATION 109

5S AND THE VISUAL WORKPLACE 110

R-13: Condition Monitoring 115

VIBRATION MONITORING 116

ELECTRIC MOTOR TESTING 120

OIL ANALYSIS 122

WEAR PARTICLE ANALYSIS 123

INFRARED ANALYSIS 124

VISUAL INSPECTIONS 125

PERFORMANCE MONITORING 126

NON-DESTRUCTIVE TESTING 127

R-14: Breaking Out of the Reactive Maintenance Cycle of Doom 129

STEP TWO 131

STEP THREE 131

STEP FOUR 131

STEP FIVE 131

STEP SIX 133

IF YOU WANT TO STAY IN… 133

R-15: Continuous Improvement 135

REVIEW PROGRAM STRATEGY 137

CONTINUAL EDUCATION 137

R-16: Implementation Strategy 141

EXPANDING THE PROGRAM 145