Professional Documents
Culture Documents
A. Origin of RCM
A Brief History of RCM Reliability Centered Maintenance originated in the Airline industry in the 1960’s. By
the late 1950’s, the cost of Maintenance activities in this industry had become high enough to warrant a special
investigation into the effectiveness of those activities. Accordingly, in 1960, a task force was formed consisting
of representatives both the airlines and the FAA to investigate the capabilities of preventive maintenance. The
establishment of this task force subsequently led to the development of a series of guidelines for airlines and
aircraft manufacturers to use, when establishing maintenance schedules for their aircraft.
This led to the 747 Maintenance Steering Group (MSG) document MSG-1; Handbook: Maintenance Evaluation
and Program Development from the Air Transport Association in 1968. MSG-1 was used to develop the
maintenance program for the Boeing 747 aircraft, the first maintenance program to apply RCM concepts. MSG-
2, the next revision, was used to develop the maintenance programs for the Lockheed L-1011 and the Douglas
DC-10. The success of this program is demonstrated by comparing maintenance requirements of a DC-8
aircraft, maintained using standard maintenance techniques, and the DC-10 aircraft, maintained using MSG-2
guidelines. The DC-8 aircraft has 339 items that require an overhaul, verses only seven items on a DC-10. Using
another example, the original Boeing 747 required 66,000 labor hours on major structural inspections before a
major heavy inspection at 20,000 operating hours. In comparison, the DC-8 - a smaller and less sophisticated
aircraft using standard maintenance programs of the day required more than 4 million labor hours before
reaching 20,000 operating hours.
In 1974 the US Department of Defense commissioned United Airlines to write a report on the processes used in
the civil aviation industry for the development of maintenance programs for aircraft. This report, written by
Stan Nowlan and Howard Heap and published in 1978, was entitled Reliability Centered Maintenance, and has
become the report upon which all subsequent Reliability Centered Maintenance approaches have been based.
What Mr's Nowlan and Heap found was many types of failures could not be prevented no matter how intensive
the maintenance activities. Additionally it was discovered that for many items the probability of failure did not
increase with age. Consequently, a maintenance program based on age will have little, if any effect on the
failure rate.
Process-oriented programming is a programming paradigm that separates the concerns of data structures and
the concurrent processes that act upon them. The data structures in this case are typically persistent, complex,
and large scale - the subject of general purpose applications, as opposed to specialized processing of specialized
data sets seen in high productivity applications (HPC). The model allows the creation of large scale applications
that partially share common data sets. Programs are functionally decomposed into parallel processes that
create and act upon logically shared data.
The paradigm was originally invented for parallel computers in the 1980s, especially computers built
with transputer microprocessors by INMOS, or similar architectures. Occam was an early process-oriented
language developed for the Transputer.
Some derivations have evolved from the message passing paradigm of Occam to enable uniform efficiency
when porting applications between distributed memory and shared memory parallel computers ]. The first such
derived example appears in the programming language Ease designed at Yale University[1][2] in 1990. Similar
models have appeared since in the loose combination of SQL databases and objected oriented languages such
as Java, often referred to as object-relational models and widely used in large scale distributed systems today.
The paradigm is likely to appear on desktop computers as microprocessors increase the number of processors
(multicore) per chip. The Actor model might usefully be described as a specialised kind of process-oriented
system in which the message-passing model is restricted to the simple fixed case of one infinite input queue per
process to which any other process can send messages
C. Task Oriented Programs
Task-Oriented Programming (TOP) is a programming paradigm for developing interactive multi-user systems.
It is really nothing more than its name says: programming using "Task" as central concept. The idea is that a
computer program supports a person or a group of people in accomplishing a certain task, or can
autonomously perform some task.
If we can precisely define the task that needs to be accomplished, or can decompose a complex task into
smaller simpler tasks, we should have enough information to generate executable systems that support it. This
means that technical infrastructure of an interactive system (handling events, storing data, encoding/decoding
etc.) can be factored out and solved generically.
Task Oriented Programming (or shortly TOP) is a new programming paradigm. It is used for developing
applications where human beings closely collaborate on the internet to accomplish a common goal. The tasks
that need to be done to achieve this goal are described on a very high level of abstraction. This means that one
does need to worry about the technical realization to make the collaboration possible. The technical realization
is generated fully automatically from the abstract description. TOP can therefore be seen as a model driven
approach. The tasks described form a model from which the technical realization is generated.
D. Maitenance Intervals
RCM is used to develop scheduled maintenance plans that will provide an acceptable level of operability, with
an acceptable level of risk, in an efficient and cost-effective manner.No aircraft is so tolerant of neglect that it is
safe in the absence of an effective inspection and maintenance programme. The processes that affect an aircraft
are Deterioration with age (e.g. fatigue, wear and corrosion) as well as chance failures (e.g. tyre burst, excess
structural loads).Aircraft Maintenance can be defined in a number of ways and the following may help
understand the different aspects: “Those actions required for restoring or maintaining an item in a serviceable
condition including servicing, repair, modification, overhaul, inspection and determination of condition”.
“Maintenance is the action necessary to sustain or restore the integrity and performance of the airplane”
“Maintenance is the process of ensuring that a system continually performs its intended function at its
designed-in level of reliability and safety.”Preventive maintenance intervals can yield significant cost benefits
by increasing the availability of a system and reducing the total maintenance costs. The question of how often
the task should be performed, however, is important to consider. If the preventive maintenance interval is too
short, then the maintenance costs associated with preventive maintenance can be too high. If, on the other
hand, the interval is too long, then the costs associated with corrective maintenance can be too
high. RCM provides calculations to help determine the optimum maintenance interval, based on the probability
of occurrence of a failure event and the costs of performing different types of maintenance.
Maintenance will consist of a mixture of Preventive and Corrective work, including precautionary work to
ensure that there have been no undetected chance failures. There will be inspection to monitor the progress of
wear out processes, in addition to scheduled or Preventive work to anticipate and prevent failures and
scheduled work to repair maintenance and On-condition maintenance. In general terms, for preventive work to
be worthwhile, two conditions should be met. The item must be restored to its original reliability after
maintenance action, and. The cost of maintenance action must be less than the failure it is intended to prevent.
An FMEA is an engineering analysis done by a cross-functional team of subject matter experts that thoroughly
analyzes product designs or manufacturing processes early in the product development process. Finds and
corrects weaknesses before the product gets into the hands of the customer. An FMEA should be the guide to
the development of a complete set of actions that will reduce risk associated with the system, subsystem, and
component or manufacturing/assembly process to an acceptable level.
Failure Mode and Effects Analysis (FMEA) is a method designed to identify and fully understand potential
failure modes and their causes, and the effects of failure on the system or end users, for a given product or
process. Assess the risk associated with the identified failure modes, effects and causes, and prioritize issues for
corrective action. Identify and carry out corrective actions to address the most serious concerns.
Primary Objective of FMEA is to identify and prevent safety hazards, minimize loss of product performance or
performance degradation, improve test and verification plans (in the case of System or Design FMEAs)
improve Process Control Plans (in the case of Process FMEAs), consider changes to the product design or
manufacturing process, identify significant product or process characteristics, develop Preventive Maintenance
plans for in-service machinery and equipment and develop online diagnostic techniques.
Types of FMEAs and its Primary Objective are for System FMEAs, the objective is to improve the design of the
system. For Design FMEAs, the objective is to improve the design of the subsystem or component. For Process
FMEAs, the objective is to improve the design of the manufacturing process. Why Perform Failure Mode and
Effects Analysis (FMEA) historically, the sooner a failure is discovered, the less it will cost. If a failure is
discovered late in product development or launch, the impact is exponentially more devastating. FMEA is one
of many tools used to discover failure at its earliest possible point in product or process design. Discovering a
failure early in Product Development (PD) using FMEA provides the benefits of: Multiple choices for Mitigating
the Risk, higher capability of Verification and Validation of changes, collaboration between design of the
product and process, improved Design for Manufacturing and Assembly (DFM/A), and lower cost solutions
legacy, Tribal Knowledge, and Standard Work utilization. Ultimately, this methodology is effective at identifying
and correcting process failures early on so that you can avoid the nasty consequences of poor performance.
Failure Mode and Effect Analysis (FMEA) and Failure Modes, Effects and Criticality Analysis (FMECA) are
methodologies designed to identify potential failure modes for a product or process, to assess the risk
associated with those failure modes, to rank the issues in terms of importance and to identify and carry out
corrective actions to address the most serious concerns. The term “failure mode” combines two words that
both have unique meanings. The Concise Oxford English Dictionary defines the word “failure” as the act of
ceasing to function or the state of not functioning. “Mode” is defined as a way in which something occurs . A
“failure mode” is the manner in which the item or operation potentially fails to meet or deliver the intended
function and associated requirements may include failure to perform a function within defined limits,
inadequate or poor performance of the function and intermittent performance of a function and/or
performing an unintended or undesired function. An “effect” is the consequence of the failure on the system or
end user. This can be a single description of the effect on the top level system and/or end user, or three levels
of effects (local, next-higher level, and end effect).
For Process FMEAs, consider the effect at the manufacturer or assembly level, as well as at the system or end
user. There can be more than one effect for each failure mode. However, typically the FMEA team will use the
most serious of the end effects for the analysis.
Severity classification is assigned for each failure mode of each unique item and entered on the FMECA matrix,
based upon system level consequences. A small set of classifications, usually having 3 to 10 severity levels, is
used.
Criticality Analysis describes two types of criticality analysis: qualitative and quantitative.
To use qualitative criticality analysis to evaluate risk and prioritize corrective
actions, the analysis team must a) rate the severity of the potential effects of failure and b) rate the
likelihood of occurrence for each potential failure mode. It is then possible to compare failure modes via a
Criticality Matrix, which identifies severity on the horizontal axis and occurrence on the vertical axis.
To use quantitative criticality analysis, the analysis team considers the
reliability/unreliability for each item at a given operating time and identifies the portion of the item’s
unreliability that can be attributed to each potential failure mode. For each failure mode, they also rate the
probability that it will result in system failure. The team then uses these factors to calculate a quantitative
criticality value for each potential failure and for each item.
C. Fault Tree Analysis
Fault tree analysis (FTA) is a top-down, deductive failure analysis in which an undesired state of a system is
analyzed using Boolean logicto combine a series of lower-level events. This analysis method is mainly used in
the fields of safety engineering and reliability engineering to understand how systems can fail, to identify the
best ways to reduce risk or to determine (or get a feeling for) event rates of a safety accident or a particular
system level (functional) failure. FTA is used in the aerospace. nuclear power, chemical and process,
pharmaceutical, petrochemical and other high-hazard industries; but is also used in fields as diverse as risk
factor identification relating to social service system failure. FTA is also used in software engineering for
debugging purposes and is closely related to cause-elimination technique used to detect bugs.
In aerospace, the more general term "system failure condition" is used for the "undesired state" / top event of
the fault tree. These conditions are classified by the severity of their effects. The most severe conditions require
the most extensive fault tree analysis. These system failure conditions and their classification are often
previously determined in the functional hazard analysis.
Fault tree analysis can be used to understand the logic leading to the top event / undesired state, show
compliance with the (input) system safety / reliability requirements. prioritize the contributors leading to the
top event- creating the critical equipment/parts/events lists for different importance measures, monitor and
control the safety performance of the complex system (e.g., is a particular aircraft safe to fly when fuel
valve x malfunctions? For how long is it allowed to fly with the valve malfunction?), minimize and optimize
resources, assist in designing a system. The FTA can be used as a design tool that helps to create (output /
lower level) requirements, function as a diagnostic tool to identify and correct causes of the top event. It can
help with the creation of diagnostic manuals / processes.Event symbols are used for primary
events and intermediate events. Primary events are not further developed on the fault tree. Intermediate
events are found at the output of a gate.
Basic event - failure or error in a system component or element (example: switch stuck in open
position)
External event - normally expected to occur (not of itself a fault)
Undeveloped event - an event about which insufficient information is available, or which is of no
consequence
Conditioning event - conditions that restrict or affect logic gates (example: mode of operation in effect)
Gate symbols describe the relationship between input and output events. The symbols are derived from
Boolean logic symbols:
Mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system. Mean Time To
Failure (MTTF) is a basic measure of reliability for non-repairable systems. It is the mean time expected until
the first failure of a piece of equipment. MTTF is a statistical value and is meant to be the mean over a long
period of time and a large number of units. Technically, MTBF should be used only in reference to a repairable
item, while MTTF should be used for non-repairable items. However, MTBF is commonly used for both
repairable and non-repairable items. Mean time to failure is extremely similar to another related term, mean
time between failures (MTBF). The difference between these terms is that while MTBF is used for products
than that can be repaired and returned to use, MTTF is used for non-repairable products. When MTTF is used
as a measure, repair is not an option.
As a metric, MTTF represents how long a product can reasonably be expected to perform in the field based on
specific testing. It is important to note, however, that the mean time to failure metrics provided by companies
regarding specific products or components may not have been collected by running one unit continuously until
failure. Instead, MTTF data is often collected by running many units, even many thousands of units, for a
specific number of hours.
One of the main situations where terms like MTTF are extremely important is when hardware pieces or other
products are used in mission-critical systems. Here it becomes valuable to know about general reliability for
these items. For non-repairable items, MTTF is a statistic that is of great interest to engineers and others
assessing these pieces as parts of larger systems.
Mean Time Between Failure (MTBF) is a reliability term used to provide the amount of failures per million
hours for a product. This is the most common inquiry about a product’s life span, and is important in the
decision-making process of the end user. MTBF is more important for industries and integrators than for
consumers. Most consumers are price driven and will not take MTBF into consideration, nor is the data often
readily available. On the other hand, when equipment such as media converters or switches must be installed
into mission critical applications, MTBF becomes very important. In addition, MTBF may be an expected line
item in an RFQ (Request For Quote). Without the proper data, a manufacturer’s piece of equipment would be
immediately disqualified.Mean time between failures (MTBF) is the predicted elapsed time between
inherent failures of a mechanical or electronic system, during normal system operation. MTBF can be calculated
as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems,
The definition of MTBF depends on the definition of what is considered a failure. For
complex, repairable systems, failures are considered to be those out of design conditions which place the
system out of service and into a state for repair.
Failures which occur that can be left or maintained in an unrepaired condition, and do not place the system out
of service, are not considered failures under this definition. In addition, units that are taken down for routine
scheduled maintenance or inventory control are not considered within the definition of failure. [3] The higher
the MTBF, the longer a system is likely to work before failing. Mean time between failures (MTBF) describes the
expected time between two failures for a repairable system. For example, three identical systems starting to
function properly at time 0 are working until all of them fail. The first system failed at 100 hours, the second
failed at 120 hours and the third failed at 130 hours. The MTBF of the system is the average of the three failure
times, which is 116.667 hours. If the systems are non-repairable, then their MTTF would be 116.667 hours.
In general, MTBF is the "up-time" between two failure states of a repairable system during operation as
outlined here:
For each observation, the "down time" is the instantaneous time it went down, which is after (i.e. greater than)
the moment it went up, the "up time". The difference ("down time" minus "up time") is the amount of time it
was operating between these two events. By referring to the figure above, the MTBF of a component is the sum
of the lengths of the operational periods divided by the number of observed failures:
Failure Pattern B is known as the wear out curve consists of a low level of random
failures, followed by a sharp increase in failures at the end of its life. The pattern accounts for
approximately 2% of failures.
Failure Pattern E is known as the random pattern and is a consistent level of random failures over the
life of the equipment with no pronounced increases or decreased related to the life of the equipment.
This pattern accounts for approximately 11% of failures.
Failure Pattern F is known as the infant mortality curve and shows a high initial failure rate followed by
a random level of failures. This pattern accounts for 68% of failures.
A. Weibull Distribution
The Weibull distribution. named for its inventor, Waloddi Weibull, this distribution is widely used in reliability
engineering and elsewhere due to its versatility and relative simplicity. The most general expression of the
Weibull pdf is given by the three-parameter Weibull distribution expression,
This is one of the most important aspects of the effect of β on the
Weibull distribution. As is indicated by the plot, Weibull
distributions with β < 1 have a failure rate that decreases with
time, also known as infantile or early-life failures. Weibull
distributions with β close to or equal to 1 have a fairly
constant failure rate, indicative of useful life or random
failures. Weibull distributions with β > 1 have a failure rate
that increases with time, also known as wear-out failures.
These comprise the three sections of the classic "bathtub
curve." A mixed Weibull distribution with one subpopulation
with β < 1, one subpopulation with β = 1 and one subpopulation
with β > 1 would have a failure rate plot that was identical to the
bathtub curve. An example of a bathtub curve is shown in the
following chart.
B. Gamma Distribution
In probability theory and statistics, the gamma distribution is a two-parameter family of
continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared
distribution are special cases of the gamma distribution. There are three different parametrizations in common
use: