You are on page 1of 11

1.

RCM (Realibilty Centered Maintenance)

A. Origin of RCM

A Brief History of RCM Reliability Centered Maintenance originated in the Airline industry in the 1960’s. By
the late 1950’s, the cost of Maintenance activities in this industry had become high enough to warrant a special
investigation into the effectiveness of those activities. Accordingly, in 1960, a task force was formed consisting
of representatives both the airlines and the FAA to investigate the capabilities of preventive maintenance. The
establishment of this task force subsequently led to the development of a series of guidelines for airlines and
aircraft manufacturers to use, when establishing maintenance schedules for their aircraft.

This led to the 747 Maintenance Steering Group (MSG) document MSG-1; Handbook: Maintenance Evaluation
and Program Development from the Air Transport Association in 1968. MSG-1 was used to develop the
maintenance program for the Boeing 747 aircraft, the first maintenance program to apply RCM concepts. MSG-
2, the next revision, was used to develop the maintenance programs for the Lockheed L-1011 and the Douglas
DC-10. The success of this program is demonstrated by comparing maintenance requirements of a DC-8
aircraft, maintained using standard maintenance techniques, and the DC-10 aircraft, maintained using MSG-2
guidelines. The DC-8 aircraft has 339 items that require an overhaul, verses only seven items on a DC-10. Using
another example, the original Boeing 747 required 66,000 labor hours on major structural inspections before a
major heavy inspection at 20,000 operating hours. In comparison, the DC-8 - a smaller and less sophisticated
aircraft using standard maintenance programs of the day required more than 4 million labor hours before
reaching 20,000 operating hours.

In 1974 the US Department of Defense commissioned United Airlines to write a report on the processes used in
the civil aviation industry for the development of maintenance programs for aircraft. This report, written by
Stan Nowlan and Howard Heap and published in 1978, was entitled Reliability Centered Maintenance, and has
become the report upon which all subsequent Reliability Centered Maintenance approaches have been based.
What Mr's Nowlan and Heap found was many types of failures could not be prevented no matter how intensive
the maintenance activities. Additionally it was discovered that for many items the probability of failure did not
increase with age. Consequently, a maintenance program based on age will have little, if any effect on the
failure rate.

B. Process Oriented Programs

Process-oriented programming is a programming paradigm that separates the concerns of data structures and
the concurrent processes that act upon them. The data structures in this case are typically persistent, complex,
and large scale - the subject of general purpose applications, as opposed to specialized processing of specialized
data sets seen in high productivity applications (HPC). The model allows the creation of large scale applications
that partially share common data sets. Programs are functionally decomposed into parallel processes that
create and act upon logically shared data.
The paradigm was originally invented for parallel computers in the 1980s, especially computers built
with transputer microprocessors by INMOS, or similar architectures. Occam was an early process-oriented
language developed for the Transputer.
Some derivations have evolved from the message passing paradigm of Occam to enable uniform efficiency
when porting applications between distributed memory and shared memory parallel computers ]. The first such
derived example appears in the programming language Ease designed at Yale University[1][2] in 1990. Similar
models have appeared since in the loose combination of SQL databases and objected oriented languages such
as Java, often referred to as object-relational models and widely used in large scale distributed systems today.
The paradigm is likely to appear on desktop computers as microprocessors increase the number of processors
(multicore) per chip. The Actor model might usefully be described as a specialised kind of process-oriented
system in which the message-passing model is restricted to the simple fixed case of one infinite input queue per
process to which any other process can send messages
C. Task Oriented Programs

Task-Oriented Programming (TOP) is a programming paradigm for developing interactive multi-user systems.
It is really nothing more than its name says: programming using "Task" as central concept. The idea is that a
computer program supports a person or a group of people in accomplishing a certain task, or can
autonomously perform some task.
If we can precisely define the task that needs to be accomplished, or can decompose a complex task into
smaller simpler tasks, we should have enough information to generate executable systems that support it. This
means that technical infrastructure of an interactive system (handling events, storing data, encoding/decoding
etc.) can be factored out and solved generically.

Task Oriented Programming (or shortly TOP) is a new programming paradigm. It is used for developing
applications where human beings closely collaborate on the internet to accomplish a common goal. The tasks
that need to be done to achieve this goal are described on a very high level of abstraction. This means that one
does need to worry about the technical realization to make the collaboration possible. The technical realization
is generated fully automatically from the abstract description. TOP can therefore be seen as a model driven
approach. The tasks described form a model from which the technical realization is generated.

D. Maitenance Intervals

RCM is used to develop scheduled maintenance plans that will provide an acceptable level of operability, with
an acceptable level of risk, in an efficient and cost-effective manner.No aircraft is so tolerant of neglect that it is
safe in the absence of an effective inspection and maintenance programme. The processes that affect an aircraft
are Deterioration with age (e.g. fatigue, wear and corrosion) as well as chance failures (e.g. tyre burst, excess
structural loads).Aircraft Maintenance can be defined in a number of ways and the following may help
understand the different aspects: “Those actions required for restoring or maintaining an item in a serviceable
condition including servicing, repair, modification, overhaul, inspection and determination of condition”.
“Maintenance is the action necessary to sustain or restore the integrity and performance of the airplane”
“Maintenance is the process of ensuring that a system continually performs its intended function at its
designed-in level of reliability and safety.”Preventive maintenance intervals can yield significant cost benefits
by increasing the availability of a system and reducing the total maintenance costs. The question of how often
the task should be performed, however, is important to consider. If the preventive maintenance interval is too
short, then the maintenance costs associated with preventive maintenance can be too high. If, on the other
hand, the interval is too long, then the costs associated with corrective maintenance can be too
high. RCM provides calculations to help determine the optimum maintenance interval, based on the probability
of occurrence of a failure event and the costs of performing different types of maintenance.

Maintenance will consist of a mixture of Preventive and Corrective work, including precautionary work to
ensure that there have been no undetected chance failures. There will be inspection to monitor the progress of
wear out processes, in addition to scheduled or Preventive work to anticipate and prevent failures and
scheduled work to repair maintenance and On-condition maintenance. In general terms, for preventive work to
be worthwhile, two conditions should be met. The item must be restored to its original reliability after
maintenance action, and. The cost of maintenance action must be less than the failure it is intended to prevent.

2. Failure Models and Measurement of Reliability

A. Failure Mode and Effects Analysis (FMEA)

An FMEA is an engineering analysis done by a cross-functional team of subject matter experts that thoroughly
analyzes product designs or manufacturing processes early in the product development process. Finds and
corrects weaknesses before the product gets into the hands of the customer. An FMEA should be the guide to
the development of a complete set of actions that will reduce risk associated with the system, subsystem, and
component or manufacturing/assembly process to an acceptable level.

Failure Mode and Effects Analysis (FMEA) is a method designed to identify and fully understand potential
failure modes and their causes, and the effects of failure on the system or end users, for a given product or
process. Assess the risk associated with the identified failure modes, effects and causes, and prioritize issues for
corrective action. Identify and carry out corrective actions to address the most serious concerns.

Primary Objective of FMEA is to identify and prevent safety hazards, minimize loss of product performance or
performance degradation, improve test and verification plans (in the case of System or Design FMEAs)
improve Process Control Plans (in the case of Process FMEAs), consider changes to the product design or
manufacturing process, identify significant product or process characteristics, develop Preventive Maintenance
plans for in-service machinery and equipment and develop online diagnostic techniques.
Types of FMEAs and its Primary Objective are for System FMEAs, the objective is to improve the design of the
system. For Design FMEAs, the objective is to improve the design of the subsystem or component. For Process
FMEAs, the objective is to improve the design of the manufacturing process. Why Perform Failure Mode and
Effects Analysis (FMEA) historically, the sooner a failure is discovered, the less it will cost. If a failure is
discovered late in product development or launch, the impact is exponentially more devastating. FMEA is one
of many tools used to discover failure at its earliest possible point in product or process design. Discovering a
failure early in Product Development (PD) using FMEA provides the benefits of: Multiple choices for Mitigating
the Risk, higher capability of Verification and Validation of changes, collaboration between design of the
product and process, improved Design for Manufacturing and Assembly (DFM/A), and lower cost solutions
legacy, Tribal Knowledge, and Standard Work utilization. Ultimately, this methodology is effective at identifying
and correcting process failures early on so that you can avoid the nasty consequences of poor performance.

B. Failure Mode Effect and Criticality Analysis

Failure Mode and Effect Analysis (FMEA) and Failure Modes, Effects and Criticality Analysis (FMECA) are
methodologies designed to identify potential failure modes for a product or process, to assess the risk
associated with those failure modes, to rank the issues in terms of importance and to identify and carry out
corrective actions to address the most serious concerns. The term “failure mode” combines two words that
both have unique meanings. The Concise Oxford English Dictionary defines the word “failure” as the act of
ceasing to function or the state of not functioning. “Mode” is defined as a way in which something occurs . A
“failure mode” is the manner in which the item or operation potentially fails to meet or deliver the intended
function and associated requirements may include failure to perform a function within defined limits,
inadequate or poor performance of the function and intermittent performance of a function and/or
performing an unintended or undesired function. An “effect” is the consequence of the failure on the system or
end user. This can be a single description of the effect on the top level system and/or end user, or three levels
of effects (local, next-higher level, and end effect).
For Process FMEAs, consider the effect at the manufacturer or assembly level, as well as at the system or end
user. There can be more than one effect for each failure mode. However, typically the FMEA team will use the
most serious of the end effects for the analysis.
Severity classification is assigned for each failure mode of each unique item and entered on the FMECA matrix,
based upon system level consequences. A small set of classifications, usually having 3 to 10 severity levels, is
used.

Criticality Analysis describes two types of criticality analysis: qualitative and quantitative.
 To use qualitative criticality analysis to evaluate risk and prioritize corrective
actions, the analysis team must a) rate the severity of the potential effects of failure and b) rate the
likelihood of occurrence for each potential failure mode. It is then possible to compare failure modes via a
Criticality Matrix, which identifies severity on the horizontal axis and occurrence on the vertical axis.
 To use quantitative criticality analysis, the analysis team considers the
reliability/unreliability for each item at a given operating time and identifies the portion of the item’s
unreliability that can be attributed to each potential failure mode. For each failure mode, they also rate the
probability that it will result in system failure. The team then uses these factors to calculate a quantitative
criticality value for each potential failure and for each item.  
C. Fault Tree Analysis
Fault tree analysis (FTA) is a top-down, deductive failure analysis in which an undesired state of a system is
analyzed using Boolean logicto combine a series of lower-level events. This analysis method is mainly used in
the fields of safety engineering and reliability engineering to understand how systems can fail, to identify the
best ways to reduce risk or to determine (or get a feeling for) event rates of a safety accident or a particular
system level (functional) failure. FTA is used in the aerospace.  nuclear power, chemical and process,
pharmaceutical, petrochemical and other high-hazard industries; but is also used in fields as diverse as risk
factor identification relating to social service system failure. FTA is also used in software engineering for
debugging purposes and is closely related to cause-elimination technique used to detect bugs.
In aerospace, the more general term "system failure condition" is used for the "undesired state" / top event of
the fault tree. These conditions are classified by the severity of their effects. The most severe conditions require
the most extensive fault tree analysis. These system failure conditions and their classification are often
previously determined in the functional hazard analysis.
Fault tree analysis can be used to understand the logic leading to the top event / undesired state, show
compliance with the (input) system safety / reliability requirements. prioritize the contributors leading to the
top event- creating the critical equipment/parts/events lists for different importance measures, monitor and
control the safety performance of the complex system (e.g., is a particular aircraft safe to fly when fuel
valve x malfunctions? For how long is it allowed to fly with the valve malfunction?), minimize and optimize
resources, assist in designing a system. The FTA can be used as a design tool that helps to create (output /
lower level) requirements, function as a diagnostic tool to identify and correct causes of the top event. It can
help with the creation of diagnostic manuals / processes.Event symbols are used for primary
events and intermediate events. Primary events are not further developed on the fault tree. Intermediate
events are found at the output of a gate.

The primary event symbols are typically used as follows:

 Basic event - failure or error in a system component or element (example: switch stuck in open
position)
 External event - normally expected to occur (not of itself a fault)
 Undeveloped event - an event about which insufficient information is available, or which is of no
consequence
 Conditioning event - conditions that restrict or affect logic gates (example: mode of operation in effect)

Gate symbols describe the relationship between input and output events. The symbols are derived from
Boolean logic symbols:

The gates work as


follows:

 OR gate - the output occurs if any input occurs.


 AND gate - the output occurs only if all inputs occur (inputs are independent).
 Exclusive OR gate - the output occurs if exactly one input occurs.
 Priority AND gate - the output occurs if the inputs occur in a specific sequence specified by a
conditioning event.
 Inhibit gate - the output occurs if the input occurs under an enabling condition specified by a
conditioning event.
D. Mean Time to Failure (MTTF)

Mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system. Mean Time To
Failure (MTTF) is a basic measure of reliability for non-repairable systems. It is the mean time expected until
the first failure of a piece of equipment. MTTF is a statistical value and is meant to be the mean over a long
period of time and a large number of units. Technically, MTBF should be used only in reference to a repairable
item, while MTTF should be used for non-repairable items. However, MTBF is commonly used for both
repairable and non-repairable items. Mean time to failure is extremely similar to another related term, mean
time between failures (MTBF). The difference between these terms is that while MTBF is used for products
than that can be repaired and returned to use, MTTF is used for non-repairable products. When MTTF is used
as a measure, repair is not an option.
As a metric, MTTF represents how long a product can reasonably be expected to perform in the field based on
specific testing. It is important to note, however, that the mean time to failure metrics provided by companies
regarding specific products or components may not have been collected by running one unit continuously until
failure. Instead, MTTF data is often collected by running many units, even many thousands of units, for a
specific number of hours.
One of the main situations where terms like MTTF are extremely important is when hardware pieces or other
products are used in mission-critical systems. Here it becomes valuable to know about general reliability for
these items. For non-repairable items, MTTF is a statistic that is of great interest to engineers and others
assessing these pieces as parts of larger systems.

E. Mean time between failures (MTBF)

Mean Time Between Failure (MTBF) is a reliability term used to provide the amount of failures per million
hours for a product. This is the most common inquiry about a product’s life span, and is important in the
decision-making process of the end user. MTBF is more important for industries and integrators than for
consumers. Most consumers are price driven and will not take MTBF into consideration, nor is the data often
readily available. On the other hand, when equipment such as media converters or switches must be installed
into mission critical applications, MTBF becomes  very important. In addition, MTBF may be an expected line
item in an RFQ (Request For Quote). Without the proper data, a manufacturer’s piece of equipment would be
immediately disqualified.Mean time between failures (MTBF) is the predicted elapsed time between
inherent failures of a mechanical or electronic system, during normal system operation. MTBF can be calculated
as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems,
The definition of MTBF depends on the definition of what is considered a failure. For
complex, repairable systems, failures are considered to be those out of design conditions which place the
system out of service and into a state for repair.
Failures which occur that can be left or maintained in an unrepaired condition, and do not place the system out
of service, are not considered failures under this definition. In addition, units that are taken down for routine
scheduled maintenance or inventory control are not considered within the definition of failure. [3] The higher
the MTBF, the longer a system is likely to work before failing. Mean time between failures (MTBF) describes the
expected time between two failures for a repairable system. For example, three identical systems starting to
function properly at time 0 are working until all of them fail. The first system failed at 100 hours, the second
failed at 120 hours and the third failed at 130 hours. The MTBF of the system is the average of the three failure
times, which is 116.667 hours. If the systems are non-repairable, then their MTTF would be 116.667 hours.
In general, MTBF is the "up-time" between two failure states of a repairable system during operation as
outlined here:

For each observation, the "down time" is the instantaneous time it went down, which is after (i.e. greater than)
the moment it went up, the "up time". The difference ("down time" minus "up time") is the amount of time it
was operating between these two events. By referring to the figure above, the MTBF of a component is the sum
of the lengths of the operational periods divided by the number of observed failures:

F. Failure Rate Patterns


In the 1960s the failure rate of jet aircraft was high even with the
extensive maintenance programs that were put in place to
prevent the failures.  The programs required overhauls, rebuilds
and detailed inspections which required the various components
to be disassembled.  All of these activities were based on an
estimated save life of the equipment. Under the guidance of the
FAA, extensive engineering studies were conducted on all of the
aircraft in service to determine the source of failures.  United
Airlines pioneered and published a report on the failures which
turned the industry on its head.  The concluded that only 9% of
the failures were related to the age of the aircraft.  The rest were
random in nature or induced by the very maintenance work that
was put in place to prevent them. As a result of these findings,
many of the extensive maintenance programs were reduced and
the reliability of the aircraft went up! The report from United
Airlines highlighted six unique failure patterns of equipment. 
Understanding these patterns illustrates why the reduction in
maintenance could result in improved performance.

 Failure Pattern A is known as the bathtub curve and has a


high probability of failure 
when the equipment is new, followed by a low level of
random failures, and followed by a sharp increase in
failures at the end of its life. This pattern accounts for
approximately 4% of failures.

 Failure Pattern B is known as the wear out curve consists of a low level of random
failures, followed by a sharp increase in failures at the end of its life. The pattern accounts for
approximately 2% of failures.

 Failure Pattern C is known as the fatigue curve and is characterized by a gradually


increasing level of failures over the course of the equipment’s life. This pattern accounts for
approximately 5% of failures.
 Failure Pattern D is known as the initial break in curve and starts off with a very low level of failure
followed by a sharp rise to a constant level. This pattern accounts for
approximately 7% of failures

 Failure Pattern E is known as the random pattern and is a consistent level of random failures over the
life of the equipment with no pronounced increases or decreased related to the life of the equipment.  
This pattern accounts for approximately 11% of failures.

 Failure Pattern F is known as the infant mortality curve and shows a high initial failure rate followed by
a random level of failures. This pattern accounts for 68% of failures.

3. Probability Distribution Function and their Application in Reliability Evaluation

A. Weibull Distribution

The Weibull distribution. named for its inventor, Waloddi Weibull, this distribution is widely used in reliability
engineering and elsewhere due to its versatility and relative simplicity. The most general expression of the
Weibull pdf is given by the three-parameter Weibull distribution expression,

 β is the shape parameter, also known as the Weibull slope


 η is the scale parameter
 γ is the location parameter
 
Frequently, the location parameter is not used, and the value for this parameter can be set to zero. When this is
the case, the pdf equation reduces to that of the two-parameter Weibull distribution. There is also a form of the
Weibull distribution known as the one-parameter Weibull distribution. This in fact takes the same form as the
two-parameter Weibull pdf, the only difference being that the value of β is assumed to be known beforehand.
This assumption means that only the scale parameter needs be estimated, allowing for analysis of small data
sets. It is recommended that the analyst have a very good and justifiable estimate for β before using the one-
parameter Weibull distribution for analysis.
As was mentioned previously, the Weibull distribution is widely used in reliability and life data analysis due to
its versatility. Depending on the values of the parameters, the Weibull distribution can be used to model a
variety of life behaviors. An important aspect of the Weibull distribution is how the values of the shape
parameter, β, and the scale parameter, η, affect such distribution characteristics as the shape of the pdf curve,
the reliability and the failure rate. 
Weibull Shape Parameter, β
The Weibull shape parameter, β, is also known as the Weibull slope. This is because the value of β is equal to
the slope of the line in a probability plot. Different values of the shape parameter can have marked effects on
the behavior of the distribution. In fact, some values of the shape parameter will cause the distribution
equations to reduce to those of other distributions. For example, when β = 1, the pdf of the three-parameter
Weibull reduces to that of the two-parameter exponential distribution. The parameter β is a pure number.
The following figure shows the effect of different values of the shape parameter, β, on the shape of
the pdf (while keeping γ constant). One can see that the shape of the pdf can take on a variety of forms based on
the value of β.
Looking at the same information on a Weibull probability plot, one can easily understand why the Weibull
shape parameter is also known as the slope.  The following plot shows how the slope of the Weibull probability
plot changes with β. Note that the models represented by the three lines all have the same value of η.
Another characteristic of the distribution where the value of β has a distinct effect is the failure rate. The
following plot shows the effect of the value of β on the Weibull failure rate.

This is one of the most important aspects of the effect of β on the
Weibull distribution. As is indicated by the plot, Weibull
distributions with β < 1 have a failure rate that decreases with
time, also known as infantile or early-life failures. Weibull
distributions with β close to or equal to 1 have a fairly
constant failure rate, indicative of useful life or random
failures. Weibull distributions with β > 1 have a failure rate
that increases with time, also known as wear-out failures.
These comprise the three sections of the classic "bathtub
curve." A mixed Weibull distribution with one subpopulation
with β < 1, one subpopulation with β = 1 and one subpopulation
with β > 1 would have a failure rate plot that was identical to the
bathtub curve. An example of a bathtub curve is shown in the
following chart.

Weibull Scale parameter, η


A change in the scale parameter, η, has the same effect on the distribution as a change of the abscissa scale.
Increasing the value of η while holding β constant has the effect of stretching out the pdf. Since the area under
a pdf curve is a constant value of one, the "peak" of the pdf curve will also decrease with the increase of η, as
indicated in the following figure.
 If η is increased, while β and γ are kept the same, the distribution gets stretched out to the right and its
height decreases, while maintaining its shape and location.
 If η is decreased, while β and γ are kept the same, the distribution gets pushed in towards the left (i.e.,
towards its beginning or towards 0 or γ), and its height increases.
 η has the same unit as T, such as hours, miles, cycles, actuations, etc.

B. Gamma Distribution
In probability theory and statistics, the  gamma distribution  is a two-parameter family of
continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared
distribution are special cases of the gamma distribution. There are three different parametrizations in common
use:

1. With a shape parameter k and a scale parameter θ.


2. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.
3. With a shape parameter k and a mean parameter μ = kθ = α/β.
In each of these three forms, both parameters are positive real numbers. A gamma distribution is a general type
of statistical distribution that is related to the beta distribution and arises naturally in processes for which the
waiting times between Poisson distributed events are relevant. The gamma distribution has been used to
model the size of insurance claims and rainfalls. This means that aggregate insurance claims and the amount of
rainfall accumulated in a reservoir are modelled by a gamma process – much like the exponential
distribution generates a Poisson process. The gamma distribution is also used to model errors in multi-
level Poisson regression models, because the combination of the Poisson distribution and a gamma distribution
is a negative binomial distribution. In wireless communication, the gamma distribution is used to model the
multi-path fading of signal power. In oncology, the age distribution of cancer incidence often follows the gamma
distribution, whereas the shape and scale parameters predict, respectively, the number of  driver eventsand the
time interval between them In neuroscience, the gamma distribution is often used to describe the distribution
of inter-spike intervals.In bacterial gene expression, the copy number of a constitutively expressed protein
often follows the gamma distribution, where the scale and shape parameter are, respectively, the mean number
of bursts per cell cycle and the mean number of protein molecules produced by a single mRNA during its
lifetime.[In genomics, the gamma distribution was applied in peak calling step (i.e. in recognition of signal)
in ChIP-chip and ChIP-seq data analysis. The gamma distribution is widely used as a conjugate prior in Bayesian
statistics. It is the conjugate prior for the precision (i.e. inverse of the variance) of a  normal distribution. It is
also the conjugate prior for the exponential distribution.
C. LOG NORMAL DISTRIBUTION
In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of
a random variable whose logarithm is normally distributed. Thus, if the random variable X is log-normally
distributed, then Y = ln(X) has a normal distribution. Likewise, if Y has a normal distribution, then
the exponential function of Y, X = exp(Y), has a log-normal distribution. A random variable which is log-
normally distributed takes only positive real values. The distribution is occasionally referred to as the Galton
distribution or Galton's distribution, after Francis Galton. The log-normal distribution also has been associated
with other names, such as McAlister, Gibrat and Cobb–Douglas. A log-normal process is the statistical
realization of the multiplicative product of many independent random variables, each of which is positive. This
is justified by considering the central limit theorem in the log domain. The log-normal distribution is
the maximum entropy probability distribution for a random variate X for which the mean and variance
of ln(X) are specified. continuous distribution in which the logarithm of a variable has a normal distribution. It
is a general case of Gibrat's distribution, to which the log normal distribution reduces with   and  .A
log normal distribution results if the variable is the product of a large number of independent, identically-
distributed variables in the same way that a normal distribution results if the variable is the sum of a large
number of independent, identically-distributed variables.

You might also like