Professional Documents
Culture Documents
/ Reliability Basics
Abstract The cost and high profile nature of aviation related accidents helped
Increasingly, managers and engineers, who are responsible for to motivate the aviation industry to participate heavily in the
manufacturing and other industrial pursuits, are incorporating a development of the reliability engineering discipline. Likewise, due to
the critical nature of military equipment in defense, reliability
reliability focus into their strategic and tactical plans and initiatives.
engineering techniques have long been employed to assure operational
This trend is affecting numerous functional areas, including
readiness. Many of our standards in the reliability engineering field are
machine/system design and procurement, plant operations and plant
MIL Standards or have their origins in military activities.
maintenance. With its origins in the aviation industry, reliability
engineering, as a discipline, has historically been focused primarily on
Reliability engineering deals with the longevity and dependability of
assuring product reliability. Increasingly, these methods are being
parts, products and systems. More poignantly, it is about controlling
employed to assure the production reliability of manufacturing plants
risk. Reliability engineering incorporates a wide variety of analytical
and equipment – often as an enabler to Lean Manufacturing. This
techniques designed to help engineers understand the failure modes
presentation provides an introduction to the most relevant and
and patterns of these parts, products and systems. Traditionally, the
practical of these methods for plant reliability engineering, including: reliability engineering field has focused upon product reliability and
• Basic reliability calculations for failure rate, MTBF, availability, etc. dependability assurance. In recent years, organizations that deploy
• An introduction to the exponential distribution – the cornerstone machines and other physical assets in production settings have
of the reliability methods. begun to deploy various reliability engineering principles for the
purpose of production reliability and dependability assurance.
• Identifying failure time-dependencies using the versatile Weibull
system.
Increasingly, production organizations deploy reliability engineering
• Developing an effective field data collection system. techniques like reliability centered maintenance (RCM), including
failure modes effects (and criticality) analysis (FMEA.FMECA), root
Introduction cause analysis (RCA), condition-based maintenance, improved work
The origins of the field of reliability engineering, at least the planning schemes, etc. These same organizations are beginning to
demand for it, can be traced back to the point at which man began to adopt life cycle cost-based design and procurement strategies,
depend upon machines for his livelihood. The Noria, for instance, is change management schemes and other advanced tools and
an ancient pump thought to be the world’s first sophisticated techniques in order to control the root causes of poor reliability.
machine. Utilizing hydraulic energy from the flow of a river or stream, However, the adoption of the more quantitative aspects of reliability
the Noria utilized buckets to transfer water to troughs and other engineering by the production reliability assurance community has
distribution devices to irrigate fields and provide water to been slow. This is due in part to the perceived complexity of the
communities. If the community Noria failed, the people who techniques and in part due to the difficulty in obtaining useful data.
depended upon it for their supply of food were at risk. Survival has
The quantitative aspects of reliability engineering may, on the
always been a great source of motivation for reliability and
surface, seem complicated and daunting. In reality, however, a
dependability.
relatively basic understanding of the most fundamental and widely
applicable methods can enable the plant reliability engineer to gain a
While the origins of its demand are ancient, reliability engineering as
much clearer understanding about where problems are occurring,
a technical discipline truly flourished along with the growth of
their nature and their impact on the production process - at least in
commercial aviation following World War II. It became rapidly
the quantitative sense. Used properly, quantitative reliability
apparent to managers of aviation industry companies that crashes are
engineering tools and methods enable the plant reliability engineering
bad for business. Karen Bernowski, editor of Quality Progress, revealed
to more effectively apply the frameworks provided by RCM, RCA,
in one of her editorials research into the media value of death by etc., by eliminating some of the guesswork involved with their
various means, which was conducted by MIT statistic professor Arnold application otherwise. However, engineers must be particularly
Barnett and reported in 1994. Barnett evaluated the number of New clever in their application of the methods because the operating
York Times’ front page news articles per 1000 deaths by various context and environment of a production process incorporates more
means. He found that cancer related deaths yielded 0.02 front page variables than the somewhat one-dimensional world of product
news articles per 1000 deaths, homicide yielded 1.7 per thousand reliability assurance due to the combined influence of design
deaths, AIDS yielded 2.3 per thousand deaths, and aviation related engineering, procurement, production/operations, maintenance, etc.,
accidents yielded a whopping 138.2 articles per thousand deaths! and the difficulty in creating effective tests and experiments to model
the multidimensional aspects of a typical production environment.
/ Reliability Basics
R(6) = 2.718281828-(0.1* 6)
R(6) = 0.5488 = ~ 55%
We often speak of projected bearing life as the L10 life. In reality, The cumulative distribution function is simply the cumulative number
this is the point in time at which 10% of a population of bearings of failures one might expect over a period of time. For the exponential
should be expected to fail (90% survival rate). In reality, only a distribution, the failure rate is constant, so the relative rate at which
fraction of the bearings actually survive to the L10 point. We’ve failed components are added to the cdf remains constant. However, as
come to accept that as the objective life for a bearing when, perhaps, the population declines as a result of failure, the actual number of
we should set our sights on the L63.22 point, indicating that our mathematically estimated failures decreases as a function of the
bearings are lasting, on average, to projected MTBF – assuming, of declining population. Much like the pdf asymptotically approaches zero,
course, that the bearings follow the exponential distribution. We’ll the cdf asymptotically approaches one (Figure 4).
discuss that issue later in the Weibull analysis section of the paper.
pdf (t ) = λe − λt
Where:
pdf(t) = Life frequency distribution for a given time (t)
e = Base of the natural logarithms (2.718281828)
λ = Failure rate (1/MTBF, or 1/MTTF) Figure 4 - Failure rate and the cumulative distribution function.
In our electric motor example, the actual likelihood of failure at The declining failure rate portion of the bathtub curve, which is
three years is calculated as follows: often called the infant mortality region, and the wear out region will
be discussed in the following section addressing the versatile Weibull
pdf(3) = 01. * 2.718281828-(0.1* 3) distribution.
pdf(3) = 0.1 * 0.7408
pdf(3) = .07408 = ~ 7.4% Weibull Distribution
Originally developed by Wallodi Weibull, a Swedish mathematician,
In our example, if we assume a constant failure rate, which follows Weibull analysis is easily the most versatile distribution employed by
the exponential distribution, the life distribution, or pdf for the reliability engineers. While it is called a distribution, it is actually a
industrial electric motors, is expressed in Figure 3. Don’t be confused tool that enables the reliability engineer to first characterize the
by the declining nature of the pdf function. Yes, the failure rate is probability density function (failure frequency distribution) of a set of
constant, but the pdf mathematically assumes failure without failure data, to characterize the failures as early life, constant
replacement, so the population from which failures can occur is (exponential) or wear out (Gaussian or log normal) by plotting time to
continuously reducing - asymptotically approaching zero. failure data on a special plotting paper with the log of the
times/cycles/miles to failure plotted a log scaled x-axis versus the
cumulative percent of the population represented by each failure on a
log-log scaled y-axis (Figure 5).
/ Reliability Basics
In our previously discussed example of electric motors, we The Multi-Slope Weibull Plot
previously assumed the exponential distribution. However, if Weibull Frequently, when drawing a best-fit regression line through the
analysis revealed early life failures by yielding a β shape parameter of data points on a Weibull plot, the coefficient of correlation is poor,
0.5, the estimate of reliability at six years time would be ~46%, not meaning the actual data points stray a great distance from regression
the ~55% estimated assuming the exponential distribution. In order line. This is assessed by examining the coefficient of correlation R,
to reduce wearout failures, we would need to lean on our suppliers to or more conservatively, R2, which denotes data variability. When
provide better built and delivered quality and reliability, store the correlation is poor, the reliability engineer should examine the data to
motors better to avoid rust, corrosion, fretting and other static wear evaluate if two or more patterns exist, which can denote major
mechanism and do a better job of installing and starting up new or differences in failure modes, operating context, etc. Often, this
rebuilt machines. produces two or more estimates of beta (Figure 8).
/ Reliability Basics
Series Systems
Before discussing series systems, we should discuss reliability Figure 10 - Simple parallel system – the system reliability is increased to
block diagrams. Not a complicated tool to use, reliability block 99% due to the redundancy.
diagrams simply map a process from start to finish. For a series
system, subsystem A is followed by subsystem B and so forth. In To calculate the reliability of an active parallel system, where both
the series system, the ability to employ subsystem B depends upon machines are running, use the following simple equation:
the operating state of subsystem A. If subsystem A is not operating,
the system is down regardless of the condition of subsystem B
(Figure 9).
[ ]
Rs( t ) = 1 − (1 − R1( t ) )× (1 − R 2 (t ) )× ... × (1 − Rn(t ) )
Where:
To calculate the system reliability for a serial process, one needs Rs(t) – System reliability for given time (t)
only to multiply the estimated reliability of subsystem A at time (t) by R1-n(t) – Subsystem or sub-function reliability for given time (t)
the estimated reliability of subsystem B at time (t). The basic
equation for calculating the system reliability of a simple series The simple parallel system in our example with two components in
system is: parallel, each having a reliability of 0.90, has a total system reliability
of 1 – (0.1 X 0.1) = 0.99. So the system reliability was significantly
Rs( t ) = R1(t ) × R 2( t ) × ... × Rn(t ) improved. There are some shortcut methods for calculating parallel
system reliability when all subsystems have the same estimated
Where: reliability. More often, systems contain parallel and serial
Rs(t) – System reliability for given time (t) subcomponents as depicted in Figure 11. The calculation of standby
R1-n(t) – Subsystem or sub-function reliability for given time (t) systems requires knowledge about the reliability of the switching
mechanism. In the interest of simplicity and brevity, this topic will be
So, for a simple system with three subsystems, or sub-functions, reserved for a future paper.
each having an estimated reliability of 0.90 (90%) at time (t), the
system reliability is calculated as 0.90 X 0.90 X 0.90 = 0.729, or
about 73%.
k would likely be very different (Figure 12). For certain, some failure
n!
R( r ≤ k ) = ∑ p r (1 − p ) n − r modes would still be mathematical, but many, and arguably most,
r = 0 r!( n − r!)
would exhibit a time dependency. This kind of information would arm
Where: reliability engineers and managers with a powerful set of options for
Rs = System reliability given the actual number of failures (r) is mitigating failure risk with a high degree of precision. Naturally, this
less than or equal to the maximum allowable (k) ability depends upon the effective collection and subsequent analysis
r = The actual number of failures of field data.
k = The maximum allowable number of failures
n = The total number of units in the system
p = The probability of survival, or the subcomponent reliability for
a given time (t).
P(0) = 0.6561
P(1) = 0.2916
/ Reliability Basics
reliability engineering methods in future issues of Reliability World Maintainability – The measure of the ability of an item to be
magazine at more detailed and applied levels, emphasizing the needs retained or restored to specified condition when maintenance is
of the plant reliability engineer. If your interest in reliability performed by personnel having specified skill levels, using prescribed
engineering methods is high, I encourage you to pursue professional procedures and resources, at each prescribed level of maintenance
certification by the American Society for Quality as a reliability and repair.
engineer (CRE).
Maintenance, Corrective – All actions performed, as a result of
failure, to restore an item to a specified condition. Corrective
References maintenance can include any or all of the following steps: localization,
Troyer, D. (2006) Strategic Plant Reliability Management Course isolation, disassembly, interchange, reassembly, alignment, and
Book, Noria Publishing, Tulsa, Oklahoma. checkout.
Bernowski, K (1997) “Safety in the Skies,” Quality Progress,
January. Maintenance, Preventive – All actions performed in an attempt to
retain an item in a specified condition by providing systematic
Dovich, R. (1990) Reliability Statistics, ASQ Quality Press,
inspection, detection and prevention of incipient failures.
Milwaukee, WI.
Krishnamoorthi, K.S. (1992) Reliability Methods for Engineers, ASQ Mean-Time-Between-Failure (MTBF) – A basic measure of
Quality Press, Milwaukee, WI. reliability for repairable items: the mean number of life units during
Mil Standard 721 which all parts of the item perform within their specified limits, during
a particular measurement interval under stated conditions.
IEC Standard 300-3-3
DOE Standard NE-1004-92 Mean-Time-To-Failure (MTTF) – A basic measure of reliability for
non-repairable items: the mean number of life units during which all
Appendix – Select Reliability Engineering parts of the item perform within their specified limits, during a
particular measurement interval under stated conditions.
Terms from MIL STD 721
Availability – A measure of the degree to which an item is in the Mean-Time-To-Repair (MTTR) – A basic measure of maintainability:
operable and committable state at the start of the mission, when the the sum of corrective maintenance times at any specified level of repair,
mission is called for at an unknown state. divided by the total number of failures within an item repaired at that
level, during a particular interval under stated conditions.
Capability – A measure of the ability of an item to achieve mission
objectives given the conditions during the mission. Mission Reliability – The ability of an item to perform its required
functions for the duration of specified mission profile.
Dependability – A measure of the degree to which an item is
operable and capable of performing its required function at any Reliability – (1) The duration or probability of failure free
(random) time during a specified mission profile, given the availability performance under stated conditions. (2) The probability that an item
at the start of the mission. can perform its intended function for a specified interval under stated
conditions. For non-redundant items this is the equivalent to
Failure – The event, or inoperable state, in which an item, or part definition (1). For redundant items, this is the definition of mission
of an item, does not, or would not, perform as previously specified. reliability.