You are on page 1of 9

Lean RW Final.

qxp 4/16/2007 9:56 AM Page 213

/ Reliability Basics

Reliability Engineering Principles


for the Plant Engineer
By: Drew D. Troyer, CRE, CMRP, Noria Corporation

Abstract The cost and high profile nature of aviation related accidents helped
Increasingly, managers and engineers, who are responsible for to motivate the aviation industry to participate heavily in the
manufacturing and other industrial pursuits, are incorporating a development of the reliability engineering discipline. Likewise, due to
the critical nature of military equipment in defense, reliability
reliability focus into their strategic and tactical plans and initiatives.
engineering techniques have long been employed to assure operational
This trend is affecting numerous functional areas, including
readiness. Many of our standards in the reliability engineering field are
machine/system design and procurement, plant operations and plant
MIL Standards or have their origins in military activities.
maintenance. With its origins in the aviation industry, reliability
engineering, as a discipline, has historically been focused primarily on
Reliability engineering deals with the longevity and dependability of
assuring product reliability. Increasingly, these methods are being
parts, products and systems. More poignantly, it is about controlling
employed to assure the production reliability of manufacturing plants
risk. Reliability engineering incorporates a wide variety of analytical
and equipment – often as an enabler to Lean Manufacturing. This
techniques designed to help engineers understand the failure modes
presentation provides an introduction to the most relevant and
and patterns of these parts, products and systems. Traditionally, the
practical of these methods for plant reliability engineering, including: reliability engineering field has focused upon product reliability and
• Basic reliability calculations for failure rate, MTBF, availability, etc. dependability assurance. In recent years, organizations that deploy
• An introduction to the exponential distribution – the cornerstone machines and other physical assets in production settings have
of the reliability methods. begun to deploy various reliability engineering principles for the
purpose of production reliability and dependability assurance.
• Identifying failure time-dependencies using the versatile Weibull
system.
Increasingly, production organizations deploy reliability engineering
• Developing an effective field data collection system. techniques like reliability centered maintenance (RCM), including
failure modes effects (and criticality) analysis (FMEA.FMECA), root
Introduction cause analysis (RCA), condition-based maintenance, improved work
The origins of the field of reliability engineering, at least the planning schemes, etc. These same organizations are beginning to
demand for it, can be traced back to the point at which man began to adopt life cycle cost-based design and procurement strategies,
depend upon machines for his livelihood. The Noria, for instance, is change management schemes and other advanced tools and
an ancient pump thought to be the world’s first sophisticated techniques in order to control the root causes of poor reliability.
machine. Utilizing hydraulic energy from the flow of a river or stream, However, the adoption of the more quantitative aspects of reliability
the Noria utilized buckets to transfer water to troughs and other engineering by the production reliability assurance community has
distribution devices to irrigate fields and provide water to been slow. This is due in part to the perceived complexity of the
communities. If the community Noria failed, the people who techniques and in part due to the difficulty in obtaining useful data.
depended upon it for their supply of food were at risk. Survival has
The quantitative aspects of reliability engineering may, on the
always been a great source of motivation for reliability and
surface, seem complicated and daunting. In reality, however, a
dependability.
relatively basic understanding of the most fundamental and widely
applicable methods can enable the plant reliability engineer to gain a
While the origins of its demand are ancient, reliability engineering as
much clearer understanding about where problems are occurring,
a technical discipline truly flourished along with the growth of
their nature and their impact on the production process - at least in
commercial aviation following World War II. It became rapidly
the quantitative sense. Used properly, quantitative reliability
apparent to managers of aviation industry companies that crashes are
engineering tools and methods enable the plant reliability engineering
bad for business. Karen Bernowski, editor of Quality Progress, revealed
to more effectively apply the frameworks provided by RCM, RCA,
in one of her editorials research into the media value of death by etc., by eliminating some of the guesswork involved with their
various means, which was conducted by MIT statistic professor Arnold application otherwise. However, engineers must be particularly
Barnett and reported in 1994. Barnett evaluated the number of New clever in their application of the methods because the operating
York Times’ front page news articles per 1000 deaths by various context and environment of a production process incorporates more
means. He found that cancer related deaths yielded 0.02 front page variables than the somewhat one-dimensional world of product
news articles per 1000 deaths, homicide yielded 1.7 per thousand reliability assurance due to the combined influence of design
deaths, AIDS yielded 2.3 per thousand deaths, and aviation related engineering, procurement, production/operations, maintenance, etc.,
accidents yielded a whopping 138.2 articles per thousand deaths! and the difficulty in creating effective tests and experiments to model
the multidimensional aspects of a typical production environment.

2007 Conference Proceedings 213


Lean RW Final.qxp 4/16/2007 9:56 AM Page 214

Despite the increased difficulty in applying quantitative reliability


λ = r /T
methods in the production environment, it is nonetheless worthwhile
to gain a sound understanding of the tools and apply them where Where:
appropriate. Quantitative data helps to define the nature and λ = Failure rate (sometimes referred to as the hazard rate)
magnitude of a problem/opportunity, which provides vision to the T = Total running time/cycles/miles/etc. during an investigation
reliability in his or her application of other reliability engineering tools. period for both failed and non-failed items.
This paper and presentation will provide an introduction to the most r = The total number of failures occurring during the investigation
basic reliability engineering methods that are applicable to the plant period.
engineer that is interested in production reliability assurance. It
presupposes a basic understanding of algebra, probability theory and
For example, if five electric motors operate for a collective total
univariate statistics based upon the Gaussian (normal) distribution
time of 50 years with five functional failures during the period, the
(e.g. measure of central tendency, measures of dispersion and
failure rate is 0.1 failures per year.
variability, confidence intervals, etc.). It should be made clear that
this paper is a brief introduction to reliability methods. It is by no
The basic calculation to estimate Mean-Time-Between-Failure
means a comprehensive survey of reliability engineering methods, nor
(MTBF) and Mean-Time-To Failure (MTTF), measures of central
is it in any way new or unconventional. The methods described
tendency, is simply the reciprocal of the failure rate function. It is
herein are routinely used by reliability engineers and are core
calculated using the following equation.
knowledge concepts for those pursuing professional certification by
the American Society for Quality (ASQ) as a reliability engineer (CRE).
Several books on reliability engineering are listed in the bibliography θ =T /r
of this article. The author of this article has found Reliability Methods
Where:
for Engineers, by K.S. Krishnamoorthi, and Reliability Statistics by
θ = Mean-Time-Between/To-Failure
Robert Dovich, to be particularly useful and user friendly references
T = Total running time/cycles/miles/etc. during an investigation
on the subject of reliability engineering methods. Both are published
period for both failed and non-failed items.
by the ASQ press.
r = The total number of failures occurring during the investigation
Before discussing methods, one should familiarize him or herself period.
with reliability engineering nomenclature. For convenience, a highly
abridged list of key terms and definitions is provided in the appendix The MTBF for our industrial electric motor example is 10 years,
of this article. For a more exhaustive definition of reliability terms which is the reciprocal of the failure rate for the motors. Incidentally,
and nomenclature, refer to MIL-STD-721 and other related standards. we would estimate MTBF for electric motors that are rebuilt upon
The definitions contained in the appendix are from MIL-STD-721. failure. For smaller, motors that are considered disposable, we would
state the measure of central tendency as MTTF.
Basic Mathematical Concepts
The failure rate is a basic component of many more complex reliability
in Reliability Engineering
calculations. Depending upon the mechanical/electrical design, operating
Many mathematical concepts apply to reliability engineering,
context, environment, maintenance effectiveness, a machine’s failure
particularly from the areas of probability and statistics. Likewise,
rate as a function of time may decline, remain constant, increase linearly
many mathematical distributions can be used for various purposes,
or increase geometrically (Figure 1). The importance of failure rate
including the Gaussian (normal) distribution, the log-normal
versus time will be discussed in more detail later.
distribution, the Rayleigh distribution, the exponential distribution, the
Weibull distribution and a host of others. For the purpose of this brief
introduction, we’ll limit our discussion to the exponential distribution
and the Weibull distribution, the two most widely applied to reliability
engineering. In the interest of brevity and simplicity, important
mathematical concepts such as distribution goodness-of-fit and
confidence intervals have been excluded.

Failure Rate and Mean-Time-Between/To-Failure


(MTBF/MTTF)
The purpose for quantitative reliability measurements is to define
the rate of failure relative to time and to model that failure rate in a
mathematical distribution for the purpose of understanding the
quantitative aspects of failure. The most basic building block is the
failure rate, which is estimated using the following equation:

214 2007 Conference Proceedings


Lean RW Final.qxp 4/16/2007 9:56 AM Page 215

/ Reliability Basics

Figure 2 - The much maligned “bathtub” curve.

The human body is an excellent example of a system that follows


the bathtub curve. People, and other organic species for that matter,
tend to suffer a high failure rate (mortality) during their first years of
life, particularly the first few years, but the rate decreases as the
child grows older. Assuming a person reaches puberty and survives
Figure 1 - Different failure rate versus time scenarios. his or her teenage years, his or her mortality rate becomes fairly
constant and remains there until age (time) dependent illnesses begin
The “Bathtub” Curve to increase the mortality rate (wearout). Numerous influences affect
Individuals that have received only basic training in probability and mortality rates, including pre-natal care and mother’s nutrition, quality
statistics are probably most familiar with the Gaussian or normal and availability of medical care, environment and nutrition, lifestyle
distribution, which is associated with familiar bell-shaped probability choices and, of course, genetic predisposition. These factors can be
density curve. The Gaussian distribution is generally applicable to metaphorically compared to factors that influence machine life.
data sets where the two most common measures of central Design and procurement is analogous to genetic predisposition;
tendency, mean and median, are approximately equal. Surprisingly, installation and commissioning is analogous to prenatal care and
despite the versatility of the Gaussian distribution in modeling mother’s nutrition; and lifestyle choices and availability of medical
probabilities for phenomenon ranging from standardized test scores to care is analogous to maintenance effectiveness and proactive control
the birth weights of babies, it is not the dominant distribution over operating conditions.
employed in reliability engineering. The Gaussian distribution has its
place in evaluating the failure characteristics of machines with a The Exponential Distribution
dominant failure mode, but the primary distribution employed in The exponential distribution, the most basic and widely used
reliability engineering is the exponential distribution. reliability prediction formula, models machines with the constant
failure rate, or the flat section of the bathtub curve. Most industrial
When evaluating the reliability and failure characteristics of a machines spend most of their lives in the constant failure rate, so it is
machine, we must begin with the much maligned “bathtub” curve, widely applicable. Below is the basic equation for estimating the
which reflects the failure rate versus time (Figure 2). In concept, the reliability of a machine that follows the exponential distribution,
bathtub curve effectively demonstrates a machine’s three basic where the failure rate is constant as a function of time.
failure rate characteristics – declining, constant or increasing.
Regrettably, the bathtub curve has been harshly criticized in the R(t ) = e − λt
maintenance engineering literature because it fails to effectively
model the characteristic failure rate for most machines in an Where:
industrial plant, which is generally true at the macro level. Most R(t) = Reliability estimate for a period of time, cycles, miles, etc. (t).
machines spend their lives in the early life, or infant mortality, and/or e = Base of the natural logarithms (2.718281828)
the constant failure rate regions of the bathtub curve. We rarely see λ = Failure rate (1/MTBF, or 1/MTTF)
systemic time-based failures in industrial machine. Despite its
limitations in modeling the failure rates of typical industrial machines, In our electric motor example, if one assumes a constant failure
the bathtub curve is a useful tool for explaining the basic concepts of rate the likelihood of running a motor for six years without a failure,
reliability engineering. or the projected reliability, is 55%. This is calculated as follows:

R(6) = 2.718281828-(0.1* 6)
R(6) = 0.5488 = ~ 55%

2007 Conference Proceedings 215


Lean RW Final.qxp 4/16/2007 9:56 AM Page 216

In other words, after six years, about 45% of the population of


identical motors operating in an identical application can
probabilistically be expected to fail. It is worth reiterating at this point
that these calculations project the probability for a population. Any
given individual from the population could fail on the first day of
operation while another individual could last 30 years. That is the
nature of probabilistic reliability projections.

A characteristic of the exponential distribution is the MTBF occurs


at the point at which the calculated reliability is 36.78%, or the point
at which 63.22% of the machines have already failed. In our motor
example, after 10 years, 63.22% of the motors from a population of
identical motors serving in identical applications can be expected to
fail. In other words, the survival rate is 36.78% of the population. Figure 3 - The probability density function (pdf).

We often speak of projected bearing life as the L10 life. In reality, The cumulative distribution function is simply the cumulative number
this is the point in time at which 10% of a population of bearings of failures one might expect over a period of time. For the exponential
should be expected to fail (90% survival rate). In reality, only a distribution, the failure rate is constant, so the relative rate at which
fraction of the bearings actually survive to the L10 point. We’ve failed components are added to the cdf remains constant. However, as
come to accept that as the objective life for a bearing when, perhaps, the population declines as a result of failure, the actual number of
we should set our sights on the L63.22 point, indicating that our mathematically estimated failures decreases as a function of the
bearings are lasting, on average, to projected MTBF – assuming, of declining population. Much like the pdf asymptotically approaches zero,
course, that the bearings follow the exponential distribution. We’ll the cdf asymptotically approaches one (Figure 4).
discuss that issue later in the Weibull analysis section of the paper.

The probability density function (pdf), or life distribution, is a


mathematical equation that approximates the failure frequency
distribution. It is the pdf, or life frequency distribution, that yields the
familiar bell-shaped curve in the Gaussian, or normal, distribution.
Below is the pdf for the exponential distribution.

pdf (t ) = λe − λt
Where:
pdf(t) = Life frequency distribution for a given time (t)
e = Base of the natural logarithms (2.718281828)
λ = Failure rate (1/MTBF, or 1/MTTF) Figure 4 - Failure rate and the cumulative distribution function.

In our electric motor example, the actual likelihood of failure at The declining failure rate portion of the bathtub curve, which is
three years is calculated as follows: often called the infant mortality region, and the wear out region will
be discussed in the following section addressing the versatile Weibull
pdf(3) = 01. * 2.718281828-(0.1* 3) distribution.
pdf(3) = 0.1 * 0.7408
pdf(3) = .07408 = ~ 7.4% Weibull Distribution
Originally developed by Wallodi Weibull, a Swedish mathematician,
In our example, if we assume a constant failure rate, which follows Weibull analysis is easily the most versatile distribution employed by
the exponential distribution, the life distribution, or pdf for the reliability engineers. While it is called a distribution, it is actually a
industrial electric motors, is expressed in Figure 3. Don’t be confused tool that enables the reliability engineer to first characterize the
by the declining nature of the pdf function. Yes, the failure rate is probability density function (failure frequency distribution) of a set of
constant, but the pdf mathematically assumes failure without failure data, to characterize the failures as early life, constant
replacement, so the population from which failures can occur is (exponential) or wear out (Gaussian or log normal) by plotting time to
continuously reducing - asymptotically approaching zero. failure data on a special plotting paper with the log of the
times/cycles/miles to failure plotted a log scaled x-axis versus the
cumulative percent of the population represented by each failure on a
log-log scaled y-axis (Figure 5).

216 2007 Conference Proceedings


Lean RW Final.qxp 4/16/2007 9:56 AM Page 217

/ Reliability Basics

As a caveat to tie this tool back to excellence in maintenance and


operations excellence, if we were to more effectively control the
forcing functions that lead to mechanical failure in bearings, gears,
etc., such as lubrication, contamination control, alignment, balance,
appropriate operation, etc., more machines would actually reach their
fatigue life. Machines that reach their fatigue life will exhibit the
familiar wearout characteristic.

Using the β coefficient to adjust the failure rate equation as a


function of time yields the following general equation:
β
β ⎛t⎞
h(t ) = ⎜ ⎟
t ⎝θ ⎠
Where:
Figure 5 - The simple Weibull plot - annotated. h(t) = Failure rate (or hazard rate) for a given time (t)
e = Base of the natural logarithms (2.718281828)
Once plotted, the linear slope of the resultant curve is an important θ = Estimated MTBF/MTTF
variable, called the shape parameter, represented by β, which is used to β = Weibull shape parameter from plot.
adjust the exponential distribution to fit a wide number of failure
distributions. In general, if the b coefficient, or shape parameter, is less
And the following reliability function:
than 1.0, the distribution exhibits early life, or infant mortality failures. If
the shape parameter exceeds about 3.5, the data are time dependent β
⎛t ⎞
and indicate wearout failures. This data set typically assumes the −⎜ ⎟
⎝θ ⎠
Gaussian, or normal, distribution. As the β coefficient increases above R(t ) = e
~ 3.5, the bell-shaped distribution tightens, exhibiting increasing kurtosis
(peakedness at the top of the curve) and a smaller standard deviation. Where:
Many data sets will exhibit two or even three distinct regions. It is R(t) = Reliability estimate for a period of time, cycles, miles, etc. (t)
common for reliability engineers to plot, for example, one curve e = Base of the natural logarithms (2.718281828)
representing the shape parameter during run in and another curve to θ = Estimated MTBF/MTTF
represent the constant or gradually increasing failure rate. In some β = Weibull shape parameter from plot.
instances, a third distinct linear slope emerges to identify a third shape,
the wearout region. In these instances, the pdf of the failure data do in And the following probability density function (pdf):
fact assume the familiar bathtub curve shape (Figure 6). Most
β
mechanical equipment used in plants, however, exhibit an infant β ⎛t⎞
β ⎛ t ⎞ −⎜ ⎟
mortality region and a constant or gradually increasing failure rate region. pdf (t ) = ⎜ ⎟ e ⎝θ ⎠
t ⎝θ ⎠
It is rare to see a curve representing wearout emerge. The characteristic
life, or η (lower case Greek “Eta”), is the Weibull approximation of the Where:
MTBF. It is always the function of time, miles or cycles where 63.21% of pdf(t) = Probability density function estimate for a period of time,
the units under evaluation have failed, which is the MTBF/MTTF for the cycles, miles, etc. (t)
exponential distribution.
e = Base of the natural logarithms (2.718281828)
θ = Estimated MTBF/MTTF
β = Weibull shape parameter from plot.

It should be noted that when the β equals 1.0, the Weibull


distribution takes the form of the exponential distribution on which it
is based.

To the uninitiated, the mathematics required to perform Weibull


analysis may look daunting. But once you understand the mechanics
of the formulas, the math is really quite simple. Moreover, software
will do most of the work for us today, but it is important to have an
Figure 6 - Depending upon the shape parameter, the Weibull failure understanding of the underlying theory so that the plant reliability
density curve can assume several distributions, which is what makes it engineer can effectively deploy the powerful Weibull analysis
so versatile for reliability engineering. technique.

2007 Conference Proceedings 217


Lean RW Final.qxp 4/16/2007 9:56 AM Page 218

In our previously discussed example of electric motors, we The Multi-Slope Weibull Plot
previously assumed the exponential distribution. However, if Weibull Frequently, when drawing a best-fit regression line through the
analysis revealed early life failures by yielding a β shape parameter of data points on a Weibull plot, the coefficient of correlation is poor,
0.5, the estimate of reliability at six years time would be ~46%, not meaning the actual data points stray a great distance from regression
the ~55% estimated assuming the exponential distribution. In order line. This is assessed by examining the coefficient of correlation R,
to reduce wearout failures, we would need to lean on our suppliers to or more conservatively, R2, which denotes data variability. When
provide better built and delivered quality and reliability, store the correlation is poor, the reliability engineer should examine the data to
motors better to avoid rust, corrosion, fretting and other static wear evaluate if two or more patterns exist, which can denote major
mechanism and do a better job of installing and starting up new or differences in failure modes, operating context, etc. Often, this
rebuilt machines. produces two or more estimates of beta (Figure 8).

Conversely, if Weibull analysis revealed that the motors exhibited


predominantly wearout related failures, yielding a β shape parameter
of 5.0, the estimate of reliability at six years time would be ~ 93%,
instead of the ~55% estimated assuming the exponential distribution.
For time-dependent wearout failures, we can perform scheduled
overhaul or replacement assuming we have a good estimate of the
MTBF/MTTF after we’ve reached the wearout region and a
sufficiently small standard deviation so as to make high confidence
rebuild/replace decisions that aren’t exceedingly costly. In our motor
example, assuming a β shape parameter of 5.0, the failure rate
begins to rapidly increase after about 5 or six years, so we may want
to edit our data to just focus upon the wearout region when
estimating time-based replacement or rebuild time. Alternatively, we
can improve the design, targeting the dominant failure mode(s) with
the objective of decreasing “stress-strength” interferences. In other Figure 8 - An example of a multi-beta Weibull plot.
words, we can attempt to eliminate the machine’s frailties through
design modification, the goal being to eliminate whatever is causing As we see in our example in Figure 8, the data set works better
the time-dependent failures. when two distinct regression lines are drawn. The first line, exhibits
a beta shape parameter of 0.5, suggesting early life failures. The
Assuming everything is constant, except the β shape parameter, second line exhibits a beta shape of 3.0, suggesting that the risk of
Figure 7 illustrates the difference the β shape parameter has on the failure increases as a function of time. It is common for complex
estimate of reliability assuming β shape values of 0.5 (early life), 1.0 equipment, particularly mechanical equipment, to experience “run-in”
(constant, or exponential) and 5.0 (wearout) for a range time failures when new or recently rebuilt. As such, the risk of failure is
estimates. This graphic visually illustrates the concept of increasing highest just following initial start-up. Once the system works through
risk versus time (β = 0.5), constant risk versus time (β = 1.0) and its run-in period, which can take minutes, hours, days, weeks,
increasing risk versus time (β = 5). months or years, depending upon the system type, the system enters
a different risk pattern, in this example, the system enters a period
where the risk of failure increases as a function of time once the
system exits its run-in period.

The multi-beta offers the reliability engineer a more precise


estimate of risk as a function of time. Armed with this knowledge, he
or she is better positioned to take mitigating actions. For example,
during the early life period, we’d be inclined to improve the precision
with which we manufacture/rebuild, install and start-up. Moreover,
we might add monitoring techniques and/or increase our monitoring
frequency during the high risk period. Following the run-in period, we
might introduce monitoring techniques that are targeted at the time-
dependent wearout failures that are believed to affect the system,
increase monitoring frequency accordingly or schedule “hard-time”
preventive maintenance actions in some cases.

Figure 7 - Various reliability projections as a function of time for different


Weibull shape parameters.

218 2007 Conference Proceedings


Lean RW Final.qxp 4/16/2007 9:56 AM Page 219

/ Reliability Basics

Estimating System Reliability


Once the reliability of components or machines has been
established relative to the operating context and required mission
time, plant engineers must assess the reliability of a system or
process. Again, for the sake of brevity and simplicity, we’ll discuss
system reliability estimates for series, parallel and shared-load
redundant system (r/n systems).

Series Systems
Before discussing series systems, we should discuss reliability Figure 10 - Simple parallel system – the system reliability is increased to
block diagrams. Not a complicated tool to use, reliability block 99% due to the redundancy.
diagrams simply map a process from start to finish. For a series
system, subsystem A is followed by subsystem B and so forth. In To calculate the reliability of an active parallel system, where both
the series system, the ability to employ subsystem B depends upon machines are running, use the following simple equation:
the operating state of subsystem A. If subsystem A is not operating,
the system is down regardless of the condition of subsystem B
(Figure 9).
[ ]
Rs( t ) = 1 − (1 − R1( t ) )× (1 − R 2 (t ) )× ... × (1 − Rn(t ) )

Where:
To calculate the system reliability for a serial process, one needs Rs(t) – System reliability for given time (t)
only to multiply the estimated reliability of subsystem A at time (t) by R1-n(t) – Subsystem or sub-function reliability for given time (t)
the estimated reliability of subsystem B at time (t). The basic
equation for calculating the system reliability of a simple series The simple parallel system in our example with two components in
system is: parallel, each having a reliability of 0.90, has a total system reliability
of 1 – (0.1 X 0.1) = 0.99. So the system reliability was significantly
Rs( t ) = R1(t ) × R 2( t ) × ... × Rn(t ) improved. There are some shortcut methods for calculating parallel
system reliability when all subsystems have the same estimated
Where: reliability. More often, systems contain parallel and serial
Rs(t) – System reliability for given time (t) subcomponents as depicted in Figure 11. The calculation of standby
R1-n(t) – Subsystem or sub-function reliability for given time (t) systems requires knowledge about the reliability of the switching
mechanism. In the interest of simplicity and brevity, this topic will be
So, for a simple system with three subsystems, or sub-functions, reserved for a future paper.
each having an estimated reliability of 0.90 (90%) at time (t), the
system reliability is calculated as 0.90 X 0.90 X 0.90 = 0.729, or
about 73%.

Figure 11 - Combination system with parallel and serial elements

Figure 9 - Simple serial system.


r out of n Systems (r/n Systems)
An important concept to plant reliability engineers is the concept of
Parallel Systems r/n systems. These systems require that r units from a total
Often, design engineers will incorporate redundancy into critical population in n be available for use. A great industrial example is
machines. Reliability engineers call these parallel systems. These coal pulverizers in an electric power generating plant. Often, the
systems may be designed as active parallel systems or standby engineers design this function in the plant using an r/n approach. For
parallel systems. The block diagram for a simple two component instance, if a unit has four pulverizers and the unit requires that three
parallel system is shown in Figure 10. of the four be operable to run at the unit’s full load. This reliability
calculation can be reduced to a simple cumulative binomial
distribution calculation, the formula for which is:

2007 Conference Proceedings 219


Lean RW Final.qxp 4/16/2007 9:56 AM Page 220

k would likely be very different (Figure 12). For certain, some failure
n!
R( r ≤ k ) = ∑ p r (1 − p ) n − r modes would still be mathematical, but many, and arguably most,
r = 0 r!( n − r!)
would exhibit a time dependency. This kind of information would arm
Where: reliability engineers and managers with a powerful set of options for
Rs = System reliability given the actual number of failures (r) is mitigating failure risk with a high degree of precision. Naturally, this
less than or equal to the maximum allowable (k) ability depends upon the effective collection and subsequent analysis
r = The actual number of failures of field data.
k = The maximum allowable number of failures
n = The total number of units in the system
p = The probability of survival, or the subcomponent reliability for
a given time (t).

This equation is somewhat more complicated. In our pulverizer


example, assuming a subcomponent reliability of 0.90, the equation
works out as a summation of the following:

P(0) = 0.6561
P(1) = 0.2916

So, the likelihood of completing the mission time (t) is 0.9477


(0.6561 + 0.2916), or approximately 95%.

Field Data Collection


To employ the reliability analysis methods described herein, the
engineer requires data. It is imperative to establish field data Figure 12 - Good field data collection enables you to break the random trap.
collection systems to support your reliability management initiatives.
Likewise, as much as possible, you’ll want to employ common Conclusions
nomenclature and units so that your data can be parsed effectively This brief introduction to reliability engineering methods is intended
for more detailed analysis. Collect the following information: to expose the otherwise uninitiated plant engineer to the world of
• Basic System Information quantitative reliability engineering. The subject is quite broad,
however, and I’ve only touched on the major reliability methods that I
• Operating Context
believe are most applicable to the plant engineer. I encourage you to
• Environmental Context
further investigate the field of reliability engineering methods,
• Failure Data concentrating on the following topics, among others:
• More detailed understanding of the Weibull distribution and its
A good general system for data collection is described in the IEC applications
standard 300-3-2. In addition to providing instructions for collecting • More detailed understanding of the exponential distribution and
field data, it provides a standard taxonomy of failure modes. Other its applications
taxonomies have been established, but the IEC standard represents a
• The Gaussian distribution and its applications
good starting point for your organization to define its own. Likewise,
• The log-normal distribution and its applications
DOE standard NE-1004-92 offers a very nice standard nomenclature
of failure causes. • Confidence intervals (binomial, chi-square/Poisson, etc.)
• Beta distribution and its applications
An important benefit derived from your efforts to collect good field • Bayesian applications of reliability engineering methods
data is that it enables you to break the “random trap.” As I • Stress-strength interference analysis
mentioned earlier, the bathtub curve has been much maligned –
• Testing options and their applicability to plant reliability
particularly in the reliability-centered maintenance literature. While
engineering
it’s true that Weibull analysis reveals that few complex mechanical
systems exhibit time-dependent wear out failures, the reason, at • Reliability growth strategies and management
least in part, is due to the fact that the reliability of complex systems • More detailed understanding of field data collection
is affected by a wide variety of failure modes and mechanisms.
When these are lumped together, there is a “randomizing” effect, Most important, spend time learning how to apply reliability
which makes the failures appear to lack any time dependency. engineering methods to plant reliability problems. I’ll be addressing
However, if the failure modes were analyzed individually, the story

220 2007 Conference Proceedings


Lean RW Final.qxp 4/16/2007 9:56 AM Page 221

/ Reliability Basics

reliability engineering methods in future issues of Reliability World Maintainability – The measure of the ability of an item to be
magazine at more detailed and applied levels, emphasizing the needs retained or restored to specified condition when maintenance is
of the plant reliability engineer. If your interest in reliability performed by personnel having specified skill levels, using prescribed
engineering methods is high, I encourage you to pursue professional procedures and resources, at each prescribed level of maintenance
certification by the American Society for Quality as a reliability and repair.
engineer (CRE).
Maintenance, Corrective – All actions performed, as a result of
failure, to restore an item to a specified condition. Corrective
References maintenance can include any or all of the following steps: localization,
Troyer, D. (2006) Strategic Plant Reliability Management Course isolation, disassembly, interchange, reassembly, alignment, and
Book, Noria Publishing, Tulsa, Oklahoma. checkout.
Bernowski, K (1997) “Safety in the Skies,” Quality Progress,
January. Maintenance, Preventive – All actions performed in an attempt to
retain an item in a specified condition by providing systematic
Dovich, R. (1990) Reliability Statistics, ASQ Quality Press,
inspection, detection and prevention of incipient failures.
Milwaukee, WI.
Krishnamoorthi, K.S. (1992) Reliability Methods for Engineers, ASQ Mean-Time-Between-Failure (MTBF) – A basic measure of
Quality Press, Milwaukee, WI. reliability for repairable items: the mean number of life units during
Mil Standard 721 which all parts of the item perform within their specified limits, during
a particular measurement interval under stated conditions.
IEC Standard 300-3-3
DOE Standard NE-1004-92 Mean-Time-To-Failure (MTTF) – A basic measure of reliability for
non-repairable items: the mean number of life units during which all
Appendix – Select Reliability Engineering parts of the item perform within their specified limits, during a
particular measurement interval under stated conditions.
Terms from MIL STD 721
Availability – A measure of the degree to which an item is in the Mean-Time-To-Repair (MTTR) – A basic measure of maintainability:
operable and committable state at the start of the mission, when the the sum of corrective maintenance times at any specified level of repair,
mission is called for at an unknown state. divided by the total number of failures within an item repaired at that
level, during a particular interval under stated conditions.
Capability – A measure of the ability of an item to achieve mission
objectives given the conditions during the mission. Mission Reliability – The ability of an item to perform its required
functions for the duration of specified mission profile.
Dependability – A measure of the degree to which an item is
operable and capable of performing its required function at any Reliability – (1) The duration or probability of failure free
(random) time during a specified mission profile, given the availability performance under stated conditions. (2) The probability that an item
at the start of the mission. can perform its intended function for a specified interval under stated
conditions. For non-redundant items this is the equivalent to
Failure – The event, or inoperable state, in which an item, or part definition (1). For redundant items, this is the definition of mission
of an item, does not, or would not, perform as previously specified. reliability.

Failure, Dependent – Failure which is caused by the failure of an


associated item(s). Not independent.

Failure, Independent – Failure which occurs without being caused


by the failure of any other item. Not dependent.

Failure Mechanism – The physical, chemical, electrical, thermal or


other process which results in failure.

Failure Mode – The consequence of the mechanism through


which the failure occurs, i.e. short, open, fracture, excessive wear.

Failure, Random – Failure whose occurrence is predictable only in


the probabilistic or statistical sense. This applies to all distributions.

Failure Rate – The total number of failures within an item


population, divided by the total number of life units expended by that
population, during a particular measurement interval under stated
conditions.

2007 Conference Proceedings 221

You might also like