Professional Documents
Culture Documents
Guide Line For Equipment Reability PDF
Guide Line For Equipment Reability PDF
SEMATECH
Technology Transfer 92031014A-GEN
SEMATECH and the SEMATECH logo are registered service marks of SEMATECH, Inc.
Product names and company names used in this publication are for identification purposes only and may be trademarks or service
marks of their respective companies
Abstract: This guideline was developed by a task force comprised of reliability experts and users of
reliability methodologies from the SEMI/SEMATECH member companies. The document was
written to address the needs of semiconductor equipment manufacturers and their customers. It
includes a description of the principles of a cost-effective reliability program, instructions on how
to get started, and details on what needs to be done. A large portion of the document is dedicated
to analysis and testing methodologies. These include: Failure Modes and Effects Analysis
(FMEA), Fault Tree Analysis (FTA), Component Failure Analysis (CFA), Human Reliability
Analysis (HRA); and Reliability Testing, Component Testing, Accelerated Testing (Sudden Death,
Step-Stress Testing), Burn-in Testing, Life Testing, Environmental Stress Screening, Qualification
Testing, and Acceptance Testing.
Keywords: Life Cycle Phases, Reliability Testing, RAMP, Failure, FRACAS, Failure Modes and Effects
Analysis, Quality Function Deployment (QFD), Design of Experiment, Cost of Ownership, Infant
Mortality, Reliability Qualification Testing (RQT), Taguchi, User’s Groups, Reliability Block
Diagram Modeling (RBD), Environmental Stress Screening (ESS), Fault Tree Analysis (FTA)
Table of Contents
1 SUMMARY ................................................................................................................................. 1
2 THE RELIABILITY IMPROVEMENT PROCESS AND EQUIPMENT LIFE CYCLE........... 2
2.1 Introduction ......................................................................................................................... 2
2.2 The Equipment Life Cycle .................................................................................................. 2
2.3 Life Cycle Phases ................................................................................................................ 3
2.4 Life Cycle Cost.................................................................................................................... 9
2.5 The Reliability Improvement Process ............................................................................... 13
2.6 Applying the Reliability Improvement Process................................................................. 21
2.7 Summary ........................................................................................................................... 23
2.8 References ......................................................................................................................... 24
3 IMPLEMENTATION OF THE RELIABILITY IMPROVEMENT PROCESS....................... 25
3.1 Introduction ....................................................................................................................... 25
3.2 Management’s Role........................................................................................................... 25
3.3 Applying the Reliability Improvement ProcessThe Reliability Improvement Process..... 26
3.4 Specific Applications of the Reliability Improvement Process......................................... 44
3.4.1 Starting with Equipment in the Design Phasewith Equipment in the Design
Phase .................................................................................................................... 44
3.4.2 Starting with Equipment in the Prototype Phase ................................................... 46
3.4.3 Starting with Equipment in the Pilot Production Phasewith Equipment in the
Pilot Production Phase ......................................................................................... 47
3.4.4 Starting with Equipment in the Production and Operation Phasewith
Equipment in the Production and Operation Phase ............................................. 49
3.4.5 Starting with Equipment in Phase Out Phase with Equipment in Phase Out
Phase .................................................................................................................... 50
3.5 Functional ResponsibilitiesResponsibilities...................................................................... 51
3.6 Where to Begin.................................................................................................................. 52
3.7 Reliability Plans ................................................................................................................ 55
3.8 Application of Resources and Communicating Value ...................................................... 56
3.9 Summary ........................................................................................................................... 57
3.10 References ....................................................................................................................... 58
4 ACTIVITIES AND TOOLS IN THE RELIABILITY IMPROVEMENT PROCESS............... 59
4.1 Introduction ....................................................................................................................... 59
4.2 Reliability ActivitiesActivities.......................................................................................... 59
List of Figures
List of Tables
Table 3-1. Reliability Improvement Process Applied at Six Different Starting Points................. 27
Table 3-2. Reliability Improvement Process Activities ............................................................... 31
Table 3-3. Reliability Improvement Process Activities2-3. Reliability Improvement
Process Activities for the Design Phase..................................................................... 34
Table 3-4. Reliability Improvement Process Activities for the Prototype Phase......................... 37
Table 3-5. Reliability Improvement Process Activities for the Pilot Production Phase .............. 40
Table 3-6. Reliability Improvement Process Activities for the Production and Operation
Phase .......................................................................................................................... 42
Table 3-7. Reliability Improvement Process Activities for the Phase–Out Phase2-7.
Reliability Improvement Process Activities for the Phase–Out Phase ...................... 44
Table 3-8. Design Phase Reliability Improvement Process Activities......................................... 45
Table 3-9. Prototype Phase Reliability Improvement Process Activities..................................... 47
Table 3-10. Pilot Production Phase Reliability Improvement Process Activities When
Initiated In Pilot Production Phase............................................................................. 48
Table 3-11. Production and Operation Phase Reliability Improvement Process Activities
When Initiated in Production and Operation Phase ................................................... 50
Table 3-12. Phase Out Phase Reliability Improvement Process Activities When Initiated
in Phase-Out Phase..................................................................................................... 51
Table 3-13. Current Product Line Status...................................................................................... 54
Acknowledgements
To assist in the development of these guidelines, a task force of representatives from the
semiconductor industry was assembled to provide guidance in the structure and content. Their
contributions and dedication to this effort has been excellent and beyond the call of duty. Our
thanks to each of the task force members, reviewers, and contributors for their commitment to
such an ambitious effort. It has made the development of these guidelines more enjoyable and
possible.
TASK FORCE MEMBERS
SEMATECH
Dr. Vallabh H. Dhudshia, Texas Instruments, Inc.
David Seekon, National Semiconductor Corp.
Mario Villacourt
SEMI/SEMATECH
Dr. Michael McGraw
William J. Spencer
President and Chief Executive Officer
Preface
These guidelines have been written for use by semiconductor equipment suppliers and customers.
They are intended as a road map that these groups can refer to for assistance in improving the
reliability of their semiconductor manufacturing equipment as part of a long-term strategy aimed
at regaining an increased worldwide market share.
Although there is an abundance of reliability information available in text books, military
handbooks and standards, and guidebooks directed at specific products, there is no concise,
single source document available for the semiconductor equipment industry. The purpose of
these guidelines is to fill this gap. To assist in this effort, a task force consisting of
representatives from the semiconductor industry was assembled to provide guidance in the
structure and content of these guidelines. The guidelines do not provide comprehensive
instruction on the details of reliability engineering; rather they provide a description of the
principles of a cost-effective reliability program, instructions on how to get started, and details on
what needs to be done. Descriptions of necessary program activities and reliability concepts are
provided along with references for those who desire additional information.
The focus of the guidelines is on hardware reliability realizing that software reliability is an
important aspect of reliability for a large segment of semiconductor manufacturing equipment.
However, other guidelines exist that address the issue of software reliability. Thus, the software
reliability topic is discussed only briefly.
The guidelines:
• Are intended to be of value to managers, reliability engineers, and designers
• Are not a "detailed how-to" document, but rather a "roadmap of how to"
• Are centered around a continuous improvement process referred to as the
Reliability Improvement Process
• Cover the entire equipment life cycle as it applies to the semiconductor equipment
industry
Even though emphasis is placed on designing in reliability, the guidelines show how to
incorporate reliability into every phase of the equipment life cycle.
1 SUMMARY
These guidelines focus on a continuous improvement process referred to as the Reliability
Improvement Process, and the Equipment Life Cycle. These two concepts are introduced and
discussed in Section 1.0 of the guidelines. Knowledge of the equipment life cycle is important
because it provides a basis for understanding how and where reliability engineering enters into
the process of designing, producing, and operating the equipment. In this document, the life
cycle has been broken into six distinct phases, each representing a unique portion of the life
cycle. These six life cycle phases are:
1. Concept and Feasibility Phase
2. Design Phase
3. Prototype (alpha-site) Phase
4. Pilot Production (beta-site) Phase
5. Production and Operation Phase
6. Phase-out Phase
These phases provide the framework for tracking reliability improvement throughout the
equipment life cycle phases and guidance on when and where to apply resources. Life cycle costs
concepts are introduced to help understand the impact on expenditures and cost of ownership
when reliability is initiated at different phases of the life cycle.
The Reliability Improvement Process provides a means for systematically improving reliability
throughout the equipment life cycle. It is an iterative process of setting goals, evaluating,
comparing, and improving directed toward continuous reliability improvement. It consists of
five basic steps.
1. Establish reliability goals and requirements for equipment
2. Apply reliability engineering or improvement activities, as needed
3. Conduct an evaluation of the equipment or equipment design
4. Compare the results of the evaluation to the goals and requirements and make a
decision for the next step
5. Identify problems and root causes
The process then returns to Step 2, and repeats Steps 2 through 5 until goals and requirements are
met.
2.1 Introduction
The reliability improvement process and the equipment life cycle form the basis for these
guidelines and are introduced in this section. The reliability improvement process is an iterative
process that provides:
• An effective and systematic way to include reliability in equipment design
• A structure for making reliability improvements throughout the equipment life
cycle
The reliability improvement process provides a means for making revolutionary advancements
when it is applied to equipment early in the design stage, or during major design upgrades, or for
making evolutionary improvements to existing equipment.
Knowledge of the equipment life cycle is important because it provides:
• The framework for applying the reliability improvement process
• A basis for understanding the best practice for improving equipment reliability
and the cost of the improvement
Life cycle costs are introduced in this section to provide a perspective on the impact of initiating
the reliability improvement process early in the equipment life cycle. A thorough knowledge of
life cycle costs and life cycle phase relationships helps to achieve better equipment at lower total
costs.
enables proper planning and execution of the activities and functions necessary for designing,
manufacturing, and operating reliable equipment in a cost effective manner.
ä Concept/Feasibility
Design
Prototype (α-site)
Pilot Production (β-site)
Production/Operation
Phase Out
During this phase, marketing and sales personnel, customer service representatives,
design and reliability engineers, and manufacturing engineers work together with the
customer to:
• Determine the need for new equipment
• Establish reliability goals
• Evaluate the feasibility of meeting these goals
• Estimate resource requirements
• Examine alternative design concepts
• Select those concepts to be studied in more detail during the design phase
• Estimate cost trade offs
The concept and feasibility phase, and the design phase that follows, are the optimal
times for using design-for-reliability practices.
2. Design. The alternative design concepts selected during the concept and feasibility phase
are explored in more detail by the design engineers during this phase of the life cycle. A
design disclosure package is prepared and evaluated by all concerned parties. Reliability
and manufacturing engineers, as well as quality assurance and field service personnel are
generally called on by the design engineers for input concerning parts selection,
components, serviceability, and manufacturing processes. Also, reliability goals set for
the equipment during the concept and feasibility phase are translated into requirements
very early in the design phase. Requirements are useful in making preliminary reliability
allocations to subsystems and components to understand cost impacts.
This phase of the life cycle can be separated into two parts: preliminary design and final
design.
Concept/Feasibility
ä Design
Prototype (α-site)
Pilot Production (β-site)
Production/Operation
Phase Out
Concept/Feasibility
Design
ä Prototype (α-site)
Pilot Production (β-site)
Production/Operation
Phase Out
Multiple design alternatives may require prototyping and testing if serious questions exist
about the best overall choice. It is common for reliability engineers to have responsibility
for performing these tests. However, manufacturing personnel will have responsibility
for determining that parts and components conform to specifications within financial
guidelines.
During the prototype phase, design, reliability, test, and manufacturing engineers, as well
as quality assurance personnel:
• Build and test one or more prototypes of a design
• Present the test results for a pilot production design review
• Redesign as needed to fix weaknesses or make other desirable changes
• Conduct additional design reviews as appropriate
The design reviews should include another critical design review to give the customer an
opportunity to review the latest design being considered.
Concurrent with redesigns and design reviews, reliability engineers, quality assurance
personnel, and manufacturing engineers will develop quality assurance plans, design
inspection and testing programs, set up production facilities, and develop production
plans in preparation for the pilot production phase.
4. Pilot Production. This phase of the life cycle serves as a bridge between the prototype
phase and the production and operation phase. This is the first opportunity for the
equipment to be evaluated in an extended customer environment, and is therefore
commonly called beta-site evaluation. In fact, it may be the first time that the equipment
is exposed to a customer’s processes.
Concept/Feasibility
Design
Prototype (α-site)
ä Pilot Production (β-site)
Production/Operation
Phase Out
The purpose of the pilot production phase is to help identify and correct problems with
the equipment before full-scale production begins. Design and reliability engineers
should evaluate the actual level of equipment reliability and determine what needs to be
accomplished to meet requirements in a cost effective manner.
During the pilot production phase, project management, reliability engineers, manufactur-
ing and test personnel, and customer service representatives:
• Qualify the equipment manufacturing process
• Establish field trials and customer applications of equipment
• Monitor the equipment’s performance
• Identify root causes of failures
• Implement a "corrective action" program for reliability problems
• Determine cost of ownership
Prior to the production and operation phase of the life cycle, reliability and design
engineers should evaluate equipment reliability and make the appropriate recommen-
dations. If the actual equipment reliability level is less than desired, specific reliability
improvement activities that were identified in the corrective action program should be
implemented. This is the last opportunity to make design changes and other
improvements before full-scale production.
Design reviews conducted at this point are often broken down into:
• Qualification Review - verify that the final design meets requirements
• Production Readiness Review - to determine the readiness of full
production
• Reliability Budget Review - verify the reliability goal allocations
If any design changes were made at this point, another critical design review may be
appropriate.
5. Production and Operation. This phase of the life cycle represents the time when units
are produced and sold. All major reliability problems should have been identified and
corrected prior to the production and operation phase. A formal program must be in place
for collecting and analyzing field service data and performance data for the customer’s
unit as well as for the cost impact.
Concept/Feasibility
Design
Prototype (α-site)
Pilot Production (β-site)
ä Production/Operation
Phase Out
During the production and operation phase, field service personnel, management, quality
assurance personnel, and reliability engineers:
• Implement a field tracking and customer feedback and satisfaction
program
• Provide training and technical assistance to customers
• Document and employ installation testing and operation procedures
• Identify and report operation and maintenance problems
• Record failure data in a formal database
• Manage continuous improvement efforts
• Determine cost of ownership impacts
Recorded failure data should account for uncertainty due to variations in site, product
vintage, and customer procedures.
After proper review, decisions are made for resource allocation for continuous improve-
ment in the reliability process. The supplier and customer should function as partners in
these efforts and may participate in user groups.
Once equipment is in the field, it is important to continually monitor reliability, analyze
failures and identify root causes, implement corrective actions, and improve known
causes of failures both for the current and the next generation of equipment.
6. Phase Out. The equipment product line is approaching the end of its useful life during
this final phase of the life cycle. The end of useful life naturally occurs earlier for the
supplier than it does for the customer. The end of useful equipment life for the customer
can occur due to obsolescence, wear, or a change in business plans. To remain
competitive, the supplier must make plans for the next generation of equipment before
phasing out current generation production.
Concept/Feasibility
Design
Prototype (α-site)
Pilot Production (β-site)
Production/Operation
ä Phase Out
The information gained during the six phases of the life cycle should be retained so that it
can be used to improve future generations of similar or new equipment.
This completes the life cycle for the current generation of equipment. Each new
generation of equipment would experience basically the same life cycle.
Supplier Cost Implications. The early life cycle phases typically represent the smallest portion
of those total life cycle costs borne by the supplier, yet generally represent the region where the
greatest impact on equipment reliability can be made. As a design moves toward completion,
design details become increasingly fixed. Thus, the cost in time and dollars to correct reliability
problems increases. Figure 1-1 shows that typically, toward the end of the design/development
macro phase of the life cycle, only 15% of the life cycle costs are consumed, but approximately
95% of the total life cycle costs have been determined (i.e., locked in).[2] Thus, changes made to
improve reliability after the design/development macro phase have little impact on overall life
cycle costs, but can be very expensive in terms of costly design changes, retrofits, service calls,
warranty claims, and customer goodwill. This is not meant to imply that equipment already in
the production/operation macro phase should be ignored in terms of improving reliability.
Reliability improvement activities should continue throughout the life cycle.
100 100
95%
85%
Operation (50%)
80 80
% Locked-In Costs
60 60
% %
Locked-In Total
Costs Costs
40 40
Production (35%)
20 20
12%
3%
0 0
Concept/Feasibility Design/Development Production/Operation
Although reliability improvements made earlier in the life cycle can increase initial supplier
costs, they generally result in lower support costs for the supplier and lower operational costs for
the customer. Also, early improvement could reduce the supplier’s costs of production, warranty,
and service.
Life cycle costs include both equipment supplier costs, which are passed on to the customer in
the purchase price of the equipment, and all costs incurred by the customer over the equipment
life. Supplier costs plus the supplier’s gross profit margin are referred to asacquisition costs, and
include:
• Research and development
• Marketing and sales
• Testing and manufacturing
• Supplier shipping and installation
• Supplier training and support
• Supplier service and spare parts
• Warranty costs
• Continuous improvement
Costs incurred by the customer are referred to as operational costs, and include:
• Customer installation and training
• Operating costs
• Customer service costs and spares inventory
• Customer performed maintenance
• Customer space costs
• Scheduled maintenance
• Equipment improvements and upgrades
• Down time and scrap costs
• Disposal costs
Life cycle costs implications to both the supplier and the customer are discussed in the following
paragraphs.
Customer Cost Implications. Improvements in reliability made by the supplier early in the
equipment life cycle may result in higher development costs being passed on to the customer in
the equipment acquisition costs. However, this can be more than offset as the customer benefits
by having lower operational costs with increased reliability and up time that results in greater
productivity.
Figure 1-2 illustrates how a reliability program impacts acquisition and operational costs. As this
figure indicates, acquisition costs may increase due to efforts to improve reliability.
Operational
Total Costs
Operational
Life
Costs
Cycle
Costs
Total
Life
Cycle Acquisition
Costs
Costs
Acquisition
Costs
However, operational costs, and even more important, total life cycle costs decrease. It is
important for the customer to make equipment purchase decisions based on total life cycle costs
and not just on initial purchase price.
Optimizing Life Cycle Costs. Increasing acquisition costs to improve equipment reliability and
lower operational and total life cycle costs is clearly a recommended practice. However, there is
a point at which increasing acquisition costs to obtain higher levels of reliability is no longer
beneficial. Figure 1-3 shows an optimal point beyond which total life cycle costs begin
increasing with further improvements in reliability.
Life Cycle
Costs
Optimized Cost
Point
Life
Cycle
Costs
Acquisition Operational
Costs Costs
Reliability
When this occurs, a more reliable technology is required for further improvement.
Reliability insights from a technology used in one generation of equipment should be
documented so they can be used to improve the next generation. Improvements in technology
transfer between equipment generations will generally produce a decrease in the life cycle costs
in each succeeding generation of equipment as shown in Figure 2-4.
Generation 1
Generation 2
Life Generation 3
Cycle
Costs
Generation 4
Reliability
4. Compare the results of the evaluation to the goals and requirements and make a
decision to move either to the next step or the next phase
5. Identify problems and root causes
The process then returns to Step 2, and Steps 2 through 5 are repeated until goals and
requirements are met.
The reliability improvement process steps are shown in the flowchart in Figure 1-5.
Establish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Step 4.
Go/No Go
Are Yes Decision on
Goals/Requirements Met? Next Phase
No
Step 5.
Identify Problems & Root Causes
1. Establish Reliability Goals and Requirements. The first step in the reliability improve-
ment process is to establish reliability goals and requirements. A distinction is made be-
tween goals and requirements. Goals are more internally driven and may or may not be
met. Requirements, on the other hand, are more specific and are customer driven.
Requirements are usually included as deliverables in contractual agreements. Goals are
the starting point, but are modified to satisfy customer requirements early in the equip-
ment life cycle.
Establish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Step 4.
Go/No Go
Are Yes Decision on
Goals/Requirements Met? Next Phase
No
Step 5.
Identify Problems & Root Causes
All goals have certain common characteristics. The following criteria can be used to
assist in establishing goals[3]:
• Attainability: Goals should be set at levels reasonably attainable within
the available time span. Large goals over long periods should be avoided
to maintain interest and commitment. Subgoals over shorter times are
more attainable and more cost effective.
• Supportability: Support and resources must be available at the time they
are needed to achieve goals. Advance planning is needed to determine the
resources and the extent to which they can or will be provided.
• Acceptability: Goals must be acceptable to those who will be actively
involved in pursuing these goals. Acceptance is influenced by relevance,
perceived importance, reasonableness, and desirability of outcome.
• Measurability: Goals provide standards against which performance may
be assessed and, therefore, should be selected for suitability and defined in
a way that enables measurement. To make them measurable, goals must
be defined qualitatively, quantitatively, and in terms of performance
parameters, values, and time scales.
2. Reliability Engineering and Improvements. Once goals and requirements have been
established, design-for-reliability practices, or reliability improvement activities are
applied to enhance the reliability of equipment that is in any phase of the life cycle, or for
equipment already in existence.
Establish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Step 4.
Go/No Go
Are Decision on
Goals/Requirements Met? Next Phase
No
Step 5.
Identify Problems & Root Causes
There are some basic practices that can be applied to improve reliability. These include:
• Simplicity. Simplification of equipment configuration is one of the basic
principles of designing-for-reliability. Added parts or features increase the
number of failure modes. A common practice in simplification is referred
to as component integration (the use of a single component to perform
multiple functions).
• Redundancy. Another reliability improvement practice is to include more
than one way to accomplish a function by having certain components or
subassemblies in parallel, rather than in series. Beyond a certain point,
redundancy may be the only cost-effective way to design reliable
equipment.
• Proven Components and Methods. To the extent possible, designers
should use components and methods that have been shown to work in
similar applications. Using proven components can minimize analyses
and testing to verify reliability, thus reducing time and costs of
demonstrating reliability of the equipment.
• Derating. Derating is the practice of using components or materials at
environmental conditions or loads that are less severe than their limiting
condition. Under these conditions, the component or material is expected
to be more reliable.
• Eliminating Known Causes of Failure (Fault Avoidance). This can be
accomplished through screening and burn-in procedures to eliminate weak
components before equipment is actually shipped to the customer.
Establish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Step 4. Go/No Go
Are Decision on
Goals/Requirements Met? Next Phase
No
Step 5.
Identify Problems & Root Causes
4. Are Goals and Requirements Met? Results of the evaluation process are compared to
reliability goals and requirements. If goals and requirements are not met, the problems
and root causes should be identified as described in Step 5, and reliability improvement
activities should be initiated. If goals and requirements are met or exceeded, then approv-
al can be given to move to the next phase of the life cycle, or goals and requirements can
be updated and additional analyses carried out. For example, if the equipment is in the
concept and feasibility or design phase of the life cycle, sensitivity analyses can be
conducted to evaluate design and cost trade-offs such as:
• Design complexity versus reliability
• Maintainability versus reliability
• Increased costs versus reliability
Esbablish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Step 4.
Go/No Go
Are Decision on
Goals/Requirements Met? Next Phase
No
Step 5.
Identify Problems & Root Causes
If goals are, or can be exceeded by a significant margin, then the supplier should
capitalize on the situation by turning it into a competitive leadership position.
Upon completing design trade-off studies, approval can be given to move to the next
phase of the equipment life cycle where the reliability improvement process is again
initiated.
5. Identify Problems and Root Causes. If reliability goals and requirements are not met,
the reasons need to be identified and corrective actions should be taken. Test data on
prototypes or actual equipment in the field can be used to supplement information on
equipment reliability generated from predictive modeling. Testing can also help to
identify causes of failure and any potential reliability problems.
Establish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Step 4. Go/No Go
Are Decision on
Goals/Requirements Met? Next Phase
No
Step 5.
Identify Problems & Root Causes
A key tool useful for reporting and analyzing failure data is the failure reporting,
analysis, and corrective action system (FRACAS). This tool is discussed in more detail in
Sections 2.0 and 3.0.
Test data and all reported failures should be investigated to verify that a failure occurred.
Failure verification can be performed by subjecting the component to the same conditions
as those reported when the "failure" occurred.
The reliability improvement process now returns to Step 2, where reliability improvement
and growth activities are initiated, or upgrades and modifications to reliability goals and
requirements are made. Reliability growth activities generally fall into the following
major categories:
• Strengthening the existing design, by testing or modeling (or both) to
identify optimal design changes to improve reliability. The process of
identifying weak areas can be aided by performing sensitivity studies using
the reliability model of the system.
• Redesigning part or all of the system (fault tolerance), which includes
studying ergonomic–enhancing software, adding redundancy, and
incorporating error detection techniques.
• Eliminating known causes of failure (fault avoidance), which includes
using screening and burn-in procedures to eliminate weak components,
derating parts, and using more reliable parts.
Steps 2 through 5 are repeated until goals and requirements are met. The process may
require several cycles of goal setting, evaluating, comparing, and improving. Approval
can then be given to move to the next phase of the life cycle, where the reliability
improvement process is again applied.
Concept/Feasibility
Establish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Concept/Feasibility
Step 4.
Are Go/No Go
Decision on Establish Goals/Requirements
Goals/Requirements Met?
Next Phase
Step 2.
No Reliability Engineering/Improvements
Step 5.
Identify Problems & Root Causes
Step 3.
Conduct Evaluation
2.7 Summary
Knowledge of the equipment life cycle is important because it provides a basis for understanding
how and where reliability engineering enters into the process of designing, producing, and
operating the equipment. The equipment life cycle is broken into distinct phases, each
representing a unique portion of the equipment life. These phases provide the framework for
tracking reliability throughout the life cycle of the equipment and guidance on when and where to
apply resources. Awareness of life cycle costs help equipment owners understand the impact on
expenditures and cost of ownership when reliability is initiated at different life cycle phases.
The reliability improvement process provides a means for systematically improving reliability
throughout the equipment life cycle. Optimal benefits are realized when reliability is designed
into a piece of equipment. However, it is important to improve reliability throughout the life of
the equipment to meet reliability goals and objectives.
The reliability improvement process is an iterative process of setting goals, then evaluating
(predicting), comparing, and improving those goals. Central to the reliability improvement
process is data collection and analysis; design improvements; and operations and maintenance
procedure improvements.
About Section 3.0
The next section provides details on preparing for and implementing the reliability improvement
process. It includes a discussion of the various activities associated with each step of the
improvement process and each phase of the life cycle. In preparation for this discussion, the
following questions may assist in assessing current reliability practices and focus.
1. Is the importance of reliability conveyed throughout the company?
2. Is the approach to reliability improvement reactive or proactive?
3. Is the equipment development process life cycle oriented?
4. Have specific goals and requirements been established for equipment
reliability and its growth?
5. Does the organization have technical and executive managers who
champion the reliability cause?
6. Is demonstrated achievement of reliability goals a part of the criteria for
deciding when equipment is ready for release to market?
7. Does the organization collect data that can readily be used in measuring and
providing guidance for equipment reliability performance?
8. Do indicators of reliability performance exist for all equipment?
9. Are these indicators routinely monitored to ensure achievement of
improvement goals?
10. Is a closed–loop failure reporting and corrective action system in place?
2.8 References
1. SI Staff, "Selecting a Product: The Task at Hand," Semiconductor International,
March 1991, pages 7-8.
2. J. E. Arsenault and J. A. Roberts, Reliability and Maintainability of Electronic
Systems, Potomac, MD:Computer Science Press, 1980.
3. W. Grant Ireson and Clyde F. Coombs, Jr., Handbook of Reliability Engineering
and Management, Editors in Chief, McGraw-Hill, 1988.
3.1 Introduction
To ensure that maximum benefits are achieved when implementing the reliability improvement
process, it is important to have an understanding of:
• Management’s role in the implementation process
• The activities associated with applying the process
• Functional responsibilities in the implementation process
• Where to start the process
• How to use limited resources and communicate the value of the process
Each of these topics is discussed in this section. Primary focus is given to applying the
reliability improvement process. Activities associated with applying the reliability improvement
process to equipment in the concept and feasibility phase and continuing throughout its life cycle
are discussed first. Later, the discussion focuses on activities associated with applying the
reliability improvement process to equipment in an advanced phase (other than concept and
feasibility) of the life cycle.
Table 3-1. Reliability Improvement Process Applied at Six Different Starting Points
ä Concept/Feasibility
Design
Prototype (a-site)
Pilot Production (b-site)
Production/Operation
Phase Out
Once goals have been established, a reliability program plan is created that documents how these
goals will be achieved. It defines:
• Activities to be performed
• Resources required to fulfill the activities
• Schedule for these activities
• Procedures by which the activities will be performed
• Organizations and interfaces required to perform the activities
The program plan provides management and the customer with a means of measuring progress
and assuring that requirements will be accomplished.
Step 2. Reliability Engineering and Improvements. In the concept and feasibility phase, Step
2 of the reliability improvement process focuses first on developing alternative design concepts.
All possible alternatives should be identified and evaluated to ensure that those selected for the
design phase are capable of fulfilling goals and requirements. Functional block diagrams are
used to develop the basic concepts for the equipment and to evaluate their feasibility. The
functional block diagram is updated as the concept changes.
The next step is to develop a preliminary model of the equipment using the functional block
diagrams. The initial model is created at a gross level; that is, the equipment is broken into a few
(approximately 10 to 20) major subsystems. This model is used to make initial predictions of the
equipment reliability (Step 3).
A reliability allocation is conducted to allocate the equipment reliability goal into the individual
major subsystems. This is done to make equipment reliability requirements more manageable and
to establish individual reliability requirements for each major subsystem. Since no detailed
information on the equipment is yet available, the allocation process is approximate; it is used to
guide the designer when developing various concepts.
In this phase, the equipment has not been built, so other sources of data are required. Historical
data can be used for those subsystems that are similar to previous generations of equipment. For
those subsystems for which no historical data is available, expert judgement can be used. Expert
judgement takes the opinion of individuals that are considered to be knowledgeable about a
subsystem or component and uses this knowledge to create initial reliability values.
Another reliability engineering activity available for identifying conceptual design weaknesses is
a failure modes and effects analysis (FMEA). This is a technique for systematically identifying,
analyzing and documenting the possible failure modes within a design and the effects of such
failures on equipment performance.
The process of setting up an FMEA is initiated in this step, but it is used later in Step 5 to help
identify problems and root causes.
Step 3. Conduct Evaluation. The subsystem failure data and the reliability prediction model
are used to evaluate the reliability of the conceptual design. A reality check assures that the
predicted reliability value makes sense. Evaluate the following:
• Predicted versus the anticipated reliability value
• Historical and expert opinion data used to calculate equipment reliability
• Reliability prediction model
Conceptual design review(s) of the concepts that will be carried to the design phase are
conducted at this point. These design reviews are also useful in evaluating the current level of
the predicted reliability of the concepts being considered.
Step 4. Are Goals and Requirements Met? A comparison is made between established goals
and the predicted reliability values. If the goals are not met, continue to Step 5 where problems
and root causes are identified. If the goals are met or exceeded, approval is eventually given to
move to the design phase of the life cycle, where goals may be modified to meet customer
requirements.
Step 5. Identify Problems and Root Causes. If goals are not met, problems and root causes
should be identified. Sensitivity analyses can be conducted to direct attention to those
subsystems that have the greatest impact on the equipment reliability.
If an FMEA was developed in Step 2, use it to examine the potential failure modes identified and
to establish possible root causes.
The reliability improvement process now returns to Step 2 (reliability improvement and growth
activities are initiated). These might include:
• Adding high-level redundancy
• Using proven high reliability components and parts
• Forming partnerships with sub-tier suppliers
• Derating
Once the conceptual design improvements have been selected and incorporated, both the
functional block diagram and the reliability prediction model are re-evaluated. The model and the
data used in the model are changed to reflect the conceptual design improvements. If an FMEA
was initiated, it is also updated to reflect design changes.
Steps 2 through 5 are repeated until goals are met and approval is given to move to the design
phase of the life cycle.
At the end of concept and feasibility phase, the following objectives have been met:
• Reliability goals have been established and allocated to major subsystems
• A reliability program plan has been initiated
• Conceptual designs that form the basis of the equipment design are determined
• Feasibility that selected conceptual designs will meet goals is demonstrated
Table 3-2 summarizes the activities associated with applying the reliability improvement process
to the concept and feasibility phase. There are three designators used for the activities:
Design
Step 1. Establish Goals and RequirementsGoals and Requirements. The reliability goals
established in the concept and feasibility phase of the life cycle are modified and become
reliability requirements in the design phase. Requirements need to be well-defined so that they
are understandable by design engineers and manufacturers. Requirements should be broad in
nature and be both qualitative (e.g., definition of responsibilities and program requirements) and
quantitative (e.g., mean time between failures and uptime).
Concept/Feasibility
ä Design
Prototype (α-site)
Pilot Production (β-site)
Production/Operation
Phase Out
If a critical component is used for the first time and the life data is not available, run a simulated
life test to generate the life data under the expected use conditions.
Step 3. Conduct Evaluation. Use the subsystem and component failure data, and the updated
reliability prediction model, to evaluate the reliability of the current equipment design. As was
the case in the concept and feasibility phase, evaluate the following:
• Data sources and their validity
• Predicted versus the anticipated reliability value
• Historical and expert opinion data used in determining equipment reliability
• Reliability prediction model
Conduct design review(s) of the design(s) that will be carried to the prototype phase at this time.
These reviews are often broken down into:
• Requirements Review - review the equipment’s design requirements
• Preliminary Design Review - evaluate the preliminary design against requirements
• Critical Design Review - provide design to the customer(s) for review
Step 4. Are Goals and Requirements Met? Compare the reliability requirements and the
predicted reliability values. If requirements are not met, continue to Step 5 where problems and
root causes are identified. If requirements are met, approval is given to move to the prototype
phase of the life cycle.
Step 5. Identify Problems and Root Causes. If requirements are not met, sensitivity analyses
can be conducted to direct attention to those subsystems and components that have the greatest
impact on the equipment reliability. Evaluate the FMEA that was developed in Step 2 to
determine potential failure modes of the subsystems and components.
The process now returns to Step 2, where reliability improvement activities are initiated.
Steps 2 through 5 are repeated until requirements are met. Approval can then be given
to move to the prototype phase of the life cycle.
At the end of the design phase, the following objectives have been met:
• The core architecture of the equipment design has been finalized
• Design(s) have been chosen for prototype
Table 3-3 summarizes the activities associated with applying each step of the reliability
improvement process to the design phase.
Reliability Improve-
Activities
ment Process Step
Prototype
Step 1. Establish Goals and RequirementsGoals and Requirements. At this point in the life
cycle, requirements have been established and little remains to be done other than to upgrade
these as the design moves toward completion and prototypes are built. Modeling, as well as
failure data analyses can be used to appraise current equipment reliability levels and evaluate
what levels are achievable.
Concept/Feasibility
Design
ä Prototype (α-site)
Pilot Production (β-site)
Production/Operation
Phase Out
As was the case in the previous two phases, the reliability program plan is updated.
Step 2. Reliability Engineering and ImprovementsEngineering and Improvements. The
functional block diagram is again updated in the prototype phase to reflect any design changes.
Subsystems and components having the greatest impact on equipment reliability are further
expanded in the reliability prediction model. If reliability requirements were revised in Step 1,
re-allocation to major subsystems and components may be necessary. For those subsystems and
components that are modeled in more detail, reliability allocations need to be made to lower
levels. If more than one prototype is built, a reliability model for each prototype design may be
needed.
Conduct a test to generate subsystem and system level reliability data for each of the prototypes.
Aspects of the test program that are considered include:
• Test objectives
• Test parameters
• Test sample size
• Test duration
• Test environments
Component tests are useful for identifying basic weaknesses in critical components, whereas
system tests are useful in exploring the effects of components interactions. Results from
component tests alone should not be used for predicting system reliability performance, since
component tests rarely duplicate system interactions.
A failure reporting and corrective action system (FRACAS) can be initiated to record failure data
gathered during the testing program. The FRACAS is a closed-loop reporting system that is
useful in:
• Identifying failures and establishing a historical data base
• Analyzing failures to determine the cause
• Documenting the corrective action required to minimize reoccurrence of the
failures
Maximum benefits from a FRACAS are realized when it is implemented early in a test program
and is directly coupled to the modeling effort. Failures identified during in-house testing (e.g.,
prototype tests) are easier to analyze than failures in the field. Furthermore, it is more cost
effective to identify and correct failures earlier in the life cycle.
The actual failure modes that are uncovered during testing, should be recorded in the FRACAS,
and compared to the predicted failure modes established in the FMEA. Where difference occur,
the reasons should be identified.
Step 3. Conduct EvaluationEvaluation. Reliability of the various prototypes is evaluated
based on the test data.
Results of the prototype test are then presented for a design review prior to pilot production.
Step 4. Are Goals and Requirements Met?Goals and Requirements Met? Compare the
results of the testing of the prototype(s) to the requirements to see if they have been met. If the
requirements are not met, move to Step 5, where problems and root causes are identified. If
requirements are met, then a design review is performed, including a management go/no go
decision to continue to the pilot production phase of the life cycle.
Step 5. Identify Problems and Root CausesProblems and Root Causes. A sensitivity
analysis is conducted to direct attention to those subsystems and components that have the
greatest impact on the equipment reliability. Root causes of the failures recorded in the
FRACAS are identified and corrective actions implemented. A more detailed failure analysis
might also be performed on those subsystems and components that are failing at a significantly
higher rate than previously anticipated.
The process now returns to Step 2, where improvement activities are initiated. If a FRACAS was
initiated, it might identify corrective actions that could be implemented to eliminate failures.
Other possibilities include:
• Derating
• Procedural changes
• Process changes
A preventive maintenance (PM) program can be developed for subsystems and components that
degrade equipment performance. Partnerships established with suppliers are continually
nurtured and purchased subsystems and components are continually evaluated. Human
capabilities and limitations are considered and changes are made to the equipment to eliminate
failures due to human errors. The software reliability program is continued. For critical
subsystems and components, the optimal operating range is found and the impact of the optimal
range on other components is evaluated.
Steps 2 through 5 are repeated until requirements are met. Approval can then be given to move
to the pilot production phase of the life cycle.
At the end of the prototype phase, the following objectives have been met:
• The prototype(s) has been tested and evaluated to determine its capability of
achieving the requirements. This includes redesigning and re-evaluating until a
go/no go decision is reached
• The core subsystem and component designs are finalized.
Table 3-4 summarizes the activities associated with applying the reliability improvement process
to the prototype phase.
Table 3-4. Reliability Improvement Process Activities for the Prototype Phase
Reliability Improve-
ment Process Step Activities
1. Establish Goals - Update reliability requirements (E1)
and Requirements - Update reliability program plan (E2)
2. Reliability - Update functional block diagram (E3)
Engineering and - Expand reliability model, as needed (E4)
Improvements
- Re-allocate subsystem and component reliability requirements (E5)
- Establish test plan (T1)
- Conduct Prototype test (T2)
- Establish FRACAS (E17)
- Perform human reliability analysis (D2)
- Develop preventive maintenance program (E10)
- Continue to evaluate the reliability of purchased components (E11)
- Perform ergonomics studies (E12)
- Conduct software reliability studies (E13)
- Update Life Cycle Cost (AT19)
3. Conduct - Evaluate prototype reliability (T2)
Evaluation - Conduct design review(s) (E7)
4. Are Goals and - Compare reliability requirements to predicted values
Requirements Met? - If requirements are not met, continue to Step 5
- If requirements are met move to pilot production phase of life cycle
5. Identify Problems - Perform sensitivity analyses (E8)
and Root Causes - Evaluate FRACAS to identify problems and root causes (E17)
- Evaluate FMEA to identify potential failure modes (E14)
- Perform failure analyses on critical components (E16)
Pilot ProductionProduction
Step 1. Establish Goals and RequirementsGoals and Requirements. During the pilot
production phase, upgrades are made to goals and requirements, as appropriate, and the reliability
program plan is updated to reflect these, as well as other, changes. Modeling and failure data
analyses are used to assess current and potential levels of equipment performance.
Concept/Feasibility
Design
Prototype (α-site)
ä Pilot Production (β-site)
Production/Operation
Phase Out
Table 3-5. Reliability Improvement Process Activities for the Pilot Production Phase
Reliability
Improvement Activities
Process Step
1. Establish Goals - Update reliability requirements, as needed (E1)
and Requirements - Update reliability program plan (E2)
2. Reliability - Update functional block diagram, if needed (E3)
Engineering and - Update reliability model, if needed (E4)
Improvements
- Re-allocate reliability requirements, as needed (E5)
- Upgrade testing program, as needed (T1)
- Implement FRACAS, if not already done (E17)
- Perform human reliability analyses (D2)
- Perform software reliability studies (E13)
- Perform ergonomic studies (E12)
- Update preventive maintenance program, as needed (E10)
- Continue to evaluate reliability of purchased components (E11)
- Update Life Cycle Cost (AT19)
3. Conduct - Conduct tests of equipment (T2)
Evaluation - Evaluate equipment reliability (E6)
- Conduct design review(s) (E7)
4. Are Goals and - Compare reliability requirements to observed values
Requirements Met? - If requirements are not met, continue to Step 5
- If requirements are met move to production & operations phase of life cycle
5. Identify Problems - Perform sensitivity analyses (E8)
and Root Causes - Evaluate FRACAS (E17)
- Evaluate FMEA (E14)
- Perform failure analyses on critical components (E16)
Production/Operation
5
Step 1. Establish Goals and Requirements. Final updates to reliability requirements and the
reliability program plan are made at this point. All major reliability problems should have been
identified and corrected prior to full-scale production and deployment of the equipment.
Concept/Feasibility
Design
Prototype (α-site)
Pilot Production (β-site)
ä Production/Operation
Phase Out
Step 2. Reliability Engineering and Improvements. Functional block diagrams and the
reliability model are updated to reflect any design changes that occurred during the pilot
production phase. The FRACAS data base is updated to reflect failure modes uncovered during
pilot production testing. The observed failures are also used to update the reliability model.
A field tracking and customer feedback program is initiated to record operation and maintenance
problems in the field. This information should account for uncertainty due to variations in site,
equipment vintage, and customer procedures.
Step 3. Conduct EvaluationEvaluation. Evaluation of the equipment’s performance at this
point consists primarily of feedback from maintenance records. However, the effect of the
pending corrective actions should be counted to predict the equipments future performance.
Step 4. Are Goals and Requirements Met?Goals and Requirements Met? Here again, if
requirements are not being met, then problems and root causes are identified in Step 5. If
requirements are being met, then it is important to continually monitor equipment performance
and to implement a process of continuous improvement until decisions are made to phase out the
current generation of equipment and begin development of the next generation.
Step 5. Identify Problems and Root CausesProblems and Root Causes. Failures and
problems reported during full-scale production and deployment in the field are fed through the
FRACAS to verify the failure(s) and to identify root causes and corrective actions. Pareto
analyses can be used to prioritize problems.
The process now returns to Step 2, where improvements and corrective actions are implemented.
Steps 2 through 5 are repeated until requirements are met.
At the end of the equipment’s production and operation phase, the following objectives have been
met:
• The equipment is manufactured in a manner that uniformly meets the customer
and supplier requirements.
• Continuous improvement goals and requirements are established and
demonstrated.
Table 3-6 summarizes the activities associated with applying the reliability improvement process
to the production and operation phase of the life cycle.
Table 3-6. Reliability Improvement Process Activities for the Production and
Operation Phase
Reliability Improve-
ment Process Step Activities
1. Establish Goals - Final update of reliability requirements, if needed (E1)
and Requirements - Final update of reliability program plan (E2)
2. Reliability - Update FRACAS data base (E17)
Engineering and - Implement field tracking, customer feedback (D1) and corrective action
Improvements program
- Update human reliability analyses (D2)
- Update software reliability studies (E13)
- Update ergonomic studies (E12)
- Update preventive maintenance program, as needed (E10)
- Continue to evaluate reliability of purchased components (E11)
- Update Life Cycle Cost, if required (AT19)
3. Conduct - Assess equipment reliability based on the field data(E6)
Evaluation - Evaluate feedback from field tracking and maintenance records (D1)
4. Are Goals and - Compare requirements to observed values
Requirements Met? - If requirements are not met, continue to Step 5
- If requirements are met:
* Continually monitor equipment performance
* Implement process of continuous improvement
* Revise goals and requirements, as appropriate (E1)
* Eventually phase out current generation equipment
5. Identify Problems - Perform sensitivity analyses (E8)
and Root Causes - Perform failure analyses on field failures (E16)
Phase Out
Step 1. Establish Goals and RequirementsGoals and Requirements. At this point in the life
cycle, there are no goals or requirements to establish. A general goal would be to set
requirements for subsystems and components to be carried over to the next generation of
equipment. Also, it is important to have documented and retained all the information gained
during the life cycle phases of the current generation of equipment so that similar mistakes will
not be repeated.
Concept/Feasibility
Design
Prototype (α-site)
Pilot Production (β-site)
Production/Operation
ä Phase Out
Table 3-7. Reliability Improvement Process Activities for the Phase–Out Phase2-7.
Reliability Improvement Process Activities for the Phase–Out Phase
Reliability Improvement
Process Step Activities
1. Establish Goals and - Set requirements for subsystems and components to be carried to next generation
Requirements of equipment
- Document and retain all information gathered during generation of equipment
being phased out
2. Reliability Engineering - Offer phase-out alternatives to customers of equipment being phased out
and Improvements - Phase out current generation equipment in stages
3. Conduct Evaluation - Assess reliability of the current generation(E6) and carried information to next
generation of equipment.
4. Are Goals and - There are no goals or requirements to meet
Requirements Met?
5. Identify Problems and - Retain all information on equipment being phased out so that it can be used in
Root Causes future generations of equipment
3.4.1 Starting with Equipment in the Design Phasewith Equipment in the Design Phase
When equipment has reached the design phase, the basic concept has already been established
and is fixed in the minds of the design engineers. It is more difficult to incorporate customer
needs into the design in this phase than in the concept and feasibility phase. However, it is not
too late and is clearly important, to incorporate customer needs and requirements when
establishing reliability goals.
If a reliability program plan has not been initiated, do so at this time.
If the functional block diagrams and the corresponding reliability model were not initiated in the
concept and feasibility phase, develop them now. Equipment reliability requirements are then
allocated to individual major subsystems in the model. Failure data are collected for use in the
reliability model.
Other activities associated with applying the reliability improvement process to the remainder of
the process steps and life cycle phases are identical to those discussed earlier and are listed in
Tables 3-4 through 3-7. Therefore, they are not listed again here.
Table 3-8 summarizes the activities associated with applying the reliability improvement process
to equipment that is in the design phase. The activities listed in Table 3-8 are similar to those
listed in Table 3-3; the difference is in the activities listed under Steps 1 and 2.
3.4.3 Starting with Equipment in the Pilot Production Phasewith Equipment in the Pilot
Production Phase
For equipment in the pilot production phase of the life cycle, the focus should be on appraising
the actual level of equipment reliability (from available data) and determining what levels are
desired and obtainable. This is still an important step in the environment of customer
requirements.
A reliability program plan can still be created to identify and tie together all of the reliability
improvement process activities that will be performed during the pilot production phase and
subsequent phases of the equipment life cycle.
The majority of this effort should be directed at making needed design improvements once the
equipment is evaluated. It is not too late to incorporate some design-for-reliability practices.
The focus should be on reliability growth activities directed at the existing design.
A method for collecting, tracking, and storing reliability data should be established. A FRACAS
can be initiated and used to track reported failures during pilot production, and to identify
corrective actions necessary to eliminate these failures. It is still not too late to initiate an FMEA.
Ergonomic studies can be used very effectively at this point.
Table 3-10 summarizes the activities associated with applying the reliability improvement
process to equipment starting in the pilot production phase.
Table 3-10. Pilot Production Phase Reliability Improvement Process Activities When
Initiated In Pilot Production Phase
Reliability Improvement Activities
Process Step
3.4.4 Starting with Equipment in the Production and Operation Phasewith Equipment in
the Production and Operation Phase
For equipment in the production and operation phase of the life cycle, the design is fixed. There
is no opportunity to make major design changes at this time. Thus, the focus of Step 1 should be
on appraising the actual level of reliability of equipment in this phase, and evaluating the levels
that are desired and whether these levels are achievable. Upgrades to existing equipment can be
made based on failure data analyses.
Although rather late in the life cycle, creating a reliability program plan to track the activities to
be performed during this phase and the phase out period of the life cycle is still beneficial.
Efforts should focus on making needed improvements to the existing design and on reliability
growth activities since it is too late to design reliability into the system.
Table 3-11 summarizes the activities associated with applying the reliability improvement
process to equipment that is in the production and operation phase of the life cycle. The
activities associated with applying the improvement process to the phase–out phase of the life
cycle are identical to those discussed earlier and listed in Table 3-7 and, therefore, are not listed
here.
3.4.5 Starting with Equipment in Phase Out Phase with Equipment in Phase Out Phase
It is much too late to make any changes to the equipment during the phase-out phase. The goal in
this phase is limited to collecting the reliability data of the equipment in order to gain insight into
the next generation of equipment. This information can save tremendous amounts of time and
money in the concept and feasibility phase of the next generation.
There are no reliability engineering or reliability improvements to be made at this point. Phase-
out alternatives should be offered to customers of current generation equipment.
Table 3-12 summarizes the activities involved in applying the reliability improvement process to
equipment that is in the phase-out phase of the life cycle. This table is identical to Table 3-7.
Table 3-12. Phase Out Phase Reliability Improvement Process Activities When
Initiated in Phase-Out Phase
Reliability Improve-
ment Process Step Activities
1. Establish Goals and - Set requirements for subsystems and components to be carried to next
Requirements generation of equipment
- Document and retain all information gathered during generation of equipment
being phased out
2. Reliability - Offer phase-out alternatives to customers of equipment being phased out
Engineering and - Phase out current generation equipment in stages
Improvements
3. Conduct Evaluation - Create reliability model of subsystems and components carried to next
generation equipment (E4)
4. Are Goals and - There are no goals or requirements to meet
Requirements Met?
5. Identify Problems and - Retain all information on equipment being phased out so that it can be used in
Root Causes future generations of equipment
Figure 3-1. Multiple Equipment and Their Life Cycle Phase Status
Equipment C
Concept Production
and Phase
Design ´-Site ´-Site and
Feasibility Out
Operation
Equipment B
Concept Production
and Design Phase
´-Site ´-Site and
Feasibility Out
Operation
Equipment A
Concept Production
Phase
and Design ´-Site ´-Site and
Out
Feasibility Operation
Today Time
Equipment C has the greatest potential for cost-effective improvements in reliability because it is
in the earliest phase of its life cycle. However, this does not mean that it is too late to improve
the reliability of Equipment A and B. Reliability improvements can and should be considered in
every phase of the life cycle. However, when starting a reliability improvement process, it is
generally advantageous to choose equipment that will show immediate successes. If sufficient
resources exist, address all equipment in all life cycle phases. Because it is unlikely that this is
the situation, the following priorities are recommended:
1. Equipment in the Production and Operation Phase. Although this is a reactive
strategy, it is the most customer oriented, and is capable of demonstrating quick
benefits. Another benefit of starting with equipment in this phase is that data on
the equipment in the field is available and can be used to determine current
reliability performance. If you are unable to determine your current situation, it
is difficult to set realistic goals and determine whether they have been met. It is
also important to assess the impact of upgrades to equipment in this phase using
the reliability model and existing failure data.
2. Equipment in the Design Phase. This is a proactive strategy and has the greatest
long-term benefits. In this phase, it is difficult to determine what the reliability
performance of the equipment will be unless the previous generation has a
database and a significant number of similar parts. If this information exists, it
can be used with modeling to evaluate potential performance of designs being
considered.
3. Equipment in the Prototype or Pilot Production Phase. These phases are
reactive and have benefits between the prior two stages. There is some amount
of data available; therefore, the anticipated reliability performance of the
equipment in the field can be determined. The drawback with these phases is the
expense and time involved if major design changes are necessary.
4. Equipment in the Concept and Feasibility Phase. This is a proactive and the
least expensive phase. Significant reliability improvements can be made to
equipment in this phase with minimal use of resources. However, as with the
design phase, the lack of data makes it difficult to determine reliability
performance.
5. In general, ignore equipment in or near Phase Out. Activities should be limited
to customer requests. However, if the product that is being phased out has future
generations that are significant to the company’s strategic plan, collecting data
and analyzing failures of the product will yield tremendous insight into
development of the next generation.
When making a choice, choose equipment that you know will have future generations. As
mentioned in Section 1.0, the cost of improving equipment reliability will decrease as it moves
from generation to generation.
Knowing the reliability performance of existing equipment is essential to evaluating current
equipment status and for setting reliability goals for current and future equipment. It is difficult to
set realistic and attainable performance goals without this knowledge. Table 2-13 illustrates the
type of reliability performance information that is available for the three equipment lines shown
in Figure 2-1.
Mean time between failures (MTBFp) and mean time to repair (MTTR) are the two measures of
reliability performance used in this illustration. SEMI Standard E10-90[6] provides several other
measures of reliability. Table 2-13 indicates that the MTBFp and MTTR values are known for
Equipment A. Actual data are not available for Equipment B and C because they are in early
stages of development. However, Equipment B has predicted values based on the design and
Equipment C has goals that it is targeted to meet.
Reliability and design engineers determine current reliability performance by collecting and
analyzing data received from a number of sources, including
• Field service reports
• Customer feedback
• In-house testing
In situations where data is not available, but reliability performance needs to be determined,
preliminary engineering judgements, mathematical predictions, and consensus using the opinions
of experts can be used as a first cut at data values.
As discussed previously, one of the cornerstones of reliability improvement is the reliability data
reporting system. It is an organized means of gathering factual data about equipment
performance-both good and bad. Although useful data estimates can be determined during the
concept and feasibility phase as well as the design and development phases of the equipment life
cycle, the most meaningful data is collected during the production and operation phase, when the
equipment is operating in its intended environment. Nevertheless, information gathered in any
phase of the life cycle can be used to ensure that the reliability goals are attained with minimal
time and expense commitments.
Section 4.0 discusses in detail the activities associated with data collection and analysis. These
activities include determining:
• What data to collect
• How to use this data
• The most effective format to use when collecting data
• How to transform the data into failure rates
• How to get numerical values for human errors
It is important to note that an effective reliability improvement process includes a central
database that includes data collected for all equipment of the same model or type and accounts
for uncertainty due to variations in site, equipment vintage, and customer procedures.
3.9 Summary
The role management plays in the reliability improvement process is vital. Management has
unique responsibilities in the establishment and implementation of the process. Management
also assigns individuals to the role of reliability champions. The executive champion provides
reliability leadership with the full support of upper management. The technical champion
establishes the reliability improvement process and is responsible for its success.
The five steps of the reliability improvement process can be applied to a piece of equipment no
matter what phase it is in. This section discussed the activities associated with each step of the
reliability improvement process for each phase of the life cycle.
This section also included a discussion on how to select a piece of equipment to implement a
reliability program based on the life cycle phases. The section also covered the importance of
data, the choice of activities when resources are limited, rules for the reliability program plan,
and suggestions on how to communicate the value of the reliability effort to key decision makers
and participants in the reliability program.
Section 3.0 provides more detailed descriptions of the reliability-related activities and presents
some of the tools and techniques available in planning, developing, and implementing a
reliability improvement program.
3.10 References
1. MIL-HDBK-217E, Reliability Prediction of Electronics Components.
2. Non-Electronics Part Reliability Data, Reliability Analysis Center, Rome, NY,
1991.
3. RMS Committee, RMS, Reliability, Maintainability & Supportability
Guidebook, SAE G-11, Society of Automotive Engineers, Inc, Warrendale, PA,
1990.
4. DOD 4245.7-M, Transition from Development to Production, September, 1985.
5. William W. Everett, et al., Reliability by Design, A Guide to Reliability
Management, Issue 1, AT&T, Indianapolis, IN, November 1990.
6. SEMI E10-90, Guideline for Definition and Measurement of Equipment
Reliability, Availability, and Maintainability (RAM), SEMI 1990.
4.1 Introduction
The first two sections of these guidelines provided an overview of the reliability improvement
process and the equipment life cycle. This section provides a description of the activities and
tools that are part of the reliability improvement process. The reliability activities are grouped as:
• Engineering
• Data
• Testing
Engineering activities form the foundation of the reliability improvement process. Data activities
also play an important role because the engineering activities depend on data. Testing activities
provide a valuable source of data and information. There are three designators used for the
activities: E (engineering), D (data), and T (testing). These designators followed by a number
provide the location of the activity in this section.
Some of the activities stand alone; that is, they do not require any formally recognized tools of
the trade. These tools come from various academic disciplines such as probability and statistics,
and reliability engineering. However, many of the activities use these standard methods and
techniques referred to as tools. The designator used for the tools is AT, followed by a number.
Even though safety and maintainability goals are not addressed in these guidelines, some mention
of these goals is necessary because of the key interactive role they play with reliability. Designers
should identify safety, maintainability, and reliability goals at the same time. Since
maintainability is built into equipment, it is primarily addressed in the concept and feasibility and
design phases. Maintainability is achieved by carefully considering and balancing numerous
factors such as basic physical configuration and layout of the design, test provisions for quick
fault location, interchangability of replaceable parts, adequate maintenance procedures, and skill
levels of technicians. As with reliability, pertinent data is collected to estimate the maintainability
measures and to ensure that the maintainability goals are being achieved.
It is important to remember that setting reliability goals is not a one-time affair; it is a continuous
process of gradual improvements that are made toward the goals over time.
Applicable Tools
AT4 Competitive Benchmarking
AT11 Quality Function Deployment (QFD)
References
Ireson, W., C. Coombs, Jr., editors, Handbook of Reliability Engineering and Management, New
York:McGraw-Hill, 1988, pp. 2.3-2.8.
During this activity, the equipment is depicted by clear, abbreviated schematics that show the
major subsystems, components, or parts of the equipment and the critical support systems such as
power, actuation signals, control, and cooling. A functional block diagram is used to show how
the equipment subsystems, components, and parts interact with one another and with the support
systems. A block diagram provides a clear picture of how the equipment functions and can be
used to create a reliability model. It also helps create an understanding of what makes the
equipment work and what causes it to fail. If alternative concepts or designs have been created,
each one should have its own functional block diagram.
Establish Goals/Requirements
Step 2.
Reliability Engineering/Improvements
Step 3.
Conduct Evaluation
Step 4. Go/No Go
Are Decision on
Goals/Requirements Met? Next Phase
No
Step 5.
Identify Problems & Root Causes
An example of a functional block diagram is given in the icon above. The functional block
diagram represents a hypothetical personal computer (PC). As can be seen from the diagram, the
PC has two hard disk drives (HD1, HD2) and two floppy drives (FD1, FD2). The keyboard, IO
board, ram card, disk controller card, and video control card all derive their power from the
power supply via the mother board. The CRT (monitor) is a separate unit with its own power
supply.
The need for schematics and flow diagrams is well recognized, but typically these are too
complex to use directly. It is important to construct diagrams that depict clearly and simply how
the equipment functions. Subsystems, components, support systems, and human actions that lead
to equipment failure should be obvious when the functional block diagram is constructed
properly.
The reliability modeling described here is not concerned with time degradation; that is, the
equipment is neither being broken in nor at the point of wear out. This idea can be explained
more clearly by discussing the typical "bathtub" curve seen in many reliability texts. The
following figure shows a typical bathtub curve, also known as a failure rate curve, over the life of
a part, component, subsystem or the equipment. The early part of the curve, where the failure rate
is decreasing, is often called burn-in or the break-in stage. The later part of the curve, where the
failure rate is increasing, is typically called the wear-out stage. As was mentioned, the reliability
model discussed in these guidelines assumes that the components, parts, subsystems and the
equipment itself are in the constant failure rate portion of this curve. This allows one to assume
that all components, parts, subsystems, and the equipment have a constant failure rate; that is, the
failure rate does not change over time. The model also assumes that the components, parts, and
subsystems being modeled are repairable and that the repaired items are as good as new.
Failure Rate
Time
If a block diagram is used to model the equipment, the equipment model will consist of series
blocks (when the failure of one subsystem, component, or part causes the equipment to fail),
parallel blocks (when every subsystem, component, or part must fail for the equipment to fail) or
a combination of these.
The following paragraphs discuss how to create a reliability model. The first step involves clearly
defining what is meant by equipment failure. For example, one might define failure as any
occurrence that causes the equipment to be down for more than a given period of time (e.g., 6
minutes) or any occurrence that results in wafer damage. This step also involves identifying all of
the failure mechanisms that lead to the defined equipment failure. If, for example, equipment
failure is defined as a down time of 6 minutes or more, all failure mechanisms that cause the
equipment to be down at least 6 minutes are included in the reliability model. If equipment
failure is defined as any occurrence that results in wafer damage, all failure mechanisms that
result in wafer scrap are identified. Field data is often useful in defining what is meant by
equipment failure and in identifying mechanisms that lead to failure.
The next step involves creating the reliability model. Fault trees and reliability block diagrams
are the tools that are used to do this. RAMP is a software package that has been created to help in
the documentation and analysis of a reliability model. It uses reliability block diagrams. RAMP
allows one to create the reliability model on a personal computer, provides a means of
documenting failures, and performs the Boolean algebra necessary to solve the model. A
reasonable starting point in the creation of the model is to initially create a coarse model made up
of the equipments’ major subsystems. If a block diagram is used as the modeling tool, the model
would consist of approximately 10 to 20 major subsystems; that is, in the model, one block
would represent each major subsystem. Later versions of the model add detail only to those
subsystems that are identified as being important; that is, only those subsystems that cause the
equipment to fail are broken down into components and parts. Adding detail to unimportant
subsystems for the sake of completeness simply increases the modeling effort without adding to
the usefulness of the results. Careful examination of field data helps determine the appropriate
level of detail for the model. In general, the model should not be more detailed than the available
information will support. If the modeling effort is for equipment not yet in the field, field data for
a previous generation of equipment can yield valuable insights into improvements in the next
generation.
Once the model is completed, it can be transformed into an equation for quantification, which is
discussed in engineering activity E6. The equipment reliability is calculated using the failure data
collected for the subsystems, components and parts.
The following paragraphs discuss tips that will make the modeling effort easier.
1. Think carefully about the subsystem divisions for the equipment being modeled.
The choice of subsystems will vary from company to company and equipment to
equipment; however, it is best to base the choice on functional considerations
not on parts count methods. Choose subsystems based on the functions they
perform. Group components and parts under the subsystems that make
functional sense.
2. Avoid parts list modeling. That is, do not represent the equipment as a collection
of parts. It is important to include failure modes such as operator errors, software
failures and failures that are the result of drifting out of specification. In
addition, valuable insight into the equipment is gained by thinking about failure
modes and interactions between different subsystems. Parts list modeling does
not encourage this kind of thinking.
3. It is best to begin by modeling an existing piece of equipment. Good reliability
modeling practice comes through experience. If the first model created is for
equipment that is well understood, the model can be validated in terms of the
failure rate and failure mechanisms. Also, introduction of a reliability modeling
program will almost always cause the data collection and data management
procedures to be revised. It is generally better to sort out data problems with an
existing system than with a new system.
4. No matter what phase of the life cycle the equipment is in, it is best to keep the
model as simple as possible. As the model becomes more complicated, it
becomes more difficult to interpret.
5. As the reliability process proceeds, continually change, expand, and improve the
model. This allows the model to be used throughout the life of the equipment.
Applicable Tools
AT7 Fault Tree Analysis (FTA)
AT12 Reliability, Analysis and Modeling Program (RAMP) Software
AT15 Reliability Block Diagram Modeling (RBD)
References
Campbell, J.R., Iman, R., Longsine, D., Thompson, B., A Tutorial on Reliability Modeling Using
RAMP, Albuquerque, NM:SETEC, Sandia National Laboratories, SETEC91-030, 1991.
MIL-HDBK-217E, Reliability Prediction of Electronic Equipment, Griffiss AFB,NY:Rome Air
Development Center, October 1986.
References
Ireson, W., C. Coombs, Jr., editors, Handbook of Reliability Engineering and Management, NY,
McGraw-Hill, 1988, pp. 18.34-18.42.
Juran, J., F. Gryna, editors, Juran’s Quality Control Handbook, Fourth edition, NY, McGraw-
Hill, 1988, pp. 13.21-13.22.
Kapur, K., L. Lamberson, Reliability in Engineering Design, NY, John Wiley and Sons, 1977,
pp. 405-422.
Lloyd, D., M. Lipow, Reliability: Management, Methods, and Mathematics, Second edition,
Milwaukee, WS, The American Society for Quality Control, 1984, pp. 25-27, 267-270.
O’Connor, P., Practical Reliability Engineering, Third edition, NY, John Wiley and Sons, 1991,
pg. 136.
A B E
Equipment Failure = A + B + [ C * D ] + E
Applicable Tools
AT7 Fault Tree Analysis (FTA)
AT12 Reliability, Analysis and Modeling Program (RAMP) Software
AT15 Reliability Block Diagram Modeling (RBD)
References
Campbell, J., R. Iman, D. Longsine, B. Thompson, A Tutorial on Reliability Modeling Using
RAMP, Albuquerque, NM, SETEC, Sandia National Laboratories, SETEC91-030, 1991, pp. 43-
50.
References
Arsenault, J., J. Roberts, editors, Reliability & Maintainability of Electronic Systems, Potomac,
MD, Computer Science Press, 1980, pp. 280-293, 365-393.
Boothroyd, G., P. Dewhurst, Product Design For Assembly, Wakefield, RI:Boothroyd Dewhurst,
Inc.
Burgess, J.A., "Improving Product Reliability," Quality Progress, December 1987, pp. 47-54.
Davidson, J., editor, The Reliability of Mechanical Systems, London, Mechanical Engineering
Publications Limited for The Institution of Mechanical Engineers, 1988, pp. 47-57.
O’Connor, P., Practical Reliability Engineering, Third Edition, NY, John Wiley & Sons, 1991,
pp. 219-220, 117-125, 328-329.
Skrabec, Q. Jr., "The Transition for 100% Inspection to Process Control," Quality Progress,
April 1989, pp. 35-36.
Smith, J. R., "Reliability Analysis By Simulation," 41st Annual Quality Congress Transactions,
May 4-6, 1987, pp. 654-662.
Tunner, J., "Total Manufacturing Process Control-The High Road To Product Control," Quality
Progress, October 1987, pp. 43-50.
Vanderbei, K., et.al., Reliability by Design, Indianapolis, IN:AT&T, 1990, pp. 105-114, 61-71.
MIL-STD-470B, Maintainability Program for System and Equipment, Irvine, CA:Global
Engineering Documents, 30 May 1989.
One of the most important steps in the supplier reliability program is measurement and feedback.
Measurements provide a means of determining if the supplier is meeting the agreed upon
reliability requirements. Feedback gives the supplier the necessary information to improve the
product.
Finally, for every product that does not meet the reliability requirements, suppliers are asked
what corrective action they will take. They should be able to provide answers to the following
questions:
• What caused the product to not meet the reliability requirements?
• What changes need to be made to make the product meet the requirements?
• How will these changes be made foolproof?
• How will the customer know that these changes have been made?
It is the customer’s responsibility to ensure that the supplier works to find the root cause of each
failure to meet the requirements and takes the necessary action to permanently eliminate the
cause. The role of the supplier in improving reliability of the equipment is critical. For the
supplier to continuously improve the product’s reliability, the customer must demand it.
Applicable Tools
AT6 Environmental Stress Screening (ESS)
AT8 Life Testing
AT18 User Groups
References
Broeker, E., "Build a Better Supplier-Customer Relationship," Quality Progress, September
1989, pp. 67-68.
Juran, J., F. Gryna, editors, Juran’s Quality Control Handbook, Fourth edition, NY:McGraw-
Hill, 1988, pp. 15.1-15.46, 30.18-30.21.
Klock, J., "How to Manage 3,555 (or Fewer) Suppliers," Quality Progress, June 1990,
pp. 43-47.
Richardson, J., "Vendor Quality Assurance in a Process Industry," Quality Progress, November
1984, pp. 60-63.
Ground Coffee
Transportation Shop
First Aid
Baggage Phones
Claim
Ergonomics (or Human Factors Engineering) is a discipline concerned with designing equipment,
operations, and work environments to match human capabilities and limitations. Ultimately,
everything that one designs has an impact on the human in one way or another. Someone will
have to fabricate the equipment, package it, distribute it, unpack it and prepare it for use, operate
or use it, service and maintain it, and finally dispose of it. For this reason, designers should be
constantly alert to the human factors implications of their proposed design. Keep in mind that the
ultimate success of the equipment depends on how well the user performs the tasks associated
with it.
The intent of human factors engineering in this document is to focus on and resolve human-
equipment interface problems and solutions wherever or whatever they are. Philosophically, then,
human factors engineering is looking at a design from the standpoint of user efficiency, or total
human-equipment output effectiveness. Inherent in this philosophy are the following objectives:
• To make the user’s contribution to the equipment output as efficient as possible so
that the basic equipment output is not compromised by human failures.
• To make the combined user-equipment involvement as safe as possible so that
neither human nor equipment failures will compromise the user’s health or
damage the hardware. Inherent in this objective is the avoidance of injury to
others and of damage to adjacent hardware.
• To minimize the stress that the equipment imposes on the user as he or she uses,
operates, services, or maintains it. This includes such stresses as an undue energy
demand, frustration in trying to deal with the equipment at any point in the
human-equipment interaction, and worry about whether one is using the
equipment properly.
• To maximize the acceptability of the equipment, not only in terms of its
attractiveness, but also in terms of giving users the feeling that the equipment
allows them to use it efficiently and keep it in good working order with a
minimum of effort.
The methods of ergonomics are based on a logical and systematic process of: (1) establishing the
proper role of the human with the equipment, (2) designing the human-equipment interfaces to fit
the human’s capabilities and limitations, (3) evaluating and testing to see that the design does fit
these capabilities and limitations, and (4) properly training the human to operate the equipment.
If the equipment has used ergonomically sound human-equipment interfaces, the following items
have been accomplished:
• The equipment conforms to populational stereotypes and user expectations
• It is easy to learn how to operate the equipment
• Easily perceived displays and simple controls allow effective and efficient
communication between humans and the equipment
• The tasks allocated to humans and the equipment are based on known relative
strengths and weaknesses
• Relevant information is provided to the user by the equipment which avoids
reliance on the user’s memory
• Effective and efficient performance of equipment functions are facilitated
Whenever practicable, human engineering specialists should be used to help identify and solve
human engineering problems. However, this is not always possible. There are numerous human
factors references available; however, most of these references are directed to human factors or
human engineering specialists. The reference provided at the end of this activity has been
directed specifically toward the engineer or designer and provides a number of guidelines to
assist designers in doing their own human engineering. Its purpose is to provide a general
reference to key human factors questions and human-equipment interface design suggestions in a
form that engineers and designers can utilize with a minimum of searching or study.
References
Woodson, W., Human Factors Design Handbook Information and Guidelines for the Design of
Systems, Facilities, Equipment, and Products for Human Use, New York:McGraw-Hill Book
Company, 1981.
4. Equipment Not Degraded An indicated software or firmware problem that does not
severely degrade the equipment or any essential function;
restart acceptable.
5. Minor Fault All other minor problems/non-functional faults due to
software or firmware problems.
An example set of data collection, analysis, and reporting process flow steps include:
STEP 1: Begin test sequence.
STEP 2: Collect equipment and execution data for each failure.
STEP 3: Send collected data to analysis personnel at end of test sequence.
STEP 4: Respond to queries from analysis personnel for more information.
STEP 5: Record failure and management status data.
STEP 6: Update software operational reliability data base.
STEP 7: Generate failure/fault count summary reports.
STEP 8: Update software operational reliability model.
STEP 9: Generate software operational reliability measures, graphs.
STEP 10: Provide summary of results to management on a regular basis.
The references provide more detail about software reliability.
References
Ireson, W., C. Coombs, Jr., editors, Handbook of Reliability Engineering, NY:McGraw-Hill,
1988.
Musa, J., A. Iannino, K. Okumoto, Software Reliability: Measurement, Prediction, Application,
NY:McGraw-Hill, 1987.
SETEC, "Software Reliability for SEMI/SEMATECH Companies (Draft)," SEMATECH,
SETEC-91-032, December 20, 1991.
The complexity of the equipment and the availability of data dictate the FMEA analysis approach
that will be used. There are two primary approaches for accomplishing an FMEA. One is the
hardware approach which lists individual hardware components and analyzes their possible
failure modes. The other is the functional approach which recognizes that every component is
designed to perform a number of functions that can be classified as outputs. These outputs are
listed and their failure modes are analyzed. For complex systems, a combination of the functional
and hardware approaches may be used. The FMEA may start at the highest equipment level and
proceed down to lower levels (top-down) or start at the lowest level and proceed to the highest
equipment level (bottom-up). The hardware approach is normally used when hardware
components can be uniquely identified from schematics, drawings, and other engineering and
design data. This approach is generally done bottom-up. The functional approach is normally
used when hardware components cannot be uniquely identified or when equipment complexity
requires analysis from the highest equipment level down through succeeding levels. This
approach is generally done top-down.
An FMEA analysis is used to:
• Ensure that all conceivable failure modes and their effects are understood
• Assist in the identification of design weaknesses
• Select design alternatives
• Select design improvements
• Prioritize corrective actions
Findings of the FMEA analysis are recorded in a tabular format in FMEA worksheets. MIL-STD-
1629A describes the worksheets in detail.
References
Sundararajan, C., Guide to Reliability Engineering Data, Analysis, Applications, Implementation,
and Management, NY:Van Nostrand Reinhold, 1991, pp. 146-152.
MIL-HDBK-338-1A, Electronic Reliability Design Handbook, Irvine, CA:Global Engineering
Documents, 12 October 1988, Global Engineering Documents, pp. 7-100 to 7-121.
MIL-STD-1629A, Procedures for Performing a Failure Mode, Effects, and Criticality Analysis,
Washington, DC:Department of Defense, 24 November 1980.
Test failure
FAILURE REVIEW report
BOARD
Quality
Assurance
DATABASE Report
Analysis
Development
Implementation
Verification
ANALYSIS
Failure Investigation
Cause Investigation
and the reporting of delinquencies to management. The failure cause for each failure is clearly
stated.
The objectives of a FRACAS are to:
• Assess historical reliability performance
• Develop a pattern of deficiencies
• Provide engineering data for corrective action
• Develop statistical data for
− component failure rates and downtime
− component selection suitability criteria
− component application reviews
− future designs and design reviews
− product improvement programs
− spares provisioning
− life cycle costing
• Develop contractual performance data
• Provide warranty information
• Furnish safety and regulatory compliance data
• Assess liability-claim information
References
A Reliability Guide to Failure Reporting, Analysis, and Corrective Action Systems, Milwaukee,
WS:American Society for Quality Control, 1977.
MIL-STD-785B, Reliability Program for Systems and Equipment Development and Production,
Task 104, Philadelphia, PA:Naval Publications and Forms Center, 1980.
Society of Automotive Engineers Data Activity D1: Data Collection and Data
One of the building blocks for FRACAS is the collection of data and managing that data with a
data base management system. Together, they provide an organized way to gather factual data
about equipment performance - both good and bad.
Based on the reliability model for the equipment, a shopping list for data is established. Each
component or subsystem modeled in the fault tree or block diagram requires data in the form of a
failure probability or frequency. Several types of data are needed to determine the failure
probability and to assess product reliability:
• Cumulative operating time
• Number of failures
• Conditions present at the time of failure
There are three methods used for collecting reliability data. The first method involves the use of
a standardized reporting form that is filled out by engineers and technicians who are involved in
equipment testing, troubleshooting, and repair. These forms need to be simple to use and ask
only for needed information. An example of a reliability reporting form is on the following page.
To obtain a better understanding of the final use and importance of the data; personnel involved
in the collection of the data, final test technicians, and field service engineers are part of the team
that designs the data collection form and are involved in analyzing the data.
The second method involves the use of customer database and equipment tracking information.
This requires an excellent on-going customer supplier relationship. Great care must be taken to
ensure compatibility between the supplier and multiple customers’ data. Simply agreeing to
SEMI E10-90 specifications will not suffice; although basing the specifications on E10-90 makes
it industry compatible. In addition, a standard way of identifying failures and assists to the
subsystems and components should be devised. Inclusion of key customer equipment engineers
in evaluating the validity of the data collected is very useful.
The third method is to use the on-board CPU power to monitor and track equipment status,
faults, and errors. Customers agree to allow the information to be downloaded to a floppy disk
and removed from the site. The ability to time stamp and match this information to customer
data base information provides useful data.
Impact/Effect/Consequences of Problem
Remarks
If there is no equipment in the field from which to collect data there are several sources of data
available:
• Historical data
• Sub-tier supplier data
• In-house data
• Expert judgement
Historical data is data that has been collected for a previous generation of equipment or similar
equipment. The use of this data is limited to those subsystems and components that are similar
to those in current equipment. This data also requires that attention is paid to trends; that is, if
the subsystem or component had been undergoing improvements or if the methods of collecting
the data were changing, these must be accounted for. When a subsystem or component is
purchased from a supplier, that supplier should be able to supply the data that has been collected
for that part up to this point in time. Once a testing program exists for the equipment, in-house
data is available. For those subsystems and components that have none of the previous sources
of data available, expert judgement can be used to create initial reliability values. Expert
judgement takes the opinion of individuals who are considered to be knowledgeable about a
subsystem or component and uses this knowledge to create failure rates. It should be noted that
these sources of data do not always represent the environment and operating conditions that the
equipment will see in the field. Thus, the preferred source of data is always field data.
When collecting data, it is important to keep all of the data. This makes it possible to represent
the subsystem and component failure rates over a range of values and more accurately represents
the variety of environments and users that the subsystem and component will see.
It cannot be stressed enough that the validity of the reliability model and its predictions depend
on the validity of the data. A statement commonly used by software users is, "Garbage In,
Garbage Out," which is just as applicable here. As soon as possible replace historical and expert
judgement data with data collected during testing and operation in the field.
At this time it is important to discuss how the collected data is translated into failure rates, that
are used to improve the equipment’s reliability. In a typical piece of equipment, some
components are under stress or used continuously while others are used cyclically. Thus, failure
rates can be defined as a function of time (per hour) or cycle (per wafer). In either case, the
collected data includes the number of cycles, wafers, or hours during which the failures occurred.
Failures are evaluated to assure that the failures were genuine and resulted in equipment
shutdown or lost production time. Once the evaluation is done, translating data into failure rates
is fundamentally simple. Suppose that a database includes 25 machines operating over a 9-
month period. If component A failed 20 times and the average operational time for the 50
machines was 70 percent (that is, its utilization factor is 0.70), the failure rate for component A
would be
MTBF = 20/[25(9 mo.)(30 days/mo.)(24 hr./day)0.7]
= 1.8x10-4 failures/hr.
Suppose a second component, B, failed 12 times, but it relates to wafers, and the machine
averages 10 wafers/hr. the failure rate of component B would be
12/[25(9mo.)(30 day/mo.)(24 hr./day)(10 wafers/hr.)0.70]
= 9.5x10-5 failures/wafer processed.
Alternatively, it would be
MTBF = 9.5x10-5 failures/wafer(10 wafers/hr.)
= 9.5x10-4 failures/hr.
The key, of course, is knowing or estimating the utilization factor. This can be determined by
tabulating and averaging the operational times of all 25 machines. It can also come from groups
of machines, given general production information.
Applicable Tools
AT18 User Groups
References
Bigelow, J., "Tailored Data Collection," Quality, August 1991, pp. 21-22.
Burgess, J.A., "Improving Product Reliability," Quality Progress, December 1987, pp. 47-54.
SEMI E10-90, Guideline For Definition And Measurement Of Equipment Reliability, Availability,
and Maintainability (RAM), SEMI 1990, pp. 69-75.
many failures will occur in components and parts that have not been sufficiently proven out; this
makes failure tracking difficult. Another disadvantage in starting equipment testing too early is
that if too many component and part failures occur, the remainder will be subjected to too many
start operations, which are perhaps severer than steady-state operation. Consequently, a false
impression of the failure distributions will occur, compared with those expected in operation.
Equipment testing focuses on "Is the component or part reliable within the subsystem or
equipment?" Equipment testing does not eliminate component testing, but helps to pinpoint the
faulty components or parts, so that they may be replaced or modified by superior products.
Equipment testing is a way of realistically evaluating reliability as well as guiding component
and part improvement by systematically discovering problems and weaknesses.
There are several tools that are useful for testing subsystems and equipment. As with component
tests, accelerated testing can be used to gather reliability data in a shorter period of time. It can
also be used with Environmental Stress Screening (ESS) for subsystems and Reliability
Development/Growth Testing (RD/GT) for both subsystems and equipment. ESS is not done at
the equipment level; however, it is useful at the subsystem level. ESS can be used to stimulate
failures by stressing the subsystem to detect and remove early failures. RD/GT is used to identify
and correct failure modes and then to verify that the failure has been eliminated. Reliability
Qualification Testing (RQT) is used to verify that critical subsystems and the equipment meet
design goals and comply with contractual/program objectives. Life testing can be used to
evaluate the useful life or reliability of a subsystem or the equipment. Burn-In Testing is used to
screen out defects during a subsystem’s or equipment’s infant mortality period.
Reliability Demonstration Tests are used to demonstrate, often to the customer, that the
equipment is capable of meeting its specified performance and reliability for a stated period of
operation. This type of test can be very expensive and requires careful planning and execution.
The equipment and its associated subsystems, components, and parts that are going to be tested,
and the test conditions to be used must be closely controlled to ensure the validity of the final
results. It is often the practice to disassemble the items totally after the tests are completed to
inspect each one for wear, damage, or signs of impending failure.
A tool that is very useful for reliability demonstration tests is Reliability Qualification Testing
(RQT). RQT is used to verify that the equipment will meet design goals and comply with
contractual/program requirements.
Applicable Tools
AT1 Accelerated Testing
AT2 Burn-In Testing
AT6 Environmental Stress Screening (ESS)
AT8 Life Testing
AT13 Reliability Development/Growth Testing (RD\GT)
AT14 Reliability Qualification Testing (RQT)
References
Burgess, J.A., "Improving Product Reliability," Quality Progress, December 1987, pp. 51-52.
Lloyd, D.K., M. Lipow, RELIABILITY: Management, Methods, and Mathematics, Second
Edition, Milwaukee, WS:The American Society for Quality Control, 1991, pp. 349-354.
The left decreasing portion of the curve is the infant mortality period, where a disproportionate
number of failures occur early in the equipment’s lifetime. The flat part represents the constant
failure rate during the useful life of the equipment. The right increasing portion is the wear-out
period. It is useful to know, as closely as possible, where the infant mortality ends and the wear
out starts, even when burn-in tests are not performed.
Burn-in has proven to be an effective means of screening out defects during a components infant
mortality period. The typical burn-in test combines electrical stresses with temperature cycling
for short periods of time to activate temperature and voltage failure mechanism dependencies.
The two types of burn-in tests are static and dynamic. In static burn-in, a bias may be applied to
the device under test at very high temperatures. In dynamic burn-in, entire circuit cards may be
operated to simulate actual equipment operation.
Screening out the infant mortality failures results in more reliable components. Because most of
the failures occur during the infant mortality phase of the components life, this method of testing
results in reliability improvement of the equipment.
Burn-in tests are usually conducted on 100% of the production units to weed out production
errors related to minor variations in workmanship and process fluctuations that result from
engineering changes. Burn-in tests also discover some residual design errors. In these tests, the
stresses applied are usually within published performance constraints, and are applied for short
periods of time. Their purpose is to prevent production-related errors from being shipped.
Products that have undergone burn-in tests should be failure free.
References
Klinger, D., Y. Nakada, M. Menendez, AT&T Reliability Manual, NY:Van Nostrand Reinhold,
1990, pp. 52-57.
Punches, K., "Burn-In and Strife Testing," Quality Progress, May 1986, pp. 93-94.
For each cause ask, "Why does it happen?" and list responses as branches off the major causes.
The causes shown as branches can have sub-causes, indicated by sub-branches, and so on.
References
Ishikawa, K., Guide to Quality Control, White Plains, NY:Quality Resources, 1982, pp. 8-29.
O’Connor, P., Practical Reliability Engineering, Third Edition, New York:John Wiley & Sons,
1991, pp. 311-312
The Memory Jogger, Methuen, MA:GOAL/QPC, 1988, pp. 24-29.
The process itself is straightforward and simple; Industry Week outlines the benchmarking
process with a list of 10 steps. However, the simplicity of the process belies its true power. One
aspect of benchmarking that sets it apart is that it directs a company’s focusoutside their own
walls - aimed squarely at the marketplace and their competition. This leads to setting goals that
are geared toward being the best in the world, not just slightly better than last year.
Another benefit of benchmarking is that it can provide the blueprints for how a company can leap
ahead of even the best of its competitors. Improvements are not only in the equipment but in
secondary and supporting systems and processes.
Other benefits of benchmarking include:
• Identifying the keys for success for each area studied
• Providing specific quantitative targets
• Creating an awareness of state-of-the-art approaches
• Cultivating a culture where change, adaptation, and continuous improvement are
actively sought out
• Spotting emerging competitors and seeing where the company should be going in
the future
References
Altany, D., "Copycats," Industry Week, November 5, 1990, pp. 11-18.
Camp, R., Benchmarking: The Search For Best Practices That Lead To Superior Performance,
Milwaukee, WS:ASQC Quality Press, 1989.
Pryor, L., Beating The Competition: A Practical Guide To Benchmarking, Washington
DC:Kaiser Associates, 1988.
Competitive Benchmarking: What It Is And What It Can Do For You, Stamford, CONN:Xerox
Corporate Quality Office, Reference No. 700P90201, May 1987.
References
Bailey, R., R. Gilbert, "STRIFE Testing for Reliability Improvement," PROCEEDINGS -
Institute of Environmental Sciences, Vol. 1, 1981, pp. 119 - 123.
Bird, C., "Unit Level Environmental Screening," PROCEEDINGS - Institute of Environmental
Sciences, May 1980, pp. 63 - 64.
Punches, K., "Burn-In and Strife Testing," Quality Progress, May 1986, pp. 93 - 94.
Tustin, W., "Shake and Bake the Bugs Out," Quality Progress, Sept. 1990, pp. 61-64.
MIL-STD-785B, Reliability Program For Systems And Equipment Development And Production,
Task 301 Environmental Stress Screening, 3 July 1986, pp. 301-1 to 301-2.
MIL-STD 810E, Environmental Test Methods And Engineering Guidelines, 14 July 1989.
RMS Committee, RMS Reliability, Maintainability & Supportability Guidebook, SAE G-11,
Warrendale, PA:Society of Automotive Engineers, Inc., 1990, pp. 203-209.
Equipment
Failure
SS 1 SS 2 SS 3
C1 C2 C3 C4 C5
P1 P2 P3 P4
FTA is used to determine the various combinations of events; that is, component-level failures,
that could result in equipment failure. Component-level failures include hardware failures,
human errors, and software errors. A failure can range from noncompliance with specifications
to the inability of a component to perform its intended function. Component-level failures, in
fault tree (FT) terminology, are called primary events. Equipment failure refers to an undesired
state of the equipment; such as, the equipment stops functioning or makes bad products.
Equipment failure, in fault tree terminology, is called the top event. A fault tree is not a model of
all possible equipment failures or all possible causes of equipment failure. A fault tree is tailored
to its top event; that is, the fault tree only includes those failures that cause that top event to
occur.
Construction of a FT begins by defining what the top event is, for example, failure of the
equipment at less than 1000 hours. The next step involves determining the various ways that this
failure can occur. This is initially done at a fairly gross level. (For example, equipment failure
due to failure of the wafer handler subsystem). Once the equipment is modeled at a gross level;
that is, the model consists of 10 to 20 major subsystems, the next step is to determine which of
the subsystems should be modeled in more detail. If a particular subsystem rarely fails and it is
anticipated that this situation will not change, it would be a waste of time and effort to model it.
Concentrate instead on those subsystems that cause the equipment to frequently or
catastrophically fail. Those subsystems that are targeted as a reliability problem for the
equipment are broken into more detail. For example, the wafer handler subsystem could be
broken into the arm, associated software, and electrical components. Only those portions of the
wafer handler subsystem that significantly contribute to failure of that subsystem are broken into
more detail. This process is continued for all identified subsystems until all potential ways of
failing the equipment are identified.
The remainder of the description of this tool will focus on a general description of fault tree
analysis and the Boolean algebra necessary to quantify the fault tree into an equipment failure
rate. The references at the end of the description provide more detailed information.
At the top of the FT the top event is listed within a rectangle. The icon at the beginning of this
tool description has labeled its top event Equipment Failure. Next, the question, "How can the
equipment fail?" is asked. All those events; that is, subsystems, that can cause equipment failure
are placed in the FT under the top event, see Subsystem 1 (SS1), Subsystem 2 (SS2), and
Subsystem 3 (SS3) in the icon. Gates are used to connect the events. The gate between the top
event, equipment failure, and the primary events, SS1, SS2, and SS3, indicates that failure of
SS1, SS2 or SS3 will cause the equipment to fail. Some of the symbols used in a fault tree
include:
Primary Events
Basic Event A basic failure requiring no further development.
Gates
AND Gate Output fault occurs if all the input faults occur.
Transfer Symbols
Transfer In Indicates that the tree is developed further on
another page.
There are other less-used events and gates that are described in texts on FTA. As can be seen in
the icon, SS1 fails if component 1 or 2 (C1 or C2) fail. C2 fails only if both parts 1 and 2 (P1
and P2) fail. SS3 fails if components 3, 4, and 5 (C3, C4, and C5) all fail. Failure of C4 requires
either part 3 (P3) or part 4 (P4) to fail.
Once construction of the fault tree is completed, it is translated into an equation that is used to
quantify the equipment failure rate. Fault trees are based on Boolean algebra. Boolean algebra is
the mathematical manipulation of events derived from logical reasoning. The references discuss
Boolean algebra in detail; it will not be discussed here. The Boolean equations for the icon fault
tree are:
No. of
Failures
The Pareto diagram is a vertical or horizontal bar chart used to quantify and identify problems
and determine which problems should be worked on first. The bars are used to present a graphic
picture of the problems related to equipment. The bars are arranged in descending order of
importance from left to right. Analyzing failure data and using that data to create a Pareto
diagram allows for determining how to solve the largest proportion of the overall reliability
problem with the most economical use of resources.
References
Harrington, H., The Improvement Process, New York:McGraw-Hill, 1987, pp. 108-110, 207.
Ishikawa, K., Guide to Quality Control, White Plains, NY:Quality Resources, 1982, pp. 42-49.
O'Connor, P., Practical Reliability Engineering, Third Edition, New York:John Wiley & Sons,
1991, pp. 270-271.
The Memory Jogger, Second Edition, Methuen, MA:GOAL/QPC, 1988, pg. 17.
If equipment, subsystems, components, or parts have a tolerance (or specification) width, and are
produced by a process that generates variation in the parameter(s) of interest, it is important that
the process variation be less than the tolerance width. The ratio of the tolerance to the process
variation is called the process capability index, and is expressed as
T
Cp =
6σ
where T is the tolerance width and 6σ represents an interval of six standard deviations or, plus or
minus three standard deviations from the process mean. A Cp of 1 indicates that a process will
generate approximately 3 out-of-specification units in 1000, given the following assumptions.
The first assumption is that the process is normally distributed and stable. Any systematic
divergence, due for example to set-up errors, movement of the process mean during the
manufacturing cycle, or other causes, could significantly affect the output. Therefore, the use of
Cp to characterize a production process is appropriate only for processes that are under statistical
control; that is, there are no special causes of variation such as those just mentioned, only
common causes. Common cause variation is the random variation inherent in the process, when
it is under statistical control. The Cp index also assumes that the tolerance center and the process
mean coincide; that is, the process average is centered on the nominal value.
The Cpk index uses the Cp index as a starting point for stating a process’s capability, however, it
accounts for the process center not being the nominal value. Cpk is expressed as
C pk = (1- K) C p
where
D-x
K=
T/2
if D>x; otherwise replace D-x withx -D. D is the design center,x is the process mean, and T is
the tolerance width.
Ideally Cp = Cpk.
There are several things to keep in mind when using Cp and Cpk indices:
• If the process is not stable, Cp and Cpk are meaningless statistics.
• Not all processes can be assumed to be normally distributed. A naive user may
incorrectly assess the fraction of process output that will be out of specification.
• Cp and Cpk do not yield the same information about a process
• Both Cp and Cpk are closely tied to traditional 0-1 loss and do not account for
losses incurred for being off-target; each measures distance from specifications
not distance from target.
References
Gitlow, H., S. Gitlow, A. Oppenheim, R. Oppenheim, Tools and Methods for the Improvement of
Quality, Boston, MA:IRWIN, 1989, pp. 451-457.
Kane, V. E., "Process Capability Indices," Journal of Quality Technology, Vol. 18, No. 1,
January 1986, pp. 41-52.
O’Connor, P., Practical Reliability Engineering, Third Edition, New York:John Wiley & Sons,
1991, pp. 302-303.
Sullivan, L., "Reducing Variability: A New Approach to Quality," Quality Progress, July 1985,
pp. 15-21.
The Memory Jogger, Second Edition, Methuen, MA, GOAL/QPC, 1988, pp. 64-68.
Correlation
Matrix
Pr
io
rit
How?
y
What? Relationship
Matrix
Importance Ratings
How much?
The focus of QFD is almost entirely on the customer; that is, the voice of the customer. The
attitude promoted by QFD is one of problem avoidance rather than problem solving.
QFD is best used in a team or group context. The information required to complete a QFD
matrix is usually found in many different disciplines or skill sets. The information needed
stretches from a few simple (but presumably accurate) statements of customer needs, all the way
to the most detailed manufacturing process description. Therefore, it is not a methodology that
can be effectively used by a single person.
Advantages of QFD include:
• Promoting careful planning of the equipment through all life cycle phases in such
a way that attention is paid to customer needs
• Eliminating spurious engineering and process requirements; that is, those that
have no role in meeting customer needs
• Shortening the time it takes to move through the concept and feasibility to
production and operation phases by avoiding later life cycle changes that stretch
out the cycle time
• Identifying problem areas early, exposing areas for improvement, and providing
documentation for these activities
Difficulties with QFD include:
• Being semi-quantitative, QFD doesn’t replace good engineering judgement and
good sense
• An inability to compensate for an inaccurate or incomplete list of customer needs
• Not being designed to promote innovation in the sense of new or radical product
ideas
• Requiring the use of a wide variety of expertise and a team environment
The basics of the QFD matrix are simple; although, in practice it is a great deal of work to collect
the information necessary to create the matrix. Generally the QFD matrix consists of seven parts
• What?
• How?
• Relationship matrix
• Priority
• Correlation matrix
• Importance ratings
• How much?
What? is a collection of simple statements of customer wants, needs, or requirements; that is, the
voice of the customer. These statements are easy for the customer to identify with and to
understand. They accurately and simply list the group of characteristics or properties that make
the customer happy.
How? is a list of engineering, design and technical properties that are necessary to develop the
equipment. The What? list becomes the titles for the QFD matrix rows, and the How? list
becomes the titles for the columns, see the icon at the beginning of the QFD discussion.
The relationship matrix is used to relate the What? rows to the How? columns. A relevance
number or symbol is assigned to the intersections of the rows and columns. This results in
establishing the relationship between what the customer wants and how the equipment is going to
meet those wants.
Usually an extra column, called priority, is placed just to the left of the relationship matrix. It is
used to assign importance weights to the customer wants; that is, to determine which of the
customer wants are the most important to the customer. This determines which characteristics
will get the most focus. The determination is made with the customer, or at least with some very
good knowledge of what the customer wants.
Engineering, design, and technical properties are not independent of one another. Therefore, it is
necessary to examine how they relate to one another. This results in the roof of the house of
quality which is the correlation matrix. It is also necessary to determine if the properties are
correlated positively or negatively. An example of negatively correlated properties would be
strength and flexibility.
The matrix is usually expanded further to include the importance ratings and the How much?
column. The importance ratings contain numbers derived from the matrix values and the priority
column. It is used to indicate the importance of each of the properties with respect to the
customer wants. The How much? column contains the target values for every property listed in
the How? column. It answers the question, "How much is enough?"
References
Akao, Y., editor, Quality Function Deployment: Integrated Customer Requirements into Product
Design, Norwalk CN:Productive Press, 1990.
Hauser, J., D. Clausing, "The House of Quality," Harvard Business Review, May-June 1988,
pp. 63-73.
Ryan, N., editor, Taguchi Methods and QFD: Hows and Whys for Management, Dearborn,
MI:ASI Press, 1988, pp. 63-110.
Modeling produces its maximum economic benefit when performed during the design phase of
the equipment life cycle. However, modeling can also provide economic benefits when applied
to existing equipment.
The development of a system model depends heavily on the user’s understanding of the
equipment that is being modeled. However, proper utilization of the model also requires the
analyst to have a working knowledge of several concepts in the areas of statistics, probability,
and reliability.
Version 1.0 of RAMP provides the capability for developing, editing, and evaluating reliability
models for equipment used in semiconductor manufacturing. This capability is supported by an
integrated data management system and an integrated graphics output capability.
The following features were included to make the software as user friendly as possible:
• Menu driven. All options available to the user can be accessed from on-screen
menus.
• Help screens. Context-sensitive help is available to the user at all times.
• Mouse support. Mouse support is provided on all screens where use of the mouse
significantly improves the user interface.
Figure 4-1. A Block Model Developed in RAMP for the SETEC Generic Wafer
Handler System
A system model for the equipment is easily developed in RAMP in the form of a block diagram.
Figure 3-1 gives an example of a block model representation of a SETEC generic wafer handler
as developed in RAMP by the analyst. The system is represented with 14 components in series
(7 of which are shown in Figure 4-1). Component failure rate information, including a
characterization of the uncertainty, is entered into the component data library in RAMP. RAMP
converts the block diagram model in Figure 4-1 to a mathematical equation and uses random
selection techniques to sample the component failure rates from the component data library. The
output from RAMP provides complete sensitivity and uncertainty analysis results for various
performance measures that are associated with a reliability analysis of the system being modeled,
including
• System MTBF The system MTBF is for the modeled system. A range of values
for the MTBF and the distribution associated with that range is provided.
• Component contribution to system failure The fractional contribution that a
component makes to the failure of the system.
• Component contribution to subsystem failure The fractional contribution that a
component makes to the failure of the subsystem.
• Subsystem contribution to system failure The fractional contribution that a
subsystem makes to the failure of the system.
• Reliability Improvement The value of reliability improvement for a component is
the system MTBF (in hours) that would result if the failure rate for that
component were zero (that is, the component were perfectly reliable or nearly so).
• Uncertainty importance Uncertainty importance provides a measure of the
contribution of a component to the uncertainty in the probability of system failure.
Results produced by RAMP are available in various types of displays that include
Histograms A histogram is a graphical presentation of sample data using classes (that is,
intervals) on the x axis and relative frequency on the y axis.
Cumulative distribution functions (CDFs) A CDF is a graph of the cumulative relative
frequency (cumulative fraction) of observations less than or equal to a given value.
Pareto diagrams A Pareto diagram is a bar chart with the displayed values ordered from the
largest to the smallest. RAMP orders displayed values based on the mean. The 5th and 95th
percentiles are also displayed when they are available.
Summary statistics A written list of all the statistics calculated by RAMP is displayed, such as
the average MTBF, standard deviation for MTBF, and selected quantiles of the uncertainty
distribution for MTBF.
Input samples This option allows an analyst to view or print input failure rates as sampled from
component failure rate distributions.
Output results from samples This option allows the analyst to view or print the numerical
results that are calculated for each of the sampled failure rates.
Statistical results This option allows an analyst to view or print selected statistical results, such
as the mean value for all components.
Based on the characterization of the failure rates in the component data library for the SETEC
generic wafer handler system shown in Figure 4-1, the summary statistics produced by RAMP
give a mean value for MTBF of 93 hrs with about a 5 percent chance of being less than 50 hrs
and a 5 percent chance of exceeding 178 hrs. A graph of the estimated cumulative distribution
function for MTBF that is produced by RAMP is given in Figure 4-2.
The Pareto diagram in Figure 3-3 identifies the components that are the dominant contributors to
the failure of the system such as robot servo, robot wafer sensor, elevator door, and sensor
amplifiers. The Pareto diagram uses three horizontal bars with each component name rather than
the usual one bar. This is done to display the uncertainty associated with the contribution of each
component to system failure. The three bars represent the 95th percentile, the mean, and the 5th
percentile of the distribution of the component’s contribution to system failure.
Now assume that the engineers involved with the SETEC generic wafer handler have developed
a new and improved elevator that improves its MTBF by a factor of 2. The component data
library is modified to reflect the new MTBF for the elevator.
In addition, the engineers would like to evaluate the impact on system reliability of a design
change that would incorporate redundancy by adding another robot wafer sensor in parallel.
Because the sensors are in parallel, they must both fail before they cause the system to fail, thus
improving the system MTBF. The block diagram model is modified to include this desired
design change. The modified block diagram is shown in Figure 4-4.
WHS-ROB
WSENP
Figure 4-4. A Revised Block Diagram for the SETEC Generic Wafer Handler
System, showing the Addition of the Redundant Wafer Sensor
After these modifications, the summary statistics produced by RAMP give a mean value for
MTBF of 137 hrs for an increase of 47 percent. There is approximately a 5 percent chance of the
MTBF being less than 64 hr and a 5 percent chance of it exceeding 249 hr. A graph of the
estimated cumulative distribution function for MTBF that is produced by RAMP is given in
Figure 4-5.
Figure 4-5. An Estimate of the Cumulative Distribution Function for MTBF after
Modifying the Generic Wafer Handler System
The new Pareto diagram is given in Figure 4-6 and shows that the wafer sensor is no longer a
problem and has dropped out of the top ten list of components contributing to system failure. In
addition, the elevator door has now dropped behind the sensor amplifiers in the rankings.
This example has illustrated how RAMP provides a prediction of the system MTBF (including
the uncertainty in the prediction) after making two improvements in the system. Thus, modeling
has provided a tool for adopting a proactive position rather than a reactive position with respect
to making changes in the system to improve its reliability. That is, the analyst now has a good
idea of how the proposed changes will affect the performance of the system and knows where to
expend the company’s resources to provide an even greater improvement prior to committing
those resources.
This simple example provided a flavor of how RAMP works and demonstrated the usefulness of
modeling. Modeling alone does not make a system reliable, but it does provide an organized
means of understanding the system as well as being a tool to guide the wise expenditure of
resources for improved reliability.
References
Campbell, J., R. Iman, D. Longsine, B. Thompson, A Tutorial on Reliability Modeling Using
RAMP, SETEC91-030, Albuquerque, NM:Sandia National Laboratories.
Campbell, J., B. Thompson, D. Longsine, P. O’Connell, R. Iman, RAMP User’s Reference
Manual, SETEC91-030, Albuquerque, NM:Sandia National Laboratories.
Reliability Block Diagram (RBD) models are one of the tools that can be used to create a
reliability model of equipment. One of the easiest ways to describe the basic ideas used in the
creation of RBD models is to create a simple RBD; for a more detailed description of the
diagrams look at the sources listed in the references. Construction of a reliability block diagram
begins by defining what is meant by equipment failure; for example, equipment failure may be
defined as any failure that causes the equipment to be down for 8 minutes or longer. Once this is
done, the next step is to determine the various ways that this failure can occur. This is initially
done at a gross level; that is, 10 to 20 subsystems are defined that can lead to equipment failure.
A block diagram model that consists of 3 subsystems (SS1, SS2, and SS3) follows:
C3
C1 C2 SS2 C4
C5
In this example SS2 is not a significant contributor to the unreliability of the equipment, so it will
not be broken into any more detail. SS1 and SS3 however, are contributors to equipment
unreliability. SS1 fails if component 1 or 2 (C1 or C2) fail. SS3 fails if components 3, 4, and 5
(C3, C4, and C5) all fail. The block diagram model now looks like:
C3
P1
C1 SS2 P3 P4
P2
C5
Further analysis reveals that C2 fails if parts 1 and 2 (P1 and P2) fail. C4 fails if parts 3 or
4 (P3 or P4) fail. The block diagram model now looks like:
Once construction of the model is complete, it is translated into a Boolean equation which is then
used to quantify the equipment reliability. The references discuss Boolean algebra in detail, it
will not be discussed here. The Boolean equation for the RBD is:
Equipment Failure = C1 + P1 * P2 + SS2 + [C3 * (P3 + P4) * C5]
expanding and using the associative and distributive laws,
Equipment Failure = C1 + P1 * P2 + SS2 + C3 * P3 * C5 + C3 * P4 * C5.
Each of the terms in this equation represent a way that the equipment can fail. For example, if
part 1 and part 2 (P1 and P2) fail, the equipment fails.
The reliability block diagram has been translated into an equation, it is now time to quantify the
probability that the equipment fails as a function of its subsystems, components, and parts. Often
the term probability is used when what is really meant is frequency, probabilities must lie
between 0 and 1. A frequency can be any number greater than or equal to 0, depending on the
number of failures and the time scale used. For example, if a component fails twice per year, its
frequency is 2/yr, or 0.66/mo.
Using the previous example, the probability of equipment failure can be written,
P(Equipment Failure) = P(C1 + P1 * P2 + SS2 + C3 * P3 * C5 + C3 * P4 * C5).
But, how does one deal with the right-hand side of the equation? Considering the basic laws of
probability and the small probability approximation, and assuming that the events are
independent, the example equation becomes:
P(Equipment Failure) = P(C1) + P(P1)*P(P2) + P(SS2) + P(C3)*P(P3)*P(C5)
+ P(C3)*P(P4)*P(C5).
References
Campbell, J., R. Iman, D. Longsine, B. Thompson, A Tutorial on Reliability Modeling Using
RAMP, Albuquerque, NM:Sandia National Laboratories, SETEC91-030, pp. 9 - 31.
Klinger, D., Y. Nakada, M. Menendez, AT&T Reliability Manual, New York:Van Nostrand
Reinhold, 1990, pp. 78-91.
MIL-STD-756B, Reliability Modeling and Prediction, Washington, DC:Department of Defense,
18 November 1981, pp. 1001-1 to 1001-11.
proportional to the mean squared error of Υ about its targeted value τ. Therefore the
fundamental measure of variability is the mean squared error and not the variance. The
concept of quadratic loss emphasizes the importance of continuously reducing performance
variation.
5. The final quality and cost of a manufactured product are determined to a large extent by the
engineering design of the product and its manufacturing process. The number of
manufacturing imperfections in a product, hence the manufacturing cost of a product, is
significantly affected by the product’s design and the design of the process used to produce
the product. Generally, a product’s field performance is affected by environmental variables
as well as human variations in operating the product, product deterioration, and
manufacturing imperfections. Note that these sources of variation are chronic problems.
Manufacturing imperfections are the deviations of the actual parameters of a manufactured
product from their nominal values. These imperfections are caused by inevitable
uncertainties in a manufacturing process and are responsible for performance variation
across different units of a product. Dealing with variations due to environmental factors and
product deterioration can be done only in the product’s concept and design phases.
The manufacturing costs and imperfections in a product are largely determined by the design
of the manufacturing process. Increasing process controls can reduce manufacturing
imperfections; however, process controls cost money. It is, therefore, necessary to reduce
both manufacturing imperfections and process controls. Once the process is under statistical
control, it can be improved. Without a stable process it is almost impossible to discover a
means of reducing variation due to chronic problems.
6. Performance variation can be reduced by exploiting the nonlinear effects between a
product’s and/or process’s parameters and the product’s desired performance
characteristics. Due to the importance of the product and process design, quality control
must begin in the concept phase of the life cycle and continue through all phases. There are
two types of quality control methods:
• Off-line, which are technical aids for quality and cost control in product and
process design. These are used to improve product quality and manufacturability,
and to reduce product development, manufacturing, and lifetime costs.
• On-line, which are technical aids for quality and cost control in manufacturing.
As with performance characteristics, all specifications of product and process parameters
should be stated in terms of ideal values and tolerances around these ideal values. The idea
is not to produce products whose parameters are barely inside the tolerance intervals. Such
products are likely to be of poor quality due to the interdependencies of the parameters. A
product performs best when all parameters of the product are at their ideal values. Further,
the knowledge of ideal values of product and process parameters encourages continuous
quality improvements.
Taguchi has introduced a three-step approach to assign nominal values and tolerances to
product and process parameters:
• System design
• Parameter design
• Tolerance design
System design involves applying scientific and engineering knowledge to produce a basic
functional prototype design. The prototype model defines the initial setting of the product or
process parameters. System design requires an understanding of both the customer’s needs
and the manufacturing environment. A product cannot satisfy the customer’s needs unless it
is designed to do so. Designing for manufacturability requires an understanding of the
manufacturing environment.
Parameter design involves identifying the settings of product or process parameters that
reduce the sensitivity of engineering designs to the sources of variation. Adjustment of the
mean value of a performance characteristic to its targeted value is usually a much easier
engineering problem than the reduction of performance variation. The utilization of
nonlinear effects of product or process parameters on the performance characteristics to
reduce the sensitivity of engineering designs to the sources of variation is the essence of
parameter design. Because parameter design reduces performance variation by reducing the
influence of the sources of variation rather than by controlling them, it is a very cost-
effective technique for improving engineering designs. It is economically advantageous for
a designer to provide designs that are tolerant to statistical variations.
Tolerance design involves determining tolerances around the nominal settings identified by
parameter design. Industry commonly assigns tolerances using convention rather than
science. Narrow tolerances increase manufacturing costs while wide tolerances increase
performance variation. Thus, tolerance design is a trade-off between society’s loss due to
performance variation and the increase in manufacturing costs.
7. Statistically planned experiments can be used to identify the settings of product (and
process) parameters that reduce performance variation. This is the portion of Taguchi’s
methodology that is subject to criticism. Engineers tend to like Taguchi’s statistical methods
because he has made a serious effort to develop methods that are easy for a non-statistical
expert to use. However, Taguchi’s experiments can be enormous and extremely inefficient.
Taguchi’s approach to the use of statistically planned experiments for parameter design
involves classification of the performance characteristics of a product or process into two
categories: design parameters and sources of noise. Design parameters are those product or
process parameters whose nominal settings can be chosen by the responsible engineer.
These nominal settings define the product or process design specifications and vice versa.
The sources of noise are all those variables that cause the performance characteristics to
deviate from their targeted values. The noise factors are those sources of noise that can be
systematically varied in a parameter design experiment. The key noise factors, those that
represent the major sources of noise affecting a product’s performance in the field and a
process’ performance in the manufacturing environment, should be identified and included in
the experiment.
References
Barker, T.B., "Quality Engineering By Design: Taguchi’s Philosophy,"Quality Progress,
December 1986, pp. 32-42.
Gitlow, H., S. Gitlow, A. Oppenheim, R. Oppenheim, Tools and Methods for the Improvement of
Quality, Boston, MA:IRWIN, 1989, pp. 491-507.
Gunter, B., "A Perspective on the Taguchi Methods," Quality Progress, June 1987, pp. 44-52.
Kackar, R.N., "Taguchi’s Quality Philosophy: Analysis and Commentary,"Quality Progress,
December 1986, pp. 21-29.
Miller, K.L., D. Woodruff, "A Design Master’s End Run Around Trial and Error,"Business
Week/Quality, October 15, 1991, pg. 24.
Phadke, M.S., Quality Engineering Using Robust Design, Englewood Cliffs, NJ:Prentice Hall,
1989.
Port, O., J. Carey, "Quality: A Field With Roots That Go Back To The Farm," Business
Week/Quality, October 15, 1991, pg. 15.
Ross, P.J., Taguchi Techniques for Quality Engineering Loss Function, Orthogonal Experiments,
Parameter and Tolerance Design, New York, NY:McGraw-Hill Book Company, 1988.
Taguchi, G., Introduction To Quality Engineering Designing Quality into Products and Process,
White Plains, NY:Asian Productivity Organization, 1987.
References
Campbell, J., R. Iman, D. Longsine, B. Thompson, A Tutorial on Reliability Modeling Using
RAMP, SETEC91-030, Albuquerque, NM:Sandia National Laboratories.
Campbell, J., B. Thompson, D. Longsine, P. O’Connell, R. Iman, RAMP User’s Reference
Manual, SETEC91-030, Albuquerque, NM:Sandia National Laboratories.
Cost of Ownership Model, SEMATECH Technology Transfer # 91020473B-GEN,
Austin, TX:SEMATECH, January 24, 1991
http://www.sematech.org