You are on page 1of 6

Lessons Learned

Abstract Failures in Avionics systems are the result of certain errors and mistakes. Studying about the root causes of many avionics system failures give us good lessons. The lessons learnt from failures, help us build better systems. Inadequate capturing of the requirements, insufficient margins, differences in conditions in test and in flight, polarity/sign reversals, incorrect application of components, discarding early warning signals etc. are some of the important reasons for failures. Also systemic deficiencies such as over confidence, complacency, unsustainable project schedules, flaws in review process, inadequate documentation, etc. also adds to the causes of failures. Introduction Success of launch vehicles & satellites are dependent on the performance of avionics system also. In the worldwide launch scenario, during the last decade since 2000, about 30 % of launch failures are caused due to the malfunctioning of avionics systems including software ( Table 1). Also during the final phase of the launch campaign of many missions, anomalies in the electronics system has caused anxieties and delays in launch. Many good lessons are learned from design, development, qualification & acceptance testing, final integrated tests and flight experience. A few cases are discussed here. Lesson 1: Space is unforgiving; Thousands of good decisions can be undone by a single engineering flaw or workmanship error, and these errors and flaws can
1

result in catastrophe It is always the simple stuff that kills you. The failures of launch vehicles are quite different from failure of other systems like T & M equipment, computers or communication systems. In fact launch failures are considered as accidents rather than reliability failures caused by random failures of the components. These are mainly due to design errors or workmanship errors in fabrication. If we look at the history of ISRO launch failures , it can be seen that very simple & silly mistakes/errors were the reasons behind these accidents. Table 1 gives the summary of ISRO launch failures. As stated above, design or workmanship related errors in fabrication are the reasons for these failures. Attending to minute details during design and avoiding deviations & mistakes during fabrication are the key factors which ensure mission success.

Failure Reasons Propulsion Guidance and Navigation Software and computing systems Electrical systems Structures Ordnance Pneumatics & Hydraulics

Percentage 54% 4% 21% 8% 0% 0% 0%

Table 1 Worldwide scenario of launch failures

Lesson 2: Robust Design Essential to mission success We all know that the first and foremost factor determining the reliability of a system is good design. Major factors of a good avionics system design are the following: All the requirements are clearly and unambiguously specified initially itself. Making a very clear specification document of interfaces between the systems Selection of right electronic parts and applying them correctly as recommended by the part Manufacturer. Provide sufficient de-rating for the parts Design for testability Good PCB layout with good grounding, guarding low level signals against noise, good timing design with adequate margins and taking care of signal integrity issues. Good thermal design adequate margins. Good mechanical with respect to vibration. with

One major concern during the system design phase is that, the requirements of the system and sub-systems are not adequately defined and detailed. Normally, Scope and major specifications are properly defined; however, very detailed requirements to the minute levels, are not made initially and often, during the proto model evaluation many new requirements are discovered. Giving special attention to the details of the requirements and having a thorough discussion in this regard will go a long way in having a good system. This is all the more important in the case of software intensive systems as well as systems with FPGA devices. One classical example is the much publicised MARS Polar Lander failure. The three legs of the Polar lander, which were kept in stowed position, had to deployed at about 1500 metres above ground. Also the engine have to be shut down within 50 mS of legs touching the Mars surface. The shock sensors mounted on the legs were used to sense the touchdown and then shutdown the engine. System designers were aware that, when the stowed legs were deployed, the sensors on the legs will produce similar momentary signal as on touchdown. However, during the design, the designers failed to implement the requirement that the processing of the leg sensor data shall not begin until 12 metre above the ground. As a result, during the landing phase, the shutdown of the engine happened when the leg sensors generated the false momentary signal, when the legs were deployed at about 1500 meters and hence a 2 year long mission was lost. Lesson 4: Wrong application of avionics parts a major concern.
2

packaging shock &

Lesson 3: Systems Requirements To be adequately captured

Incorrect usage, wrong application, inadequate de-rating, not following the guidelines of component manufacturer, etc. are major causes of malfunctions of avionics systems. The problems caused by these reasons, are actually much more than system failures, observed due to component quality & reliability related issues. A few case studies are given below:Recently in an avionics package, it was observed that, in the + 15 V supply line, an over shoot, up to 22V is seen, for a duration of about 2 to 3 mS, when the package is switched ON. On analysis it is seen that the data sheet of Interpoint make DC-DC converter MHF+ 2815D specifies the maximum capacitance across its output shall be less than 10 micro farad, while in the package about 80 UF capacitance is put across the supply line. This overshoot may become catastrophic as absolute max voltage spec. for many devices is 16 V or 18 V. Many a times, problems were seen with regard to data corruption in EEPROM devices due to wrong or incorrect data protection schemes employed in the circuits. In certain cases, unused pin terminations results in intermittent malfunctioning of circuits. Inadequate or incorrect power on reset circuits, have caused problems in many packages. Special care in layout design is required with regard to the timing capacitors used with certain devices, as these inputs may be more susceptible to noise. Recently in one of the telemetry packages, excessive spikes in the output data were observed as the RS 485 opto-coupled transceiver device was not provided with the necessary bypass capacitors as suggested by the respective manufacturer. Disregarding the inverse current gain of transistors, in a relay driver
3

circuit, has resulted in sneak paths in the system, which was actually detected after about 15 years of usage. Tinning and Hand soldering of surface mount CDR type ceramic capacitors was a regular practice in many work centres and this has resulted in failures of packages even at launch pad. In fact Chip capacitor manufacturers have recommended to avoid hand soldering practices, mainly because the thermal gradient during the soldering causes cracks in the layers of the multilayer ceramic capacitors which may develop into capacitor shorts. Not adhering to the workmanship practices, during the assembly of devices on to PCB, is a major problem causing many latent failures. Many of the new devices are very fast and extreme care is needed during layout. The line lengths should be kept very short and proper terminations are to be provided for each interconnection to avoid signal integrity related issues like overshoot and ringing problems. Qualification models passing the tests and later flight models developing problems, especially during thermal tests is a phenomenon, very common now a days, which are attributable to such signal integrity related reasons. The important point is that the design engineers should read and understand all the datasheets, application guidelines, precautions, layout guidelines, good workmanship practices, etc., of every device used in the design. Even copying an already proven circuit from an old design, may cause problems in a new design, unless all these are not properly understood and applied.

Lesson 5: Use FPGA devices in critical applications with extreme care. FPGA devices were in use in our launch vehicle projects for more than a decade. Initially low capacity devices, ( 1 to 8 K gates ), were used. Currently 100 to 200 K gate designs are implemented using FPGA devices. This high gate count and associated design complexity is one of the major problems, while designing with FPGA devices. The design methodology followed for FPGA design is neither the good software engineering practices followed in software design, nor the ASIC design methodologies. In both these cases well matured engineering practices exists to avoid mistakes during design and to find & correct errors & bugs before the product is released. Today the FPGA design process is a casual approach, as it is felt that the design errors can be easily corrected compared to cost of correction in an ASIC. There is an overall increase in the usage of FPGA designs for space applications. However the design methodology has not improved significantly and at the same time the risk involved is also not fully appreciated. It is important to follow a good design methodology, as envisaged in the DO 254 standard for Complex electronics with documents like requirements specification, design document, third party independent verification and validation, and very detailed review process. Some of the very important lessons learned from the usage of FPGA designs for onboard applications are given below:The power ON behaviour of FPGA devices has to be studied very carefully.
4

Initialising all the flip flops in a FPGA device is good, but may not be necessary as it may increase the usage of routing resources. However, all the flip-flops driving the Outputs shall be initialised during power ON Using an external Schmitt trigger inverter or buffer is recommended to rout the reset signal to the input of FPGA Asynchronous assertion and synchronous de assertion of the reset, make the FPGA design more robust and can tolerate start up delays of oscillators and other internal delays. Internal clock buffers are to be used to rout the reset signal as well as the clock inside the FPGA. As the available clock buffers are finite, limit the number of clocks used inside the FPGA. Avoid derived clocks and gated clocks as this will force the designers to use normal lines other than clock buffers for driving the clock inputs of Flip flops. This will increase the clock skews as well as reduce the testability of the circuits. Metastability related issues may develop when signals are transferred over different clock domains. Necessary precautions have to be taken to avoid this. Unused inputs such as test & Mode pins have to be properly terminated. During synthesis, SAFE option shall be enabled to ensure recovery from illegal states The design documents, VHDL codes, test benches, Synthesis tool configuration , fuse maps

etc. have to be version & configuration controlled. The important lesson is that, the designers of FPGA designs should be experts in the area, with detailed knowledge of all the intricacies of the devices, design methodologies, verification and validation practices and above all experts in the design and verification tools. All the designs have to be thoroughly reviewed by a team of experts both in FPGA design and the system design. Lesson 6: Experience with Industrial Grade Plastic Encapsulated Microcircuit (PEM) devices is really good. It is true that more and important lessons are learned from failures. However, we can learn lessons from successes also. There was a myth that only Mil grade / space grade devices are suitable for launch vehicle applications. However, the bold decision to use industrial grade PEM devices in launch vehicles, especially in less critical telemetry applications has paid dividends. The major benefits of using industrial grade devices are availability, lower cost, increased functionality of the components, etc. The failure rate and reliability levels are comparable to that of mil devices. The important lessons learned from the usage of PEMs are as below Select components from reputed manufacturers only Employ the components only after a proper evaluation and qualification tests to establish margins. Provide protection against moisture during storage

Lesson 7: interfaces

Take

care

of

One major problem in large systems, like launch vehicles and satellites, is in defining and implementing the proper electrical and mechanical interfaces between systems. Also maintaining the proper interfaces between the different working teams is very important and demanding. In fact there was a major failure, as one of the teams provided the data in FPS units and the other team interpreted the numbers in MKS units. It is important to have a well defined interface document where all the interface specifications are provided without ambiguity. This document has to be reviewed and approved by all the teams working on the project.

Lesson 8: Demonstrate design margins. Demonstrating the design margins is equally important as Robust design for ensuring mission success. This is to be done in the early phase of the project itself. Design margins are to be compliant with environment, interfaces, tolerances and uncertainties. It is important to analytically determine the margins prior to testing the system. A proper derating analysis of every component in the system, with regard to voltage, current, power and thermal characteristics will give good assurance with regard to electrical stresses. Vibration analysis of the chassis or packaging and mounting details of the unit on to the launch vehicle has to be done to ascertain the margins available. After ensuring through design analysis, that sufficient margins exist, the actual systems
5

Currently about 60 to 70 % of the total number of semiconductor devices used in onboard applications are industrial grade semiconductor parts.

have to be put to the required tests. Lesson 9: There is alternative to testing. no

A product or subsystem is designed and developed to perform defined functions meeting the requirements, specifications, interface definitions, environmental conditions etc. Testing is the only process to validate that all the above are satisfactorily met. There are many good lessons learnt with regard to testing. The system has to be tested for all that, the system should do and should not do. The tests are to be representative of flight conditions. Test as you FLY and Fly as you test A frequent cause of maiden flight failures is that the ground tests are not truly representing the flight conditions. It may not be always possible to meet the above guidelines. In such cases systematic analysis of differences between the test and flight conditions has to carried out to understand the limitations of ground test and thus assess the risks involved and find ways to mitigate the risks. Use real flight systems instead of simulations wherever feasible, as simulations may sometimes miss some important points. Assumptions used in test and simulations are to be fully understood.

important to ensure that the testing is done very carefully. Many avionics systems had failed during testing due to operational errors, faulty test equipment, wrong test conditions, improper power up sequences, etc. Some lessons learned are:All tests are to be done based on a test plan, which is reviewed and approved by experts from design and test agencies. The test equipment or checkout systems should have necessary safety interlocks to prevent accidental damage to the test article. Never perform a new test on the flight system for the first time. First it is to be performed on a ground model before doing on the flight unit. Some common problems are: excess neutral to earth voltage, isolation degradation between onboard and ground systems, improper over voltage/ over current settings, A/C not working, thermal chambers not having humidity control, Inadequate ESD control, poor training/inexperience/ fatigue of the operators

It is to be noted that realising a flight hardware, meeting all the quality norms, is not an easy task. Test induced failure is a reality and hence extreme care is to be exercised while testing the flight systems.

Lesson 10: Test Induced failures are also a concern. While it is very essential to test every subsystem as described above, it is also extremely
6