You are on page 1of 15

Original Article

Proc IMechE Part O:


J Risk and Reliability
1–15
Operation-oriented reliability and Ó IMechE 2018
Article reuse guidelines:
availability evaluation for onboard sagepub.com/journals-permissions
DOI: 10.1177/1748006X18800630

high-speed train control system with journals.sagepub.com/home/pio

dynamic Bayesian network

Lei Jiang1,2 , Yiliu Liu2, Xiaomin Wang1 and Mary Ann Lundteigen2

Abstract
The reliability and availability of the onboard high-speed train control system are important to guarantee operational effi-
ciency and railway safety. Failures occurring in the onboard system may result in serious accidents. In the analysis of the
effects of failure, it is significant to consider the operation of an onboard system. This article presents a systemic
approach to evaluate the reliability and availability for the onboard system based on dynamic Bayesian network, with tak-
ing into account dynamic failure behaviors, imperfect coverage factors, and temporal effects in the operational phase.
The case studies are presented and compared for onboard systems with different redundant strategies, that is, the triple
modular redundancy, hot spare double dual, and cold spare double dual. Dynamic fault trees of the three kinds of
onboard system are constructed and mapped into dynamic Bayesian networks. The forward and backward inferences
are conducted not only to evaluate the reliability and availability but also to recognize the vulnerabilities of the onboard
systems. A sensitivity analysis is carried out for evaluating the effects of failure rates subject to uncertainties. To improve
the reliability and availability, the recovery mechanism should be paid more attention. Finally, the proposed approach is
validated with the field data from one railway bureau in China and some industrial impacts are provided.

Keywords
Operation oriented, reliability, availability, train control system, dynamic Bayesian network

Date received: 31 January 2018; accepted: 4 August 2018

Introduction four levels, level 0–level 3.1,2 Transitions between these


levels are performed according to defined functions and
The main purpose of high-speed railways is to provide procedures. Some of these system levels, such as CTCS
safe, efficient, and punctual services to travelers level 3 (CTCS-3) and ETCS level 2 (ETCS-2), use wire-
between cities. The train movements are controlled by less communication (Global System for Mobile
the train control system, which consists of onboard and Communications–Railway (GSM-R) or Long Term
trackside subsystems. Recent national and international Evolution–Railways (LTE-R)) to realize continuously
standards and regulations have introduced new require- bidirectional information transmission between the
ments to ensure a higher level of interoperability and subsystems.
flexibility in the exploitation of the railway infrastruc- The CTCS/ETCS systems, both onboard and track-
ture. These requirements have resulted in practical stan- side, are safety critical. In this article, the attention is
dards for train control system, such as the European
Train Control System (ETCS) and Chinese Train 1
Key Laboratory of Traffic Information Engineering and Control,
Control System (CTCS). The standards allow integra-
Southwest Jiaotong University, Chengdu, China
tion with traditional systems installed in the railway 2
Department of Mechanical and Industrial Engineering, Norwegian
infrastructure, but also specify a totally new train con- University of Science and Technology, Trondheim, Norway
trol philosophy and architecture. For this reason, both
CTCS and ETCS are introduced with different levels, Corresponding author:
Xiaomin Wang, Key Laboratory of Traffic Information Engineering and
depending on the specific implementation selected. The Control, Southwest Jiaotong University, Chengdu, Sichuan 610031, China.
CTCS has five levels, level 0–level 4, while ETCS has Email: xmwang@swjtu.edu.cn
2 Proc IMechE Part O: J Risk and Reliability 00(0)

directed mainly to the onboard system, as the higher uniqueness of the operation and maintenance of high-
levels of CTCS and ETCS will have many of the track- speed trains should be taken into account. For exam-
side functions transferred to the onboard systems. The ple, the onboard system cannot be repaired online in
onboard system has two main purposes: To avoid too working time considering the safety and efficiency of
high speed and to prevent violating a red (stop) signal. railway network, and it is only repaired in the train
A failure occurring in the onboard system may have the inspection and repair station. The onboard system may
potential to cause serious accidents. degrade on account of the wireless communication
Achieving a high degree of reliability and availability timeout. In operational phase, the dynamic failure
has obvious advantages in the safety and efficiency for behaviors and imperfect coverage factors of redundant
the train control system. For this reason, both active onboard system also have significant effects on the
and standby redundancy strategies are introduced for reliability and availability. Such kind of behaviors
the onboard system. For example, the triple modular should be reflected in an effective reliability and avail-
redundancy (TMR), hot standby double-dual (HDD), ability analysis model.
and cold standby double-dual (CDD) architectures are BN and dynamic Bayesian network (DBN) have
widely applied. One important feature in a redundancy been proved effective in modeling the complexity of
system is the dependencies among components, which industrial system.15–17 Boudali and Dugan18 proposed
cause the dynamic failure behaviors. It has critical a discrete-time BN reliability formalism for system
effects on the reliability and availability evaluation and reliability prediction and diagnosis. Cai et al.19 pre-
should be modeled when considering the use of spares sented a real-time fault diagnosis methodology of com-
such as hot spare, warm spare, and cold spare.3 plex systems using object-oriented BN. Neil and
Another significant feature in a redundant system is the Marquez20 utilized a hybrid BN framework to evaluate
coverage factor, the probability that a single failure the effects and corrective maintenance time, logistic
entails a complete system failure.4 The imperfect cover- delay, and preventive maintenance on the availability
age factor can be used to model the inaccuracy of a of renewable systems. The static BN represents a joint
recovery mechanism in a redundant system. probability distribution at a fixed time slice, while
Models and approaches for reliability and availabil- DBN can extend the static BN at different time slices
ity analysis of high-speed train control system have to model dynamics of random variables. For instance,
been studied in the past decade. For example, Liang et al.21 applied a DBN to evaluate the reliability
Flammini et al.5 combined fault tree (FT) and Bayesian of warship and dealt with the dynamic failure and mul-
network (BN) to evaluate the reliability of ETCS-2. A tistage missions. Cai et al.22 proposed a DBN model to
multiformalism model is used to perform an availabil- analyze the reliability and availability of subsea blow-
ity analysis of an ETCS-2 at early phases of its develop- out preventer. An availability-based engineering resili-
ment cycle.6 For the ETCS-2 onboard system, the ence metric was proposed from the perspective of
components have adopted active redundancy strategy reliability engineering using DBN model.23 Generally,
and the FT model is utilized to perform a reliability BN and DBN models can be obtained by mapping the
analysis. Di et al.7 utilized the reliability block diagram traditional reliability method, such as RBD, FT, and
(RBD) and Markov model to evaluate the reliability, dynamic fault tree (DFT).24–26
availability, and maintainability of CTCS-3 onboard In this article, we aim to adopt the DBN approach
system. Su and Che8 modeled CTCS-3 onboard system to evaluate the reliability and availability of onboard
by mapping FT to BN and analyzed the multistate high-speed train control systems in the operation phase,
situations. For the CTCS-3 onboard system, the com- with taking into account dynamic failure behaviors,
ponents have adopted standby redundancy strategy. imperfect coverage factors, and multistates. Based on
Qiu et al.9,10 proposed a simulation approach to model the corrective maintenance, the field data regarding
the availability of ETCS-2 as a system of systems. onboard system are collected to calculate the failure
Bernardi et al.11 presented a model-driven approach rate of the components. In practice, different redundant
for the development of formal maintenance and relia- strategies of the onboard system (TMR, HDD, and
bility models for the availability evaluation of train CDD) are adopted. The reliabilities and availabilities
control system. Some studies formulated non- of those onboard systems are presented and compared
Markovian models for the moving-block control (com- to provide some industrial impacts. The DBN forward,
munications-based train control (CBTC) or ETCS level backward, and sensitivity analysis will be conducted to
3), and their works are more detailed in the communi- support the maintenance decision making. Finally, the
cation availability.12–14 availability obtained is well validated by the field data.
For the onboard system, the studies discussed above The remainder of this article is organized as follows:
mainly focus on the reliability and availability analysis upcoming section introduces the structure and opera-
in the design phase. Since railways require a long-term tion of an onboard system. Then, the corresponding
and sustainable strategy, performance during operation DBN model and model validation are presented. Next,
and maintenance needs to be analyzed. Based on the a case study of reliability and availability evaluation of
corrective maintenance, the field data can be used for CTCS-3 onboard system will be conducted. Finally,
reliability and availability evaluation. Moreover, the conclusions occur in the final section.
Jiang et al. 3

Figure 1. The function modules of the onboard system.

System description unavailability comes from the recovery time when the
primary system suffers a failure. The average working
Structure of an onboard system time for a high-speed train is 18 h in 1 day. For the rest
High-speed train control depends on the bidirectional time, the maintenance is required for the train in an
information transmission between the trackside and inspection and repair station. The onboard system can-
onboard subsystem. Through wireless communication not be online repaired, namely, it is kept operating dur-
system, information about train speed and location can ing the working hours only by the redundancy strategy.
be transmitted to the radio block center (RBC), and Meanwhile, the onboard system can be fully repaired
meanwhile, the movement authorities, generated by during the maintenance hours. The recovery time is
RBC, can be sent to the onboard system. The onboard influenced by some dynamic factors, such as the recov-
system consists of the following components: vital com- ery mechanism and the operational condition.
puter (VC), radio transmission unit (RTU), balise Field data can be collected, containing failure mode/
transmission module (BTM), and driver machine inter- effect/cause and recovery time. These field data are very
face (DMI). Generally, the onboard system can be useful for the calculation of failure rate and the valida-
divided into five functional modules: wireless communi- tion of results.
cation processing module (WP), lineside data process-
ing module (LP), train and driver interface (TD), kernel
information processing module (KN), and power and DBN-based reliability and availability
bus (PB), as shown in Figure 1. The VC is the core modeling
component to realize KN function. KN processes the
information from WP, LP, and TD and conduct safe
Introduction on BN and DBN
control, protection, and supervision of the train. The A BN consists of two parts, corresponding to a directed
other components of CDD, HDD, and TMR onboard acyclic graph (DAG) and a joint probability distribu-
system adopted cold standby, hot standby, and parallel tion.27 Precisely, a BN is a triplet \ (V, E), P . , where
structure, respectively. (V, E) are the variables (nodes) and edges (arcs) of a
DAG and P is the probability distribution for every
V.28 The DAG realizes the qualitative analysis
Operation of an onboard system of dependence V. Meanwhile, the conditional prob-
Any failure in the five function modules can result in abilistic table (CPT) over V accomplishes the quanti-
the failure of the onboard system. According to the tative analysis. For discrete random variables
fail-safe concept in railway, the train should be stopped V = fX1 , X2 , . . . , XN g, the joint probability distribu-
and then the driver needs to restart the onboard sys- tion can be given by
tem. The working timetable for onboard system in Y
1 day is shown in Figure 2. Availability means the P(V) = P(X1 , X2 , . . . , XN ) = P(Xi jPa(Xi )) ð1Þ
onboard system is operative in the working hours, and Xi 2V
4 Proc IMechE Part O: J Risk and Reliability 00(0)

Figure 2. The working timetable of the onboard system.

where N is the number of random variables in the graph factor c, in a redundant system is taken into account to
and Pa(Xi ) is the parent of Xi. model the inaccuracy of a recovery mechanism. The
DBNs can model the dynamic behavior of random coverage factor c is presented as c = probability {sys-
variables by extending the static BN with the temporal tem recovers|fault occurs}.30 Then, DBN extends the
dependencies at different time slices. Intra-slice arcs at BN by incorporating temporal dependencies at differ-
the same time slice and temporal arcs in different time ent time slices. For instance, the node C1 (t) is extended
slices are two types of conditional dependencies to C1 (t + D t) with a temporal arc. Similarly, the
between nodes. Generally, two-time slices temporal DBNs of OR gate and 2oo3 voting gate are shown in
Bayesian networks (2TBN) are considered in modeling Figure 3(a) and (c), respectively.
the temporal evolution, assuming that temporal arcs Dynamic logic gates are designed to express the time
between the consecutive time t 2 1 and t satisfy the sequence and failure behaviors of the systems. The pri-
first-order Markov process. The 2TBN defines ority AND (PAND) gate, the functional dependency
P(Xt jXt1 ) by means of a DAG as follows29 (FDEP) gate, and the spare gate are commonly used in
DFT modeling. The mapping rules of spare gate are
Y
N   described based on the previous study.25,26 Generally, a
PðXi jXi1 Þ = P Xit jPa(Xit ) ð2Þ
i=1
spare gate consists of two types of elements: the pri-
mary modules and one or multiple redundant modules.
where Xit is the ith node in time slice t and Pa(Xit ) indi- For example, in Figure 3(d), the DBN structure is simi-
cates the parent of Xit which can only stand in slices t lar to that in Figure 3(a) and (b), but the former one
2 1 and t. demonstrates that component S at t + D t time slice is
Note that the nodes in the first time slice do not asso- dependent on both P at t time slice and S at t time slice.
ciate any parameters, while each node from the second Assume that the primary P is active at the t time slice
time slice has an associated CPT. with failure rate l, and the failure rate of one spare S is
Thereby, the joint distribution can be defined by l in active state or al at inactive state, where a is the
‘‘unrolling’’ the 2TBN with T time slices as follows dormancy factor. Hot and cold spares can be modeled
by setting a equal to 1 and 0, respectively. Whenever
T Y
Y N   
PðX1:T Þ = P Xit Pa(Xit ) ð3Þ the P fails, a replacement is initiated and the S will be
t=1 i=1 powered up to keep the system functional.

Determination of DBN parameters


DBN structure modeling
DBN parameters are based on the prior probabilities of
The structure modeling presents the mapping rules
root nodes and the CPT of intermediate nodes and leaf
from DFT into DBN. Here, we briefly model the OR
nodes. The node C1 in Figure 3(b) is demonstrated as
gate, AND gate, 2oo3 voting gate, and spare gate as
an example. Assuming C1 follows the exponential dis-
they will be used later in the case study. For those static
tribution with failure rate l, it can be obtained
logic gates (the OR gate, AND gate, and 2oo3 voting
gate), mapping rules are described in the previous PfC1(t + Dt) = FjC1(t) = Wg = 1  elt
study.24 As indicated in Figure 3(b), the relationship
between C1, C2, and A is linked by intra-slice arcs. Considering a repair action, the availability of C1
Each node involves two states denoted by Working can also be obtained. If the repair rate of C1 is m, it can
(W) or Failed (F). The significant feature, coverage be obtained
Jiang et al. 5

Figure 3. Mapping the (a) OR gate, (b) AND gate, (c) 2oo3 voting gate, and (d) spare gate to DBNs.

Table 1. The CPT of C1 at t + D t time slice without repair. Table 2. The CPT of C1 at t + D t time slice with repair.

C1 at t C1 at t + D t C1 at t C1 at t + D t
W F W F

W elt 1  elt W elt 1  elt


F 0 1 F 1  elt emt

PfC1(t + Dt) = WjC1(t) = Fg = 1  emt


For spare gate shown in Figure 3(d), the CPT of
The CPT for C1 at t + D t time slice given C1 at t node S without and with repair are given in Tables 3
time slice is provided in Tables 1 and 2. and 4, where a is the dormancy factor.
6 Proc IMechE Part O: J Risk and Reliability 00(0)

Table 3. The CPT of S at t + D t time slice without repair. reliability and availability evaluation of the actual sys-
tem. In this article, the validations are accomplished in
P at t S at t S at t + D t two ways: a partial validation of the model usability
W F should satisfy three axioms proposed by Jones at al.31
The availabilities obtained from the proposed approach
W W ealt 1  elt are validated by analyzing the field data of one railway
W F 0 1
F W elt 1  elt bureau in China.
F F 0 1

Case study
Table 4. The CPT of S at t + D t time slice with repair.
We choose a CTCS-3 onboard system as an example to
illustrate the proposed approach. Before the reliability
P at t S at t S at t + D t and availability analysis, three assumptions should be
made:
W F

W W ealt 1  elt 1. All components are mutually independent;


W F 1  elt emt 2. Failures of all components in the system follow the
F W elt 1  elt
F F 1  elt emt exponential distributions with a constant failure
rate, because they are mainly electronic products;
3. All components are considered ‘‘as good as new’’
after repairs.
Evaluation and validation
Through the forward inference, the reliability and
availability of different redundancy strategy can be CTCS-3 onboard system
obtained. Meanwhile, the posterior probabilities of Figure 4 shows the hierarchical architecture of the
each node are generated by the backward inference CTCS-3 onboard system.32,33 According to the system
after an evidence is introduced. A sensitivity analysis is requirements specification, the mean time between fail-
carried out with the assumption that the prior prob- ure (MTBF) shall be not less than 105 h and the avail-
abilities of five function modules are subject to the ability of CTCS-3 onboard system shall be not less
uncertainty of 10%. Moreover, the effects of coverage than 0.9999.32 Since the transition between CTCS-3
factor on reliability and availability will be calculated. and CTCS-2 is possible, a CTCS-3 onboard system also
The validation of the proposed approach is a signifi- includes the components of CTCS-2. The five func-
cant procedure to prove that it is reasonable for the tional modules are realized by the corresponding

Figure 4. The architecture of the CTCS-3 onboard system.


Jiang et al. 7

Table 5. The partial specification of field data.

Failure time Failure mode Failure effect Recovery time (min) Failure cause

21 May 2015 Invalid BTM port Stopped the train 11 BTM1 failure
18 May 2015 Wireless communication timeout Degraded to CTCS-2 10 MT1 failure
18 May 2015 Error on C3CU Stopped the train 10 C3CU failure
17 May 2015 Error on train interface Stopped the train 18 VDX2 failure

components shown in the physical layer. It should be cold or hot spare gate. Specifically, VDX1 and VDX2
noted that are connected by an OR gate because if there is any
fault in the VDX1 or VDX2, the train will be stopped
1. WP function is realized by mobile terminal (MT) unconditionally.
and RTU. RTU is used for processing the mes-
sages, received or transmitted by MT.
2. LP function is conducted by BTM and track circuit Mapping the DFT to DBN
reader (TCR). BTM is designed for processing the In this article, the DBNs for CDD, TMR, and HDD
telegrams, received by BTM antenna. TCR receives
onboard systems are analyzed using GeNIe 2.1 aca-
the messages from TCR antenna and transmits the
demic software developed at the University of
messages to VC.
Pittsburgh.34 To establish the DBNs, the mapping rules
3. TD function is conducted by DMI and train inter-
from DFTs to DBNs are utilized. The DBN of HDD
face unit (TIU). TIU consists of vital digital input/
or CDD onboard system with two-time slices is shown
output unit (VDX) and relay unit (RLU).
in Figure 6. On the contrary, the DBN of TMR
4. KN function accomplishes safe control, protection,
onboard system should replace the C3VC and C2VC
and supervision of the train. CTCS-3 vital com-
with 2oo3 voting architecture. The top event is mapped
puter (C3VC) and CTCS-2 vital computer (C2VC)
into the corresponding child node (CTCS-3 onboard),
are the core computing system for preventing the
and the basic events are translated into the correspond-
train from overspeed or overrunning in CTCS-3
ing parent nodes in DBNs. The failure rates of the
and CTCS-2, respectively. Speed and distance pro-
components are based on both field data and expert
cessing (SDP) unit measures the speed and
experience, as shown in Table 6.
distance.
In consideration that the degradation from CTCS-3
5. PB function is conducted by POWER and BUS.
to CTCS-2 level is possible, the child node (CTCS-3
onboard) has three states corresponding to Working
Field data have been collected from one railway
(W), Degraded (D), and Failed (F), whereas the other
bureau in China. The partial specification of field data
nodes only have two states (W and F). It should be
is shown in Table 5. It should be mentioned that the
field data are related to specific railway lines and opera- mentioned that the degradation of the node (CTCS-3
tional environment. The failure number of components onboard) occurs only if the wireless communication
in a given runtime is obtained from the field data. The failure happens, and the CPT of this node is shown in
failure rate can therefore be expressed as Table 7.
In the case study, Dt is set to be 1 week, that is,
l = N=Nm =T ð4Þ 126 h. Since failures follow the exponential distribution,
the initial states are in the perfect functioning in 0th
where N is the number of component failures, Nm is the week, and the failure probabilities of those nodes are
total number of the onboard systems, and T is the assigned to 0. According to the DBN parameter model-
runtime. ing, the CPT of the nodes in other time slices can be
calculated.
The DFT of CTCS-3 onboard system Since the component is considered ‘‘as good as new’’
after repair action, the average availability is calculated
The construction of DFT is constructed based on the as MUT/(MUT + MDT), where the MUT is the aver-
system structure and the expert experience. Failure of
age working time and the MDT is the average down
the CTCS-3 onboard system is considered as the top
time. Considering the operational situation of onboard
event. There are five intermediate events, that is, line-
system shown in Figure 2, the MDT mainly depends
side data processing failure, wireless communication
on the mean waiting time of the component in 18 h
failure, PB failure, kernel information processing fail-
after it failed. Therefore, the mean waiting time for
ure, and the TD failure. The relationship between
both primary and spare component in hot spare and
events is shown in Figure 5. The spare gate is either
8 Proc IMechE Part O: J Risk and Reliability 00(0)

Figure 5. The DFT of CDD, HDD, and TMR onboard system.

Table 6. The failure rates and prior and posterior probabilities of components.

No. Component Failure CDD HDD TMR


rate l (h21)
Prior Posterior Prior Posterior Prior Posterior
probabilities probabilities probabilities probabilities probabilities probabilities

1 C3CPU1 7.45E206 8.56E202 1.54E201 8.56E202 1.61E201 8.56E202 1.82E201


2 C3CPU2 7.71E203 6.14E202 8.56E202 1.61E201 – –
3 C2CPU1 6.00E206 7.02E202 1.21E201 7.02E202 1.25E201 7.02E202 1.39E201
4 C2CPU2 5.12E203 4.08E202 7.02E202 1.25E201 – –
5 SDP1 5.00E207 6.27E203 8.63E203 6.27E203 8.09E203 6.27E203 8.30E203
6 SDP2 2.93E205 2.33E204 6.27E203 8.09E203 6.27E203 8.30E203
7 SDU1 2.50E207 3.14E203 4.31E203 3.14E203 4.04E203 3.14E203 4.15E203
8 SDU2 1.46E205 1.17E204 3.14E203 4.04E203 3.14E203 4.15E203
9 RTU1 1.80E206 2.24E202 2.24E202 2.24E202 2.24E202 2.24E202 2.24E202
10 RTU2 2.51E204 2.51E204 2.24E202 2.24E202 2.51E204 2.51E204
11 MT1 6.00E206 7.28E202 7.28E202 7.28E202 7.28E202 7.28E202 7.28E202
12 MT2 7.28E202 7.28E202 7.28E202 7.28E202 7.28E202 7.28E202
13 BTM1 2.07E206 2.57E202 3.66E202 2.57E202 3.49E202 2.57E202 3.60E202
14 BTM2 3.31E204 2.64E203 2.57E202 3.49E202 2.57E202 3.60E202
15 TCR1 2.30E206 2.86E202 4.09E202 2.86E202 3.91E202 2.86E202 4.03E202
16 TCR2 4.08E204 3.25E203 2.86E202 3.91E202 2.86E202 4.03E202
17 POWER1 6.00E206 7.28E202 1.28E201 7.28E202 1.12E201 7.28E202 1.17E201
18 POWER2 7.28E202 1.28E201 7.28E202 1.12E201 7.28E202 1.17E201
19 Bus1 4.00E206 4.92E202 7.32E202 4.92E202 7.13E202 4.92E202 7.39E202
20 Bus2 1.22E203 9.68E203 4.92E202 7.13E202 4.92E202 7.39E202
21 DMI1 5.00E206 6.11E202 9.29E202 6.11E202 9.14E202 6.11E202 9.49E202
22 DMI2 1.88E203 1.50E202 6.11E202 9.14E202 6.11E202 9.49E202
23 VDX1 2.00E206 2.49E202 1.98E201 2.49E202 1.49E201 2.49E202 1.64E201
24 VDX2 2.49E202 1.98E201 2.49E202 1.49E201 2.49E202 1.64E201
25 RLU 1.50E206 1.87E202 1.49E201 1.87E202 1.12E201 1.87E202 1.23E201
Jiang et al. 9

Figure 6. The DBN of HDD or CDD onboard system.

Table 7. The CPT of the node (CTCS-3 onboard).

LP WP PB KN TD P (system 1§ LP, WP, PB, KN, TD)

W F W F W F W F W F W D F
1 0 1 0 1 0 1 0 1 0 1 0 0
1 0 0 1 1 0 1 0 1 0 0 1 0
0 1 1 0 1 0 1 0 1 0 0 0 1
0 1 0 1 0 1 0 1 0 1 0 0 1
10 Proc IMechE Part O: J Risk and Reliability 00(0)

Figure 7. The (a) reliability, (b) degraded state, and (c) availability evaluation of three onboard systems.

2oo3 redundancy system is 9 h, whereas the values are The high availabilities indicate that the onboard systems
9 and 4.5 h in cold spare redundancy system. can recover rapidly when the primary system suffers a
failure. Obviously, the availabilities accord with design
specification that it should be greater than 0.9999.
Results and discussions By setting the failure probability of CTCS-3
Reliability and availability evaluation. The reliabilities and onboard node to 1, the posterior probabilities of the
availabilities within 100 weeks are evaluated by the for- component are obtained by backward inference. The
prior probability and posterior probability for three
ward inference, as shown in Figure 7. The coverage fac-
onboard systems at 100th week are listed in the 4th–
tors for the redundant system are assigned to 0.95. As
9th columns of Table 6. The difference between pos-
indicated in Figure 7(a), the reliabilities decrease with
terior probability and prior probability is shown in
time. The reliability of CDD system is higher than
Figure 8. It can be seen that the values of VDX, RLU,
HDD system, whereas the TMR system is between
C3CPU, C2CPU, and POWER are much higher than
them. Moreover, the reliabilities of CDD, TMR, and
other components, meaning that the five components
HDD system at 100th week are 0.81, 0.785, and 0.771,
should be given more attention to improving the relia-
respectively. As indicated in Figure 7(b), the occurrence
bility of CTCS-3 onboard system. In addition, the
probabilities of degraded state for the CDD, TMR,
wireless communication failure only leads the system
and HDD system increase to 0.065, 0.064, and 0.063 at
to degraded state so that the values of MT and RTU
100th week, respectively.
are 0.
As shown in Figure 7(c), the availability of CDD,
TMR, and HDD system is 0.999923, 0.999909, and
0.999902, respectively. The availabilities of the three The failure probability for five function modules in different time
onboard systems approach steady values in 10 weeks. slices. As indicated in Figure 9, the failure probabilities
Jiang et al. 11

Figure 8. The difference between posterior probability and prior probability.

Figure 9. The failure probability for five function modules in different time slices.
12 Proc IMechE Part O: J Risk and Reliability 00(0)

HDD system, respectively. However, the order is


KN . TD . WP . PB . LP for TMR system.
Moreover, it shows that the KN has the large fluctua-
tion for the three kinds of CTCS-3 onboard system,
whereas others have little fluctuation.

Effects of coverage factor on reliability and availability. From


the above analysis, the value of coverage factor is 0.95.
To analyze the effects of coverage factor, the values are
assigned to 0.9, 0.925, 0.95, 0.975, and 1 to calculate
the reliability and availability of onboard system at
100th week, as shown in Figure 11. It can be seen that
the reliability and availability increase with the increas-
ing coverage factor, meaning that the recovery mechan-
ism is significant for the three onboard systems.
Figure 10. The effects of changes in each function module.
Furthermore, the effects on HDD system is more
important than CDD system, whereas the TMR system
for five function modules increase with time. WP, KN, is between them. The difference in reliability and avail-
and TD have higher failure probabilities than LP and ability shrinks with the increasing coverage factor.
PB. In addition, LP has a negligible effect on system Specifically, the availabilities of the three onboard sys-
failure. Specifically, WP is mainly responsible for the tems reach the same value, that is, 0.999949. To achieve
degraded state of CTCS-3 onboard system, meaning high reliability and availability, the recovery mechan-
that the higher probability of WP brings the less effi- ism should be paid more attention.
ciency of the system. It can be concluded that KN and
TD are more critical than LP and PB as their failure Validation of the model. To validate the model usability,
probabilities are much higher. It should be noted that the DBN of HDD system is taken as an example. When
the failure probability of KN is higher than TD at the parent node ‘‘BTM1’’ is set to 50% from 0%, the
100th week for HDD system, whereas the failure prob- reliability of system decreases to 0.743 from 0.771.
ability is lower in CDD and TMR system. When both the change plus the parent node ‘‘BTM2’’ is
set to 50%, the reliability decreases to 0.55. With the
Sensitivity analysis. The variables of failure probabilities addition of parent node, ‘‘TCR1’’ and ‘‘TCR2’’ are set
for each onboard system at 100th week are calculated to 50% and the reliability decreases to 0.392. In addi-
with the assumption that the failure rates of each func- tion, the sensitivity analysis is also a validation of the
tion module change 10%. Effects of changes in each model usability. Therefore, the exercise of increasing
function part are shown in Figure 10. It can be con- the influencing node give a partial validation to the pro-
cluded that the order of effects on the systems failure posed model.
probabilities is as follows: TD . WP . KN . PB . The availabilities obtained can be validated by the
LP and KN . WP . TD . PB . LP for CDD and field data from a railway bureau in China. Through the

Figure 11. The effects of coverage factor to reliability and availability.


Jiang et al. 13

Figure 12. Availability for the CDD onboard system.

description of system, the corrective maintenance data and they do not fail before operation. Therefore, the
have been analyzed between May 2015 and November cost of CDD is lower than HDD and TMR onboard
2016. The CDD onboard system has been adopted for system. The choice of CDD onboard system is a rea-
electrical multiple unit (EMU) and the number of sonable trade-off of reliability and cost, especially when
EMU is 63. According to Table 5, the total recovery the number of EMU is very large. The CDD onboard
time is 2683 min and the number of system failure is system requires a manual switch to activate the redun-
155. The starting time for the availability calculation is dant component. One failure can cause the train to stop
1 May 2015. Then, the availabilities can be obtained by and the driver needs to restart the system, which could
A = MUT/(MUT + MDT). For example, the first affect the efficiency of train operation. However, this
failure occurs on May 17. The total time is feature makes the CDD onboard system satisfy the
1,088,640 min and the MDT is 18 min. Therefore, the fail-safe concept.
availability is 0.999983. Similarly, all the availabilities In this article, the simplifying assumption of con-
are calculated and the average availability is 0.999934 stant failure rates (exponential distribution for compo-
during the operational time, as shown in Figure 12. It nents) is made. However, in the operational phase, the
can be concluded that most of the availabilities are failure rate may increase with time, due to the aging
between 0.999949 (coverage factor c = 1) and 0.999923 and degradation of the components. Future works
(coverage factor c = 0.95). When the coverage factor c should try to develop the models in BN with handling
is assigned to 0.975, the availability obtained by the of nonconstant failure rates.
proposed approach is 0.999935, which is approximately Another important issue is that this article mainly
equal to the average availability based on the field data. focuses on the onboard system, ignoring the relevance
The practical exercise gathering the field data performs of trackside and communication systems. Therefore,
a partial validation for the model. future works should focus on the reliability and avail-
ability evaluation of the whole high-speed train control
system in order to handle the interdependencies and
Discussions. The CDD, HDD, and TMR onboard high- dynamic problems between onboard and trackside
speed train control systems are adopted in the opera- system.
tional phase, due to the tiny discrepancy of reliability
and availability within 100 weeks. Then, the choice of
redundancy strategies becomes an additional issue. In Conclusion
the HDD and TMR onboard systems, all the redun- The reliability and availability of onboard train control
dant components are affected by operational stresses, system are significant for the performance of high-
and the cost of high-reliability components is a real speed railway network. According to the architecture
problem. For CDD onboard system, the redundant and operational situation of onboard system, the DFTs
components are shielded from the operational stresses and DBNs are constructed. The DBN forward,
14 Proc IMechE Part O: J Risk and Reliability 00(0)

backward, and sensitivity analyses are conducted to and Bayesian Network. In: Soares G and Zio E (eds)
support the maintenance decision making. The DBN- Safety and reliability for managing risk. London: Taylor
based approach provides a powerful reliability and & Francis Group, 2006, pp.2675–2683.
availability evaluation for onboard system. The main 6. Flammini F, Marrone S, Iacono M, et al. A multi-
achievements can be summarized as follows: formalism modular approach to ERTMS/ETCS failure
modeling. Int J Reliab Qual Safety Eng 2014; 21(1):
1450001-1–1450001-29.
1. The reliability and availability of CDD, TMR, and 7. Di LQ, Yuan X and Wang YN. Research on the evalua-
HDD onboard system are evaluated and compared tion method for the RAM goals of CTCS-3. China Rail
in the operational phase. Sci 2010; 31(6): 92–97.
2. Based on the system architecture and operation of 8. Su HS and Che YL. Dependability assessment of CTCS-
the onboard system, the problems such as dynamic 3 on-board subsystem based on Bayesian Network.
failure behavior and imperfect coverage factors China Rail Sci 2014; 35(5): 96–104.
have been solved. 9. Qiu S, Sallak M, Schön W, et al. Availability assessment of
3. The results of availability are validated by the field railway signalling systems with uncertainty analysis using
data from one railway bureau. Statecharts. Simul Modell Pract Theory 2014; 47: 1–18.
10. Qiu S, Sallak M, Schön W, et al. Modeling of ERTMS level
4. To improve the reliability and availability of
2 as an SoS and evaluation of its dependability parameters
onboard system, the VDX, RLU, C3CPU, C2C
using Statecharts. IEEE Syst J 2014; 8(4): 1169–1181.
PU, and POWER should be paid more attention. 11. Bernardi S, Flammini F, Marrone S, et al. Model-driven
Based on sensitivity analysis, the effects of failure availability evaluation of railway control systems. In:
rate have been researched. International conference on computer safety, reliability,
and security, Naples, 19–22 September 2011, pp.15–28.
Berlin: Springer.
Declaration of conflicting interests
12. Carnevali L, Flammini F, Paolieri M, et al. Non-Marko-
The author(s) declared no potential conflicts of interest vian performability evaluation of ERTMS/ETCS level 3.
with respect to the research, authorship, and/or publi- In: European workshop on computer performance engineer-
cation of this article. ing, Madrid, 31 August–1 September 2015, pp.47–62.
Cham: Springer.
13. Biagi M, Carnevali L, Paolieri M, et al. Performability
Funding evaluation of the ERTMS/ETCS-Level 3. Transport Res
The author(s) disclosed receipt of the following finan- Part C: Emerg 2014; 82: 314–336.
cial support for the research, authorship, and/or publi- 14. Neglia G, Alouf S, Dandoush A, et al. Performance eva-
cation of this article: This research was Supported by luation of train moving-block control. In: International
Open Project Fund for the Center of National Railway conference on quantitative evaluation of systems, Quebec
City, QC, Canada, 23–25 August 2016, pp.348–363.
Intelligent Transportation System Engineering and
Cham: Springer.
Technology (Grant No. RITS2018KF02), the Major 15. Weber P, Medina-Oliva G, Simon C, et al. Overview on
Project for Science and Technology of Sichuan Bayesian networks applications for dependability, risk
Province (Grant No. 2017GZDZX0002), and the Key analysis and maintenance areas. Eng Appl Artif Intell
Research and Development Projects for Science and 2012; 25: 671–682.
Technology of Sichuan Province (Grant No. 16. Cai BP, Huang L and Xie M. Bayesian networks in fault
2018GZ0195). diagnosis. IEEE Trans Ind Inf 2017; 13(5): 2227–2240.
17. Codetta-Raiteri D and Portinale L. Approaching dynamic
ORCID iD reliability with predictive and diagnostic purposes by
exploiting dynamic Bayesian networks. Proc IMechE,
Lei Jiang https://orcid.org/0000-0002-0819-3885
Part O: J Risk and Reliability 2014; 228(5): 488–503.
Xiaomin Wang https://orcid.org/0000-0003-4934-
18. Boudali H and Dugan JB. A discrete-time Bayesian net-
4288 work reliability modeling and analysis framework. Reliab
Eng Syst Saf 2005; 87: 337–349.
References 19. Cai BP, Liu HL and Xie M. A real-time fault diagnosis
1. UNISIG. Subset-026 of the ERTMS/ETCS system methodology of complex systems using object-oriented
requirements specification (SRS), 2012. Bayesian networks. Mech Syst Sig Process 2016; 80: 31–44.
2. CTCS general rules of technical specification. Beijing, 20. Neil M and Marquez D. Availability modelling of repair-
China: Science and Technology Division, Ministry of able systems using Bayesian networks. Eng Appl Artif
Railways, 2004. Intell 2012; 25: 698–704.
3. Dugan JB, Bavuso SJ and Boyd MA. Dynamic fault-tree 21. Liang XF, Wang HD, Yi H, et al. Warship reliability eva-
models for fault-tolerant computer systems. IEEE Trans luation based on dynamic Bayesian networks and numeri-
Reliab 1992; 41(3): 363–377. cal simulation. Ocean Eng 2017; 136: 129–140.
4. Langseth H and Portinale L. Bayesian networks in relia- 22. Cai BP, Liu YH, Zhang YW, et al. Dynamic Bayesian
bility. Reliab Eng Syst Saf 2007; 92: 92–108. networks based performance evaluation of subsea blow-
5. Flammini F, Marrone S, Mazzocca N, et al. Modeling out preventers in presence of imperfect repair. Expert
system reliability aspects of ERTMS/ETCS by fault trees Syst Appl 2013; 40: 7544–7554.
Jiang et al. 15

23. Cai BP, Xie M, Liu YH, et al. Availability-based engi- 28. Wilsona AG and Huzurbazar AV. Bayesian networks for
neering resilience metric and its corresponding eva- multilevel system reliability. Reliab Eng Syst Saf 2007;
luation methodology. Reliab Eng Syst Saf 2018; 172: 92(10): 1413–1420.
216–224. 29. Murphy KP. Dynamic Bayesian networks: representation,
24. Bobbio A, Portinale L, Minichino M, et al. Improving inference and learning. Berkeley, CA: University of
the analysis of dependable systems by mapping fault trees California, 2002.
into Bayesian networks. Reliab Eng Syst Saf 2001; 71: 30. Dugan JB and Trivedi K. Coverage modeling for depend-
249–260. ability analysis of fault-tolerant systems. IEEE Trans
25. Portinale L, Codetta-Raiteri D and Montani S. Support- Comput 1989; 38: 775–787.
ing reliability engineers in exploiting the power of 31. Jones B, Jenkinson I, Yang Z, et al. The use of Bayesian
dynamic Bayesian networks. Int J Approximate Reason- network modelling for maintenance planning in a manu-
ing 2010; 51: 179–195. facturing industry. Reliab Eng Syst Saf 2010; 95: 267–277.
26. Montani S, Portinale L, Bobbio A, et al. RADYBAN: a 32. CTCS-3 system requirements specification. Beijing, China:
tool for reliability analysis of dynamic fault trees through Ministry of Railways, Science and Technology Division,
conversion into dynamic Bayesian networks. Reliab Eng 2008.
Syst Saf 2008; 93: 922–932. 33. CTCS-3 function requirements specification. Beijing,
27. Zhu JY and Deshmukh A. Application of Bayesian China: Ministry of Railways, Science and Technology
decision networks to life cycle engineering in Green Division, 2008.
design and manufacturing. Eng Appl Artif Intell 2003; 34. University of Pittsburgh. GeNIe and SMILE-Home,
16: 91–103. https://dslpitt.org/genie/

You might also like