1 s2.0 S0957415821001100 Main

Mechatronics 79 (2021) 102633
Contents lists available at ScienceDirect
Mechatronics
journal homepage: www.elsevier.com/locate/mechatronics
Data-driven fault tolerant predictive control for temperature regulation in

data center with rack-based cooling architecture✩
Kai Jiang a,b , Masoud Kheradmandi b , Chuan Hu c , Souvik Pal b , Fengjun Yan a,b ,∗
a
Department of Mechanical Engineering, McMaster University, Hamilton, ON L8S 4L7, Canada
b
Computing Infrastructure Research Center, McMaster University, Hamilton, ON L8S 4L7, Canada
c
Department of Mechanical Engineering, University of Texas at Austin, Austin, TX 78712, USA
ARTICLE INFO ABSTRACT
Keywords: Recently increasing amount of data centers (DCs) have been constructed for the development of information
Data center and communications technology (ICT). Meanwhile, more energy consumption is required to support the
Temperature control operation of DCs, where nearly thirty percent of the consumption is used for cooling system. Therefore,
Data-driven model
this paper studies an efficient cooling architecture named rack mountable cooling unit (RMCU) for DCs, and
Fault tolerant control
provides a novel data-driven fault tolerant predictive controller to regulate the temperature accordingly. In
order to circumvent the complicated physics modeling in DC, a data-driven modeling method is developed. In
this method, multiple local linear models in form of auto-regressive exogenous (ARX) model are selected to
describe the strongly nonlinear system. Besides, the multiple sub-models are identified through partial least
square (PLS) and corresponding training data is partitioned by fuzzy c-means (FCM). Then, on the basis of
developed data-driven model, a predictive controller considering actuator faults is designed to manage the
temperature in DC. Finally, real experimental results are presented and demonstrate the superior performance
of our proposed controller.
1. Introduction predictable airflow paths in these cooling methods are able to address
the above problems and reduce the power consumption [5].
With the growing demand of novel ICT, more DC facilities are Except for the development of cooling architecture, control design
required to support the services [1]. Since the power consumption of IT of regulating temperature in DCs is also a critical and challenging
equipments (ITEs) in the DC would be dissipated in the form of heat, issue [6–8]. Firstly, the DC system is a nonlinear multi-input-single-
appropriate cooling system is necessary to remove the heat from the output (MISO) system, and thus advanced controller is necessary to
indoor facility and release it outside [2]. Generally, there are three ensure the temperature stability. Moreover, the complicated airflow
different cooling structures for DC cooling system, which include room- and heat transfer in DC further increase the difficulty of modeling and
based, row-based and rack-based cooling. Here room-based cooling control. Therefore, many researchers are attempting to explore effi-
means the cold air is supplied to the computer room directly and cool cient control strategies to overcome the drawbacks, and some relevant
down the whole servers; row-based cooling denotes that several cooling works [9–11] about cooling system control have been carried out so
units are mounted between the racks and each of them delivers cold air far.
to chill a row of servers; rack-based cooling indicates the cooling unit The initial control strategy used in cooling system is on–off control,
is installed within a single rack [3]. The schematic diagram of three which means the cooling devices would be turned off or on when the
different cooling architectures is shown in Fig. 1. Nowadays room- temperature below or above the set point. This on–off control is the
based cooling architecture is widely used in DCs due to the simplicity simplest control in cooling system, while the frequently on–off would
of manufacturing and handling, while the problems of cold air bypass lead to temperature instability and reduce the lifetime of cooling equip-
and hot air recirculation in such DCs deteriorate the energy efficiency ments [12]. Proportional–integral–derivative (PID) control is another
significantly [4]. Therefore, row-based and rack-based cooling archi- effective technique for cooling system control in DCs. For instance,
tectures are developed and become prevalent. The shorter and more Baptiste et al. designed a PID controller for DC optimization in [13],
✩ This paper was recommended for publication by Associate Editor Marcel Francois Heertjes.
∗ Corresponding author at: Department of Mechanical Engineering, McMaster University, Hamilton, ON L8S 4L7, Canada.
E-mail address: yanfeng@mcmaster.ca (F. Yan).
https://doi.org/10.1016/j.mechatronics.2021.102633
Received 8 December 2020; Received in revised form 22 June 2021; Accepted 21 July 2021
Available online 10 August 2021
0957-4158/© 2021 Elsevier Ltd. All rights reserved.
K. Jiang et al. Mechatronics 79 (2021) 102633
Fig. 1. Schematic diagram of DC with room-based, row-based and rack-based cooling architecture.
where the controller was mainly utilized to manage the air speed. deal with three typical actuator faults (aging of fans, failure of fans and
Self-tuning PID control was proposed to control computer room air valve deactivation) in rack-based cooling architecture.
conditioning (CRAC) units in DCs as well [14]. In this work, the PID The main contributions in this work are illustrated as follows. (1)
control was adopted to regulate temperature automatically and a fuzzy The data-driven model based on the algorithm of FCM and PLS is
logic method was used to tune the parameters in PID controller. Al- adopted to represent the nonlinear dynamics and used for further
though PID control had achieved good effectiveness in cooling system, control issues in DC system. (2) Three types of actuator faults are
the problem of large temperature oscillation was still a threat for server considered in this work, and corresponding fault factors are identified
security and probably resulted in energy inefficiency. The excessively during the control process to compensate the faults. (3) Different cost
high temperature would decrease the lifetime of servers and even lead functions are switched for the data-driven MPC framework according to
to breakdown. The low temperature would result in energy waste and the three actuator faults and thus improving the control performance.
further increase the total cost of DC. The rest of this article is organized as following. Section 2 presents
Lately, model predictive control (MPC) in DC cooling system has the fundamental of DC and the rack-based cooling system. Then the
attracted a great deal of interests. Decentralized MPC was employed to methodology of data-driven modeling is introduced in Section 3. The
control the blower speed and supplied water temperature in multiple detailed development of fault tolerant predictive control and experi-
CRAC units [15]. This method was able to meet the constraints of tem- mental results in real system are provided in Section 4 and Section 5
separately. Finally, conclusion is depicted in Section 6.
perature in each zone and minimize the power consumption of CRAC
units at the same time. A novel controller based on MPC was proposed
2. Description of DC with rack-based cooling architecture
for cooling system in [16], in which indirect fresh air was utilized
and the power consumption was reduced by 30%. Besides, the authors
Cooling system plays a significant role in DC to regulate temper-
in [17] investigated DC control in cyber–physical perspective through
ature and further protect servers. Currently, there are three types of
MPC. This work considered computational network, thermal network
cooling methods in DCs, which contains air cooling, liquid cooling
and the effect of electricity market, and MPC was used to implement the
and phase-change-material cooling [18]. Considering the operational
global optimization in DC. At last the simulation indicated this method
feasibility, cooling security and equipment cost, the method of air
could achieve great cost saving. As discussed above, MPC is extensively
cooling is widely adopted in DC cooling system. In light of cooling
applied in DCs and obtains satisfactory effect. However, most of the
architecture, room-based, row-based and rack-based cooling units are
existing works concerning MPC in DCs are limited by physics modeling,
regarded as three typical architectures. As mentioned above, row-based
which is quite complicated and time-consuming. Moreover, inaccurate and rack-based cooling system are more energy-efficient over room-
physics models owing to lots of modeling assumptions usually lead to based cooling system. Besides, in row-based and rack-based cooling
the decrease of control performance. systems, rack-based cooling system is more suitable to DCs with stand-
To tackle the problems, in this work data-driven predictive control is alone high-density racks due to the flexibility and scalability. Thus,
put forward for the thermal management of DC with rack-based cooling the detailed investigation relating to rack-based cooling architecture is
architecture. Meanwhile, we also consider the actuator faults during the conducted in this work.
controller design. In order to simulate the nonlinear characteristics in The main concept of rack-based cooling system in DCs is to extract
DC system, the data-driven model is established on multiple linear ARX the heat generated by servers out of rack through the heat exchange
models which is identified via partial least square (PLS). Besides, the between chilled water and hot air. The airflow from hot aisle to cold
algorithm of FCM is utilized to partition the training data for multiple aisle is driven by several fans and goes through a heat exchanger
modeling. Finally, appropriate weights derived through membership filled with circulating chilled water, then the heat of hot air would
matrix in FCM are used to combine the multiple sub-models together. be taken away via the chilled water. The circulating water is supplied
According to such data-driven model, predictive controller is developed and cooled through a chiller outside. The whole system is depicted in
to implement the temperature regulation in DC. Additionally, fault Fig. 2. The RMCU is also represented in Fig. 3, which comprises fans,
tolerant strategy is creatively considered in the data-driven control to heat exchanger, valve, power source and some pipes. Moreover, the
2
the total number of time lapse in historical outputs and inputs.

𝑁𝑥 𝑁𝑢
∑ ∑
𝑥(𝑘) = 𝛼𝑖 𝑥(𝑘 − 𝑖) + 𝛽𝑗 𝑢(𝑘 − 𝑗) + 𝜎 + 𝑏(𝑘), (1)
𝑖=1 𝑗=0
As the thermal dynamic in DC is a quite complex process, only a

linear ARX model is insufficient to represent the system. Additionally,
the complicated system is later found to be locally linear in short time
series. Thus, multiple local linear models are put forward to approxima-
tively simulate the nonlinearities in DC system. This method is widely
used in many applications [20,21], and it is proved to be efficient to
represent the nonlinear system [22]. Moreover, the proposed model is
assumed to be globally identifiable and the data used for training is
assumed to be clean in this work.
The algorithm of PLS is utilized for parameter identification in ARX
Fig. 2. Detailed configuration of DC with rack-based cooling architecture.
model in this work, which is a popular multivariate statistical method
of data-driven modeling. The main procedure of PLS comprises two
steps: one is data preprocessing and the other one is model identifica-
tion [23,24]. The data preprocessing is conducted through the approach
of principle component analysis (PCA) that has the ability of projecting
the latent variables in high-dimension to low-dimension. The parameter
identification is carried out by a typical method of regression called
ordinary least square (OLS). Owing to the superiority of PCA, the
method of PLS is quite suitable to handle the modeling issue with high
correlation. The main equations in PLS are expressed as following.
𝑋 = 𝑇𝑋 𝑃 T + 𝐸𝑋 , (2)
𝑌 = 𝑇𝑌 𝑄T + 𝐸𝑌 , (3)
Fig. 3. Internal structure of rack mountable cooling unit. where 𝑋 and 𝑌 represent inputs and outputs, 𝑇𝑋 and 𝑇𝑌 stand for
principle component of inputs and outputs respectively, 𝐸𝑋 and 𝐸𝑌
denote the model residuals of 𝑋 and 𝑌 , 𝑃 means the matrix of loadings
location of RMCU in the rack could be adjusted based on the demand of 𝑋 and 𝑄 is the matrix of loadings of 𝑌 . Then, to find the rela-
of customers. tionship between inputs and outputs, a function linked two principle
components is required:
According to Figs. 2 and 3, we can figure out the thermal control
system in DC clearly. The measured plant output or the control objec- 𝑌 = 𝑇𝑋 𝑆 T + 𝐸𝑍 , (4)
tive is the temperature in rack. The inputs to the plant or the outputs of
the controller are the air flow rate controlled by fans and the water flow where 𝑆 is the parameter of two principle components, and 𝐸𝑍 is model
residual. At last, by applying Lagrange multiplier, all the parameters are
rate controlled by valve. In addition, the temperature of inlet water is
derived and the final relationship of inputs and outputs is demonstrated
considered as measured disturbance, which is determined by the chiller
as below:
system in the building.
𝑋 𝑇 𝑇𝑋 𝑌 𝑇 𝑇𝑌 𝑌 𝑇 𝑇𝑋
𝑃 = , 𝑄= , 𝑆= (5)
‖𝑇𝑋 ‖2 ‖𝑇𝑌 ‖2 ‖𝑇𝑋 ‖2
3. Data-driven modeling ‖ ‖ ‖ ‖ ‖ ‖
𝑌 = 𝜃𝑋 + 𝑏(𝑘), (6)
Although the individual rack is just a small part of DC, the thermal
dynamics inside including heat conduction, convection, and radiation 𝜃 = 𝑃 𝑆T, (7)
are quite complicated and thus difficult to model through first principle.
Moreover, the different types of servers in the rack have different in which 𝜃 is defined as the model parameters. Besides, if the objective
thermal characteristics, which would further increase the modeling function is organized as the form of ARX model, model parameters can
difficulty. Therefore, a data-driven model is developed in this program be depicted as: 𝜃 = [𝛼1 ⋯ 𝛼𝑁𝑦 𝛽1 ⋯ 𝛽𝑁𝑢 𝜎] and 𝑋 = [𝑥(𝑘 − 1)𝑇 ⋯ 𝑥(𝑘 −
to tackle the modeling problem and used for further predictive control. 𝑁𝑦 )𝑇 𝑢(𝑘)𝑇 ⋯ 𝑢(𝑘 − 𝑁𝑢 )𝑇 1]T .
The framework of data-driven model is based on ARX model, which In order to obtain multiple models, the dataset for training need
to be divided into some clusters and then the model parameters are
is an efficient linear discrete model to represent the relationship be-
calculated based on each cluster. In this article, the approach of FCM is
tween inputs and outputs [19]. The simple model structure of ARX
used for clustering, which is a typical method of clustering and widely
model is also easily applied for further controller or observer design.
adopted for data processing, image segmentation and so on [25,26].
Besides, the ARX model is suitable for the implementation of con-
The idea of FCM is to obtain the partitioned data and corresponding
trol algorithm in industrial micro controllers such as Raspberry Pi or
membership matrix by minimizing a objective function that is the sum
Arduino.
of weighted absolute distances between cluster centers and each data
The typical ARX model is formulated as following. As seen in
point. It is noteworthy that the membership matrix would be utilized
Eq. (1), the output at specific step is a function of historical outputs, for later combination of the identified ARX models. The mathematical
current inputs, historical inputs, constant factor and modeling bias. 𝑥(𝑘) equation of objective function is depicted as follows.
in the equation denotes output vector at 𝑘th step, 𝑢(𝑘) stands for input
vector, and 𝑏(𝑘) means the modeling bias. The parameters 𝛼𝑖 and 𝛽𝑖 are ∑
𝐺 ∑
𝐷
𝐽𝐹 𝐶𝑀 = (𝜇𝑗,𝑖 )𝑞 ‖𝑥𝑖 − 𝑐𝑗 ‖. (8)
system coefficients. 𝜎 represents the constant factor. 𝑁𝑥 and 𝑁𝑢 express 𝑖=1 𝑗=1
3
where 𝑥𝑖 means the data point and 𝐺 stands for the amount of data
items. 𝑐𝑗 denotes the center of clusters and the total number of clusters
is 𝐷. 𝜇𝑗,𝑖 expresses the membership matrix and it means the degree of
𝑥𝑖 belonging to the 𝑗th cluster. 𝑞 is defined as a scaling factor on each
fuzzy membership. Moreover, some constraints should be considered
within the minimization of objective function, which is described as
following:
𝜇𝑗,𝑖 ∈ [0, 1], 𝑓 𝑜𝑟 1 ≤ 𝑖 ≤ 𝐺, 1 ≤ 𝑗 ≤ 𝐷,
∑
𝐷
𝜇𝑗,𝑖 = 1, 𝑓 𝑜𝑟 1 ≤ 𝑖 ≤ 𝐺, 1 ≤ 𝑗 ≤ 𝐷,
𝑗=1 (9)
∑
𝐺
0< 𝜇𝑗,𝑖 < 𝐺, 𝑓 𝑜𝑟 1 ≤ 𝑖 ≤ 𝐺, 1 ≤ 𝑗 ≤ 𝐷.
𝑖=1
After setting the initial value of membership matrix 𝜇𝑗,𝑖 and cluster
centers 𝑐𝑗 , the fuzzy clustering is carried out through a method of
iterative optimization which is worked for updating the membership
Fig. 4. Schematic diagram of developed data-driven modeling.
matrix and cluster centers. The equations are shown as below:
1
𝜇𝑗,𝑖 = ∑ ‖𝑥𝑖 −𝑐𝑗 ‖
, (10)
𝐷 2∕(𝑞−1)
𝑘=1 ( ‖𝑥𝑖 −𝑐𝑘 ‖ )
4. Fault tolerant predictive control
∑𝐺
𝑖=1 (𝜇𝑗,𝑖 )𝑞 𝑥 𝑖 To accurately regulate the temperature in DCs, many control strate-
𝑐𝑗 = ∑𝐺 . (11)
𝑞 gies such as on–off control, PID control and MPC are employed. Par-
𝑖=1 (𝜇𝑔,𝑖 )
ticularly, MPC is widely used for DC management due to the ability
The terminal condition for iteration is max𝑖𝑗 |𝜇𝑗,𝑖 (𝑠 + 1) − 𝜇𝑗,𝑖 (𝑠)| < 𝜖 of dealing with time series prediction, constrained optimization and
or 𝑠 < 𝑁𝑠𝑒𝑡 , where 𝜖 is defined as a termination criterion, 𝑠 means the nonlinear systems [27–31]. In our previous work [32], MPC control
iterative step and 𝑁𝑠𝑒𝑡 is the set point of terminal steps. is also utilized to implement the temperature regulation based on data-
As the total sub-models may contribute to the future prediction, it driven model. However, during the real industrial operation, we found
is better to identify them simultaneously [22]. The detailed computing that some occasional actuator faults would deteriorate the performance
trick is represented as follows: of data-driven predictive controller and further threaten the security
and reliability of entire system. Consequently, in order to mitigate
𝑌 = [𝜃1 𝜃2 ⋯ 𝜃𝐷 ] × [𝑈1 ⊙𝑋 𝑈2 ⊙ 𝑋 ⋯ 𝑈𝐷 ⊙ 𝑋]T , (12)
the effects of actuator faults and ensure the safety of DCs, a novel
in which ⊙ means Hadamard product, and 𝑈 stands for an augment of fault tolerant part is considered in the MPC framework. At last, three
membership matrix. The extended matrix is shown as below: experimental cases demonstrate the effectiveness and practicability of
proposed controller.
⎡ 𝜇𝑗,1 𝜇𝑗,1 ⋯ 𝜇𝑗,1 ⎤
𝑈𝑗 = ⎢ ⋮ ⋮ ⋱ ⋮ ⎥ . (13)
⎢ ⎥ 4.1. Model predictive control
⎣𝜇𝑗,𝐺 𝜇𝑗,𝐺 ⋯ 𝜇𝑗,𝐺 ⎦𝐺×𝐷
Then based on the operation introduced above, all the parameters 𝜃 MPC is known as an advanced process control where the control
and sub-models could be derived through PLS. In addition, to represent actions are acquired by predicting the future states in a finite horizon
the total nonlinearities in DC system, all the sub-models should be and optimizing corresponding objective function iteratively [33]. Such
combined into one equation through appropriate weights which are re- predictive characteristic based on system model brings superior control
lated to the membership matrix mentioned above. The entire modeling performance [34]. Moreover, the ability of dealing with state and input
diagram is described in Fig. 3, and the final integrated ARX model is constraints is another strength of MPC, and thus it is widely applied for
shown as below: practical industry [35–37].
First, we transform the proposed data-driven model to a general
𝑁𝑥 𝑁𝑢
∑
𝐷 ∑ ∑ discrete form (Based upon the validated experiments, 𝑁𝑥 is selected
𝑥𝑘 = 𝜔𝑑,𝑘 [ 𝛼𝑖,𝑑 𝑥𝑘−𝑖 + 𝛽𝑗,𝑑 𝑢𝑘−𝑗 + 𝜁𝑑 𝜆 + 𝜎𝑑 ]. (14)
𝑑=1 𝑖=1 𝑗=0
as 1, 𝑁𝑢 is set as 0 in the data-driven model):
Specifically, 𝑥𝑘 stands for the temperature in the rack, and 𝑢𝑘 𝑥𝑘 = 𝛼𝑥𝑘−1 + 𝛽𝑢𝑘 + 𝜆, (15)
denotes the control inputs including air flow rate and water flow rate.
where 𝑥 is temperature, 𝑢 = [𝐹𝑎 , 𝐹𝑤 ]T represents the vector of air flow
In this work 𝜔𝑑,𝑘 means the weight of each sub-models, and the model
rate and water flow rate, 𝛼 and 𝛽 are the coefficients, 𝜆 denotes the
weight is related to the membership matrix in FCM (𝜔𝑑,𝑖 = 𝜇𝑑,𝑖 ).
disturbance and it includes the model bias. The main challenge of MPC
Besides, temperature of input water and workload of the whole rack is to calculate the inputs sequences through constrained optimization of
are considered as measurable disturbances and defined as vector 𝜆. cost function at each sampling time. The procedure could be formulated
𝜁𝑔 represents the coefficient vector of disturbances and 𝜎𝑑 is constant as below:
factor. It should be pointed out that modeling bias 𝑏(𝑘) is eliminated
∑
𝑀
during the calculation of PLS. min 𝐽= (𝑥𝑘 − 𝑥𝑟𝑒𝑓 )T 𝛶 (𝑥𝑘 − 𝑥𝑟𝑒𝑓 ) + 𝛥𝑢T𝑘 𝛷𝛥𝑢𝑘 ,
In summary, the proposed data-driven model has the following 𝑘=1
advantages. (1) It is suitable for the complex system which is difficult s.t. 𝑥𝑘 = 𝛼𝑥𝑘−1 + 𝛽𝑢𝑘 + 𝜆, (16)
to model through first principles. (2) The multiple linear models used
𝑢𝑘,𝑚𝑖𝑛 ≤ 𝑢𝑘 ≤ 𝑢𝑘,𝑚𝑎𝑥 ,
in this method are efficient to describe the nonlinear dynamics in the
system. (3) The final form of ARX model could be applied for the 𝛥𝑢𝑘,𝑚𝑖𝑛 ≤ 𝛥𝑢𝑘 ≤ 𝛥𝑢𝑘,𝑚𝑎𝑥 ,
controller design conveniently. The schematic diagram of developed where 𝐽 means the cost function, 𝑀 is the prediction horizon, 𝛥𝑢𝑘 =
data-driven modeling is described in Fig. 4. 𝑢𝑘 −𝑢𝑘−1 indicates the change of inputs in each step, 𝑥𝑟𝑒𝑓 is the reference
4
value of state, 𝛶 and 𝛷 are weighting parameters, 𝑢𝑘,𝑚𝑖𝑛 represents the in which 𝜈𝑎𝑐𝑡 means the real actuator action, and 𝜈𝑐𝑜𝑚 is defined as
lower bound of inputs and 𝑢𝑘,𝑚𝑎𝑥 is the upper bound of inputs. the control command. 𝛿 represents the threshold for fault diagnosis
In the cost function, the part of (𝑥𝑘 − 𝑥𝑟𝑒𝑓 ) indicates the error which is determined by repeated experiments. In general cases of
between actual temperature and reference temperature. The reason fault identification, the real actuator information are usually estimated
why we minimize the error between actual temperature and reference through input observers such as Kalman filter, sliding mode observer,
temperature is to maintain the actual temperature at setpoint. Then we moving horizon estimation and so on. However, for real industrial
also minimize the changes of inputs 𝛥𝑢𝑘 , since we cannot ignore the applications, the customers have high demands of information accuracy
physical constraints of actuators. Moreover, the constraints of changes and reliability, especially the input and output data. Consequently,
of fan speed could also decrease the energy consumption to some several physical sensors are employed to measure the actuator actions
extent. in this program. Furthermore, for the measurement of fan speed, it is
The optimizing procedure in MPC is known as receding horizon quite convenient to use the fan’s own internal hall effect sensor. The
optimization. At this time-step, once the optimization process yields a water flow rate is measured through the system-provided flow meter.
finite control sequence, the first control action in this sequence will be These two approaches of obtaining actuators information are all more
applied to the system. Then the same procedure would be repeated con- reliable than input observers without increasing costs.
tinuously with new measurements for the following sampling instant.
As we can see in Fig. 3, there are five fans in the cooling unit.
Thus we need to identify five fault factors of fans at the same time and
Remark 1. In this work the stability of data-driven predictive control
combine them together. The simple equation of fault factors is shown
under the studied operating conditions is concluded from the experi-
as follows.
ments, therefore, the specific stability analysis is not elaborated. More [ ] [ ]
detailed information with regard to stabilizing predictive control could 𝜌 1∕5(𝜌𝑎,1 + 𝜌𝑎,2 + 𝜌𝑎,3 + 𝜌𝑎,4 + 𝜌𝑎,5 )
𝜌= 𝑎 = . (19)
be found in [28,38,39]. 𝜌𝑤 𝜌𝑤
Here, for the healthy system, all the fault factors of fans 𝜌𝑎,𝑖 = 1,
4.2. Fault tolerant control and then the final fault factors of fans 𝜌𝑎 = 1. For one of the fans
fails to work, the final fault factors of fans 𝜌𝑎 = 0.8 and it means the
As mentioned above, some actuator faults during the real industrial
performance of airflow actuator decreases by 20%.
operation usually degrade the performance of our developed controller
considerably. Therefore, fault tolerant control (FTC) is proposed in
4.2.3. Fault tolerant MPC
this work to compensate for the adverse influence induced by actuator
The fault factor would be identified in each sampling instant, and
faults. Basically, FTC could be categorized into active FTC and passive
the state–space model is updated accordingly. Consequently, the prob-
FTC [40–42]. These two methods have different design philosophies,
lem of fault tolerant MPC could be reformulated as below:
but have the same control objective [43]. Active FTC indicates that
the designed controller could reorganize the control structure online ∑
𝑀
according to the system failures. Besides, a fault diagnosis scheme min 𝐽∗ = (𝑥𝑘 − 𝑥𝑟𝑒𝑓 )T 𝛶 (𝑥𝑘 − 𝑥𝑟𝑒𝑓 ) + 𝛥𝑢T𝑘 𝛷𝛥𝑢𝑘 ,
𝑘=1
is necessary in the active FTC to offer the real-time faults of the
s.t. 𝑥𝑘 = 𝛼𝑥𝑘−1 + 𝛽𝜌𝑘 𝑢𝑘 + 𝜆, (20)
system [44,45]. On the contrary, passive FTC is able to dispose of
the system faults without any faults information and control structure 𝑢𝑘,𝑚𝑖𝑛 ≤ 𝑢𝑘 ≤ 𝑢𝑘,𝑚𝑎𝑥 ,
adjustments [46]. This kind of controller is fixed and insensitive to 𝛥𝑢𝑘,𝑚𝑖𝑛 ≤ 𝛥𝑢𝑘 ≤ 𝛥𝑢𝑘,𝑚𝑎𝑥 .
failures because of the pre-designed redundancy and robustness, and
thus it is named as passive FTC. Additionally, it is also noteworthy that the fault of valve deacti-
The actuators in this system are direct current fans and electrical vation would result in the change of cost function in fault tolerant
motorized ball valve, which are utilized to control the air flow rate and MPC. When the fault of valve deactivation occurs, the water flow rate
water flow rate respectively. Based on the feedbacks from engineers, stays as a constant and cannot be controlled any more. Thus the MISO
there are three common actuator faults including fans aging, fans fail- system changes to a single-input-single-output (SISO) system, and the
ure and valve deactivation. Generally, it is difficult to deal with three cost function should be adjusted accordingly. The updated fault tolerant
different actuator faults simultaneously in a fixed control structure, MPC problem is represented as:
therefore, active FTC is employed in this work instead of passive FTC. ∑
𝑀
The detailed methodologies are shown as following. min 𝐽# = (𝑥𝑘 − 𝑥𝑟𝑒𝑓 )T 𝛶 (𝑥𝑘 − 𝑥𝑟𝑒𝑓 ) + 𝛥𝑢∗T ∗
𝑘 𝛷𝛥𝑢𝑘 ,
𝑘=1
4.2.1. Fault model s.t. 𝑥𝑘 = 𝛼𝑥𝑘−1 + 𝛽𝜌𝑘 𝑢∗𝑘 + 𝜆, (21)
In order to model the actuator faults, a fault factor is proposed in
𝑢∗𝑘,𝑚𝑖𝑛 ≤ 𝑢∗𝑘 ≤ 𝑢∗𝑘,𝑚𝑎𝑥 ,
the previous data-driven model:
𝛥𝑢∗𝑘,𝑚𝑖𝑛 ≤ 𝛥𝑢∗𝑘 ≤ 𝛥𝑢∗𝑘,𝑚𝑎𝑥 ,
𝑥𝑘 = 𝛼𝑥𝑘−1 + 𝛽𝜌𝑘 𝑢𝑘 + 𝜆, (17)
where 𝑢∗𝑘 = [𝐹𝑎 , 𝜈𝑎𝑐𝑡,𝑤 ]T , and 𝜈𝑎𝑐𝑡,𝑤 is the constant measurement of
where 𝜌 = 𝑑𝑖𝑎𝑔(𝜌𝑎 , 𝜌𝑤 ) is the fault factor vector which denotes the
water flow rate under valve deactivation and no longer a control input.
ratios between the real actuator responses and control commands from
Finally, by applying a switch trigger and this updated cost function,
controller. If the actuator responses are equal to the control commands,
the fault of valve deactivation is addressed. To clearly represent the
it means the system is healthy and 𝜌 = 𝐼; if the actuators lose
procedure of fault tolerant MPC in this DC system, a flowchart is
functionality completely 𝜌 = 0.
provided in Fig. 5.
4.2.2. Fault identification
To compensate for the faults and work out appropriate control 5. Experimental results
actions in real time, it is essential to do the fault identification. The
detailed procedure of fault identification in this article is online cal- All the experiments are implemented in a real single-rack DC with
culation or estimation of fault factor. The formula of fault factor is rack-based cooling architecture which is shown in Fig. 6. The experi-
presented as below: mental DC system comprises 30 servers, cooling unit, several sensors,
{𝜈 pipes and chilled water supply system from the building. The control al-
𝑎𝑐𝑡
𝜈𝑐𝑜𝑚
, if ||𝜈𝑎𝑐𝑡 − 𝜈𝑐𝑜𝑚 || ≥ 𝛿 gorithm is programmed in Python environment and conducted through
𝜌𝑖 = (18)
1, if ||𝜈𝑎𝑐𝑡 − 𝜈𝑐𝑜𝑚 || ≤ 𝛿, Raspberry Pi Zero which is regarded as the master micro-controller
5
Fig. 7. Measurements under all working condition for model training.
Fig. 5. Flowchart of the fault tolerant MPC algorithm in this work.
Fig. 8. Validation result of data-driven model.
faults are carried out to test and verify the effectiveness of designed
controller. The detailed experiment results are presented as follows.
5.1. Case I: Aging of fans
The first type of actuator faults is aging of fans, which indicates the
performance of total fans degrade by a certain percent over time. In
Fig. 6. Experimental DC with rack-based cooling architecture. this experiment, the fault is injected at 400 s and the fans’ performance
affected by fault is assumed to degrade by 30 percent. The comparison
of results between data-driven predictive controller with fault tolerant
in this work. Besides, considering reducing the computational burden control and without fault tolerant control are depicted in Fig. 9. More-
of Raspberry Pi, an Arduino board (Nano) is used as the slave micro- over, the control inputs and measured disturbance are shown in Fig. 10
controller whose function is to send actuators commands of Raspberry as well.
Pi and collect data from sensors. Additionally, the IT workload in this As seen in Fig. 9, temperatures regulated by both of the controllers
experiment is set as a constant value. are able to track the set point properly before fault occurs, which rep-
In order to obtain an efficient data-driven model, the experiment resents the good predictive capability of data-driven model and control
under all working condition is carried out and the relevant measure- accuracy of developed predictive controller. When fault is injected at
ments are utilized for model training. The experiment costs nearly 80 h 400 s, the temperature controlled by data-driven predictive controller
and includes 10 cycles covering different air flow, water flow and IT without fault tolerant control is still 0.3∼0.4 ◦ C above setpoint and
workload. The experimental cycle is represented in Fig. 7. Then a group keeps rising. On the contrary, the proposed controller with fault toler-
of data from another experiment is used to validate the performance of ant control works properly under fault and the temperature could still
data-driven model, and the result of validation is depicted in Fig. 8. follow the setpoint well.
As we can see in this figure, the error of proposed model is roughly According to our experiments, the energy consumption of each
around 0.2 ◦ C, which is within the acceptable boundary and further server would increase 2% when the ambient temperature increases
demonstrates the data-driven model is able to describe the DC system. 1 ◦ C. Therefore, the proposed controller could save the energy con-
As mentioned above, three kinds of actuator faults comprising aging sumption by nearly 1% compared with the controller without fault
of fans, failure of fans and valve deactivation are investigated in this tolerant control. In our experiment, the IT workload of the whole rack
paper. Therefore, three groups of experiment concerning the actuator is 3000 W, so the saved energy is 30Wh per hour. Besides, the steady
6
Fig. 9. The comparative performance of data-driven predictive controller with fault Fig. 11. The comparative performance of data-driven predictive controller with fault
tolerant control and without fault tolerant control in case I. tolerant control and without fault tolerant control in case II.
Fig. 12. Control inputs and measured disturbance in case II.

Fig. 10. Control inputs and measured disturbance in case I.
fault tolerant control shows a bad adaptability to the actuator fault.

temperature is also important for the system reliability when the faults Furthermore, the differences of control commands shown in Fig. 12 ex-
occur. press the internal reason why the controller with fault tolerant control
Besides, the control inputs after fault occurrence are described in performs better. It is also noteworthy that the control performance is
Fig. 10. In this case, the controller with fault tolerant control seems to quite sensitive to the change of input water temperature and the tem-
provide more aggressive control commands, but they are necessary to perature fluctuates greatly at the beginning period, thus the controlled
stabilize the temperature when the faults occur. In this system, fans are temperatures in the rack overshoot and oscillate around the set point.
cheap and easy to replace, while the lifetime or reliability of servers is
much more important than that of the fans. 5.3. Case III: Valve deactivation
5.2. Case II: Failure of fans The third common actuator fault in cooling unit is the valve de-
activation. The valve fault is also injected around 400 s and the
The second type of actuator faults means that one or some of pre-arranged stopped water flow rate is set as 0.000333 m3 ∕s or 20
the fans stop working suddenly in cooling unit, which is one of the L∕min (it can also be set as another value). The comparison of effects
major faults in fans. As the other fans still work normally under this between two controllers is presented in Fig. 13 and corresponding
situation, the temperature in the system could be regulated by some control commands are shown in Fig. 14.
proper control algorithms. In this case, we assume that one of the fans As Fig. 13 describes, the temperatures could follow the set point
stops working after the system reaches the steady state and the fault precisely, and there are not obvious overshoot and undershoot before
is injected at a specific point-in-time through the code in advance. the injection of fault. When the valve fault occurs, the temperature
The comparative experiments of data-driven predictive controller with managed through controller without fault tolerant control appears a
fault tolerant control and without fault tolerant control are conducted big oscillation due to the change of control structure. In addition, the
respectively and the results are described in Fig. 11. Besides, the control control input of controller without fault tolerant control varies signifi-
inputs are represented in Fig. 12 as well. cantly as well. However, the behavior of controller with fault tolerant
In Fig. 11, the effectiveness of controller with fault tolerant control control is absolutely excellent even under the occurrence of fault owing
is satisfying after the injection of fault, while the controller without to the timely adjustment of cost function in control algorithm.
7
architecture. First, a data-driven based on combined ARX models is

established, which is identified through PLS and FCM. Then, according
to such data-driven model, fault tolerant predictive control is im-
plemented. In this work, this common actuator faults containing fan
aging, fan failure and valve deactivation are considered to validate
the controller. Finally, the comparative results from real experiments
reveal that the proposed controller works properly under different fault
cases. In the future work, we would investigate the problem of combi-
nation of different faults. In addition, more approaches of data-driven
modeling for industrial applications and strategies for power-efficient
optimization under temperature constraint in DC will be explored.
CRediT authorship contribution statement
Kai Jiang: Conceptualization, Methodology, Software, Validation,

Writing – original draft. Masoud Kheradmandi: Methodology, Data
curation. Chuan Hu: Writing – review & editing. Souvik Pal: Project
administration. Fengjun Yan: Supervision, Conceptualization, Funding
Fig. 13. The comparative performance of data-driven predictive controller with fault acquisition.
tolerant control and without fault tolerant control in case III.
Declaration of competing interest
The authors declare that they have no known competing finan-

cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
References
[1] Ebrahimi K, Jones GF, Fleischer AS. A review of data center cooling technology,
operating conditions and the corresponding low-grade waste heat recovery
opportunities. Renew Sustain Energy Rev 2014;31:622–38.
[2] Cheung H, Wang S, Zhuang C, Gu J. A simplified power consumption model
of information technology (IT) equipment in data centers for energy system
real-time dynamic simulation. Appl Energy 2018;222:329–42.
[3] Moazamigoodarzi H, Tsai PJ, Pal S, Ghosh S, Puri IK. Influence of cooling
architecture on data center power consumption. Energy 2019;183:525–35.
[4] Lu T, Lü X, Remes M, Viljanen M. Investigation of air management and
energy performance in a data center in Finland: Case study. Energy Build
2011;43(12):3360–72.
[5] Moazamigoodarzi H, Pal S, Ghosh S, Puri IK. Real-time temperature predictions
Fig. 14. Control inputs and measured disturbance in case III. in IT server enclosures. Int J Heat Mass Transfer 2018;127:890–900.
[6] Boucher TD, Auslander DM, Bash CE, Federspiel CC, Patel CD. Viability of
dynamic cooling control in a data center environment. J Electron Packag
2006;128(2):137–44.
As we can see in the three cases of experiments, the data-driven [7] Zhang H, Shao S, Xu H, Zou H, Tian C. Free cooling of data centers: A review.
fault tolerant predictive controller behaves quite well in the respects Renew Sustain Energy Rev 2014;35:171–82.
of response time, fault tolerance, overshoot and undershoot. Besides, [8] Liu Q, Ma Y, Alhussein M, Zhang Y, Peng L. Green data center with IoT
sensing and cloud-assisted smart temperature control system. Comput Netw
the comparison of predictive controller without fault tolerant con-
2016;101:104–12.
trol further demonstrate the remarkable performance of developed [9] Wang J, Zhang Q, Yoon S, Yu Y. Impact of uncertainties on the supervisory
controller. control performance of a hybrid cooling system in data center. Build Environ
2019;148:361–71.
Remark 2. In this work, we also conducted the experiment considering [10] Patterson MK, Weidmann R, Leberecht M, Mair M, Libby RM. An investigation
different actuator faults simultaneously. However, we found that the into cooling system control strategies for data center airflow containment
architectures. In: International electronic packaging technical conference and
temperature cannot be controlled at the setpoint even we tuned the exhibition, vol. 44625; 2011. p. 479–88.
actuator outputs to the maximum. Therefore, we plan to apply more [11] Li Y, Wen Y, Tao D, Guan K. Transforming cooling optimization for green data
powerful fans or use the input water with lower temperature to solve center via deep reinforcement learning. IEEE Trans Cybern 2019;50(5):2002–13.
this problem in the future work. [12] Parolini L, Sinopoli B, Krogh BH, Wang Z. A cyber–physical systems ap-
proach to data center modeling and control for energy efficiency. Proc IEEE
2012;100(1):254–68.
Remark 3. PID control was compared with data-driven predictive
[13] Durand-Estebe B, Le Bot C, Mancos JN, Arquis E. Data center optimization using
control for DC temperature regulation in our previous work [32]. PID regulation in CFD simulations. Energy Build 2013;66:154–64.
During the experiments, we found that PID controller had some draw- [14] Deng J, Yang L, Cheng X, Liu W. Self-tuning PID-type fuzzy adaptive control for
backs such as large overshoots, undershoots and oscillations. Thus, we CRAC in datacenters. In: International conference on computer and computing
abandoned the PID controller in our DC productions, and removed the technologies in agriculture; 2013. p. 215–25.
experiment related to PID controller in this work. [15] Zhou R, Wang Z, Bash CE, McReynolds A. Modeling and control for cooling
management of data centers with hot aisle containment. In: Proceedings of ASME
2011 international mechanical engineering congress & exposition; 2011.
6. Conclusion
[16] Ogawa M, Fukuda H, Kodama H, Endo H, Sugimoto T, Kasajima T, et al.
Development of a cooling control system for data centers utilizing indirect fresh
This paper presents a novel data-driven fault tolerant predictive air based on model predictive control. In: International congress on ultra modern
controller for temperature regulation of DC with rack-based cooling telecommunications and control systems and workshops; 2015. p. 132–7.
8
[17] Parolini L, Sinopoli B, Krogh BH. Model predictive control of data centers in the [41] Yin S, Luo H, Ding SX. Real-time implementation of fault-tolerant control systems
smart grid scenario. IFAC Proc Vol 2011;44(1):10505–10. with performance optimization. IEEE Trans Ind Electron 2013;61(5):2402–11.
[18] Ding T, guang He Z, Hao T, Li Z. Application of separated heat pipe system in [42] Lan J, Patton RJ. A new strategy for integration of fault estimation within
data center cooling. Appl Therm Eng 2016;109:207–16. fault-tolerant control. Automatica 2016;69:48–59.
[19] Corbett B, Mhaskar P. Subspace identification for data-driven modeling and [43] Jiang J, Yu X. Fault-tolerant control systems: A comparative study between active
quality control of batch processes. AIChE J 2016;62(5):1581–601. and passive approaches. Annu Rev Control 2012;36(1):60–72.
[20] Tóth R, Lyzell C, Enqvist M, Heuberger PS, Van den Hof PM. Order and structural [44] Wang R, Wang J. Fault-tolerant control with active fault diagnosis for four-
dependence selection of LPV-ARX models using a nonnegative garrote approach. wheel independently driven electric ground vehicles. IEEE Trans Veh Technol
In: Proceedings of the 48h IEEE conference on decision and control. IEEE; 2009, 2011;60(9):4276–87.
p. 7406–11. [45] Wang J, Zhang J, Qu B, Wu H, Zhou J. Unified architecture of active fault
[21] Tóth R. Modeling and identification of linear parameter-varying systems, vol. detection and partial active fault-tolerant control for incipient faults. IEEE Trans
403. Springer; 2010. Syst Man Cybern Syst 2017;47(7):1688–700.
[22] Siam A, Brandon C, Prashant M. Data-based modeling and control of nylon-6, 6 [46] Niemann H, Stoustrup J. Passive fault tolerant control of a double inverted
batch polymerization. IEEE Trans Control Syst Technol 2013;21(1):94–106. pendulum-a case study. Control Eng Pract 2005;13(8):1047–59.
[23] Aumi S, Mhaskar P. Integrating data-based modeling and nonlinear control tools
for batch process control. AIChE J 2012;58(7):2105–19.
[24] Aumi S, Mhaskar P. An adaptive data-based modeling approach for predictive
control of batch systems. Chem Eng Sci 2013;91:11–21. Kai Jiang received the B.Sc. degree in nuclear engineering
[25] Pal NR, Pal K, Keller JM, Bezdek JC. A possibilistic fuzzy c-means clustering and technology from the South China University of Tech-
algorithm. IEEE Trans Fuzzy Syst 2005;13(4):517–30. nology, Guangzhou, China, in 2014, and the M.Sc. degree
[26] Jin D, Bai X, Wang Y. Integrating structural symmetry and local homoplasy in thermal engineering from Shanghai Maritime University,
information in intuitionistic fuzzy clustering for infrared pedestrian segmentation. Shanghai, China, in 2016.
IEEE Trans Syst Man Cybern Syst 2019. He is currently working toward the Ph.D. degree in me-
[27] Mayne DQ. Model predictive control: Recent developments and future promise. chanical engineering with McMaster University, Hamilton,
Automatica 2014;50(12):2967–86. Canada. His research interests include thermal management
[28] Mayne DQ, Rawlings JB, Rao CV, Scokaert PO. Constrained model predictive in data centers, diesel engine after treatment systems, and
control: Stability and optimality. Automatica 2000;36(6):789–814. connected and automated vehicles.
[29] Farina M, Giulioni L, Scattolini R. Stochastic linear model predictive control with
chance constraints–a review. J Process Control 2016;44:53–67.
[30] Gao T, Yin S, Qiu J, Gao H, Kaynak O. A partial least squares aided intel-
ligent model predictive control approach. IEEE Trans Syst Man Cybern Syst Fengjun Yan received the B.Sc. degree in control science
2018;48(11):2013–21. and engineering from the Department of Control Science
[31] Cannon M. Efficient nonlinear model predictive control algorithms. Annu Rev and Engineering, Harbin Institute of Technology, Harbin,
Control 2004;28(2):229–37. China, in 2004, the M.Sc. degree in automotive engineering
[32] Shi S. Rack-based Data Center Temperature Regulation Using Data-driven Model from the Department of Automotive Engineering, Tsinghua
Predictive Control [Master’s thesis], McMaster University; 2019. University, Beijing, China, in 2006, and the Ph.D. degree in
[33] Laurí D, Rossiter JA, Sanchis J, Martínez M. Data-driven latent-variable mechanical engineering from the Department of Mechanical
model-based predictive control for continuous processes. J Process Control and Aerospace Engineering, Ohio State University, Colum-
2010;20(10):1207–19. bus, OH, USA, in 2012.
[34] Piga D, Forgione M, Formentin S, Bemporad A. Performance-oriented model He was an Engineer with FEV. Inc., Dalian, China, from
learning for data-driven MPC design. IEEE Control Syst Lett 2019;3(3):577–82. 2006 to 2007. He joined McMaster University, Hamilton,
[35] Evans MA, Cannon M, Kouvaritakis B. Robust MPC tower damping for variable ON, Canada, as an Assistant Professor in 2012 and was pro-
speed wind turbines. IEEE Trans Control Syst Technol 2014;23(1):290–6. moted to Associate Professor in 2017. His research interests
[36] Sakhdari B, Azad NL. Adaptive tube-based nonlinear MPC for economic au- include dynamic system modeling, nonlinear system esti-
tonomous cruise control of plug-in hybrid electric vehicles. IEEE Trans Veh mation and control, internal combustion engine advanced
Technol 2018;67(12):11390–401. combustion mode control, hybrid power train control, and
[37] Incremona GP, Ferrara A, Magni L. MPC for robot manipulators with integral intelligent vehicles. Dr. Yan served as an Associate Editor
sliding modes generation. IEEE/ASME Trans Mechatronics 2017;22(3):1299–307. of the American Control Conference and the American
[38] Scokaert PO, Mayne DQ, Rawlings JB. Suboptimal model predictive control Society of Mechanical Engineers (ASME) Dynamic Systems
(feasibility implies stability). IEEE Trans Automat Control 1999;44(3):648–54. and Control Conference, an Organizer of the ATS-TC Invited
[39] Berberich J, Köhler J, Müller MA, Allgöwer F. Data-driven model predictive Session of the American Control Conference and the ASME
control with stability and robustness guarantees. IEEE Trans Automat Control Dynamic System Control Conference.
2021;66(4):1702–17.
[40] Zhang Y, Jiang J. Bibliographical review on reconfigurable fault-tolerant control
systems. Annu Rev Control 2008;32(2):229–52.

1 s2.0 S0957415821001100 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0957415821001100 Main

Uploaded by

Copyright:

Available Formats

Mechatronics 79 (2021) 102633

Contents lists available at ScienceDirect

Data-driven fault tolerant predictive control for temperature regulation in

ARTICLE INFO ABSTRACT

the total number of time lapse in historical outputs and inputs.

As the thermal dynamic in DC is a quite complex process, only a

Fig. 7. Measurements under all working condition for model training.

Fig. 5. Flowchart of the fault tolerant MPC algorithm in this work.

Fig. 8. Validation result of data-driven model.

5.1. Case I: Aging of fans

Fig. 12. Control inputs and measured disturbance in case II.

fault tolerant control shows a bad adaptability to the actuator fault.

architecture. First, a data-driven based on combined ARX models is

CRediT authorship contribution statement

Kai Jiang: Conceptualization, Methodology, Software, Validation,

The authors declare that they have no known competing finan-

You might also like