You are on page 1of 6

To be presented at the American Control Conference, Denver, CO, June 46, 2003

Data Compression Issues with Pattern Matching in Historical Data


Ashish Singhal

Dale E. Seborg

Department of Chemical Engineering


University of California, Santa Barbara, CA 93106

Abstract

It is a common practice in the process industries to compress


plant data before it is archived. However, compression may
alter the data in a manner that makes it difficult to extract
useful information from it. In this paper we evaluate the effectiveness of a new pattern matching technique1 for applications involving compressed historical data. We also compare
several data compression methods with regard to efficiency,
data reconstruction, and suitability for pattern matching applications.

This section briefly describes some of the popular compression methods for time-series data. Because the accuracy
of retrieved data depends not only on the method that was
used for compression, but also on the method used for reconstruction, some simple reconstruction techniques that include
zero-order hold and linear interpolation are also discussed
briefly.

2.1

Due to the advances in computer technology, large amounts


of data produced by industrial plants are recorded as frequently as every second using commercially available data
historians.2, 3 Although storage media are inexpensive, the
cost of building large bandwidth networks is still high. Thus,
to minimize the cost of transmitting large amounts of data
over company networks or the internet, data have to be compressed.
One of the classic papers on compression of process data
was published by Hale and Sellars.2 They provided an excellent overview of the issues in the compression of process data
and also described piecewise linear compression methods.
Other researchers have developed several algorithms to
compress time varying signals in efficient ways. Bristol4
modified the piecewise linear compression methods of Hale
and Sellars2 to propose a swinging door data compression
algorithm. Mah et al.5 proposed a complex piecewise linear online trending (PLOT) algorithm that performed better
than the classical box-car, backward slope and swinging door
methods. Bakshi and Stephanopoulos6 compressed process
data using wavelet methods. Recently, Misra et al.7 developed an online data compression method using wavelets
where the algorithm computes and updates the wavelet decomposition tree before receiving the next data point.
In this paper different data compression methods are
evaluated not only on the basis of how accurately they represent process data, but also on how they affect pattern matching. In this paper, six data compression methods are evaluated for both accuracy and their effect on pattern matching.

Data compression methods


The box-car method is a simple piecewise linear compression method. This method records data when a
value is significantly different from the last recorded
value.2 Because recording of a future value depends
only on the value of the last recorded value and the
recording limits, the box-car algorithm performs best
when the process runs for long periods of steady-state
operation.8

Introduction

Popular data compression and reconstruction


methods for time-series data

The backward slope method is also a piecewise linear


compression method and utilizes the trending nature
of a process variable by projecting the recording limit
into the future on the basis of the slope of the previously two recorded values.2
The combination method combines the box-car and
backward slope algorithms.2 This algorithm handles
cases when the system is at steady state as well as
when process variables exhibit trends.
Data averaging compression is a common compression techniques where the time-series data are simply
averaged over a specified period of time. In this case,
the compression is performed off-line, rather than
online.
Wavelet based compression. Wavelet transforms can
be used to compress time-series data by thresholding the wavelet coefficients.7, 8 Hard thresholding is a
method by which only those wavelet coefficients that
are greater than a specified threshold are retained. It
is used for this research.
For data compression, only the non-zero thresholded
wavelet coefficients are stored. These thresholded coefficients can then be used to reconstruct data when

Present address: Johnson Controls, Inc., 507 E. Michigan St., Milwaukee, WI 53202. Email: Ashish.Singhal@jci.com
Corresponding author. Email: seborg@engineering.ucsb.edu

needed. In the present study, the recording limits on


each of the process variables will be used as threshold
values.

vidual data windows in the candidate pool are called records.


After the candidate pool has been formed, a person familiar
with the process can then perform a more detailed examination of the records. The number of observations by which
the window is moved through historical data is denoted as w,
and is set equal to one-tenth to one-fifth of the length of the
snapshot data window.1 Detailed description of the similarity factors and the pattern matching methodology is provided
by Singhal11 and Singhal and Seborg.1

Compression using commercial PI software (OSI


Software, www.osisoft.com). Because PI is widely
used for data archiving, it is informative to compare
the commercially available software with the classical
techniques. In particular, the BatchFile Interface for
the PI software was used in this research for data
compression.

2.2

3.1

Two important metrics are used to quantify the effectiveness


of a pattern matching technique. But first, several definitions
are introduced:

Data reconstruction methods

All of the data compression methods described in the previous section produce lossy compression, i.e., it is not possible to reconstruct the compressed data to exactly match the
original data. The accuracy by which compressed data can
describe the original uncompressed data depends not only on
the compression algorithm, but also on the method of data
reconstruction. Many reconstruction methods are available
such as the zero-order hold (ZOH) where the value of a variable is held at the last recorded value until the next recording.
Linear interpolation (LIN) is a simple method that can
overcome a part of this limitation by reconstructing data between recordings. It can provide more accurate reconstruction for situations where the process is at steady state, or situations where process variables show trends.
More sophisticated methods such as spline interpolation,
and expectation-maximization algorithm for data reconstruction have also been proposed.9, 10 But these methods are sensitive to the amount of missing data, and do not perform well
when a significant amount of data are missing.9, 10

Performance measures for pattern matching

NP : The size of the candidate pool. NP is the number of


historical data windows that have been labeled similar to the snapshot data by a pattern matching technique. The data windows collected in the candidate
pool are called records.
N1 : The number of records in the candidate pool that are
actually similar to the current snapshot, i.e., the number of correctly identified records.
N2 : The number of records in the candidate pool that are
actually not similar to the current snapshot, i.e., the
number of incorrectly identified records. By definition, N1 + N2 = NP .
NDB : The total number of historical data windows that are
actually similar to the current snapshot. In general,
NDB , NP .
The first metric, the pool accuracy p, characterizes the accuracy of the candidate pool:

Pattern matching approach


p,

In this article, the pattern matching methodology described


by Singhal11 and Singhal and Seborg1 is used to compare
historical and current snapshot datasets. First, the user defines the snapshot data that serves as a template for searching
the historical database. The snapshot specifications consist
of: (i) the relevant process variables, and (ii) duration of
the abnormal situation. These specifications can be arbitrarily chosen by the user; no special plant tests or pre-imposed
conditions are necessary.
In order to find periods of operation in historical data
that are similar to the snapshot data, a window of the same
size as the snapshot data is moved through the historical data.
The similarity between the snapshot and the historical data
in the moving window is characterized by the S PCA and S dist
similarity factors.1, 12 The PCA similarity factor compares
two datasets by comparing the angles between the subspaces
spanned by the datasets, while the distance similarity factor
compares datasets by calculating the Mahalanobis distance
between their centers.1
The historical data windows with the largest values of the
similarity factors are collected in a candidate pool. The indi-

N1
100%
NP

(1)

A second metric, the pattern matching efficiency , characterizes how effective the pattern matching technique is in locating similar records in the historical database. It is defined
as:
N1
,
100%
(2)
NDB
Because an effective pattern matching technique should ideally produce large values of both p and , an average of the
two quantities () is used as a measure of the overall effectiveness of pattern matching.:
,

p+
2

(3)

Simulation case study: continuous stirred tank


reactor example

In order to compare the effect of data compression on pattern matching, a case study was performed for a simulated
chemical reactor. A nonlinear continuous stirred tank reactor

(CSTR) with cooling jacket dynamics, variable liquid level


and a first order irreversible reaction, A B was simulated.
The dynamic model of Russo and Bequette13 based on the
assumptions of perfect mixing and constant physical parameters was used for the simulation. In the simulation study,
white noise is added to several measurements and process
variables in order to simulate the variability present in real
world processes.14

4.1

for a for given method and each process variable are proportional to their standard deviations. For example, the OSI PI
recording limits were chosen as 3i , while the recording limits for the box-car method were adjusted to produce the same
compression ratio as the PI method. Thus, the recording
limits for the box-car method were 2.23i .
The effectiveness of a compression-reconstruction
method was characterized in two ways: (i) reconstruction error, and (ii) degree of similarity between the original data and
the reconstructed data. The S PCA and S dist similarity factors
were used to quantify the similarity between the original and
reconstructed data.

Generation of recording limits

For the simulation study, 95% Shewhart chart limits were


used calculate the recording limits. The chart limits were
constructed using representative data that included small
disturbances as described by Johannesmeyer et al.14 The
high and low limits for each variable were calculated using
these data.14
The recording limits for each variable were specified
by calculating the Shewhart chart limits around the nominal value of each variable. The standard deviation for the ith
process variable, i , was determined using the methodology
described above. Then the recording limit for that variable
was set equal to ci , where c is a scaling factor. The value
of c was specified differently for each compression method
as described later. The value of the standard deviation, , for
each measured variable is reported by Singhal.11

5.1

Different data compression methods were first compared on


the basis of reconstruction error. The recording limits for the
OSI PI method were set to 3 and data compression was
performed using PIs proprietary algorithm. The compression ratio was calculated for each of the 28 datasets. The
average compression ratio obtained for the 28 datasets was
14.8. The recording limits for all other methods were then
adjusted using numerical root finding techniques, such as the
bisection method, to obtain an average compression ratio of
approximately 14.8 for each method. The results presented
in Table 1 show that the PI algorithm provides the best
reconstruction of the compressed data, while wavelet-based
compression is second best. Except for the box-car method,
linear interpolation provided better reconstruction than zeroorder hold. The common practice of averaging data provides
the worst reconstruction.

Results and discussion

The data compression methods described in Section 2 were


compared on the basis of the reconstruction error as well as
the compression ratio. The compression ratio (CR) is defined
as,
CR ,

No. of data points in original dataset


No. of data points in compressed dataset

Comparison of different methods with respect


to compression and reconstruction

5.2

(4)

Effect of data compression on pattern matching

Because the present research is concerned with pattern


matching, it is interesting to investigate the effect of data
compression on pattern matching. It is obvious that data
compression affects pattern matching because the original
and reconstructed data sets are not the same. In order to
evaluate the effect of different compression methods on the
effectiveness of the proposed pattern matching methodology,
similarity factors between the original data and the reconstructed data were calculated to see how similar the reconstructed dataset was to the original one. For scaling purposes,
the original dataset was considered to be the snapshot dataset
while the reconstructed dataset was considered to be the historical dataset. The average values for S PCA , S dist and their
combination, S F = 0.67 S PCA + 0.33 S dist , are presented in
Table 2. Although the averaging compression method performed worst in terms of reconstruction error (cf. Table 1),
it produced compressed datasets that show a high degree of
similarity to the original ones, as indicated by high S PCA and
S dist values. The wavelet compression method produces low
MSE values as well as high S PCA and S dist values. These
results demonstrate that wavelet-based compression is very

and the mean squared error (MSE) of reconstruction is defined as,


n
m
1 XX 2
MSE ,

(5)
m n i=1 j=1 i, j
where m is the number of measurements in the original
dataset; n is the number of variables; i, j = (xi, j xi, j ), xi, j represents the jth measurement of the ith variable in the original
data, and xi, j is the corresponding reconstructed value.
If the recording limit constant, c, is the same for all methods, then the resulting compression ratios will be different for
each method. These type of results would indicate how effective each method is for compressing data. However, in order
to compare the methods with respect to reconstruction accuracy, it is easier to analyze the results if all methods have
the same compression ratio. A constant compression ratio
requires adjusting the recording limits individually for each
method. Because the accuracy of the data reconstruction is
a key concern, the recording limits for each method were
varied in order to achieve the the same compression ratio.
As mentioned in the previous section, the recording limits

Table 1. Data compression and reconstruction results for the CSTR example for a constant compression ratio.
Compression
method

Recording limit
constant (c)

Box-Car

2.2295

Backward-slope

2.7744

Combination

2.2003

Averaging
(over 1.25 min)
Wavelet
PI

Reconstruction
method
Linear
Zero-order hold
Linear
Zero-order hold
Linear
Zero-order hold
Linear
Zero-order hold
Wavelet
PI

NA
2.2669
3.0

accurate both in terms of reconstruction error and the similarity of the reconstructed and original datasets.
Although the PI algorithm produces a very low MSE,
it does not represent the data very well for pattern matching.
The wavelet method produces both a low MSE and high similarity factor values. The wavelet transform preserves the essential dynamic features of the signal in the detail coefficients
while retaining the correlation structure between the variables in the approximation coefficients. These two features
of the wavelet transform produce low MSE and high S PCA
values between the original and reconstructed data. These
features also minimize mean shifts and result in high S dist
values. By contrast the PI method, records data very accurately and produces very low MSE values, but its variable
sampling rates disrupt the correlation structure between variables and produce low S PCA values. Variable sampling also
affects the mean value of the reconstructed data and produces
low S dist values. The detailed results for different operating
conditions for the CSTR case study are reported by Singhal.11

5.3

CR
14.84
14.84
14.83
14.83
14.86
14.63
14.6
14.63
14.83
14.83

MSE
5.23
4.91
4.09
8.83
5.28
7.94
24.69
60.35
2.61
0.33

the entire database was analyzed for one set of snapshot data,
the analysis was repeated for a new snapshot dataset. A total of 28 different snapshot datasets, one for each of the 28
operating conditions, were used for pattern matching.11
Table 3 compares the pattern matching results for historical and snapshot data compressed using different methods. The best pattern matching results were obtained when
the data were compressed using the wavelet method. The
optimum NP values were determined by choosing the value
of NP for which had the largest value. Table 3 indicates
that pattern matching is adversely affected by data compression when the data are compressed using either the averaging
method or the combination of box-car and backward slope
compression methods. By contrast wavelet-based compression has very little effect on pattern matching because similar
results are obtained for both compressed and uncompressed
data. Table 4 presents results for the situation when the snapshot data are not compressed while the historical data are
compressed using the wavelet method. The p, and values
in Table 4 are slightly lower compared to those in Table 3.
Thus, if the historical data are compressed, it may be beneficial to compress the snapshot data as well to obtain better
pattern matching.

Pattern matching in compressed historical


data

The historical data for the CSTR example described in


Section 4 were compressed using three different methods:
wavelets, averaging, and a combination of the box-car and
backward slope methods. The performance of the proposed
pattern matching technique for compressed historical data
was then evaluated. As described by Singhal,11 and Singhal and Seborg,1 a data window that was the same size as the
snapshot data (S) was moved through the historical database,
100 observations at a time (i.e., w = 100). The ith moving
window was denoted as Hi . For pattern matching, the compressed data were reconstructed using the linear interpolation
method.
The same compression method was used for both the
snapshot and historical data. The snapshot data were then
scaled to zero mean and unit variance. The historical data
were scaled using the scaling factors for the snapshot data.
Similarity factors were then calculated for each Hi . After the

Conclusions

A variety of data compression methods have been compared


and evaluated for pattern matching applications using a case
study approach. Classical methods such as box-car, backward slope and data averaging compression methods do not
accurately represent data either in terms of reconstruction error or similarity with the original dataset. Data compressed
using the PI software very accurately represents the original data, but produces somewhat lower similarity factor values. Compression using the wavelet method produces reconstruction errors that are higher than those obtained with PI ,
but much lower than conventional compression methods such
as box-car, etc. Data compressed using wavelets also show a
high degree of similarity with the original data.
For pattern matching applications, it is beneficial to compress the snapshot data prior to performing pattern matching.

Table 2. Effect of different data compression and reconstruction methods on pattern matching for the CSTR example.
Compression
method

Recording limit
constant (c)

Reconstruction
method

S PC A

S dist

SF

Box-Car

2.2295

Linear
Zero-order hold

0.88
0.87

0.67
0.83

0.81
0.86

Backward-slope

2.7744

Linear
Zero-order hold

0.84
0.83

0.63
0.39

0.77
0.68

Combination

2.20025

Linear
Zero-order hold

0.87
0.85

0.67
0.79

0.80
0.83

Averaging
(over 1.25 min)

NA

Linear
Zero-order hold

0.92
0.93

0.99
0.97

0.94
0.94

Wavelet

2.2669

Wavelet

0.95

>0.99

0.97

PI

0.88

0.71

0.82

PI
3.0

S F = 0.67 S PCA + 0.33 S dist

Table 3. Effect of data compression on pattern matching for the CSTR example when both the snapshot and historical data are
compressed using the same method.
Compression
method

Similarity
factor
S PCA only
Original data
S dist only
SF
S PCA only
Combination
S dist only
SF
S PCA only
Averaging
S dist only
SF
S PCA only
Wavelet
S dist only
SF

S F = 0.67 S PCA + 0.33 S dist

Opt. N P

p (%)

(%)

max (%)

(%)

34
25
14
41
59
15
21
24
17
34
52
16

43
41
75
30
19
65
49
40
64
38
25
71

90
68
72
78
75
67
65
65
73
82
83
76

99
97
88
99
100
91
95
96
92
99
100
92

66
54
74
54
47
66
57
53
68
60
54
73

For the simulated case study, data compression had only a


minor effect on the effectiveness of a new pattern matching
strategy.11

dow Approach. Ind. Eng. Chem. Res., 2002. 41, 3822


3838.
(2) Hale, J. C. and Sellars, H. L. Historical Data Recording
For Process Computers. Chemical Eng. Prog., 1981.
77(11), 3843.

Acknowledgements
The authors thank OSI Software for providing financial
support and the data archiving software PI , and Gregg
LeBlanc at OSI for providing software support during the
research. Financial support from ChevronTexaco Research
and Technology Co. is also acknowledged.

(3) Kennedy, J. P. Building an Industrial Desktop. Chemical Engr., 1996. 103(1), 8286.
(4) Bristol, E. H. Swinging Door Trending: Adaptive
Trend Recording? In Advances in Instrumentation and
Control, volume 45. Instrument Society of America,
Research Triangle Park, NC, 1990 749754.

References
(5) Mah, R. S. H.; Tamhane, A. C.; Tung, S. H. and Patel,
A. N. Process Trending With Piecewise Linear Smoothing. Comput. Chem. Engr., 1995. 19, 129137.

(1) Singhal, A. and Seborg, D. E. Pattern Matching in Multivariate Time Series Databases Using a Moving Win-

Table 4. Effect of data compression on pattern matching when snapshot data are not compressed and historical data are compressed.
Compression
method

Similarity
factor
S PCA only
Original data
S dist only
SF
S PCA only
Combination
S dist only
SF
S PCA only
Averaging
S dist only
SF
S PCA only
S dist only
Wavelet
SF

S F = 0.67 S PCA + 0.33 S dist

Opt. N P

p (%)

(%)

max (%)

(%)

34
25
14
48
40
15
60
16
16
39
15
14

43
41
75
26
25
59
23
52
63
31
50
68

90
68
72
76
67
63
85
57
70
75
53
67

99
97
88
100
99
91
100
92
92
99
91
88

66
54
74
51
46
61
54
54
66
53
52
68

(6) Bakshi, B. R. and Stephanopoulos, G. Compression of


Chemical Process Data Through Functional Approximation and Feature Extraction. AIChE J., 1996. 42,
477492.

(10) Roweis, S. EM Algorithms for PCA and SPCA. In


Neural Information Processing Systems 11 (NIPS98).
1997 626632.
(11) Singhal, A. Pattern Matching in Multivariate TimeSeries Data. Ph.D. Dissertation, University of California, Santa Barbara, CA, 2002.

(7) Misra, M.; Kumar, S.; Qin, S. J. and Seemann, D. Error


Based Criterion for On-Line Wavelet Data Compression. J. Process Control, 2001. 11, 717731.

(12) Krzanowski, W. J. Between-Groups Comparison of


Principal Components. J. Amer. Stat. Assoc., 1979.
74(367), 703707.

(8) Watson, M. J.; Liakopoulos, A.; Brzakovic, D. and


Georgakis, C. A Practical Assessment of Process Data
Compression Techniques. Ind. Eng. Chem. Res., 1998.
37, 267274.

(13) Russo, L. P. and Bequette, B. W. Effect of Process Design on the Open-Loop Behavior of a Jacketed Exothermic CSTR. Comput. Chem. Eng., 1996. 20, 417426.

(9) Nelson, P. R. C.; Taylor, P. A. and MacGregor, J. F.


Missing Data Methods in PCA and PLS: Score Calculations with Incomplete Observations. Chemometrics
and Intel. Lab. Syst., 1996. 19, 4565.

(14) Johannesmeyer, M. C.; Singhal, A. and Seborg, D. E.


Pattern Matching in Historical Data. AIChE J., 2002.
48, 20222038.

You might also like