Professional Documents
Culture Documents
Stream Processing Algorithms That Model Behavior Changes PDF
Stream Processing Algorithms That Model Behavior Changes PDF
Change
Agostino Capponi Mani Chandy
Computer Science Department Computer Science Department
California Institute of Technology California Institute of Technology
USA USA
acapponi@cs.caltech.edu mani@cs.caltech.edu
Abstract This paper presents algorithms that system learns about its environment from information
fuse information in multiple event streams to up- in event streams and other data sources. The environ-
date models that represent system behavior. System ment is represented by a model and algorithms contin-
behaviors vary over time; for example, an informa- uously update models as new events arrive on an event
tion network varies from heavily loaded to lightly stream. The learned model is used to determine best
loaded conditions; patterns of incidence of disease responses to critical conditions.
change at the onset of pandemics; file access pat- In some problem areas, critical conditions are sig-
terns change from proper usage to improper use that naled by changes in behavior of a system. Information
may signify insider threat. The models that repre- assurance systems monitor applications, such as email,
sent behavior must be updated frequently to adapt to and usage of information assets, such as files, to get
changes rapidly; in the limit, models must be up- alerts when changes in behavior that may signal mis-
dated continuously with each new event. Algorithms use are detected. Financial applications detect po-
that adapt to change in behavior must depend on the tential noncompliance to regulations by monitoring
appropriate length of history: Algorithms that give changes in patterns of income and expenditure. Phar-
too much weight to the distant past will not adapt maceutical companies monitor changes in patterns of
to changes in behavior rapidly; algorithms that dont problems reported by customers to detect potential
consider enough past information may conclude in- problems with products.
correctly, from noisy data, that behavior has changed These applications develop and continuously up-
while the actual behavior remains unchanged. Effi- date models of system behavior. As system behavior
cient algorithms are incremental the computational changes, model parameters change too, and significant
time required to incorporate each new event should changes in parameters indicate probable changes in be-
be small and ideally independent of the length of the havior. The systems of interest consist of groups of
history. entities. In the pharmaceutical example, for instance,
the system consists of all customers who have bought
Keywords: stream processing, parameter estimation, a product, and the events in the system are activities
sense and respond systems, incremental computation, be- by customers such as the logging of a complaint or
havioral change. an indication of satisfaction. The system generates a
stream of events the sequence of events generated by
1 Introduction all the customers collectively. Successive events may
be generated by different entities; for example, a com-
1.1 Overview plaint may be registered by one customer followed by
complaints by many other customers before an event is
A sense and respond system (1) estimates the history
generated by the first customer again. Filtering algo-
of global states of the environment from information in
rithms, such as the Kalman filtering algorithm, assume
streams of events and other data, (2) detects critical
a model of the evolution of state over time, such as
conditions threats or opportunities by analyzing
this history, and (3) then responds in a timely fash-
ion to these conditions. A sense and respond system xk = f (xk1 ) + v
(1)
can be specified by a set of rules where each rule is yk = g(xk ) + w
a pair: a condition and a response. The condition is
where xk is the state of the system at time k, yk is a
a predicate on the history of estimated global states
signal at time k, and v and w are random variables.
and the response is an action. A sense and respond
In the examples considered in this paper, such rela-
Caltech Computer Science Technical Report Caltech tionships between the signals at successive times may
CSTR:2005.004, February 2005 not exist because the signals are generated by different
entities. Therefore, filtering algorithms are less appro- the length of the history or is a lowdegree polylog or
priate than other kinds of statistical algorithms. polynomial. For example, consider the computation of
a movingpoint average over a window. When the win-
1.2 Model of Behavior dow moves by one value, the computation of the new
movingpoint average can be done in time independent
A signal is represented by a point in a multidimensional of the length of the window: merely add the leading
space where the dimensions are attributes of behavior. edge value and subtract the trailingedge value. We
The dimensions in the pharmaceutical example deal- present algorithms for adapting to behavorial change
ing with blood sugar monitors includes the age of the that come close to being incremental.
product, the strength of the battery, the type of erro-
neous reading, length of experience with this type of
1.4 Related work
product and so on. Our algorithm is fed a stream of
signals (sometimes called event information) and thus The field of adaptive stream processing has received
is continuously fed new points in this space. A model much attention recently [?, ?, ?, ?]. For many applica-
is a surface in this space, and a metric of the fitness tions today, e.g. sensor network applications, data are
of the model is the average mean-square distance of produced continuously and algorithms must provide
points from the surface. accurate answers in real time. Some of the main tech-
The system may change its behavior and the change niques dealing with continuous stream processing are
may be gradual or abrupt. In the pharmaceutical ex- summarization which provides concise representations
ample, a change may be caused by the introduction of a data set using data structures at the expense of
of a defective component in some batches of the prod- accuracy in the answer and adaptive algorithms which
uct. The signals that are generated after the change update the structure as new data arrive in a reasonable
reflect the changed behavior. The algorithm updates amount of time. An adaptive approach proposed in
model parameters with each signal it receives with the [?] processes continuous streams and provide answers
goal of maintaining an accurate model at all times. within a guaranteed approximation factor. Another
Fig. 1 illustrates a changing behavior in 3dimensional interesting approach in [?] uses sensor network obser-
space. The black dots represent signals generated be- vations to provide estimations for the temperature of
fore a change and the circles represent signals after the the surrounding within a user defined confidence inter-
change where the collection of black dots falls near one val. Correlation between consecutive sensor readings
hyperplane and the white dots near another. are used to achieve fast computation. The algorithm
described in this paper extends earlier work on stream
processing to deal with changing behaviors.
20
1.5 Types of Models
10
10
points is regression. One of the variables of the model
20
0 is identified to be a dependent variable and the other
10
20 0
variables are independent variables. A model predicts
30 20
10
the value of the dependent variable given the values of
30
all the independent variables. The differences between
40
40
50 50 x
y
the values of the dependent variables in the actual set
of data points and the values predicted by the model
Fig. 1: Two surfaces corresponding to two different are the errors of the model.
models of behaviour. The points marked with o belong We are not trying to predict one variable given values
to one surface and points marked with * to the other. of others; we are trying to estimate models of behavior
as behaviors change. For our purposes, all variables
are equivalent. One approach when dealing with n
1.3 Incremental Algorithms for Stream variables is to use n separate regression models, where
each regression model singles out one of the variables
Processing
to be the dependent variable. Changes in any of the n
A stream processing algorithm takes a sequence of regression models signifies a change in behavior.
events as its inputs and scans this sequence only once An alternate approach is orthogonal regression [?]
in increasing order of occurrence [?, ?]. The computa- in which error is defined as the minimum distance of
tional complexity measures are the space used by the a data point from the surface. As a trivial illustra-
algorithm and the time required to process each new tive example consider a model with two parameters x
event. An incremental algorithm is one for which the and y, and where the model is represented by the line
time required to fuse a single event with the history of x + y = 1. Consider a data point (1, 1). Let y be the
events is small compared with the length of the history; dependent variable in a regression model. The value of
we seek algorithms in which the time is independent of y predicted by the model when x = 1 is y = 0. Hence,
the error in the standard regression model correspond- 2 Theory
ing to point (1, 1) is the difference (1) between the
2.1 Exponential smoothing and sliding
actual (1) and predicted (0) values. By contrast, the
error in the orthogonal regression model is the shortest window
distance from the point (1, 1) to the line and this error A key issue is that of determining the weight to be
is 12 . given to old information in estimating models: too
much weight given to old information results in algo-
Another model is the degenerate case of a surface rithms that do not update models rapidly; but, the
where the model is represented by a single point. As more information that is used, the better the estimates
in the general case, the error corresponding to a data in the presence of noise. Popular algorithms for deal-
point is the minimum distance of the data point from ing with different emphases on newer and older data
the surface which, in this case, degenerates to the dis- are sliding window and exponential smoothing. A slid-
tance of the data point from point p that represents ing window protocol with window size W estimates
the model. The problem simplifies to finding the point a model using only the most recent W data points;
p that minimizes total error. As signals arrive on the it treats all W data points in the window with equal
stream, point p is recomputed with each new signal, weight, and effectively gives zero weight to points out-
and a behavior change corresponds to a significant side the window. An exponential smoothing algorithm
change in the value of p. with weight gives a weight of k to a data point k
units in the past where is positive and at most 1; thus
If all the data points are uniformly distributed in a an algorithm using a small value of forgets faster.
sphere around a point, then a model which is a single We refer to as the forgetting factor.
point is better than a model which is a surface. In- Incremental stream-processing algorithms can be
deed, in this case every hyperplane through point p is obtained for both sliding window and exponential
optimal; thus the hyperplanes carry no more informa- smoothing. Appendix 2 shows that this is done for the
tion than the single point p. If, however, the points exponential smoothing case. The proof for the sliding
lie near an extended surface, then a surface model is window is similar.
better than a point model. One solution is to have a An important issue is that of determining the ap-
general model where the number of dimensions of the propriate to use at each time T . The value of can
surface can vary; for instance in a 3parameter (and range from 1 (in which case all points from 0 to T are
hence 3dimensional space) model, the modeling sur- weighted equally) to 0 (in which case only the data
face could change continuously between being a plane, point that arrived at T is considered). Small values
a line and a point. In this paper we restrict ourselves of adapt to changes rapidly because they give less
to the case where points fall closer to a surface than to weight to old data whereas large values of are better
a point. at smoothing out noise.
One approach is to change the relative weights given
Regression models are represented by equations in
to old and new data when a change is estimated. For
which the independent variables can get arbitrarily
instance, suppose the algorithm estimates at time 103
large. In many systems, the ranges of variables are
that with high probability a change occurred at time
limited by physical constraints. For example, in an
100; the algorithm then reduces the weight given to
information assurance system monitoring file accesses
signals received before 100 and increases the weight
there is a physical limit to the rates at which files can
given to signals received after 100. A disadvantage of
be accessed by single process. Limiting ranges of vari-
this approach is that if the algorithm estimates that
ables changes error estimates. Consider the trivial il-
a behavioral change has taken place when, in reality,
lustrative example given earlier with variables x and y,
no change has occurred, then the algorithm discards
where the model is x + y = 1. Consider the error due
valuable old data needlessly. The same approach can
to a data point (0, 2) for two cases: (a) the unlimited
be used with sliding windows.
variable range [, +] and (b) the ranges of x and
y are limited to [0, 1]. The error in the former case is
2.2 Experimental Setup
the minimum distance of the point from the line, and
this distance is 12 . The error in the latter case is the At any point in time, the behavior of a system is cap-
minimum distance of the point from the line segment, tured by a model which is represented by a bounded
and this is the distance (1 unit) from the point (0, 2) surface. Our algorithm attempts to estimate the true
to the end (0, 1) of the segment. model given a sequence of noisy signals. We call the
model and the surface estimated from signals the es-
A change in behavior may be indicated by a change timated values as opposed to the true values. The
in the ranges of variables even if there is no change in true model is changed at some point in time during
the surface that represents the model: A change in the the experiment and we evaluate whether the estimated
length of a line segment may be significant even if the model follows the true model accurately.
line itself does not change. In this paper we do not At each point in time, a signal is generated as fol-
consider this issue. lows. First a point q is generated on the true bounded
surface randomly, then a scalar error term e is gener- model; compute an estimated model from the noisy
ated randomly using the given error distribution, and data; and, compare the true and estimated models. At
finally a data point r is generated where r = q + e v an arbitrary point in time, we change the true model.
where v is the unit normal to the true surface at point Noisy data is now generated using the new true model.
q. (The distribution of noise terms is assumed to remain
unchanged even though the true model changes.) Since
2.2.1 Algorithm the estimation algorithm has no specific indication that
the true model has changed, the algorithm uses data
Input at time T : A sequence of T 1 points that
before the change as well as points after the change.
arrived in the time interval [0, T 1] and a new point
Therefore, the estimated model may not be close to
that arrived at time T .
the new true model during and immediately after the
Output at time T : An estimate of the surface a
change. We would like the estimated model to become
hyperplane in a linear model at time T .
close to the true model as the time since the last change
Goal: Minimize the deviation between the estimated
increases.
and true surfaces.
The set of experiments has been restricted to 2
The true model changes over time, and the manner
dimensional surfaces. Noise is assumed to be Gaus-
of change is described separately.
sian. This is not a necessary assumption; in fact the
algorithm may be applied with any white noise vector.
2.2.2 Angle between planes
We consider cases where the noise is low ( 2 = 1) or
One measure of efficacy of fit of the estimated model high (50 2 100). We study the effect of different
to the true model is the angle between the surfaces. values of on the accuracy of the model. We choose
The inner product of the unit normals to the hyper- values of as follows. We pick a positive integer w
planes representing the estimated and true models is that we call the window size (not to be confused with
the cosine of the angle between the hyperplanes, and the window in the sliding window algorithm) and a
we use this as a measure of goodness. The cosine is positive number (which is at most1) that we call the
1 if the hyperplanes are parallel and is 0 if they are threshold. The value of was set to 0.5 in the exper-
orthogonal. iments. Given w and we pick an such that the
total weight assigned to all the signals w or more time
2.2.3 Comparison of distances of points from true units into the past is exactly (up to a rounding error)
and estimated surfaces . For instance, if w = 4 and = 0.5 then we know
that the first w signals have a total weight of 12 , the
Another measure of goodness of fit is represented by
next w have a total weight of 14 , and the next w have
the differences in distances of data points from the true
a weight of 18 and so forth.
and estimated surfaces. Let Dk,t be the minimum dis-
tance of the data point that arrived at time k from the 3.1 Experiments with changing behaviors
true surface at time t. Recall that dk,t is the minimum
distance of the data point that arrived at time k from We ran many experiments. In each experiment a
the surface estimated at time t. Let E be defined as change occurs after a certain number of time points.
follows: Each change is a translation followed by a rotation of
the (true) plane. Fig. 2 illustrates the 2D case; each
X
E= (Dk,k dk,k )2 (2) line is associated with a behavior change. Here we re-
k port on the following experiments
Now E is a measure of goodness the smaller the The true model is changed after 500 time points.
value of E the better the resulting estimate. We call E The translation is 0.75 and the rotation is 10 de-
the relative distance error. Notice that this parameter grees.
can be nonzero even if the true and estimated hyper-
The true model is changed after 50 time points.
planes are parallel because this error term is zero if and
The translation is 0.02 and the rotation is 1 de-
only if the two hyperplanes are the same.
gree.
Appendix 1 shows that the solution that minimizes
E can be obtained by solving the convex optimization The true model is changed after 5 time points.
problem of the minimum eigenvalue or a square system The translation is 0.0018 and the rotation is 0.1
of equations. For now we present results using only the degrees.
second method.
Each experiment was run for several thousands of
points and thus covered many changes of the true
3 Experiments
model. For ease of visualization, we only show 1500
We restrict ourselves to linear models; thus, the sur- points in the figures, however, the same pattern occurs
face is a hyperplane in a multidimensional space. In for larger numbers of points.
each of our experiments we assume that we are given Fig. 3 shows the cosine and the angle between the
the true model; we generate noisy data from the true true and estimated hyperplanes at different points in
Relative distance error against the number of iterations
6
4
= 0.70711
4 3.5 = 0.97857
= 0.98923
2 3
2
2
1.5
4
1
6
6 4 2 0 2 4 6
0.5
0
0 500 1000 1500
Fig. 2: The model of behavioral change used in the per- Number of iteration
0.99 (5.1)
= 0.98923
1 (0.0)
0.99 (6.2)
0.998 (3.6)
Scalar product (degree)
0.99 (7.2)
0.996 (5.12)
0.99 (8.1)
0.994 (6.2)
0.98 (8.8)
0.992 (7.2)
0.98 (9.5)
0.99 (8.1)
0.98 (10.2)
0 500 1000 1500
Number of iteration 0.98 (8.8)
0.98 (9.5)
0.98 (10.2)
Fig. 3: The scalar product between the vector of esti- 0 500
Number of iteration
1000 1500
3 1.2
Relative distance error
2 0.8
1.5 0.6
1 0.4
0.5 0.2
0 0
0 500 1000 1500 0 500 1000 1500
Number of iteration Number of iteration
Fig. 6: The relative distance error between the esti- Fig. 8: The relative distance error between the esti-
mated and actual plane. A change occurs every 500 mated and actual plane. A change occurs every 50
iterations. Threshold = 0.5, Noise variance = 100 iterations. Threshold = 0.5, Noise variance = 100
Scalar product against the number of iterations Scalar product against the number of iterations
= 0.70711 = 0.70711
= 0.97857 = 0.97857
= 0.98923 = 0.98923
1 (0.0) 1 (0.0)
Scalar product (degree)
0.998 (3.62)
0.996 (5.12)
0.997 (4.43)
0.994 (6.28)
0.996 (5.12)
0.992 (7.25)
0.995 (5.73)
0.99 (8.10)
0.994 (6.29)
Fig. 7: The scalar product between the vector of esti- Fig. 9: The scalar product between the vector of esti-
mated and actual coefficients. A change occurs every mated and actual coefficients. A change occurs every
50 iterations. Threshold = 0.5, Noise variance = 100 5 iterations. Threshold = 0.5, Noise variance = 100
3.2 Adaptive Algorithms report on them in the full version of the paper.
Adaptive algorithms change the relative weights as-
3.2.1 Number of steps for convergence for different
signed to older and newer data when a change is de-
numbers of parameters and noise
tected. The figures show that when a change is abrupt,
the change can be detected readily and adaptive algo- We show in the appendix that the computational com-
rithms work well. If the change is gradual but frequent, plexity for handling each new signal value is a constant
then for large values the algorithm may come close to independent of the length of the history, except for the
complete recover, but never recover fully. This clearly case of computing the eigenvalue. In our experiments
appears in Figures 7, 8, 9 and 10. we compute solutions iteratively using the Matlab rou-
An alternative strategy is to compare the model at tine fsolve using the solution obtained at time t as the
time T with the models at previous times t where t initial guess for the computation at time t + 1. Our ra-
ranges from T to T M where M is a constant window tionale for choosing such iterative algorithms is that
size. So far, we have only discussed the case where when there is no change in the true model, we expect
M = 1 which is sufficient for significant substantial little or no change in the estimated model, and hence
changes. If the algorithm detects a change between this method of obtaining an initial guess should be very
any model at a previous time t and the current time T , good. We found that about 12 steps were required to
then the algorithm adapts the weights, giving greater converge from a random guess whereas only 4 steps
weight to signals after time t and less to signals before were needed by using the time t value as the initial
time t. These experiments are ongoing, and we will guess for the time t + 1 computation.
Relative distance error against the number of iterations [4] D.Eberly. 3D Game Engine Design. Morgan Kauf-
1
= 0.70711 mann, 2001.
0.9 = 0.97857
= 0.98923 [5] Amol Deshpande, Carlos Guestrin, Samuel
0.8
R.Madden, Joseph M.Hellerstein, and Wei Hong.
0.7
Model-driven data acquisition in sensor networks.
Relative distance error
+ 2 ( sT + y[T + 1]) a0 +
2a
(9)