You are on page 1of 8

Stream Processing Algorithms that Model Behavior

Change
Agostino Capponi Mani Chandy
Computer Science Department Computer Science Department
California Institute of Technology California Institute of Technology
USA USA
acapponi@cs.caltech.edu mani@cs.caltech.edu

Abstract This paper presents algorithms that system learns about its environment from information
fuse information in multiple event streams to up- in event streams and other data sources. The environ-
date models that represent system behavior. System ment is represented by a model and algorithms contin-
behaviors vary over time; for example, an informa- uously update models as new events arrive on an event
tion network varies from heavily loaded to lightly stream. The learned model is used to determine best
loaded conditions; patterns of incidence of disease responses to critical conditions.
change at the onset of pandemics; file access pat- In some problem areas, critical conditions are sig-
terns change from proper usage to improper use that naled by changes in behavior of a system. Information
may signify insider threat. The models that repre- assurance systems monitor applications, such as email,
sent behavior must be updated frequently to adapt to and usage of information assets, such as files, to get
changes rapidly; in the limit, models must be up- alerts when changes in behavior that may signal mis-
dated continuously with each new event. Algorithms use are detected. Financial applications detect po-
that adapt to change in behavior must depend on the tential noncompliance to regulations by monitoring
appropriate length of history: Algorithms that give changes in patterns of income and expenditure. Phar-
too much weight to the distant past will not adapt maceutical companies monitor changes in patterns of
to changes in behavior rapidly; algorithms that dont problems reported by customers to detect potential
consider enough past information may conclude in- problems with products.
correctly, from noisy data, that behavior has changed These applications develop and continuously up-
while the actual behavior remains unchanged. Effi- date models of system behavior. As system behavior
cient algorithms are incremental the computational changes, model parameters change too, and significant
time required to incorporate each new event should changes in parameters indicate probable changes in be-
be small and ideally independent of the length of the havior. The systems of interest consist of groups of
history. entities. In the pharmaceutical example, for instance,
the system consists of all customers who have bought
Keywords: stream processing, parameter estimation, a product, and the events in the system are activities
sense and respond systems, incremental computation, be- by customers such as the logging of a complaint or
havioral change. an indication of satisfaction. The system generates a
stream of events the sequence of events generated by
1 Introduction all the customers collectively. Successive events may
be generated by different entities; for example, a com-
1.1 Overview plaint may be registered by one customer followed by
complaints by many other customers before an event is
A sense and respond system (1) estimates the history
generated by the first customer again. Filtering algo-
of global states of the environment from information in
rithms, such as the Kalman filtering algorithm, assume
streams of events and other data, (2) detects critical
a model of the evolution of state over time, such as
conditions threats or opportunities by analyzing
this history, and (3) then responds in a timely fash-
ion to these conditions. A sense and respond system xk = f (xk1 ) + v
(1)
can be specified by a set of rules where each rule is yk = g(xk ) + w
a pair: a condition and a response. The condition is
where xk is the state of the system at time k, yk is a
a predicate on the history of estimated global states
signal at time k, and v and w are random variables.
and the response is an action. A sense and respond
In the examples considered in this paper, such rela-

Caltech Computer Science Technical Report Caltech tionships between the signals at successive times may
CSTR:2005.004, February 2005 not exist because the signals are generated by different
entities. Therefore, filtering algorithms are less appro- the length of the history or is a lowdegree polylog or
priate than other kinds of statistical algorithms. polynomial. For example, consider the computation of
a movingpoint average over a window. When the win-
1.2 Model of Behavior dow moves by one value, the computation of the new
movingpoint average can be done in time independent
A signal is represented by a point in a multidimensional of the length of the window: merely add the leading
space where the dimensions are attributes of behavior. edge value and subtract the trailingedge value. We
The dimensions in the pharmaceutical example deal- present algorithms for adapting to behavorial change
ing with blood sugar monitors includes the age of the that come close to being incremental.
product, the strength of the battery, the type of erro-
neous reading, length of experience with this type of
1.4 Related work
product and so on. Our algorithm is fed a stream of
signals (sometimes called event information) and thus The field of adaptive stream processing has received
is continuously fed new points in this space. A model much attention recently [?, ?, ?, ?]. For many applica-
is a surface in this space, and a metric of the fitness tions today, e.g. sensor network applications, data are
of the model is the average mean-square distance of produced continuously and algorithms must provide
points from the surface. accurate answers in real time. Some of the main tech-
The system may change its behavior and the change niques dealing with continuous stream processing are
may be gradual or abrupt. In the pharmaceutical ex- summarization which provides concise representations
ample, a change may be caused by the introduction of a data set using data structures at the expense of
of a defective component in some batches of the prod- accuracy in the answer and adaptive algorithms which
uct. The signals that are generated after the change update the structure as new data arrive in a reasonable
reflect the changed behavior. The algorithm updates amount of time. An adaptive approach proposed in
model parameters with each signal it receives with the [?] processes continuous streams and provide answers
goal of maintaining an accurate model at all times. within a guaranteed approximation factor. Another
Fig. 1 illustrates a changing behavior in 3dimensional interesting approach in [?] uses sensor network obser-
space. The black dots represent signals generated be- vations to provide estimations for the temperature of
fore a change and the circles represent signals after the the surrounding within a user defined confidence inter-
change where the collection of black dots falls near one val. Correlation between consecutive sensor readings
hyperplane and the white dots near another. are used to achieve fast computation. The algorithm
described in this paper extends earlier work on stream
processing to deal with changing behaviors.

20
1.5 Types of Models
10

0 A popular way of estimating a model that fits a set of


z

10
points is regression. One of the variables of the model
20
0 is identified to be a dependent variable and the other
10
20 0
variables are independent variables. A model predicts
30 20
10
the value of the dependent variable given the values of
30
all the independent variables. The differences between
40
40
50 50 x
y
the values of the dependent variables in the actual set
of data points and the values predicted by the model
Fig. 1: Two surfaces corresponding to two different are the errors of the model.
models of behaviour. The points marked with o belong We are not trying to predict one variable given values
to one surface and points marked with * to the other. of others; we are trying to estimate models of behavior
as behaviors change. For our purposes, all variables
are equivalent. One approach when dealing with n
1.3 Incremental Algorithms for Stream variables is to use n separate regression models, where
each regression model singles out one of the variables
Processing
to be the dependent variable. Changes in any of the n
A stream processing algorithm takes a sequence of regression models signifies a change in behavior.
events as its inputs and scans this sequence only once An alternate approach is orthogonal regression [?]
in increasing order of occurrence [?, ?]. The computa- in which error is defined as the minimum distance of
tional complexity measures are the space used by the a data point from the surface. As a trivial illustra-
algorithm and the time required to process each new tive example consider a model with two parameters x
event. An incremental algorithm is one for which the and y, and where the model is represented by the line
time required to fuse a single event with the history of x + y = 1. Consider a data point (1, 1). Let y be the
events is small compared with the length of the history; dependent variable in a regression model. The value of
we seek algorithms in which the time is independent of y predicted by the model when x = 1 is y = 0. Hence,
the error in the standard regression model correspond- 2 Theory
ing to point (1, 1) is the difference (1) between the
2.1 Exponential smoothing and sliding
actual (1) and predicted (0) values. By contrast, the
error in the orthogonal regression model is the shortest window
distance from the point (1, 1) to the line and this error A key issue is that of determining the weight to be
is 12 . given to old information in estimating models: too
much weight given to old information results in algo-
Another model is the degenerate case of a surface rithms that do not update models rapidly; but, the
where the model is represented by a single point. As more information that is used, the better the estimates
in the general case, the error corresponding to a data in the presence of noise. Popular algorithms for deal-
point is the minimum distance of the data point from ing with different emphases on newer and older data
the surface which, in this case, degenerates to the dis- are sliding window and exponential smoothing. A slid-
tance of the data point from point p that represents ing window protocol with window size W estimates
the model. The problem simplifies to finding the point a model using only the most recent W data points;
p that minimizes total error. As signals arrive on the it treats all W data points in the window with equal
stream, point p is recomputed with each new signal, weight, and effectively gives zero weight to points out-
and a behavior change corresponds to a significant side the window. An exponential smoothing algorithm
change in the value of p. with weight gives a weight of k to a data point k
units in the past where is positive and at most 1; thus
If all the data points are uniformly distributed in a an algorithm using a small value of forgets faster.
sphere around a point, then a model which is a single We refer to as the forgetting factor.
point is better than a model which is a surface. In- Incremental stream-processing algorithms can be
deed, in this case every hyperplane through point p is obtained for both sliding window and exponential
optimal; thus the hyperplanes carry no more informa- smoothing. Appendix 2 shows that this is done for the
tion than the single point p. If, however, the points exponential smoothing case. The proof for the sliding
lie near an extended surface, then a surface model is window is similar.
better than a point model. One solution is to have a An important issue is that of determining the ap-
general model where the number of dimensions of the propriate to use at each time T . The value of can
surface can vary; for instance in a 3parameter (and range from 1 (in which case all points from 0 to T are
hence 3dimensional space) model, the modeling sur- weighted equally) to 0 (in which case only the data
face could change continuously between being a plane, point that arrived at T is considered). Small values
a line and a point. In this paper we restrict ourselves of adapt to changes rapidly because they give less
to the case where points fall closer to a surface than to weight to old data whereas large values of are better
a point. at smoothing out noise.
One approach is to change the relative weights given
Regression models are represented by equations in
to old and new data when a change is estimated. For
which the independent variables can get arbitrarily
instance, suppose the algorithm estimates at time 103
large. In many systems, the ranges of variables are
that with high probability a change occurred at time
limited by physical constraints. For example, in an
100; the algorithm then reduces the weight given to
information assurance system monitoring file accesses
signals received before 100 and increases the weight
there is a physical limit to the rates at which files can
given to signals received after 100. A disadvantage of
be accessed by single process. Limiting ranges of vari-
this approach is that if the algorithm estimates that
ables changes error estimates. Consider the trivial il-
a behavioral change has taken place when, in reality,
lustrative example given earlier with variables x and y,
no change has occurred, then the algorithm discards
where the model is x + y = 1. Consider the error due
valuable old data needlessly. The same approach can
to a data point (0, 2) for two cases: (a) the unlimited
be used with sliding windows.
variable range [, +] and (b) the ranges of x and
y are limited to [0, 1]. The error in the former case is
2.2 Experimental Setup
the minimum distance of the point from the line, and
this distance is 12 . The error in the latter case is the At any point in time, the behavior of a system is cap-
minimum distance of the point from the line segment, tured by a model which is represented by a bounded
and this is the distance (1 unit) from the point (0, 2) surface. Our algorithm attempts to estimate the true
to the end (0, 1) of the segment. model given a sequence of noisy signals. We call the
model and the surface estimated from signals the es-
A change in behavior may be indicated by a change timated values as opposed to the true values. The
in the ranges of variables even if there is no change in true model is changed at some point in time during
the surface that represents the model: A change in the the experiment and we evaluate whether the estimated
length of a line segment may be significant even if the model follows the true model accurately.
line itself does not change. In this paper we do not At each point in time, a signal is generated as fol-
consider this issue. lows. First a point q is generated on the true bounded
surface randomly, then a scalar error term e is gener- model; compute an estimated model from the noisy
ated randomly using the given error distribution, and data; and, compare the true and estimated models. At
finally a data point r is generated where r = q + e v an arbitrary point in time, we change the true model.
where v is the unit normal to the true surface at point Noisy data is now generated using the new true model.
q. (The distribution of noise terms is assumed to remain
unchanged even though the true model changes.) Since
2.2.1 Algorithm the estimation algorithm has no specific indication that
the true model has changed, the algorithm uses data
Input at time T : A sequence of T 1 points that
before the change as well as points after the change.
arrived in the time interval [0, T 1] and a new point
Therefore, the estimated model may not be close to
that arrived at time T .
the new true model during and immediately after the
Output at time T : An estimate of the surface a
change. We would like the estimated model to become
hyperplane in a linear model at time T .
close to the true model as the time since the last change
Goal: Minimize the deviation between the estimated
increases.
and true surfaces.
The set of experiments has been restricted to 2
The true model changes over time, and the manner
dimensional surfaces. Noise is assumed to be Gaus-
of change is described separately.
sian. This is not a necessary assumption; in fact the
algorithm may be applied with any white noise vector.
2.2.2 Angle between planes
We consider cases where the noise is low ( 2 = 1) or
One measure of efficacy of fit of the estimated model high (50 2 100). We study the effect of different
to the true model is the angle between the surfaces. values of on the accuracy of the model. We choose
The inner product of the unit normals to the hyper- values of as follows. We pick a positive integer w
planes representing the estimated and true models is that we call the window size (not to be confused with
the cosine of the angle between the hyperplanes, and the window in the sliding window algorithm) and a
we use this as a measure of goodness. The cosine is positive number (which is at most1) that we call the
1 if the hyperplanes are parallel and is 0 if they are threshold. The value of was set to 0.5 in the exper-
orthogonal. iments. Given w and we pick an such that the
total weight assigned to all the signals w or more time
2.2.3 Comparison of distances of points from true units into the past is exactly (up to a rounding error)
and estimated surfaces . For instance, if w = 4 and = 0.5 then we know
that the first w signals have a total weight of 12 , the
Another measure of goodness of fit is represented by
next w have a total weight of 14 , and the next w have
the differences in distances of data points from the true
a weight of 18 and so forth.
and estimated surfaces. Let Dk,t be the minimum dis-
tance of the data point that arrived at time k from the 3.1 Experiments with changing behaviors
true surface at time t. Recall that dk,t is the minimum
distance of the data point that arrived at time k from We ran many experiments. In each experiment a
the surface estimated at time t. Let E be defined as change occurs after a certain number of time points.
follows: Each change is a translation followed by a rotation of
the (true) plane. Fig. 2 illustrates the 2D case; each
X
E= (Dk,k dk,k )2 (2) line is associated with a behavior change. Here we re-
k port on the following experiments
Now E is a measure of goodness the smaller the The true model is changed after 500 time points.
value of E the better the resulting estimate. We call E The translation is 0.75 and the rotation is 10 de-
the relative distance error. Notice that this parameter grees.
can be nonzero even if the true and estimated hyper-
The true model is changed after 50 time points.
planes are parallel because this error term is zero if and
The translation is 0.02 and the rotation is 1 de-
only if the two hyperplanes are the same.
gree.
Appendix 1 shows that the solution that minimizes
E can be obtained by solving the convex optimization The true model is changed after 5 time points.
problem of the minimum eigenvalue or a square system The translation is 0.0018 and the rotation is 0.1
of equations. For now we present results using only the degrees.
second method.
Each experiment was run for several thousands of
points and thus covered many changes of the true
3 Experiments
model. For ease of visualization, we only show 1500
We restrict ourselves to linear models; thus, the sur- points in the figures, however, the same pattern occurs
face is a hyperplane in a multidimensional space. In for larger numbers of points.
each of our experiments we assume that we are given Fig. 3 shows the cosine and the angle between the
the true model; we generate noisy data from the true true and estimated hyperplanes at different points in
Relative distance error against the number of iterations
6
4
= 0.70711
4 3.5 = 0.97857
= 0.98923

2 3

Relative distance error


2.5
0

2
2

1.5
4

1
6
6 4 2 0 2 4 6
0.5

0
0 500 1000 1500
Fig. 2: The model of behavioral change used in the per- Number of iteration

formed experiments. When the model changes points


are generated using a different line
Fig. 4: The relative distance error between the esti-
Scalar product against the number of iterations
mated and actual plane. A change occurs every 500
= 0.70711
iterations. Threshold = 0.5, Noise variance = 1
= 0.97857
= 0.98923
1 (0.0)
Scalar product against the number of iterations
0.99 (3.6)
= 0.70711
= 0.97857
Scalar product (degree)

0.99 (5.1)
= 0.98923
1 (0.0)
0.99 (6.2)

0.998 (3.6)
Scalar product (degree)

0.99 (7.2)

0.996 (5.12)
0.99 (8.1)

0.994 (6.2)
0.98 (8.8)

0.992 (7.2)
0.98 (9.5)

0.99 (8.1)
0.98 (10.2)
0 500 1000 1500
Number of iteration 0.98 (8.8)

0.98 (9.5)

0.98 (10.2)
Fig. 3: The scalar product between the vector of esti- 0 500
Number of iteration
1000 1500

mated and actual coefficients. A change occurs every


500 iterations. Threshold = 0.5, Noise variance = 1
Fig. 5: The scalar product between the vector of esti-
time for the low variance case. Fig. 4 shows the rel- mated and actual coefficients. A change occurs every
ative distance error as a function of time for the low 500 iterations. Threshold = 0.5, Noise variance = 100
variance case. The next two figures show results for
the high variance case. The angle between the true
algorithm estimates that a change has occurred.
and estimated planes increases sharply at the point of
the change and then decreases. The angle at the in- When changes occur frequently as in Fig. 7-10, then
stant of change is larger for higher values of and large values of are not very appropriate since they
this is not suprising because higher values of give give large weight to data generated according to dif-
greater weight to prechange data. Also, algorithms ferent past models. High accuracy is given by large
with higher values of take longer, after a change, to values, but only when the amount of data generated
reduce the error. according to the same model is high; when this does
Higher values of are less susceptible to noise. This not occur, then smaller give better performance.
is not apparent from the figures in the lowvariance The experiments arePexplained quite
Pk simply by con-
c
case, but is readily apparent in the highvariance case. sidering the function i=1 ki i=c+1 ki , where
Indeed, the algorithm with low cannot distinguish c denotes the time that the true model changes. The
between a change to the true model and noise in the first term is the total weight assigned to prechange
case of high noise variance. This suggests, as expected, signals and the second to postchange signals. Im-
that only high values of should be used in the case mediately after the change, the prechange weights
of high noise whether the true model is stationary or are larger because there are fewer postchange sig-
not. As discussed earlier, an approach is to adapt the nals. Likewise, the higher the value of the greater
relative weights given to old and new data when the the weight given to prechange signals.
Relative Distance error against the number of iterations Relative distance error against the number of iterations
4 1.6
= 0.70711 = 0.70711
3.5 = 0.97857 1.4 = 0.97857
= 0.98923 = 0.98923

3 1.2
Relative distance error

Relative distance error


2.5 1

2 0.8

1.5 0.6

1 0.4

0.5 0.2

0 0
0 500 1000 1500 0 500 1000 1500
Number of iteration Number of iteration

Fig. 6: The relative distance error between the esti- Fig. 8: The relative distance error between the esti-
mated and actual plane. A change occurs every 500 mated and actual plane. A change occurs every 50
iterations. Threshold = 0.5, Noise variance = 100 iterations. Threshold = 0.5, Noise variance = 100

Scalar product against the number of iterations Scalar product against the number of iterations

= 0.70711 = 0.70711
= 0.97857 = 0.97857
= 0.98923 = 0.98923
1 (0.0) 1 (0.0)
Scalar product (degree)

Scalar product (degree)

0.998 (3.62) 0.999 (2.5)

0.998 (3.62)
0.996 (5.12)

0.997 (4.43)
0.994 (6.28)
0.996 (5.12)
0.992 (7.25)
0.995 (5.73)

0.99 (8.10)
0.994 (6.29)

0.988 (8.58) 0.993 (6.78)


0 500 1000 1500 0 500 1000 1500
Number of iteration Number of iteration

Fig. 7: The scalar product between the vector of esti- Fig. 9: The scalar product between the vector of esti-
mated and actual coefficients. A change occurs every mated and actual coefficients. A change occurs every
50 iterations. Threshold = 0.5, Noise variance = 100 5 iterations. Threshold = 0.5, Noise variance = 100

3.2 Adaptive Algorithms report on them in the full version of the paper.
Adaptive algorithms change the relative weights as-
3.2.1 Number of steps for convergence for different
signed to older and newer data when a change is de-
numbers of parameters and noise
tected. The figures show that when a change is abrupt,
the change can be detected readily and adaptive algo- We show in the appendix that the computational com-
rithms work well. If the change is gradual but frequent, plexity for handling each new signal value is a constant
then for large values the algorithm may come close to independent of the length of the history, except for the
complete recover, but never recover fully. This clearly case of computing the eigenvalue. In our experiments
appears in Figures 7, 8, 9 and 10. we compute solutions iteratively using the Matlab rou-
An alternative strategy is to compare the model at tine fsolve using the solution obtained at time t as the
time T with the models at previous times t where t initial guess for the computation at time t + 1. Our ra-
ranges from T to T M where M is a constant window tionale for choosing such iterative algorithms is that
size. So far, we have only discussed the case where when there is no change in the true model, we expect
M = 1 which is sufficient for significant substantial little or no change in the estimated model, and hence
changes. If the algorithm detects a change between this method of obtaining an initial guess should be very
any model at a previous time t and the current time T , good. We found that about 12 steps were required to
then the algorithm adapts the weights, giving greater converge from a random guess whereas only 4 steps
weight to signals after time t and less to signals before were needed by using the time t value as the initial
time t. These experiments are ongoing, and we will guess for the time t + 1 computation.
Relative distance error against the number of iterations [4] D.Eberly. 3D Game Engine Design. Morgan Kauf-
1
= 0.70711 mann, 2001.
0.9 = 0.97857
= 0.98923 [5] Amol Deshpande, Carlos Guestrin, Samuel
0.8
R.Madden, Joseph M.Hellerstein, and Wei Hong.
0.7
Model-driven data acquisition in sensor networks.
Relative distance error

0.6 Proceedings of the 30th International Conference


0.5
on Very Large Data Bases, 2004.
0.4 [6] Daniel Kifer, Shai Ben-David, and Johannes
0.3 Gehrke. Detecting change in data streams. In Pro-
0.2
ceedings of the 30th VLDB Conference, Toronto,
Canada, 2004, pages 180191. VLDB, 2004.
0.1

0 [7] S. Krishnamurthy, S. Chandrasekaran, O. Cooper,


0 500 1000 1500
Number of iteration A. Deshpande, M. Franklin, J. Hellerstein,
W. Hong, S. Madden, V. Raman, F. Reiss, and
M. Shah. Telegraphcq: An architectural status re-
Fig. 10: The relative distance error between the es- port. IEEE Data Engineering Bulletin, Vol. 26(1),
timated and actual plane. A change occurs every 5 March 2003.
iterations. Threshold = 0.5, Noise variance = 100
[8] C. Olston, J. Jiang, and J. Widom. Adaptive fil-
4 Conclusions and further Work ters for continuous queries over distributed data
streams. Proc. of the ACM Intl Conf. on Manage-
We have presented an algorithm for estimating the pa- ment of Data, 2003.
rameters of a linear model. The algorithm combines
the method of orthogonal regression with exponential
forgetting to compute the best estimate. We have
Appendix 1: Method of the minimum
shown that all the computation can be done incremen- eigenvalue
tally and using a very small amount of memory. For The best fitting hyperplane with respect to a set of
the tested models we have discussed the recovery of pa- points is obtained solving the following minimization
rameters as function of the frequency of the behavioral problem:
change of the model. In the future we plan to study
incremental iterative algorithms for finding the mini- PT
min i=1 i (a0 y[i] + a0 )2
mum eigenvalue of the matrix S in Appendix 1 and (3)
consequently obtaining the eigenvector corresponding subject to kak = 1
with the best estimate of parameters. Though the lat-
ter problem is convex and therefore easily to solve with where T denotes the number of points, a =
standard packages for reasonable matrix dimensions, it (a1 , a2 , . . . , ap ) is the vector of unknowns, a0 is the
may still require an intensive CPU time if all the his- offset of the hyperplane from the origin, < 1 is
tory is taken into account. We also plan to compare the weighting factor and the y[i] denotes the point
our approach with standard regression techniques. received at time i. The larger the weight, the more
recent the point. Using Lagrangian multipliers, the
References problem can be formulated as finding the p + 1 param-
eters a0 , a1 , . . . , ap which minimize the expression
[1] B. Babcock, S. Babu, M. Datar, R. Motwani, and
J. Widom. Models and issues in data stream sys-
tems. In Invited paper in Proc. of the 2002 ACM PT
E = i=1 T i [(a0 y[i]) + a0 ]2 +
Symp. on Principles of Database Systems (PODS (4)
2002). ACM, 2002. PT
( i=1 a2i 1)
[2] Sirish Chandrasekaran, Owen Cooper, Amol Desh- 0
pande, Michael J. Franklin, Joseph M. Heller- E E E
The gradient vector (E) = , ,...,
stein, Wei Hong, Sailesh Krishnamurthy, Samuel R. a1 a2 ak
Madden, Vijayshankar Raman, Fred Reiss, and is defined by
Mehul A. Shah. Telegraphcq: Continuous dataflow
processing for an uncertain world. Conference on PT
Innovative Data systems Research, 2003. (E) = 2( i=1 T i y[i] y[i]0 ) a +
PT
[3] Sirish Chandrasekaran and Michael J. Franklin. + 2 ( i=1 T i y[i]) a0 + (5)
Remembrance of streams past: Overload-sensitive
management of archived streams. VLDB, 2004. 2a
while the expression for the derivative with respect to while the partial derivative with respect to a0 is
a0 is
E
E PT T i
= 2 a0 ( sT + y[T + 1]) +
= i=1 2 (y[i]0 a + a0 ) (6) a0 (10)
a0
PT + 2 a0 ( rT + 1)
For simplicity of notation, let S = i=1 2 T i
PT
(y[i] y[i]0 ). Furthermore, let s = i=1 T i y[i] and where S T denotes the matrix S at time T , sT is the
PT
r = i=1 T i . vector s at time T and rT is the value of r at time T .
Setting eq. (6) to zero we have the constraint that Notice that all these parameters have been defined in
a0 = s0 ar . Substituting the expression for a0 in Appendix 1.
eq. (5) we obtain the following condition: After solving the system of equations at time T + 1,
we can set set S T +1 = S T + y[T + 1] y[T + 1]0 ,
1
(S s s0 ) a a = 0 (7) sT +1 = sT + y[T + 1] and rT +1 = rT + 1. Doing
r so, we can incrementally compute the system at step
It is well known that the eigenvector associated with
T + 2.
the minimum eigenvalue of S = (S 1r s s0 ) corre-
sponds with the best estimate for a.
Appendix 3: Other norms than L2
Appendix 2: Derivation of Incremental Appendix 2 shows how the calculation of derivatives
Computations can be done incrementally for the case when the L2
We have shown in Appendix 1 the method to find the norm is used a distance criteria; in fact the problem
optimal estimate for a and a0 . Many iterative and can be formulated as
efficient methods for finding eigenvalues exist in the PT
literature. In the performed experiments, we use a min i=1 i k(a y[i] + a0 )k
different approach. We consider the p + 2 dimensional (11)
square system defined by subject to kak = 1
It can be shown that the systems of equations can
S a + s a0 a = 0 be computed incrementally also in the case when the
s0 a + r b = 0 (8) minimization is done using the Ln norm, for n > 2, i.e.

a0 a = 1
PT
The existence of one solution is guaranteed by the min i=1 i (k(a y[i] + a0 )kn )
argument in Appendix 1. However, the returned solu- (12)
tion will correspond with an eigenvalue of S , which subject to kak = 1
is not necessarily the minimum eigenvalue. Conse-
Due to space limitations we do not discuss the
quently, the corresponding eigenvector does not nec-
derivation here, but we restrict ourself to the follow-
essarily correspond with the best estimate for the un-
ing considerations. The key step for the incremental
known vectors of parameters. In the performed experi-
computation in the L2 case is the use of the matrix
ments the trust region dogleg method implemented by
S T . The k th column of this matrix, S kT , represents
the standard Matlab routine fsolve has been used to
the sum of all vectors y[i], weighted by the k th com-
solve the system. The quality of the recovered solution
ponent of each vector at time T . A generalization of
has been shown to be very satisfactory (see Section 3).
this idea leads us to the use of hypermatrices consisting
The system of equations to be solved at each step is
of n different entries for the case when the Ln norm is
computed incrementally using the following method.
used. The dimension of each entry is p, with p denoting
A data structure which summarizes all points re-
the dimension of the vector y[i]. Hence, the dimension
ceived up to step i and only incorporates the point
of an n hypermatrix is pn , which can be considered
received at step i + 1 is used. Doing so, the equa-
constant with respect to the number of points y. The
tions at step i + 1 can be computed in an amount of
conclusion is that the use of Ln norms, n > 2, only
time which is constant with respect to the number of
introduces a multiplicative factor pn2 in the dimen-
received points.
sion of the space used by the algorithm to estimate the
Using eq. (5) and (6) it is easy to see that the gra-
parameters with respect to the use of the L2 norm.
dient (E) at time T + 1 can be defined as

(E) = ( 2 S T + y[T + 1] y[T + 1]0 ) a0 +

+ 2 ( sT + y[T + 1]) a0 +

2a
(9)

You might also like