Attribution Non-Commercial (BY-NC)

95 views

Attribution Non-Commercial (BY-NC)

- Ken Black QA ch16
- S1 mindmap
- Inference Statistical Significance
- PhD Thesis
- 10083-30514-1-PB
- Econometrics - Bivariate Regression Slideshow
- arm n feet
- Anti Bullying Interventions
- School Based Assessment And The Silence Behaviour Among Secondary School Teachers In Kuala Langat, Selangor Darul Ehsan.
- estimation
- chapter9.pdf
- 123165352-estimation
- Chapter 9
- Chapter14_Review_Answers.pdf
- spss ppt
- 7-Analisis Regresi Excel Notes - Stepwise Regression
- Lecture 11.pdf
- Boschken, H. L. (2002). Urban spatial form and policy outcomes in public agencies.UAR.pdf
- Applying Regression Models to Calculate the Q Factor
- Shear Wave Velocity as Function of Spt Penetration Resistance and Vertical Effective Stress at California Bridge Sites

You are on page 1of 10

∗

Hansheng Lei Venu Govindaraju

Linear relation is valuable in rule discovery of stocks, such scaling factor. In some applications, we also need to

as ”if stock X goes up 1, stock Y will go down 3”, etc. consider X2 to be similar to X1 . Obviously, Lp norms

The traditional linear regression models the linear relation cannot capture these types of similarity. Furthermore,

of two sequences perfectly. However, if user asks ”please Lp distance only has relative meaning when used to

cluster the stocks in the NASDAQ market into groups measure (dis)similarity. By ”relative”, we mean that

where sequences have strong linear relationship with each a distance alone between two sequences X1 and X2 ,

other”, it is prohibitively expensive to compare sequences e.g., Distance(X1 , X2 ) = 95.5, cannot give us any

one by one. In this paper, we propose a new model named information about how (dis)similar X1 and X2 are.

GRM (Generalized Regression Model) to gracefully handle Only when we have another distance to compare, e.g.,

the problem of linear sequences clustering. GRM gives a Distance(X1 , X3 ) = 100.5 > 95.5, we can tell that X1 is

measure, GR2 , to tell the degree of linearity of multiple more similar to X2 than to X3 . In conclusion, Lp norms

sequences without having to compare each pair of them. Our as measure of (dis)similarity have two disadvantages:

experiments on the stocks in the NASDAQ market mined • Cannot capture similarity in the case of shifting

out many interesting clusters of linear stocks accurately and and scaling.

eﬃciently using the GRM clustering algorithm.

• Distance only has relative meaning of

1 Introduction. (dis)similarity.

Sequence analysis has attracted a lot of research inter- It is well known that the mean-deviation normalization

ests with a wide range of applications. While matching, can discard the shifting and scaling factors. The mean-

sub-matching, indexing, clustering, rule discovery, etc. deviation normalization is deﬁned as N ormal(X) =

are the basic research problems in this ﬁeld [1] - [8], (X − mean(X))/std(X) . However, it can not tell what

[23, 24], the core problem is how to deﬁne and measure the shifting and scaling factors are. Those factors are

similarity. Currently, there are several popular models exactly what we need to mine the linearity of sequences.

used to deﬁne and measure (dis)similarity of two se- A typical application of linearity is the rule discovery of

quences. Let’s classify them into 4 main categories: stocks: cluster the stocks in the NASDAQ market into

groups where sequences have strong linear relationship

1.1 Lp norms [1, 2]. Given two sequences X = with each other.

[x1 , x2 , · · · , xN ] and Y = [y1 , y2 , · · · , yN ], Lp norm is

N 1

deﬁned as Lp(X, Y ) = ( i=1 |xi − yi | p ). When p=2, 1.2 Transforms [3, 21, 22]. Popularly used trans-

it is the most commonly used Euclidean distance. Lp forms in sequences are the Fourier Transform and

norms are straightforward and easy to calculate. But Wavelet Transform. Both transforms can concentrate

in many cases, the distance of two sequences cannot most of the energy to a small region in the frequency

reﬂect the real (dis)similarity between them. A typical domain. With energy concentrated to some a small re-

case is shifting. For example, suppose sequence X1 = gion, processes can be carried out in this small region in-

[1, 2, · · · , 10] and X2 = [101, 102, · · · , 110]. X2 is the volving only few coeﬃcients, thus dimension is reduced

result of shifting X1 by 100, i.e., adding 100 to each and time is saved. From this point of view, the trans-

element of X1 . The Lp distance between X1 and X2 forms are used actually for feature extraction. However,

is large, but actually they should be considered to be after features are extracted, some type of measure is un-

similar in many applications [10, 16, 17]. Another case avoidable. If Lp norm distance is used, it inherits the

disadvantages stated above.

∗ Center for Uniﬁed Biometrics and Sensors, Computer

Science and Engineering department, State University of 1.3 Time Warping [18, 19, 20]. It deﬁnes the

New York at Buﬀalo, Amherst, NY 14260. Email: distance between sequences Xi = [x1 , x2 , · · · , xi ] and

{hlei,govind@cse.buﬀalo.edu} Y j = [y1 , y2 , · · · , yj ] as D(i, j) = |xi − yi | + min{D(i −

1, j), D(i, j − 1), D(i − 1, j − 1)}. This distance can we can solve β0 and β1 as follows :

be solved using dynamic programming. It has a great

advantage that it can tolerate some local non-alignment (2.2) β0 = y − β1 x

of time phrase so that the two sequences do not have to N

be of the same length. It is more robust and ﬂexible than i=1 (xi − x)(yi − y)

(2.3) β1 = N 2

Lp norms. But it is also sensitive to shifting and scaling. i=1 (xi − x)

And the warping distance only has relative meaning, N N

just like the Lp norms. where x = N1 i=1 xi and y = N1 i=1 yi , the average

of sequence Y and X respectively.

1.4 Linear relation [10, 16, 17]. Linear transform After obtaining β0 and β0 , there will be a question:

is Y = β0 + β1 X. Sequence X is deﬁned to be similar how do we measure how well the regression line ﬁts these

to Y if we can determine such β0 and β1 so that data? To answer this, the R-squared is deﬁned as:

Distance(Y, β0 + β1 X) is minimized and this distance N 2

u

is below a given threshold. Paper [16] solved scaling (2.4) R2 = 1 − N i=1 i

2

factor β1 and shifting oﬀset β0 from a geometrical point i=1 (yi − y)

of view. Although Distance(Y, β0 +β1 X) is invariant to

shifting and scaling, the distance still only has relative From (2.4) we can further derive:

meaning. N

2 [ i=1 (xi − x)(yi − y)]2

In this paper, we propose a new model, named (2.5) R = N N

2 2

GRM (Generalized Regression Model) to measure the i=1 (xi − x) i=1 (yi − y)

degree of the linear relation of multiple sequences at

one time. In addition, based on GRM, we develop tech- The value of R2 is always between 0 and 1. The closer

niques to cluster massive linear sequences accurately the value is to 1, the better the regression line ﬁts the

and eﬃciently. data points. R2 is the measure for the Goodness-of-Fit

The organization of this paper is as follows: Sec- in the traditional regression.

tion §1 is introduction, section §2 provides a basic back- The regression model as (2.1) is called Simple

ground of the traditional regression model. After that, Regression Model, since it involves only one independent

section §3 describes GRM in detail and §4 shows ex- variable X and one dependent variable Y . We can add

amples of how to apply GRM to linearity measure and more independent variables to the model as follows:

clustering of multiple sequences. Section §5 evaluates

(2.6) Y = β0 + β1 X1 + β2 X2 + · · · + βK XK + u

GRM using real stock prices in the NASDAQ market

and discusses the experimental results. Finally, section This is called Multiple Regression Model. β0 , β1 , · · · , βK

§6 will draw conclusions. can be estimated similarly using ﬁrst order conditions.

Linear regression analysis originated from statistics and 3.1 Why not the traditional Regression Model.

has been widely used in econometrics [27, 28]. Let’s We observed that the Simple Regression Model is excel-

use an example to introduce the basic idea of linear lent in testing the linear relation of two sequences. R2

regression analysis. For an instance, to test the linear is a good measure for linear relation. For an instance,

relation between consumption Y and incoming X, we R2 (X1 , X2 ) = 0.95 is statistically strong evidence that

can establish the linear model as: the two sequences are highly linear related to each other,

(2.1) Y = β0 + β1 X + u thus they are very similar (if we think similarity should

be invariant to shifting and scaling). We do not have to

The variable u is called the error term. The regression compare R2 (X1 , X2 ) > R2 (X1 , X3 ) and say X1 is sim-

as (2.1) is termed as ”the regression of Y on X ”. ilar to X2 rather than X3 . Therefore, the meaning of

Given a set of sample data, X = [x1 , x2 , · · · , xN ] and R2 for similarity is not relative, unlike distance-based

Y = [y1 , y2 , · · · , yN ], β0 and β1 can be estimated in the measures. Furthermore, according to equation (2.5), we

sense of minimum-sum-of-squared-error. That is, we know that no matter sequence X2 regresses on X1 or

seek to ﬁnd a line, called regression line, in the Y -X vice versa, the R2 is the same, i.e., R2 is invariant to

space, to ﬁt the points (x1 , y1 ), (x2 , y2 ), · · · , (xN , yN ) as the regression order of sequences. This makes it feasible

well as

N possible. We need to determine β0 and β1 such for R2 to be a measure of similarity.

N

that i=1 u2i = i=1 (yi − β0 − β1 xi ) is minimized, as Fig. 2 shows two pairs of sequences and correspond-

shown in ﬁg. 1 a). Using ﬁrst order conditions[27, 28], ing R2 values. Sequence X1 and X2 are similar with

(xi, yi) (xi, yi) 2000 4000 4000

Y Y ui 1000 2000

ui 2000

0

0 0

e e

Lin Lin -1000

y1 ion y1 ssio

n

ress re -2000 -2000 -2000

Reg Reg 0 100 200 300 400 0 100 200 300 400 -2000 -1000 0 1000 2000

a) Simple Regression Model b) Generalized Regression Model

1500 2000 2000

1000 1000

500

Figure 1: a) In the traditional Simple Regression Model, -500 0 0

sequences Y and X compose a 2-dimensional space. 0 100 200 300 400 0 100 200 300 400 -500 500 1500 2500 3500

(x1 , y1 ), (x2 , y2 ), · · · , (xN , yN ) are points in the space. d) Sequence X3 e) Sequence X4 f) X4 regresses on X3

The regression line is the line that ﬁts these points in the

sense of minimum-sum-of-squared-error. b) In GRM,

Figure 2: Two pairs of sequences. R2 (X1 , X2 ) = 0.91

two sequences also compose a 2-dimensional space. The

and R2 (X3 , X4 ) = 0.31. The values of R2 ﬁt our

error term ui is deﬁned as the vertical distance from the

observation that X1 and X2 are similar, while X3 and

point to the regression line.

X4 are not similar.

R2 (X1 , X2 ) = 0.91, while X3 and X4 are not similar Given K(K ≥ 2) sequences X1 , X2 , · · · , XK and

since R2 (X3 , X4 ) = 0.31. Points in ﬁg. 2 c) are dis-

tributed along the regression line, while points in ﬁg. 2 X1 x11 x12 · · · x1N

f) are scattered everywhere. When we need to test only X2 x21 x22 · · · x2N

two sequences, the Simple Regression Model is suitable. .. = .. .. .. ..

. . . . .

However, when more than two sequences are involved

XK xK1 xK2 ··· xKN

in some applications such as clustering, the Simple Re-

gression Model has to run regression between each pair We ﬁrst organize them into N points in the K-

of sequences. The performance cannot be eﬃcient. One dimensional space:

might be tempted to think that we can use the Mul-

tiple Regression Model. Unfortunately, there exists a x11 x12 x1N

x21 x22 x2N

critical problem in the Multiple Regression Model, the

p1 = . , p2 = . , · · · , pN = ..

measure. We cannot use R2 in the multiple regression .. .. .

model to test whether multiple sequences are similar xK1 xK2 xKN

to each other or not, because it only means the linear

relation between Y and the linear combination of X1 , Then, we seek to ﬁnd a line in the K-dimensional

X2 , · · ·, XK . Moreover, R2 in the multiple regression space that ﬁts the N points in the sense of minimum-

is sensitive to the order of sequences. If we randomly sum-of-squared-error.

choose Xi to substitute Y as dependent variable and let In the traditional regression, the error term is

Y be independent variable, then the regression becomes deﬁned as:

Xi = β0 + β1 X1 + · · · + βi Y + · · · + βK XK + u. The R2

(3.7) ui = yi − (β0 + β1 x1i + · · · + βK xKi )

here will be diﬀerent from that of (2.6), because they

have diﬀerent meanings. It is the distance between yi and the regression hyper-

From a geometrical point of view, equation (2.6) plane in direction of axis Y . This makes sequence

describes a hyper-plane instead of a line in (K + Y unique from any Xi (i = 1, 2, · · · , K). Fig. 1 a)

1)-dimensional space. To test the similarity among shows the ui of the traditional regression in the 2-

multiple sequences, we need a line in the space instead dimensional space. In GRM, we deﬁne the error term

of a hyper-plane. ui as the vertical distance from point (x1i , x2i , · · · , xKi )

Generalizing the idea of Simple Regression Model to to the regression line. Please note that there is no Y

multiple sequences, we propose the GRM (Generalized here anymore, because no sequence is special among its

Regression Model). community. Fig. 1 b) shows the new deﬁned ui in the

case of two-dimensional space.

§7 Appendix gives the details of how to determine Proof. Two sequences X1 and X2 are linear to each

the regression line in K-dimensional space. Here, we other means:

assume we have found the regression line as follows:

(3.10) x2i = β0 + β1 x1i (i = 1, 2, · · · , N )

p(1) − x1 p(2) − x2 p(K) − xK So, we have:

(3.8) = = ··· =

e1 e2 eK N N

t (3.11) x2i = N β0 + x1i

where p(i) is the i -th element of point p, [e1 , e2 , · · · , eK ]

i=1 i=1

is the eigenvector corresponding to the maximum eigen-

value of the scatter matrix(see §7 Appendix). xj (j = That is:

1, 2, · · · , K) is the average of sequence Xj (through out (3.12) x2 = β0 + β1 x1

the rest of the paper, xj always means the average of Subtract (3.10) by (3.12), we get:

sequence Xj ).

Expression (3.8) means that if any point p in K- (3.13) x2i − x2 = β1 (x1i − x1 )

dimensional space satisﬁes equation (3.8), it must lie on On the other side, from (3.10) we know σ(X2 ) = σ(β0 +

the line. All the points that satisfy (3.8) compose a line β1 X1 ) = β1 σ(X1 ), i.e., β1 = σ(X2 )/σ(X1 ). Thus,

in K-dimensional space. This line is the regression line. (3.13) can be rewritten as xσ(X

1i −x1

1)

= xσ(X

2i −x2

2)

. ∆

For the N points p1 , p2 , · · · , pN , some may lie on the

regression line, some may not. But the sum of squared- Lemma 3.2. K sequences X1 , X2 , · · · , XK are linear to

distance from pj (j = 1, 2, · · · , N ) to the regression line each other ⇔ K sequences X1 , X2 , · · · , XK have a linear

is minimized. To guarantee the regression line exists relation as xσ(X 1i −x1

1)

= xσ(X

2i −x2

2)

= · · · = xσ(X

Ki −xK

K)

(i =

uniquely, we need following two assumptions: 1, 2, · · · , N ), where σ(X) denotes the standard deviation

N of sequence X.

• Assumption 1. i=1 (xji − xj ) = 0. This

Proof. From lemma 3.1, this is obvious. ∆

assumption means no sequence is constant. It

guarantees the scatter matrix has eigenvector. Applying lemma 3.1 and 3.2, we can obtain the

following theorem:

• Assumption 2. There exists at least two points

pi , pj among the N points such that pi = pj . This Theorem 3.1. K sequences X1 , X2 , · · · , XK are linear

assumption guarantees the N points determine a to each other ⇔ Points p1 , p2 , · · · , pN are all distributed

line uniquely. on a line in K-dimensional space and this line is the

regression line.

In real applications, it is highly unlikely that a random Proof. From right to left is obvious. Let’s prove from

sequence is constant or all K sequences are exactly the left to right.

same. Therefore, the assumptions will not limit the According to lemma 3.2, if X1 , X2 , · · · , XK are

applications of GRM. linear to each other, then this relation can be expressed

Similar to the traditional regression, after determin- as:

ing the regression line, we need a measure for Goodness- x1i − x1 x2i − x2 xKi − xK

(3.14) = = ··· =

of-Fit. We deﬁne: σ(X1 ) σ(X2 ) σ(XK )

N 2 Recall that point pi = [x1i , x2i , · · · , xKi ]t . Let pi (k) =

2 i=1 ui

(3.9) GR = 1 − K N xki , (3.14) can be rewritten as:

2

j=1 i=1 (xji − xj )

pi (1) − x1 pi (2) − x2 pi (K) − xK

(3.15) = = ··· =

If GR2 is close to 1, then we know the regression line σ(X1 ) σ(X2 ) σ(XK )

ﬁts the N points very well, which further means the K Equation(3.15) means point pi (i = 1, · · · , N ) is on the

sequences have a high degree of linear relationship with line. Therefore, we conclude that p1 , p2 , · · · , pN are

each other. distributed on a line in K-dimensional space.

We need the following two lemmas before we prove Now we need to show the line must be the regression

an important theorem of GRM. line. The N points p1 , p2 , · · · , pN determine the line as

(3.14) and it is unique according to assumption 2. On

Lemma 3.1. Two sequences X1 and X2 are linear to the other hand, our regression line guarantees that the

each other ⇔ Two sequences X1 and X2 have a linear sum of squared-error is minimized. If and only if the

relation as xσ(X

1i −x1

1)

= xσ(X

2i −x2

2)

(i = 1, 2, · · · , N ), where regression line is same as (3.14), the sum of squared-

σ(X) denotes the standard deviation of sequence X. error is 0 (minimized). ∆

Now, let’s derive some properties of GR2 : 18 45 18

14 14

λ

1. GR2 = N

35

and 1 ≥ GR2 ≥ 0. 10 10

pi −m2 25

i=1

6 6

2

2. GR = 1 means the K sequences have exact linear 2 15 2

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

relationship to each other. 15 25 35 45

3. GR2 is invariant to the order of X1 , X2 , · · · , XK , 1 2 1 2

sequences and the value of GR2 does not change. Figure 3: An example of applying GRM.1 to two

sequences.

Proof. 1. According to §7 Appendix, we have:

N

N

(3.16) u2i = −et Se + pi − m 2 • Organize the given K sequences with length N into

i=1 i=1 N points p1 , p2 , · · · , pK in K-dimensional space as

shown in §3.2.

Thus,

N • Determine the regression

u2 N line. First, calculate

GR 2

=1 − K Ni=1 i the average m = 1

N i=1 pi . Second, calculate

j=1 i=1 (xji − xj )

2 N t

N 2 the scatter matrix S = i=1 (pi − m)(pi − m) .

u Then, determine the maximum eigenvalue λ and

= 1 − N i=1 i

2 corresponding eigenvector e of S.

i=1 pi − m

t

e Se • Calculate GR2 according to property 1 of GR2 .

= N (recall (3.16))

p − m 2

i=1 i

et λe • Draw conclusion. Suppose we only accept linearity

= N with conﬁdence no less than C (say, C = 85%). If

2

i=1 pi − m GR2 ≥ C, we can conclude that the K sequences

λ are linear to each other with conﬁdence GR2 and

= N (recall that e 2 = 1)

2

i=1 pi − m the linear relation is as (3.8).

N 2 N

From (3.16), we know i=1 ui = −et Se + i=1 4.1 Apply GRM.1 to two sequences: an exam-

2

pi − m ≥ 0, therefore: ple.

N Suppose we want to test two sequences X1 and X2

2 t and

(3.17) pi − m ≥ e Se = λ ≥ 0

i=1

X1 = [ 0, 3, 7, 10, 6, 5, 10, 14, 13, 17];

so we conclude 1 ≥ GR2 ≥ 0.

N

2. If GR2 = 1, then i=1 u2i = 0, which means the X2 = [11, 17, 25, 30, 24, 22, 25, 34, 39, 45],

regression line ﬁts the N points perfectly. There-

as shown in ﬁg. 3 a) and b) respectively. Let’s apply

fore, according to theorem 3.1, the K sequences

GRM.1 step by step.

have exact linear relationship to each other.

First, organize the two sequences into 10 points: (0,

3. This is obvious. ∆ 11), (3, 17), (7, 25), (10, 30), (6, 24), (5, 22), (10, 25),

(14, 34), (13, 39), (17, 45). If we draw the 10 points in

Because GR2 has above important properties, we deﬁne the X1 -X2 space, their distribution is as shown in ﬁg. 3

GR2 as the degree of linear relation and the similarity c).

measure in the sense of linear relation. Second,Ndetermine the regression line. Average

m = N1 i=1 pi = [8.5, 27.2]t . Maximum eigen-

4 Applications of GRM. value λ = 1161.9 and corresponding eigenvector e =

The procedure of applying GRM to measure the linear [0.4553, 0.8904]t .

relation of multiple sequences is described by algorithm Third, calculate GR2 = 0.9896. Since GR2 is as

GRM.1. high as 98.96%, we can conclude that X1 is highly linear

GRM.1: Testing linearity of multiple se- related to X 2 . In addition, we ﬁnd their linear relation

quences is X0.4553

1 −8.5

= X0.8904

1 −27.2

.

σ(X1 )

24 20 18

standard deviations, i.e., e1 = σ(Xe2

2)

= · · · = σ(X K)

eK .

σ(X )

20

16 14 To infer reversely, σ(X

ei

i)

≈ ej j is a strong hint that

16

σ(X )

12 12

10 they may be linear; if σ(X ei

i)

diﬀers greatly from ej j

8 6 , they can not be linear. This is very valuable for

clustering. We call σ(X i)

8

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

ei the feature value of sequence

a) Sequence X1 b) Sequence X2 c) Sequence X3

Xi .

Based on this, we derive an algorithm for clustering

massive sequences. Given a set of sequences S = {Xi |

Figure 4: Three sequences with low similarity. i = 1, 2, · · · , K}, algorithm GRM.2 works as follows.

Algorithm GRM.2: Clustering of massive

sequences

If we observe ﬁg. 3 a) and b), we can see they are

really similar (linear) to each other. So the conclusion • Apply Algorithm GRM.1 to test whether the given

based on GR2 makes sense. sequences are linear to each other or not. If yes, all

the sequences can go into one cluster and we can

4.2 Apply GRM.1 to multiple sequences: an stop, otherwise, go to next step.

example. • After GRM.1, we have eigenvector [e1 , e2 , · · · , eK ]t .

GRM is intended to test whether multiple sequences Create a feature value sequence F =

are linear to each other or not. Let’s give an example σ(X1 ) σ(X2 ) σ(XK )

( e1 , e2 , · · · , eK ) and sort it in increasing

for testing 3 sequences at a time.

order. After sorting, suppose F = (f1 , f2 , · · · , fK ).

Suppose we have three sequences:

• Start from the ﬁrst feature value f1 in F . Suppose

X1 = [6, 9, 13, 16, 12, 11, 16, 20, 19, 23]; the corresponding sequence is Xi . We only check

the linearity of Xi with the sequences whose feature

X2 = [8, 13, 13, 17, 13, 18, 16, 13, 17, 19]; values in F are close to f1 . Here ”close” means

X3 = [5, 9, 12, 14, 17, 18, 17, 15, 13, 13]. fj /f1 ≤ ξ (According to our experience, ξ = is

enough). We collect those sequences which have

They are shown in ﬁg. 4 a), b) and c) respectively. linearity with Xi with conﬁdence ≥ C into cluster

Following the same procedure, we can calculate GR2 = CM1 . Delete all the sequences in this cluster from

0.7301. This conﬁdence is not much high, thus we can set S, then repeat the similar procedure to obtain

conclude that some sequences are not very linear to next cluster until S becomes empty.

others. If we observe the three sequences in ﬁg. 4,

we can see that none of the three sequences are linear The most time-consuming part in GRM.1 and

2 GRM.2 is to calculate the maximum eigenvalue and cor-

to one other. This example demonstrates that GR is

a good measure again. Actually, we have tested many responding eigenvector of scatter matrix S. Fast algo-

2 rithm [25] can do so with high eﬃciency.

sequences and found GR as linearity measure agrees

with our subjective observation.

5 Experiments.

4.3 Apply GRM to clustering of massive se- Our experiments focus on two aspects: 1) Is the GR2

quences. really a good measure for linearity? If two or multiple

When hundreds or thousands of random sequences sequences have high degree of linearity with each other,

are tested by algorithm GRM.1, one can foresee that will GR2 really be close to 1 or vice versa? 2) Can

GR2 cannot be close to 1 before really calculating it, GRM.2 be used to mine linear stocks in the NASDAQ

because hundreds or thousands of random sequences are market? What is the accuracy and performance of

highly unlikely to be linear to each other. If we know GRM.2 in clustering sequences?

the result before carrying out the test, what is the use of For the ﬁrst concern, we need to test algorithm

GRM.1 in testing massive sequences? Fortunately, we GRM.1. We generated 5000 random sequences of

can make use of algorithm GRM.1 to obtain heuristic real number for experiments. Each sequence X =

information for clustering sequences. [x1 , x2 , · · · , xN ] is a random walk: xt+1 = xt + zt , t =

From theorem 3.1, lemma 3.2 and the formula 1, 2, · · · , where zt is an independent, identically dis-

of the regression line as (3.8), we know that if the tributed (IID) random variable. Our experiments con-

multiple sequences have linear relationship to each ducted on these sequences show that GR2 is a good mea-

other, then the eigenvector is directly related to the sure for linearity. Generally speaking, if GR2 ≥ 0.90,

the sequences are very linear to each other; if 0.80 ≤ 1 1

GR2 ≤ 9.0, the sequences have high degree of linearity, 0.99 0.99

Accuracy

Accuracy

while GR2 < 0.80 means the linearity is low. 0.98 0.98

1 3 5 7 9 11 13 15

0.95

1 3 5 7 9 11 13 15 17

detail here. Number of sequences (x 200), length=365 length of sequence (x 20), # of sequences=2000

a) b)

5.1 Experiments setup.

We downloaded all the NASDAQ stock prices from

Figure 5: a) Relative accuracy of GRM.2 when the

yahoo website ( http://table.ﬁnance.yahoo.com/) start-

number of sequences varies. b) Relative accuracy of

ing from 05-08-2001 and ending at 05-08-2003. It has

GRM.2 when the length of sequences varies.

over 4000 stocks. We used the daily closing prices of

each stock as sequences. These sequences are not of

the same length for various reasons, such as cash div- 3) Repeat 1) until S become empty.

idend, stock split, etc. We made each sequence length After we ﬁnd clusters CM1 , CM2 , · · · , CMl using

365 by truncating long sequences and removing short GRM.2 and clusters CR1 , CR2 , · · · , CRh using BFR, we

sequences. Finally, we have 3140 sequences all with the can calculate the accuracy of GRM.2. The accuracy of

same length 365. The 365 daily closing prices start from the BFR is considered as 1 since it is brute-force. We

05-08-2001, ending at some date (not necessarily at 05- calculate the relative accuracy of GRM.2 as follows.

08-2003, since there are no prices in weekends).

Our task is to mine out clusters of stocks with Accuracy(CM, CR) = [ maxj (Sim(CMi , CRj ))]/l,

similar trends. To solve this problem, there are two i

choices basically. The ﬁrst is GRM.2 and the second is where Sim(CMi , CRj ) = 2 |CMii|+|CRjj | .

|CM ∩CR |

to use the Simple Regression Model and brute-forcedly

Our programs were written using MATLAB 6.0.

calculate linearity of each pair. For convenience, we

Experiments were carried out in a PC with 800MHz

name this method BFR (Brute-Force R2 ). As we

CPU and 384MB of RAM.

discussed in §2, the Simple Regression Model can also

compare two sequences without worrying about scaling

5.2 Experimental results.

and shifting, because R2 is invariable to both scaling

Fig. 5 a) shows the accuracy of GRM.2 relative

and shifting. Intuitively, BFR is more accurate but

to BFR while the number of sequences varies when the

slower, while GRM.2 is faster at the compensation of

length of sequence is ﬁxed at 365. The accuracy remains

some loss of accuracy. Our experiments will compare

at above 0.99 (99%) when the number of sequences

the accuracy and performance of BFR and GRM.2. We

increases to over 3000. Fig. 5 b) shows that the relative

do not have to compare GRM.2 with other methods in

accuracy of GRM.2 stays above 0.99 when the length of

the ﬁeld of sequence analysis, such as [16] - [20], because

sequences varies with the number of sequences ﬁxed at

no method so far can test multiple sequences at a time

2000.

and most of them are sensitive to scaling and shifting,

Fig. 6 a) and b) show the running time of GRM.2

to the best of our knowledge.

and BFR. We can see that GRM.2 is much faster than

In experiments, we require the conﬁdence of linear-

BFR, no matter when the number of sequences varies

ity of sequences in the same cluster to be no less than

while holding the length ﬁxed or vice versa. The faster

85%. Recall that the linearity of is deﬁned by GR2 in

speed is at the compensation of less than 1% accuracy

GRM and R2 in the Simple Regression Model.

of clustering. Note that any clustering algorithm cannot

Given a set of sequences S = {Xi | i = 1, 2, · · · , K},

be 100% correct because the classiﬁcation of some

BFR works as follows:

points are ambiguous. From this point of view, we can

1) Take an arbitrary sequence out from S, say Xi .

conclude that GRM.2 is better than BFR.

Find all sequences from S which has R2 ≥ 85% with Xi .

Fig. 7 shows a cluster of stocks we found out us-

Collect them into a temporary set. Do post-selection

ing GRM.2. The stocks of four companies, A (Ag-

to make sure all sequences have conﬁdence of linearity

ile Software Corporation), M (Microsoft Corporation),

≥ 85% with each other by deleting some sequences from

P (Parametric Technology Corporation) and T (TMP

the set if necessary.

Worldwide Inc.), are of very similar trends. In addition

2) Save this temporary set as a cluster and delete

to the clustering, we found they have following approx-

all sequences in this set from S.

imate linear relation: A−10.96

0.0138 =

M −29.36

0.0140 = P0.0115

−6.37

=

1200 6

Running time (seconds) 38

GRM GRM

1000 5

BFR BFR

800 4

Stock of Microsoft

600 3

32

400 2

200 1

0

0

1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 17 26

Number of sequences(x 200), length=365 Length of sequences (x 80), # of sequences=100

a) b)

20

BFR. a) Running time comparison when the number of Stock of Agile Software

sequences varies. b) Running time comparison when the

length of sequences varies.

Figure 8: The regression of Microsoft stock on Agile

70

Software stock.

TMP worldwide

60

a new measure for linearity of multiple sequences. The

meaning of GR2 for linearity is not relative. Based on

Stock prices

50

40 GR2 , algorithm GRM.1 can test the linearity of multiple

30 sequences at a time and GRM.2 can cluster massive

Microsoft sequences with high accuracy as well as high eﬃciency.

20

Agile Software

10 Acknowledgements. The authors appreciate Dr.

0 Parametric Tech.

Jian Pei and Dr. Eamonn Keogh for their invaluable

suggestions and advice on this paper. The authors also

From 05-08-2001 to 05-08-2003 thank Mr. Mark R. Marino and Mr. Yu Cao for their

comments.

Figure 7: A cluster of stocks mined out by GRM.2.

7 Appendix

T −33.62 Determine the Regression Line in K-dimensional

0.0563 space

From the stocks shown in ﬁg. 7, we can see that A,

Suppose p1 , p2 , · · · , pN are the N points in K-

M and P ﬁt each other very well with some shifting.

dimensional space. First we can assume the regression

T ﬁts others with both scaling and shifting. Above

line is l, which can be expressed as:

relation is consistent with this our observation, since

0.0138, 0.014, 0.0115 is close to each other, which means (7.18) l = m + ae,

the scaling factor between each pair of M, P and T is

close to 1, while 0.0563 is about 4 times of 0.014, which where m is a point in the K-dimensiona space, a is an

means the scaling factor between T and M is about 4. arbitrary scalar, and e is a unit vector in the direction

Fig. 8 shows the regression of Microsoft stock on Agile of l.

Software stock. The distribution of the points in the When we project points p1 , p2 , · · · , pN to line l, we

2-dimensional space conﬁrms that the two stocks have have point m+ai e corresponds to pi (i = 1, · · · , N ). The

strong linear relation. squared error for point pi is:

Actually, we found many interesting clusters of

(7.19) u2i = (m + ai e) − pi 2

stocks. In each cluster, every stock has similar trend to

one another. Due to space limitation, we cannot show Thus, the sum of all the squared-error is to be:

all of them here. The results of clustering are valuable

N N

for stocks buyers and sellers.

u2i = (m + ai e) − pi 2

i=1 i=1

6 Conclusion.

N

We propose GRM by generalizing the traditional Simple = ai e − (pi − m) 2

Regression Model. GRM gives a measure GR2 , which is i=1

N

N

N

Since S generally has more than one eigenvectors, we

= a2i e 2 −2 ai et (pi − m) + pi − m 2

should select the eigenvector e which corresponds to the

i=1 i=1 i=1

largest eigenvalue λ.

N N N

= a2i − 2 ai et (pi − m) + pi − m 2 N Finally, we2 need m to complete the solution.

i=1 pi − m should be minimized since it is always

i=1 i=1 i=1

non-negative. To minimize it, m must be the average of

Since e is unit vector, e 2 = 1. p1 , p2 , · · · , pN . It is not diﬃcult to prove this. Readers

In GRM, the sum of squared error must be mini- are referred to [26] for proof.

N With m as the average of the N points and e from

mized. Note that i=1 u2i is a function of m, ai and e.

Partially diﬀerentiating it with respect to ai and setting (7.23), the regression line l is determined. The line in

the derivative to be zero, we can obtain: form of (7.18) is not easy to understand. Let’s write

it in an easier way. Suppose e = [e1 , e2 , · · · , eK ]t and

(7.20) ai = et (pi − m) m = [x1 , x2 , · · · , xK ]t , l can be expressed as: p(1)−x

e1

1

=

p(2)−x2 p(K)−xK

e2 = ··· = eK , where p(i) is the i -th element

Now,

N we should determine vector e to minimize of point p.

2

u

i=1 i . Substituting (7.20) to it, we have:

N N N N

References

u2i = a2i −2 ai ai + pi − m 2

[1] R. Agrawal, C. Faloutsos and A. Swami, Eﬃcient

N N

Similarity Search in Sequence Databases, Proc.of the

4th Intl. Conf. on Foundations of Data Organizations

2 2

= − ai + pi − m

and Algorithms (FODO) (1993), pp. 69–84.

i=1 i=1

[2] B. Yi and C. Faloutsos, Fast Time Sequence Indexing

N N

for Arbitrary Lp Norms, The 26th Intl. Conf. on Very

= − [et (pi − m)]2 + pi − m 2 Large Databases(VLDB) (2000), pp. 385–394.

i=1 i=1 [3] D. Raﬁei and A. Mendelzon, Eﬃcient Retrieval of

N N Similar Time Sequences Using DFT, Proc. of the 5th

= − [et (pi − m)(pi − m)t e] + pi − m 2 Intl. Conf. on Foundations of Data Organizations and

i=1 i=1 Algorithms (FODO) (1998), pp. 69–84.

N [4] R. Agrawal, K. I. Lin, H. S. Sawhne and K. Shim, Fast

= −et Se + pi − m 2 , Similarity Search in the Presence of Noise, Scaling, and

i=1 Translation in Time-Series Databases, Proc. of the 2Ist

VLDB Conference(1995), pp. 490–501.

N t

where S = i=1 (p i − m)(p i − m) , called scatter [5] T. Bozkaya, N. Yazdani and Z. M. Ozsoyoglu, Matching

matrix [26]. and Indexing Sequences of Diﬀerent Lengths, Proc. of

Obviously, the vector e that minimizes above equa- the 6th International Conference on Information and

t Knowledge Management(1997), pp. 128–135.

tion also maximizes e Se. We can use Lagrange mul-

t [6] E. Keogh, A fast and robust method for pattern match-

tipliers to maximize e Se subject to the constraint

ing in sequences database, WUSS(1997).

e 2 = 1. Let: [7] E. Keogh and P. Smyth. A Probabilistic Approach to

Fast Pattern Matching in Sequences Databases, The

(7.21) µ = et Se − λ(et e − 1) 3rd Intl. Conf. on Knowledge Discovery and Data

Mining(1997), pp. 24–30.

Diﬀerentiating µ with respect to e, we have: [8] C. Faloutsos, M. Ranganathan and Y. Manolopoulos,

Fast Subsequence Matching in Time-Series Databases,

∂µ

(7.22) = 2Se − 2λe In Proc. of the ACM SIGMOD Conference on manage-

∂e ment of Data(1994), pp. 419–429.

[9] C. Chung, S. Lee, S. Chun, D. Kim and J. Lee, Sim-

Therefore, to maximize et Se, e must be the eigenvector ilarity Search for Multidimensional Data Sequences,

of the scatter matrix S: Proc. of the 16th International Conf. on Data Engi-

neering(2000), pp. 599–608.

(7.23) Se = λe [10] D. Goldin and P. Kanellakis, On similarity queries for

time-series data: constraint speciﬁcation and imple-

Furthermore, note that: mentation, The 1st Intl.Conf. on the Principles and

practice of Constraint Programming(1995), pp. 137–

(7.24) et Se = λet e = λ 153.

[11] C. Perng, H. Wang, S. Zhang and D. Parker, Land-

marks: a New Model for Similarity-based Pattern

Querying in Sequences Databases, Proc. of the 16th

International Conf. on Data Engineering(2000).

[12] H. Jagadish, A. Mendelzon and T. Milo, Similarity-

Based Queries, The Symposium on Principles of

Database Systems(1995), pp. 36–45.

[13] D. Raﬁei and A. Mendelzon, Similarity-Based Queries

for Sequences Data, Proc. of the ACM SIGMOD Con-

ference on Management of Data(1997), pp. 13–25.

[14] C. Li, P. Yu and V. Castelli, Similarity Search Algo-

rithm for Databases of Long Sequences, The 12th Intl.

Conf. on Data Engineering(1996), pp. 546–553.

[15] G. Das, D. Gunopulos and H. Mannila, Finding simi-

lar sequences, The 1st European Symposium on Prin-

ciples of Data Mining and Knowledge Discovery(1997),

pp. 88–100.

[16] K. Chu and M. Wong, Fast Time-Series Searching

with Scaling and Shifting, The 18th ACM Symp. on

Principles of Database Systems (PODS 1999), pp. 237–

248.

[17] B. Bollobas, G. Das, D. Gunopulos and H. Mannila,

Time-Series Similarity Problems and Well-Separated

Geometric Sets, The 13th Annual ACM Symposium

on Computational Geometry(1997), pp. 454–456.

[18] D. Berndt and J. Cliﬀord, Using Dynamic Time Warp-

ing to Find Patterns in Sequences, Working Notes

of the Knowledge Discovery in Databases Workshop

(1994), pp. 359–370.

[19] B. Yi, H. Jagadish and C. Faloutsos, Eﬃcient Retrieval

of Similar Time Sequences Under Time Warping, Proc.

of the 14th International Conference on Data Engineer-

ing(1998), pp. 23–27.

[20] S. Park, W. Chu, J. Yoon and C. Hsu, Eﬃcient

Similarity Searches for Time-Warped Subsequences in

Sequence Databases, Proc. of the 16th International

Conf. on Data Engineering(2000).

[21] Z. Struzik and A. Siebes, The Haar Wavelet Transform

in the Sequences Similarity Paradigm, PKDD(1999).

[22] K. Chan and W. FU. Eﬃcient Sequences Matching

by Wavelets, The 15th international Conf. on Data

Engineering(1999).

[23] G. Das, K. Lin, H. Mannila, G. Renganathan and

P. Smyt, Rule Discovery from Sequences, Knowledge

Discovery and Data Mining(1998), pp. 16–22.

[24] G. Das, D. Gunopulos, Sequences Similarity Measures,

KDD-2000: Sequences Tutorial.

[25] I. Dhillon, A New O(n2 ) Algorithm for the Symet-

ric Tridiagonal igenvalue/Eigenvector Problem, Ph.D.

Thesis. Univ. of. Calif., Berkerley, 1997.

[26] R. Duda, P. Hart and D. Stork, Pattern Classiﬁcation.

2nd Edition, John Wiley & Sons, 2000.

[27] J. Wooldridge, Introductory Econometrics: a modern

approach, South-Western College Publishing, 1999.

[28] F. Mosteller and J. Tukey, Data Analysis and Regres-

sion: A Second Course in Statistics, Addison-Wesley,

1977.

- Ken Black QA ch16Uploaded byRushabh Vora
- S1 mindmapUploaded by'Tahir Ansari
- Inference Statistical SignificanceUploaded byEvelinasuper
- PhD ThesisUploaded byAmedo Amelo
- 10083-30514-1-PBUploaded byMohd Shafiq
- Econometrics - Bivariate Regression SlideshowUploaded byAdrian Alexander
- arm n feetUploaded bySuchismita Bhattacharjee
- Anti Bullying InterventionsUploaded byKarla Maria
- School Based Assessment And The Silence Behaviour Among Secondary School Teachers In Kuala Langat, Selangor Darul Ehsan.Uploaded byAJHSSR Journal
- estimationUploaded byaminoali
- chapter9.pdfUploaded bynmukherjee20
- 123165352-estimationUploaded byAzhar Siddique
- Chapter 9Uploaded byAniket Batra
- Chapter14_Review_Answers.pdfUploaded byChristine Mae Torriana
- spss pptUploaded bychandanprakash30
- 7-Analisis Regresi Excel Notes - Stepwise RegressionUploaded bynandaeka
- Lecture 11.pdfUploaded byCharles
- Boschken, H. L. (2002). Urban spatial form and policy outcomes in public agencies.UAR.pdfUploaded byHenil Dudhia
- Applying Regression Models to Calculate the Q FactorUploaded bySuriza A Zabidi
- Shear Wave Velocity as Function of Spt Penetration Resistance and Vertical Effective Stress at California Bridge SitesUploaded byMarco Soto
- Regrerssion in BusinessUploaded byshagunparmar
- GB765D2DUploaded byAbhishek Chandna
- Gust i Yanis La Hu Zaman 24010216130090Uploaded byGustiyan IZ
- 4Uploaded byMAYUGAM
- Lecture_1_2018Uploaded byCẩm Hương
- 2.docUploaded bywilliam FELISILDA
- An application of the modified Holmberg–PerssonUploaded byManuel Durant Rodriguez
- Demand ForecastingUploaded byLHM Sinergi Trading
- Syllabus_B.Com_Hons_2017.pdfUploaded byParwati Kumari
- Inferential Sta-WPS OfficeUploaded byXtremo Maniac

- Model- vs. design-based sampling and variance estimationUploaded byFanny Sylvia C.
- ReviewChaps3-4Uploaded byFanny Sylvia C.
- SampleSizeCalcRevisitedUploaded byFanny Sylvia C.
- ReviewChaps1-2Uploaded byFanny Sylvia C.
- Hypo%26PowerLectureUploaded byFanny Sylvia C.
- Non%26ParaBootUploaded byFanny Sylvia C.
- Chapter 21Uploaded byFanny Sylvia C.
- Chapter 20Uploaded byFanny Sylvia C.
- Chapter 14Uploaded byFanny Sylvia C.
- Chapter 13Uploaded byFanny Sylvia C.
- Chapter 12Uploaded byFanny Sylvia C.
- Chapter 11Uploaded byFanny Sylvia C.
- Chapter 8Uploaded byFanny Sylvia C.
- Chapter 10Uploaded byFanny Sylvia C.
- Chapter 9Uploaded byFanny Sylvia C.
- Chapter 6Uploaded byFanny Sylvia C.
- Chapter 5Uploaded byFanny Sylvia C.
- Chapter5p2LectureUploaded byFanny Sylvia C.
- Chapter 7Uploaded byFanny Sylvia C.
- An Ova PowerUploaded byFanny Sylvia C.
- Intro BootstrapUploaded byMichalaki Xrisoula
- Good Article on Standard Error vs Standard DeviationUploaded byAshok Kumar Bharathidasan
- Data Modeling: General Linear Model &Statistical InferenceUploaded byFanny Sylvia C.
- Bio Math 94 CLUSTERING POPULATIONS BY MIXED LINEAR MODELSUploaded byFanny Sylvia C.
- Clustering in the Linear ModelUploaded byFanny Sylvia C.
- R Matrix TutorUploaded byFanny Sylvia C.
- The not so Short Introduction to LaTeXUploaded byoetiker
- Close Out NettingUploaded byFanny Sylvia C.

- Introduction to Statistical Learning R labs and exercises codeUploaded byMatei Stănescu
- project part 4 math 2Uploaded byapi-339771906
- Job AnalysisUploaded bykirandevi1981
- MKT-470-report-Final.docxUploaded bySadman Shabab Ratul
- UntitledUploaded byapi-106078359
- Silabus-S2-MMT 2009-20141-MIUploaded byAdityaRahadianFachrur
- Zero Inflated Models for Overdispersed Count DataUploaded byEugenio Martinez
- Comparison of Empirical ModelsUploaded byHjt Javier
- Chi SquareUploaded byNisrina Puspaningaras
- The Return of Hegemonic Theory: Dominant States and the Origins of International CooperationUploaded byEwing Owls
- Harvard Defendant Expert Report - 2017-12-15 Dr. David Card Expert Report Updated Confid Desigs RedactedUploaded byAlan Beck
- Stat a TutorialUploaded byJennifer Kelley
- Developing 3d Mine Scale GeomechanicalUploaded byJuliet Rodriguez Vilca
- The Tip of the Blade: Self-Injury Among Early AdolescentsUploaded byTiago Cabaço
- Article for ReviewUploaded byreksmanotiya
- Synopsis Corrected.pdfUploaded bySurabhi Suman
- Interactive MarketingUploaded byssunduko
- Attribution ModelsUploaded byjade1986
- QB BCom BBA Business Research MethodsUploaded byMaheswar Kondreddy
- empcaUploaded byshadow12345678910
- A Generalized Definition of the Polychoric Correlation CoefficientUploaded byAndré Luiz Lima
- rteUploaded byayazalisindhi
- Biostatistics Workbook Aug07-1Uploaded byWaElinChawi
- 2012 Schutte Final ThesesUploaded byboudard
- STEP OF PCAUploaded bysuneel kumar rathore
- Outlier Detection Using Unsupervised Learning on High Dimensional DataUploaded byAnonymous 7VPPkWS8O
- Arithmatic Mean,Median & Mode QsUploaded byBrainy12345
- Integrating Accelerated Problem Solving Into the Six SigmaUploaded byAdrian Parfeni
- Univariate Random VariablesUploaded byRadhika Nangia
- 5 Tutorial BayesianRiskUploaded byMurali Dharan

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.