- Digital Change Detection Methods in Ecosystem Monitoring
- Dp 06588
- ant Analysis - Ppt
- Rca Jmlr05
- Principal components in three-ball cascade juggling
- 2B21sheet1a
- factor analysis - stata.doc
- j.1365-2621.2004.tb17883.x
- Paper for Histrogram Feature and GLCM Feature
- Psychometric evaluation of the Financial Threat Scale (FTS) in the context of the great recession
- Topology Based Data Analysis Identifies a Subgroup of Breast Cancer With a Unique Mutational Profile and Excellent Survival
- Fall 2012 exam
- Pike Etal 2001 Panama
- Hello World
- NPL_PCA_07
- Understanding and Using Factor Scores
- Electrochemical Monitoring of Citric Acid Production by Aspergillus Niger
- IoGAS Tutorial Basic Monash
- 3 (3)
- Transition Economies

FUNDP (Namur) and ULB (Brussels), Belgium

FNRS Associate Researcher

variables into a smaller set of

uncorrelated

variables

(principal

components).

For p random variables X1,,Xp. the

goal of PCA is to construct a new set

of p axes in the directions of greatest

variability.

X2

X1

X2

X1

X2

X1

X2

X1

the goal is to find a linear transformation

Y=1 X1+2 X2+..+ p Xp (= TX) such that

tha variance of Y (=Var(TX) =T ) is

maximal

The direction of is given by the

eigenvector correponding to the largest

eigenvalue of matrix

first), is the one that has the second

highest variance. This corresponds to

the eigenvector associated to the

second largest eigenvalue

And so on

PC is given by i / i

eigenvalues >1

covariance matrix which is sensitive to

outliers Illustration:

. matrix list C

c1 c2 c3

r1

1

r2 .7

1

r3 .6 .5

1

. corr x1 x2 x3

(obs=1000)

x1

x2

x3

. corr x1 x2 x3

(obs=1000)

x1

x2

x3

. corr x1 x2 x3

(obs=1000)

x1

x2

x3

basing the PCA on a robust estimation

of the covariance (correlation) matrix.

A well suited method for this is MCD that

considers all subsets containing h% of

the observations (generally 50%) and

estimates and on the data of the

subset associated with the smallest

covariance matrix determinant.

Intuition

2

x

xy

xy

2

y

and det()

2

x

2

y

2

xy

containing 50% of the observations

200

58

9.054910 ...

100

Solution: use subsampling algorithms

Rousseeuw and Van Driessen (1999)

3.Estimation of robust MCD

4.Estimation of robust PCA

global solution

N

*

log(1Pr)

log(1(1 ) )

need to have a probability (Pr) of having

at least one clean if % of outliers

corrupt the dataset can be easily

derived:

N

Pr 1 1 1

point in the dataset is not an outlier

N

*

log(1Pr)

log(1(1 ) )

need to have a probability (Pr) of having

at least one clean if % of outliers

corrupt the dataset can be easily

derived:

N

Pr 1 1 1

p random points in a p-subset is an

log(1Pr)

*

outlier

N

p

log(1(1 ) )

need to have a probability (Pr) of having

at least one clean if % of outliers

corrupt the dataset can be easily

derived:

N

Pr 1 1 1

of the p random points in a p-subset is

log(1Pr)

*

an

outlier

N

p

log(1(1 ) )

need to have a probability (Pr) of having

at least one clean if % of outliers

corrupt the dataset can be easily

derived:

N

Pr 1 1 1

least one outlier in each of the N plog(1Pr)

*

subsets

(i.e. that all pN log(1considered

p

(1 ) )

need to have a probability (Pr) of having

at least one clean if % of outliers

corrupt the dataset can be easily

derived:

N

Pr 1 1 1

least one clean p-subset among the N

log(1Pr)

*

considered

N

p

log(1(1 ) )

need to have a probability (Pr) of having

at least one clean if % of outliers

corrupt the dataset can be easily

derived:

N

Pr 1 1 1

Rearranging we have:

N

*

log(1Pr)

log(1(1 ) )

estimate a preliminary * and *

MD are distributed as

data.

for Gaussian

estimate a preliminary * and *

* and * for all individuals

Sort

individuals

according

to

Mahalanobis distances and re-estimate

* and * using the first 50%

observations

Repeat

the

convergence

previous

step

till

estimate a robust Covariance matrix

The reason for this is simple, it relies on

a non-robust preliminary estimation of

the covariance matrix

1. Compute a variant of MD

MD ( x ) 1( x

i

MED

MED

)'

. Use

2. Sort individuals according to MD

the subset with the first p+1 points to

re-estimate and .

3. Compute MD and sort the data.

4. Check if the first point out of the

subset is an outlier. If not, add this

point to the subset and repeat steps

3 and 4. Otherwise stop.

clear

set obs 1000

local b=sqrt(invchi2(5,0.95))

drawnorm x1-x5 e

replace x1=invnorm(uniform())+5 in

1/100

mcd x*, outlier

gen RD=Robust_distance

hadimvo x*, gen(a b) p(0.5)

scatter RD b, xline(`b') yline(`b')

8

6

Fast-MCD

Hadi

0

2

3

Hadi distance (p=.5)

C .7 1

.6 .5 1

. pca x1-x3

Principal components/correlation

Variable

Comp1

Comp2

Comp3

Unexplained

Conclusion

x1

x2

x3

0.6021

0.5815

0.5471

-0.2227

-0.5358

0.8145

-0.7667

0.6123

0.1931

0

0

0

C .7 1

.6 .5 1

. pca x1-x3

Principal components/correlation

x1

x2

x3

0.6021

0.5815

0.5471

-0.2227

-0.5358

0.8145

-0.7667

0.6123

0.1931

0

0

0

(100 real changes made)

. pca x1-x3

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

1000

3

3

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

1.51219

1.00075

.487058

.511435

.513695

.

0.5041

0.3336

0.1624

0.5041

0.8376

1.0000

Variable

Comp1

Comp2

Comp3

Unexplained

x1

x2

x3

-0.0261

0.7064

0.7073

0.9986

0.0512

-0.0143

0.0463

-0.7059

0.7068

0

0

0

(100 real changes made)

. pca x1-x3

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

1000

3

3

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

1.51219

1.00075

.487058

.511435

.513695

.

0.5041

0.3336

0.1624

0.5041

0.8376

1.0000

Variable

Comp1

Comp2

Comp3

Unexplained

x1

x2

x3

-0.0261

0.7064

0.7073

0.9986

0.0512

-0.0143

0.0463

-0.7059

0.7068

0

0

0

(100 real changes made)

. pca x1-x3

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

1000

3

3

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

1.51219

1.00075

.487058

.511435

.513695

.

0.5041

0.3336

0.1624

0.5041

0.8376

1.0000

Variable

Comp1

Comp2

Comp3

Unexplained

x1

x2

x3

-0.0261

0.7064

0.7073

0.9986

0.0512

-0.0143

0.0463

-0.7059

0.7068

0

0

0

. mcd x*

The number of subsamples to check is 20

. pcamat covRMCD, n(1000)

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

1000

3

3

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

2.24708

.473402

.27952

1.77368

.193882

.

0.7490

0.1578

0.0932

0.7490

0.9068

1.0000

Variable

Comp1

Comp2

Comp3

Unexplained

x1

x2

x3

0.6045

0.5701

0.5564

-0.0883

-0.6462

0.7581

-0.7917

0.5074

0.3402

0

0

0

. mcd x*

The number of subsamples to check is 20

. pcamat covRMCD, n(1000)

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

1000

3

3

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

2.24708

.473402

.27952

1.77368

.193882

.

0.7490

0.1578

0.0932

0.7490

0.9068

1.0000

Variable

Comp1

Comp2

Comp3

Unexplained

x1

x2

x3

0.6045

0.5701

0.5564

-0.0883

-0.6462

0.7581

-0.7917

0.5074

0.3402

0

0

0

. mcd x*

The number of subsamples to check is 20

. pcamat covRMCD, n(1000)

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

1000

3

3

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

2.24708

.473402

.27952

1.77368

.193882

.

0.7490

0.1578

0.0932

0.7490

0.9068

1.0000

Variable

Comp1

Comp2

Comp3

Unexplained

x1

x2

x3

0.6045

0.5701

0.5564

-0.0883

-0.6462

0.7581

-0.7917

0.5074

0.3402

0

0

0

accurately sum up research excellence?

prize or the Fields Medal;

and Fields Medal winners;

HiCi : Highly cited researchers

N&S: Articles published in Nature and

Science;

PUB: Articles in the Science Citation

Index-expanded, and the Social Science

Citation Index;

. pca

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

150

5

5

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

Comp4

Comp5

3.40526

.872601

.414444

.189033

.118665

2.53266

.458157

.225411

.0703686

.

0.6811

0.1745

0.0829

0.0378

0.0237

0.6811

0.8556

0.9385

0.9763

1.0000

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

scoreonalu~i

scoreonaward

scoreonhici

scoreonns

scoreonpub

0.4244

0.4405

0.4829

0.5008

0.3767

-0.4816

-0.5202

0.2651

0.1280

0.6409

0.5697

-0.1339

-0.4261

-0.3848

0.5726

-0.5129

0.6991

-0.3417

-0.1104

0.3453

-0.0155

0.1696

0.6310

-0.7567

0.0161

0

0

0

0

0

. pca

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

150

5

5

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

Comp4

Comp5

3.40526

.872601

.414444

.189033

.118665

2.53266

.458157

.225411

.0703686

.

0.6811

0.1745

0.0829

0.0378

0.0237

0.6811

0.8556

0.9385

0.9763

1.0000

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

scoreonalu~i

scoreonaward

scoreonhici

scoreonns

scoreonpub

0.4244

0.4405

0.4829

0.5008

0.3767

-0.4816

-0.5202

0.2651

0.1280

0.6409

0.5697

-0.1339

-0.4261

-0.3848

0.5726

-0.5129

0.6991

-0.3417

-0.1104

0.3453

-0.0155

0.1696

0.6310

-0.7567

0.0161

0

0

0

0

0

. pca

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

150

5

5

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

Comp4

Comp5

3.40526

.872601

.414444

.189033

.118665

2.53266

.458157

.225411

.0703686

.

0.6811

0.1745

0.0829

0.0378

0.0237

0.6811

0.8556

0.9385

0.9763

1.0000

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

scoreonalu~i

scoreonaward

scoreonhici

scoreonns

scoreonpub

0.4244

0.4405

0.4829

0.5008

0.3767

-0.4816

-0.5202

0.2651

0.1280

0.6409

0.5697

-0.1339

-0.4261

-0.3848

0.5726

-0.5129

0.6991

-0.3417

-0.1104

0.3453

-0.0155

0.1696

0.6310

-0.7567

0.0161

0

0

0

0

0

the inertia and is given by:

1=0.42Al.+0.44Aw.+0.48HiCi+0.50NS+0.38PUB

Variable

Corr. (1,Xi)

Alumni

Awards

HiCi

N&S

PUB

Total score

0.78

0.81

0.89

0.92

0.70

0.99

The number of subsamples to check is 20

. pcamat covMCD, n(150) corr

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

150

5

5

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

Comp4

Comp5

1.96803

1.46006

.835928

.409133

.326847

.507974

.624132

.426794

.0822867

.

0.3936

0.2920

0.1672

0.0818

0.0654

0.3936

0.6856

0.8528

0.9346

1.0000

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

scoreonalu~i

scoreonaward

scoreonhici

scoreonns

scoreonpub

-0.4437

-0.5128

0.5322

0.3178

0.3948

0.4991

0.4375

0.3220

0.6537

0.1690

0.2350

-0.0544

-0.3983

-0.1712

0.8682

0.6946

-0.5293

0.3494

-0.3163

-0.1233

-0.1277

0.5123

0.5765

-0.5851

0.2158

0

0

0

0

0

The number of subsamples to check is 20

. pcamat covMCD, n(150) corr

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

150

5

5

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

Comp4

Comp5

1.96803

1.46006

.835928

.409133

.326847

.507974

.624132

.426794

.0822867

.

0.3936

0.2920

0.1672

0.0818

0.0654

0.3936

0.6856

0.8528

0.9346

1.0000

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

scoreonalu~i

scoreonaward

scoreonhici

scoreonns

scoreonpub

-0.4437

-0.5128

0.5322

0.3178

0.3948

0.4991

0.4375

0.3220

0.6537

0.1690

0.2350

-0.0544

-0.3983

-0.1712

0.8682

0.6946

-0.5293

0.3494

-0.3163

-0.1233

-0.1277

0.5123

0.5765

-0.5851

0.2158

0

0

0

0

0

The number of subsamples to check is 20

. pcamat covMCD, n(150) corr

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

150

5

5

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

Comp4

Comp5

1.96803

1.46006

.835928

.409133

.326847

.507974

.624132

.426794

.0822867

.

0.3936

0.2920

0.1672

0.0818

0.0654

0.3936

0.6856

0.8528

0.9346

1.0000

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

scoreonalu~i

scoreonaward

scoreonhici

scoreonns

scoreonpub

-0.4437

-0.5128

0.5322

0.3178

0.3948

0.4991

0.4375

0.3220

0.6537

0.1690

0.2350

-0.0544

-0.3983

-0.1712

0.8682

0.6946

-0.5293

0.3494

-0.3163

-0.1233

-0.1277

0.5123

0.5765

-0.5851

0.2158

0

0

0

0

0

The number of subsamples to check is 20

. pcamat covMCD, n(150) corr

Principal components/correlation

Number of obs

Number of comp.

Trace

Rho

=

=

=

=

150

5

5

1.0000

Component

Eigenvalue

Difference

Proportion

Cumulative

Comp1

Comp2

Comp3

Comp4

Comp5

1.96803

1.46006

.835928

.409133

.326847

.507974

.624132

.426794

.0822867

.

0.3936

0.2920

0.1672

0.0818

0.0654

0.3936

0.6856

0.8528

0.9346

1.0000

Variable

Comp1

Comp2

Comp3

Comp4

Comp5

Unexplained

scoreonalu~i

scoreonaward

scoreonhici

scoreonns

scoreonpub

-0.4437

-0.5128

0.5322

0.3178

0.3948

0.4991

0.4375

0.3220

0.6537

0.1690

0.2350

-0.0544

-0.3983

-0.1712

0.8682

0.6946

-0.5293

0.3494

-0.3163

-0.1233

-0.1277

0.5123

0.5765

-0.5851

0.2158

0

0

0

0

0

1 explains 38% of inertia and

2 explains 28% of inertia

Variable

Corr. (1,)

Corr. (2,)

Alumni

Awards

HiCi

N&S

PUB

Total score

-0.05

-0.01

0.74

0.63

0.72

0.99

0.78

0.83

0.88

0.95

0.63

0.47

by the presence of outliers.

obtained either by relying on a robust

covariance matrix or by removing

multivariate outliers identified through a

robust identification method.

