You are on page 1of 40


of Contents

Peter P. Robejsek

FernUniversitt Hagen, 2011


20 21 21 23 23 23 24 25 25 26 26 27 28 29 29 32 33 33

1 KE1
1.1 Multivariate model
Univariate: n objects are observed for one characteristic n data points Multivariate: n objects are observed for p>1 characteristics np data points in np matrix

1.2 Multidimensional normal distribution

Onedimensional normal distribution: Let Z ~N(,2) Then Z has the distribution function: ( ) 2 z 1 2 ( z ) = FZ ( z ) = P( Z z ) = e 2 d 2 and the density

( z ) = fZ ( z ) =

1 e 2

( z ) 2 2 2

Multidimensional normal distribution: Let Z=(Z1,...,Zp) be a p-dimensional random vector. Let this this vector have the expected value vector =(1,...,p) and Covariance matrix where diagi()=i2, (i,j)=ij=(j,i)=ji so is symmetric. Then Z~N(,) iff. Z=AX+ where A is the cholesky factor of =AA and X=(X1,...,Xn) where each Xi~N(0,1) is iid standard normal. For positive definite a density can be given for the distribution of the random vector Z: 1 ( z )' 1 ( z ) 1 ( z ) = f Z ( z ) = e 2 (2 ) p det This multidimensional distribution has the following properties: 1. Let Y~N(,) p1 and B mp, c m1 then X=BY+c ~N(B+c;BB) i.e. a linear combination of normally distributed variables is again normally distributed 2. The marginal distributions of multivariate normal distributions are again normal Peter P. Robejsek FernUniversitt Hagen, 2011 2

3. The components of Y are uncorrelated (pairwise) if Corr(Y)=In Multivariate normal distribution Znp matrix, np matrix, npnp covariance matrix

1.3 Estimation
Repeating an experiment concerning the iid variables X1,...,Xn gives the realizations x1,..,xn. A true parameter of the Xi dist. might e.g. be . We are looking for an estimator ^(x1,...,xn) as a function of the realizations that is as close as possible to . We want the estimator to be unbiased i.e. E[^(X1,...,Xn)]=. An estimator is asymptotically unbiased if limnE[^(X1,...,Xn)]=. Also we want the estimator to be consistent, i.e. with increasing sample size the estimates get better: P(|^n-|>)0 as n. Quality criterion: MSE MSE(,^) =E[(^-)2] =Var(^)+(E(^)-)2 The second term on the LHS is called bias. The estimator is finally called efficient if it has finite variance and if there is no other estimator with a lower variance.

1.3.1 Method of Moments

The kth empirical moment is given by: mk =1/n i=1nxik

1.3.2 Maximum Likelihood

Choose for the estimator that value which is most likely in the current sample. The likelihood is defined as the probability for the particular sample. For continuous distributions L(x1,...,xn)=i=1nfXi(xi) (discrete case: i=1nP(Xi=xi). Let X1,...,Xn be iid N(,2). Maximize log likelihood: ( x )2 n n i 2 1 L( x1,..., x n ) = f X i ( x i ) = e 2 2 i =1 i =1 ( x )2 n i 2 1 2 ln L( x1,..., x n ) = ln 2 e i = 1
n 1 n 2 = log(2 ) 2 ( x i ) 2 2 2 i =1

1.3.3 Least squares

^=(YY)-1YX where Y ist he data matrix and X the vector as above.

1.4 Testing
1.4.1 Distributions
Let the standard normal distribution N(0,1) have -quantiles u. Let X1,...,Xn be iid standard normal variables. The sum of squares of these variables gives a Chisquared distributed variable with n degrees of freedom: 2=i=1nXi2 ~2n. Let the quantiles of the 2n distribution be 2n; Let X0,X1,...,Xn be iid standard normal variables. Then the variable: t=X0/sqrt(1/ni=1nXi2) follows a t-distribution with n degrees of freedom and -quantiles tn; Peter P. Robejsek FernUniversitt Hagen, 2011 3

Finally let X1,...,Xm,Y1,...,Yn be standard normal iid variables. Then the variable F=[1/m i=1mXi2]/[1/n i=1nYi2] follows an F distribution with m and n degrees of freedom: F~Fm,n The -quantiles are Fm,n; The -quantile of a continuous distribution with density fX(x) is that value on the x-axis that separates the total probability mass in the proportion :1-. One also may refer to the 1-fractile.

1.4.2 Tests
While testing the null hypothesis H0 against the alternative hypothesis H1 two types of error can occur: 1. Type: H0 is wrongly rejected 2. Type: H0 is wrongly accepted The level of significance is specified so that the frequency of type 1 errors does not exceed . Let T be a statistic T(x1,...,xn) such that large values indicate higher probability of H1, low values indicate H0. Let the critical value then be c1- such that P(T>c1-) for all H0. One sided hypothesis/test: H0:0 vs. H1:>0 Two sided hypothesis/test: H0:=0 vs. H1:0 Likelihood quotient test: H0:0 vs. H1:\0 (x1,...,xn) =sup 0L(x1,...,xn,)/sup \0L(x1,...,xn,) for the simple hypothesis =0: (x1,...,xn) =L(0)/supL() Using the estimator ^ in logs: -2ln(x1,...,xn) =2(L(^)-L(0)) ~q2

2 KE 2
2.1 Two sample problems
2.1.1 Estimators
For p categories and n objects in a sample from a population with mean vector and covariance matrix measure the yij where i=1,...,n is the number of datapoints (rows) and j=1,...,p the number of categories. Estimate the mean vector =(1,...,p) by the vector y = ( y1 ,...,yp )'. Where each yj is the mean of a column. Estimator is unbiased. Unbiased estimator for covariance matrix : n 1 n 1 S= ( y y )( y i y )' = n 1 ( y i y i ' nyy ') n 1 i =1 i i =1 Where yi denotes the row vector for the jth object. The matrix S has on the diagonal the empirical variance sj2=1/(n-1)i=1n(yij-yj)2 and off diagonal the empirical covariance: sjk=1/(n1)i=1n(yij-yj) (yik-yk) for j,k =1,...,p

2.1.2 Tests for one sample problems

Simultaneous analysis of the p characteristics (variables). Where the np data points are from one sample one has a multivariate single sample problem. Testing the structure of the expected value vector:

H0 : =* vs. H1:* a) Known covariance matrix: Peter P. Robejsek FernUniversitt Hagen, 2011 4


reject null for T2>2p;1- (i.e. at the level is significantly different from *) b) unknown (estimated) covariance matrix via Hotellings T2: (n-p)/p(n-1)T2 =(n-p)/p(n-1) n(y-*)S-1(y-*) reject null for T2>(n-p)/p(n-1)Fp,n-p;1- Simplified calculation of Hotellings T2 Let A B G = C D det A det( D CA1 B), for det A 0 det G= det D det( A BD1 C ), for det D 0
1 n ( y * )' Let G = ( n 1)S n ( y * ) And solve system for detG for T2. c) Testing for symmetric expected value vector H0 : 1=2==p compute the transformation zi=(yi1-yip yi2-yip yip-1-yip) i.e. subtract last column entry from every entry in every row and reestimate mean and covariance matrix accordingly (one entry/ row/column less) and apply the above testes with *=0 and p*=p-1 Testing the structure of the covariance matrix

To test e.g. whether the p variables display correlation H0 : =* vs. H1: * Use the statistic: L =(n-1)(lndet*-lndetS+tr(S-1)-p) correcting for small sample sizes (where n is small in comparison to p): L =[1-1/6(n-1) (2p+1-2/(p-1))]L Reject H0 if L>2p(p+1)/2;1- Simultaneously testing expected value vector and covariance matrix

H0 : =* and =* vs. H1 : * or * Use the statistic: 2=-pn-nln det (S*-1)+ntr(S*-1) Reject H0 if 2>2p+p(p+1)/2;1- The rejection reflects significant differences between either covariance matrix or exp. value vector

2.1.3 Tests for two sample problems

Assume two samples of strength n1, n2 from two populations. For these n1+n2 data points then observe the p relevant criteria. Let (1), (2), S(1), S(2) denote the estimators for covariance matrices and expected value vectors in the respective samples. Testing the structure of the expected value vector:

H0 : (1)=(2) vs. H1: (1)(2) a) for disjoint samples and identical covariance matrices (1)=(2) Estimate joint covariance matrix: S=1/(n1+n2-2) ((n1-1)S(1)+(n2-1)S(2) Use the statistic: T2=n1n2/(n1+n2) ( y(1)- y(2))S-1( y(1)- y(2)) Peter P. Robejsek FernUniversitt Hagen, 2011 5

Reject H0 if T2>p(n1+n2-2)/(n1n2-p-1) Fp,n1+n2-p-1;1- b) For disjoint samples and different covariance matrices, identical size samples n1=n2 Compute transformation: ki=yi(1)- y(1)-yi(2)+ y(2) Use the statistic: T2=n(n-1)( y(1)- y(2))(i=1nkiki)-1( y(1)- y(2)) Reject H0 if: T2>p(n-1)/(n-p) Fp,n-p;1- c) For disjoint samples and different covariance matrices, different size samples n1<n2 Compute Transformation: c=[i=1n1(yi(1)- y(1))-sqrt(n1/n2)(yi(2)-1/n1j=1n1yj(2))] Compute Matrix: B=cc Use statistic: T2=n1(n1-1)( y(1)- y(2))B-1( y(1)- y(2)) Reject H0 if: T2>p(n1-1)/(n1-p)Fp,n-1-p;1- d) For correlated samples and different covariance matrices, identical size samples n1=n2 Correlated samples: the ith vector yi(1) is somehow related to the ith vector yi(2) Compute transformation: ki=yi(1)- y(1)-yi(2)+ y(2) Use the statistic: T2=n(n-1)( y(1)- y(2))(i=1nkiki)-1( y(1)- y(2)) Reject H0 if: T2>p(n-1)/(n-p) Fp,n-p;1- Testing the structure of the covariance matrix:

H0 : (1)=(2) vs. H1: (1)(2) S=1/(n1+n2-2) ((n1-1)S(1)+(n2-1)S(2) c=1- (2p2+3p-1)/6(p+1) (1/(n1-1)+1/(n2-1)-1/(n1+n2-2)) 2=c[(n1+n2-2)ln detS-(n1-1)ln detS(1)-(n2-1)ln detS(2)] 2>2p(p+1)/2;1- a) for disjoint samples Estimate joint covariance matrix: Compute coefficient: Use the statistic: Reject H0 if:

2.2 Correlation analyis

Quantification of the linear interdependencies between characteristics

2.2.1 Correlation of two normally distributed variables

Let X,Y be two random variables that correspond to the characteristics X,Y. The magnitude of their linear interdependence is given by their correlation: =XY=Cov(X,Y)/sqrt(VarX VarY) where Cov(X,Y)=E(X-EX)(Y-EY), VarX=E(X-EX)2 As estimator for the correlation use rXY i.e. Pearsons correlation coefficient: rXY =sXY/sXsY=i=1nxiyi-nxy/sqrt[(i=1nxi2-nx2)(i=1nyi2-ny2)] Define: R2=r2XY as coefficient of determination. It gives the percentage of the variance of X explained by Y and vice versa. To test whether two normally distributed variables are uncorrelated (for normally distributed this implies independent) H0 : XY=0 vs. H1: XY0 Use the statistic: t=rXYsqrt(n-2)/sqrt(1-rXY2) Reject H0 if: |t|>tn-2;1-/2 To test whether the correlation has some specified value H0 XY=* vs, H1: XY* Compute Fishers z-transform: z=artanhrXY=1/2 ln (1+rXY)/(1-rXY) And the value of * *=1/2 ln(1+*)/(1-*) +*/2(n-1) Use the statistic: (z-*)sqrt(n-3) Reject H0 if: (z-*)sqrt(n-3)>u1-/2 Peter P. Robejsek FernUniversitt Hagen, 2011 6

To find a confidence interval for the correlation Find first a confidence interval for :[z1,z2]=[z-u1-/2/sqrt(n-3),z+u1-/2/sqrt(n-3)] Apply the inverse Fisher transf.: ri=(e2zi-1)/(1+e2zi) to find the interval [r1,r2]

2.2.2 Rank correlation between chracteristics

Sort the data x1,...,xn and y1,...,yn by their rank numbers R(x1),...,R(xn) and R(y1),...,R(yn) where the lowest value xi, yi is awarded rank 1 and identical values are awarded midranks i.e. the mean of the two rank numbers. Spearmans Rank correlation coefficient Let R(x)=R(y)=(n+1)/2 be the arithmetic mean of the rank numbers. Then the rank correlation coefficient rsXY is Pearsons correlation coefficient computed for the rank numbers: rsXY =i=1n[R(xi)-R(x)][R(yi)-R(y)]/sqrt[(i=1n(R(xi)-R(x))2)(i=1n(R(yi)-R(y))2)] =1-6i=1n(R(xi)-R(yi))2/n(n2-1) Kendalls Rank correlation coefficient (Kendalls ) To compute Kendalls for two characteristics X,Y from two continuous distributions first sort the rank pairs (R(x1,),R(y1)),,(R(xn,),R(yn)) in increasing order of magnitude. Let qi0 be the distance of rank notches that the R(yj) that are less than the rank R(yi) (i.e. rank if ranking were by Y). Then define Kendalls as: XY =1-4i=1nqi/n(n-1)

2.2.3 Partial correlation

To analyse whether some seemingly existent correlation between two characteristics X,Y is not merely due to the strong correlation of each of the characteristics with some common third characteristic U e.g. if there is an apparent correlation between the number of births and the number of storks controlling for the size of swampy regions might eliminate this correlation. The partialisation of the characteristic U means holding U constant: (X,Y)/U =(XY-XUYU)/sqrt[(1-XU2)(1-YU2)] Estimate the partial correlation by way of Pearsons correlation coefficient for the n triples (x1,y1,u1),,(xnyn,un): r(X,Y)/U =(rXY-rXUrYU)/sqrt[(1-rXU2)(1-rYU2)] To test for zero partial correlation H0 : (X,Y)/U=0 vs. H1: (X,Y)/U0 Use the statistic: Reject H0 if: t=r(X,Y)/Usqrt(n-3)/sqrt(1-r2(X,Y)/U) |t|>tn-3;1-/2

2.2.4 Multiple correlation

The correlation X,(Y1,,Yp) between one characteristic X and several characteristics Y1,,Yp is called multiple correlation. X,(Y1,,Yp) =sqrt(XYP-1YXY) 1 Y1Y2 Y1Y p Y1Y2 1 Y2Y p where XY =(XY1,XY2,,XYp) and PY = 1 Y1Y p Y2Y p Which for normally distributed characteristics X, Y1,,Yp can again be estimated by according Pearson correlation coefficients for the n (p+1)-tupels (x1,y11,,yp1) (xn,y1n,,ypn). Peter P. Robejsek

FernUniversitt Hagen, 2011

The multiple coefficient of determination rX,(Y1,,Yp) indicates the amount of variation in X explained by the characteristics Y1,,Yp. To test for multiple independence H0 : X,(Y1,,Yp)=0 vs. H1: X,(Y1,,Yp)0 (i.e. X,Yi0) Use the statistic: Reject H0 if: F=[(n-p-1)r2X,(Y1,,Yp)]/[p(1-r2X,(Y1,,Yp)] F>Fp,n-p-1;1-

2.2.5 Canonic correlation

The canonic correlation is used to describe the correlation between groups of characteristics. Define the canonic correlation between the groups X1,,Xp and Y1,,Yq as the maximum of the correlation between the linear combinations of the characteristics U=1X1++pXp and V=1Y1++qYq where the vectors (1,,p) and (1,,p) are also called vector of regressionlike parameters resp. vector of the best predictor and can be arbitrary. Define 1 1 rX 1Y1 rX 1Y2 rX 1Yq rX 1 X 2 rX 1 X p rY1Y2 rY1Y p rX 1 X 2 1 rX 2 X p rY1Y2 1 rY2Y p rX 2Y2 rX 2Y2 rX 2Yq RX = ,R = ,R = Y XY 1 rX 1 X p rX 2 X p rY1Y p rY2Y p 1 rX p Y1 rX p Y2 rX p Yq Then the canonic correlation is found to be the square root of the largest eigenvalue of the matrix RX-1RXYRY-1RXY: r(X1,,Xp),(Y1,,Yq) =sqrt(G)

2.2.6 Correlation of discrete characteristics

For X continuous, Y discrete one uses serial correlation coefficients. If Y is dichotomous use pointbiserial correlation coefficients, else pointpolyserial. For X normally distributed use biserial/polyserial correlation coefficients to estimate the correlation between X and Y. For both X,Y discrete use choric correlation coefficients. For X,Y dichotomous use tetrachoric, else polychoric. To illustrate consider pointbiserial correlation coefficients between X continuous and Y dichotomous. Write the data in a 2s table: X x1 x2 x3 xs-1 xs Y 1 n11 n12 n13 n1s-1 n1s n1. 0 n01 n02 n03 n0s-1 n0s n0. n.1 n.2 n.3 n.s-1 n.s n x1,,xs denote the sn observed values of X while Y is [0,1]. For i=0,1 and j=1,,s the number n.j signifies the frequency of the observation (xj,i) where n.j =n1j+n0j j=1,,s ni. =j=1snij i=0,1 The pointbiserial correlation coefficient is then given by: rXYpbis =[1/n sqrt(n1.n0.)(1/n1.j=1sn1jxj-1/n0.j=1sn0jxj)]/sqrt[1/n(j=1sn.jxj2)-1/n(j=1sn.jxj)2] This corresponds to Pearsons correlation coefficient with 0,1 used for the characteristic Y. letting x(1), x(0) be the mean values of x for each case of y and s the standard deviation of x (computed with 1/n) one can also write: rXYpbis =[1/nsqrt(n0.n1.)(x(1)-x(0))]/s

Peter P. Robejsek

FernUniversitt Hagen, 2011

2.2.7 Testing for independence of p characteristics

Let X1,,Xp be normally distributed characteristics within one population and correlations XiXj=0 for i,j=1,p one can test the Hypothesis: H0 : XiXj=0 for all pairs (i,j) ij vs. H1: a XiXj0 First compute empirical correlation matrix for the p(p-1)/2 tupels (x11,xp1),, (x1n,xpn): 1 rX 1 X 2 rX 1 X p rX 1 X 2 1 rX 2 X p RX = 1 rX 1 X p rX 2 X p Use the statistic: W=-(n-p-(2p+5)/6) ln detR Reject H0 if: W>2p(p-1)/2;1- This test can function as a test for the independence of p samples xj1,,xjn j=1,,p of length n.

3 KE 3 Factor analysis
Aim: To explain observed characteristics by means of a number of latent variables, the factors Usually when analyzing p characteristics Y1,,Yp correlations between the characteristics will exist and will point to the existence of some latent factors. First obtain an empirical covariance matrix from standardized (N(0,1) data): data matrix correlation matrix

r11 r12 r1 p y1 p y 2 p r21 r22 r2 p 1 R= Y' Y = n 1 st st y pp rp1 rp 2 rpp compute R from standardized data matrix where y ij y ij y ij st = for i = 1,..., n; j = 1,..., p where y ij and s j are empirical exp val. and std. deviation sj Now assume that every standardized data point yijst can be expressed as a linear combination of q factors F1,,Fq: yijst =lj1fi1++ljqfiq for i=1,,n and j=1,,p f f l l T 1q ' 1q ' 11 11 L ' = Yst = F f n1 f nq ' l p1 lpq ' L is called the factor pattern and contains the factor loadings ljk. The loading is an indicator of the relationship between the factor and the characteristic. fik is called the factor value of the is the matrix of factor values and assumed to be standardized kth factor Fk for object i. F N(0,1) i.e. Each column has mean 0 and variance 1. Both matrices are unknown and must be estimated from the data. Estimate loading matrix: 1 1 R= Y'st Yst = LF ' FL' = LRF L' n 1 n 1 1 where RF = F ' F denotes the correlation matrix of latent factors n 1 y11 y 21 Y = y p1 y12 y 22 y p2
Peter P. Robejsek FernUniversitt Hagen, 2011 9

For independent factors and linear relationships between factors and variables RF=I and L ' and ljk[-1,1] since these numbers give the correlation between the jth characteristic R=L and the kth factor (Fundamental Theorem of Factor Analysis). A factor for which only one of the loadings lik,,lpk is significantly different from zero is called unique factor whereas several nonzero loadings indicate a common factor. A general factor has all loadings significantly different from zero. The complexity of a characteristic Y1,,Yj is the number of high loadings ljk on common factors.

can be decomposed: L l l 0 0 0 0 l 0 11 1q 1q +1 L +U = L + l p1 l pq 0 0 0 0 0 l pq ' =[L|0pp]+[0pp|U] For orthogonal factors this implies: +U )( L +U )' = L L ' + L U ' +U L ' +U U ' L ' (L R=L L ' +U U ' = LL' +UU' =L Where the elements of the matrix U2 are those parts of the variances of the characteristics that cannot be explained by the common factors F1,,Fp: U2 =UU=diag(l21p+1,,l2pq)=diag(12,,p2) For orthogonal factors it is possible to give a reduced empirical correlation matrix from the loadings matrix: =R-U2=LL R Its diagonal elements kj2 =1-j2 are called communality of the jth characteristic. Communalities indicate what portion of the variance of the jth characteristic is explained by common factors. For non-orthogonal factors one obtains: =R-U2=LRFL R and therefore: kj2 =k=1qljk2+2k=1q-1k=k+1qljkljkrFkFk

3.1 Estimation of loadings matrix

For orthogonal factors it is possible to give a reduced empirical correlation matrix from the loadings matrix: =R-U2=LL R Since an infinite number of such matrices L exists different approaches make different assumptions w.r.t the estimation.

3.1.1 ML-Method and canonic factor analysis

Assumption: The p characteristics are drawn from a jointly normally distributed population: N(,). The standardized characteristics Y1st,,Ypst are then p-dimensionally normally distrib ). uted N(0,

), Let be a pq theoretical loadings matrix, let f be a Let yst be a p1 random vector ~N(0, q1 random vector ~N(0,1) and let e be a p1 random vector whose components ~N(0,j2). Assume independence of f and e. The model yst =f+e yields the covariance matrix of yst as: =+diag(12,,p2) To obtain unique estimators L for the loadings matrix and U2=diag(12,,p2) for the characteristic specific (unexplained) variances diag(12,,p2) the ML method additionally requires that LU2L be a diagonal matrix. Not required: Providing ex ante communalities k2,,kp2. Peter P. Robejsek FernUniversitt Hagen, 2011 10

Required: Ex ante specification for the number of common factors q. . The elements of S are The empirical covariance matrix S is an unbiased estimator for 2 jointly Wishart-distributed (multivariate ). Maximizing the Likelihood in two steps, first w.r.t holding diag(12,,p2) constant then inserting this conditional ML estimator into the likelihood function and maximizing w.r.t diag(12,,p2) leads to the eigenvalue problem: (R-U2)U-2A=AJ -2 where J=AU A is the diagonal matrix of the eigenvalues of (R-U2)U-2 and A is the matrix of normalized eigenvectors. The ML estimator for turns out to be: L=AJ1/2 It can only be obtained for known U which in turn can be obtained only numerically. Canonic factor analysis demands that the q canonic correlations between characteristics and latent factors be maximized. The first canonic correlation is the maximum correlation between a linear combination of characteristics and a linear combination of factors, the second is the same only orthogonal to the first etc. For identical numbers of factors both approaches (ML and canonical) lead to identical estimates of the loadings matrix. This can be ssen from the eigenvalueproblems: (R-U2)U-2A=AJ |U-1 -1 2 -1 -1 -1 U (R-U )U U A=U AJ L=UCJ1/2 where J is eigenvalue matrix of U-1(R-U2)U-1 and C=U-1A matrix of normal eigenvectors. This is therefore equivalent to the above statement. Solve iteratively: 1. Choose appropriate starting value j02 for the p specific variances j2 to obtain a starting value U02 for the diagonal matrix of specific variances U2 First let the communality of the jth characteristic Yj, kj2 equal some starting value kj02 e.g. o the multiple coefficient of determination i.e. quadratic multiple correlation of the jth characteristic Yj with the other p-1 characteristics or o the maximum value of the rjj entry in R From this value define j02=1-kj02 2. Calculate the positive eigevalues 10,,q0 of U0-1(R-U02)U0-1 and the respective normalized eigenvectors ck0=(c1k0,,cpk0) 3. in the tth step calculate Ut2=diag(1t2,,pt2) where jt2=1- jt2k=1q kt-1c2jkt-1 as well as q largest eigenvalues of Ut-1(R-Ut2)Ut-1 and corresponding eigenvector ckt 4. repeat for t steps and stop once estimated Ut+1-1 differs from Ut-1 by no more than some small quantity . Find loading matrix for J1/2=diag(sqrt(1t),, sqrt(qt)), C=[c1t,,cqt] L=[l1,,lq]=UtCJ1/2 To test which of the q factors contribute significantly to explaning the correlation of the p = R Ut 2 first factors suffice to reproduce R characteristics test successively whether the q =+diag(12,, p2), vs H1: Rpp arbitrary positive definite symm. H0 :

1/2 (1+2p-sqrt(1+8p)) Use likelihood quotient test. For q 2 det( Lq Lq ' +U( q ) ) 2 )ln ln Use statistic: =(n-1-1/6(2p+5)-2/3 q det R q q U 2( q l 2 ,...,1 l 2 ) )diag(1 where L = [l ,...l ]and
) (q 1 q

k =1


k =1


Peter P. Robejsek

FernUniversitt Hagen, 2011


Reject H0 if:

2>2[(p- q )2-(p+ q )]/2;1-

3.2 Other methods of estimating the loadings matrix

3.2.1 Principal component analysis/ Principal factor analysis
PCA does not assume the existence of characteristic specific variances. Rather it recreates the Correlation matrix of the characteristics R itself by orthogonalizing the characteristics. Assume a n2 data matrix. Then the n two dimensional data points lie in the plane. If the characteristics are normally distributed the points roughly fill an ellipse whose major and minor axis K1 and K2 intersect in the center of mass of the points S. Principal component analysis transforms the coordinate system of the characteristics into that of the principal axes by moving the origin and rotating the axes. To ensure unique solutions one demands that the axes be sorted by size i.e. the first component uses a maximum of the total variance, the second the maximum of the remaining variance etc. For a full set of eigenvalues (AM=1) the loading vector lk is the eigenvector of the kth eigenvalue k normalized so that its length is sqrt(k). Then the orthogonal principal components are from a ~N(0,1) population. Then: R =LL LL = :=diag(1,,p) The magnitude of the kth eigenvalue of R shows what portion of the total variance is explained by the component Kk. The principal factor analysis carries out principal component analysis on the basis of the reduced correlation matrix. Therefore it is not a genuine factor analysis per se.

3.2.2 Centroid- and Jreskog-Method

The centroid method is largely subjectively influenced. It creates an approximation of the loadings matrix generated by a principal factor analysis. 1. estimate communalities kj2=1-j2 from empirical correlation matrix R e.g. as the maximum correlation for each column of R. =R 0 by substituting the communality on the di2. Create reduced correlation matrix R agonal of R. 3. Choose arbitrarily some rows & columns. In both the row and the column switch the signs of the components except diagonal to obtain R0 ' . The only restriction is that the sum of all elements be positive and the sum of any one column divided by the square root of the sum of all entries be less than 1. 4. The factor loading of F1 on characteristic j is defined as: l if a sign has been switched j1 l j1 = else l j1 = U j0 where l j1 p

j =1


5. Repeat until Rt+1 is sufficiently close to zero. Then the approximate loadings matrix l l 1q 11 L = l p1 l pq for q=t factors F1,,Fq is The Jreskog method assumes that the structure of the covariance matrix of the standardized characteristics is: Peter P. Robejsek FernUniversitt Hagen, 2011 12

and that the j2 be proportional to the reciprocals of the diagonal 2 2 1 ) 1 elements of the inverted covariance matrix: diag(1 ,..., p ) = ( diag In applications the covariance matrix is estimated by the empirical covariance matrix R. 1 . Then diagR-1=diag(r11,,rpp) is an estimator for Compute eigenvalues of R*=(diagR-1)1/2R(diagR-1)1/2 q-1 p And use the criterion: j=1qj/j=1p j> and j=1 j/j=1 j to define the number of factors q. The factor of proportionality then can be estimated by: ^=1/(p-q)j=q+1pj be approximated by the matrix: The characteristic specific variances can 2 ,..., 2) = ( diagR 1 ) 1 diag( 1 p Now compute the normalized eigenvectors c1,,cq of length sqrt(1-^),, sqrt(q-^) and the estimation for the loading matrix as: L =(diagR-1)-1/2(c1,,cq)

= ' +diag(12 ,..., p 2 )

3.3 Rotation of factors

The estimation methods for loadings matrices above do not give unique results and the results are not always perfect to interpret. A nonlinear transformation of the q factors (factor rotation) enables us to extract an optimally interpretable loadings matrix. Thurstones simple structure: 1. any row of L contains at least one zero i.e. any characteristic is explained by at most q factors 2. any column contains at least q zeros i.e. any factor contributes to the explanation of at most q-p characteristics 3. Any pair of columns of L has some characteristics that load strongly on one but weakly on another factor 4. If one extracts more than q=4 factors then any pair of columns of L should contain low loadings for a large number of characteristics in both columns 5. For any pair of columns of L there should exist only a small number of characteristics should exhibit high loadings This means that the characteristics should form mutually exclusive clusters one of which has high loadings for some factors, medium loadings for a few factors and almost zero loadings on the remaining factors. This simple structure can be obtained e.g. by rotating the factors (i.e. the axes of the coordinate system). If the rotation of the factors maintains their orthogonality, the loadings matrix is multiplied by an orthogonal qq rotation matrix: Lrot =L : In general this rotation allows the reproduction of the reduced correlation matrix R LrotLrot =LL =LL = R The transformation matrix is obtained by multiplying together the transformation matrices of pairs of factors: =121q232q (q-1)q Where the transformation matrix for factors Fk, Fk, k<k and -45kk45 is given by: 0 t t 1 11 1q kk' = where t uu' = cos kk' sin t t kk ' q1 qq sin kk' start with zeors(q). All entries on the main diagonal are ones except those where u=u. Those are cos. The element where u=k and u=k is sin and that where u=k and u=k is sin.

Peter P. Robejsek FernUniversitt Hagen, 2011 13

If the angles of rotation are not identical for all factors i.e. the factors lose their orthogonality, the rotation is oblique. Multiplication of the loadings matrix by the rotation matrix merely yields the factor structure: Lfs=L. To obtain the loadings matrix of the oblique factors one must multiply with the inverse of the correlation matrix (RF=): Lrot =LfsRF-1 =L()-1 =L-1

3.3.1 Orthogonal Rotation

Varimax method: Maximizes the column sum of the variances of the squared loadings normalized by the square root of the communalities. Quartimax method: maximizes the sum of the fourth power of the loadings. This leads to one factor explaining as much as possible of the information contained in the characteristics.

For varimax, maximize the variance of the communality-normalized factor loadings: l jk l jk z jk = , j = 1,..., p; k = 1,..., q kj q 2 l jk
k =1

The criterion for simple structure is the Varimax criterion: q p q p q p 1 V = p 2 ( [ z jk 2 z.k 2 ]2 ) = p z jk 4 ( z jk 2 ) 2 p j =1 k =1 k =1 j =1 k =1 j =1

1 z jk 2 p j =1 The iterative Varimax procedure is as follows: 1. For a given loadings matrix L and q factors F1,..,Fq compute normalized loadings matrix Z0 where normalizing each element zjk(0) as above. Also compute varimax criterion V0. 2. For the tth step compute the following auxiliary values: where z.k 2 =

3. The step t rotation angle for the kk factor pairing then is:

4. Compute: Zt+1=Ztt=Zt12(t) q-1q(t) as well as Vt+1 proceed to the next step if Vt+1 is substantially greater than Vt 5. Finally compute Lrot whose elements are ljkRot =zjk(t+1)kj=zjk(t+1)sqrt(sum1:ql2jk) The Quartimax method maximizes the Quartimax criterion:
p q

Q = l jk 4
j =1 k =1

Peter P. Robejsek

FernUniversitt Hagen, 2011


To obtain the closest approximation of a unifactor solution i.e. one factor is supposed to explain as much of the information in the data as possible. 1. First calculate the quartimax criterion Q0 for a given loadings matrix L0. 2. In the tth step calculate for every pair of factors and This gives the rotation angle: 3. Compute Lt+1=Ltt and Qt+1 stop if Qt and Qt+1 are almost identical. Then Lrot=Lt+1

3.3.2 Oblique rotation

This method produces correlated factors. An oblique rotation is achieved by a transformation matrix where (jk)=coskk and kk represents the angle between the factor Fk and the rotated factor Fkrot. To obtain the loadings matrix of the oblique factors one must multiply with the inverse of the correlation matrix (RF=): Lrot =LfsRF-1 =L()-1 =L-1 The primary factor method starts by combining similar characteristics into subsets G1,,Gq and there into composite characteristics: Vk =j GkYjst for k=1,,q The empirical variance of the composite characteristic Vk truns out to be: sVk2 =j Gkj Gkrjj for k=1,,q where rjj is the (j,j) element of the reduced empirical correlation matrix. The correlation of the factor Fk with the composite characteristic Vk is rVkFk =j Gkljk/sVk Then the entries of the transformation matrix are: coskk =rVkFk/sqrt(k=1qr2VkFk) The correlation matrix of the rotated oblique factors is RF= with elements rF rot F rot


The angles between the factors are: kk=arccos rF rot F




3.4 Estimating factor values

A first step in factor analysis is the description of the standardized data via a linear combina tion of factor values where the weights are the factor loadings. Since the reduced empirical =R-U2 where U2 is the diagonal matrix of characteriscorrelation matrix can be written as: R tic specific variances. This is approximately equal to: R =1/(n-1) YstYst-U2 LRFL where L is an arbitrary loadings matrix for q factors and RF the correlation matrix of these factors. Then we obtain an approximation of the standardized data matrix from the unknown factor weight matrix F: = FL' Y st F describes the n objects in terms of q factors and has p-q fewer columns than YSt. For a principal components analysis F is easily obtained from Yst=FL and YStL=FLL=FI=F since L here is an orthogonal matrix. For other cases one must estimate F. First estimation method Assume that some standardized random vector yst can be represented as a linear combination of two perfectly correlated random vectors e,f weighted by L resp. U2: yst =Lf+Ue This gives the theoretical covariance matrix: yst =LL+UU =LL+U2 Peter P. Robejsek FernUniversitt Hagen, 2011 15

Replace yst by the empirical correlation matrix R. The respective vector of factor values then is: f^ =LR-1yst F^ =YstR-1L for orthogonal rotation Lrot=L and oblique rotation Lrot=Lfs()-1=L()-1=L()-1 we obtain L=Lrot so that the factor value estimations are f^ =LrotR-1yst and factor values relative to the orthogonally rotated factors: rot -1 f^ =f^=LrotR yst =LrotR-1yst (F^rot=YstR-1Lrot) f^rot =f^=LrotR-1yst =LfsR-1yst (F^rot=YstR-1Lfs) Second estimation method This method assumes that the nonstandardized data can be represented as a linear combination of perfectly correlated random vectors: y =Lf+Ue where L and U are covariance matrix and characteristic specific variance matrix of nonstandardized data. Here f has Ef=f~ and Cov f=I and e iid Ee=0 Cove=I. Decompose f into a deterministic component f~ and into a random component where E=0 Cov=I: y =Lf~+L+Ue then: Ey =Lf~ Cov(y)=y=LL+U2 For estimated L, U replace y by the empirical covariance matrix S. If S and LL are invertible estimators obtain estimators f^~ =(LS-1L-1)-1LS-1y ^ =ILS-1y-ILS-1y =0 So estimate f~=f^~+^= f^~ Again we have for both orthogonal and oblique rotation: L=Lrot Therefore the factor values w.r.t the rotated factors can be estimated by: f^rot = f^~rotf^ =(LrotS-1Lrot)-1LrotS-1y =()-1(LrotS-1Lrot)-1-1LrotS-1y =(LrotS-1Lrot)-1LrotS-1y For standardized datavectors replace the empirical covariance matrix S by the empirical correlation matrix of the characteristics R.

3.5 Factor analysis summary

Factor analysis entails the following steps: 1. data collection into the data matrix 2. standardization of data matrix 3. computation of covariance/correlation matrix 4. estimation of communalities 5. determination of the number of factors, reduced correlation matrix and loadings matrix 6. rotation of loadings matrix 7. interpretation of factors via rotated loadings matrix 8. computation of factor value matrix. Only steps 2 and 3 are objectively fixed all other steps can be subjectively influenced by the user.

4 KE 4 Scaling procedures
4.1 Introduction
Data matrix vs. distance matrix For observations of p characteristics in n objects one obtains a data matrix every row j of which is a p-dimensional vector of observations for object j and every column of which is a n Peter P. Robejsek FernUniversitt Hagen, 2011 16

dimensional vector of characteristic-values for characteristic p. If the matrix contains both qualitative and quantitative data call it mixed else the one or the other. A distance matrix for n objects is nn symmetrical matrix with zeros on main diag. and distances between elements off diag. where the entry d(i,j)=d(j,i) indicates the degree of difference between object i and j. Possible distance measures. Lr distance: d(i,j)=(k=1p|yik-yjk|r)1/r which gives for r=1 d(i,j)=k=1p|yik-yjk| city block distance r=2 d(i,j)=sqrt(k=1p(yik-yjk)2) Euclidean distance r= d(i,j)=max{|yi1-yj1|,,|yip-yjp|} Tchebycheff distance Weighting the Euclidean distance by the empirical covariance matrix gives the Mahalanobis distance: d(i,j)=((yi-yj)S-1(yi-yj))1/2

4.2 Scaling of ordinal characteristics

Assume an ordinal characteristic is observed in n objects.

4.2.1 Marginal normalization

Marginal normalization aims to map the s onto the real numbers so that the resulting distribution most closely resembles a standard normal distribution. For k different levels of first divide the x-axis belov the std. normal dist. into k blocks where the length of the ith block is proportional to the relative frequency of level 1,k. The scaled values x1,xk correspond to the probability mass unde the kth block. For i instances of level k compute relative frequency hi=ni/n and summed frequency: j=i=1jhi j=1,,k-1 where k=1 Compute j quantiles: uj=-1(j) as well as pdf values: zj=(uj)=1/sqrt(2) e-((u_j)^2)/2 The scaled values then are: x1=-z1/h1, zk-1=zk-1/hk, xj=(zj-1-zj)/hj for j=2,,k-1

4.2.2 Percentile ranks

Assume a test where each of n participants can achieve results i=1,,k. Compute absolute cumulated frequencies of the participants scoring at level i Nj =i=1jni and the percentile rank: Pj% =100Pj =100(Nj-nj/2)/n =50(2Nj-nj)/n Obtain the quantilse for the respective percentile rank: uPj =-1(pj) from the standard normal distribution.

4.2.3 Scaling nominal characteristics

Assume that two categorical characteristics are observed in n objects. Let the category have l levels and the category y have c levels. The two dimensional data can be represented by a lc contingency table. Let nij denote the frequency with which both level i of category and level j of category y is observed. Let ni. denote the frequency of category level i and conversely for n.j. Y 1 2 c X 1 n11 n12 n1c n1. 2 n21 n22 n2c n2. l nl1 nl2 nlc nl. n.1 n.2 n.c n If some of the ni., n.j are greater than zero one can attribute scaled values x1,,xl and y1,,yc to the levels of the two classes that have exp. value 0 and variance 1: Peter P. Robejsek FernUniversitt Hagen, 2011 17

x =1/ni=1lni.xi =1/nj=1cn.jyj=y=0 s2 x =1/ni=1lni.xi2 =1/nj=1cn.jyj2=sy2=1 Define vector valued dummy variables U=(U1,,Ul) and V=(V1,,Vc) with realizations that are always unit vectors u=eli=(0,,0,1,0,,0), v=ecj=(0,,0,1,0,,0) where the 1 is in the ith or jth row respectively. Now we look for random variables X* =U and Y*=V where the weights , are chosen such that the correlation between X,Y in the sample rXY is maximal. The weight vectors result from the normalized vectors of the empirical canonical correlation between U, V. Any observation in the contingency table can be written as: (xi*,yi*)=(i,j) and the scaled values for the , y result from normalizing and i.e. from the ll matrix Q where each qik =qki=(j=1cnijnkj/n.j-ni.nk./n)/sqrt(ni.nk.) The first canonic correlation between U, V is the sqrt of the largest eigenvalue G of Q rXY =sqrt(G). Find any eigenvector f and compute the weights as: i =fi/sqrt(ni.) Then the scaled values x result from the normalized s: xi =isqrt(n)/sqrt(i=1lni.i2) i=1,,l The weights and the scaled values for characteristic y result to j =1/n.j i=1lniji j=1,,c c 2 yj =jsqrt(n)/sqrt(i=1 n.jj ) j=1,,c

4.3 Multidimensional scaling

We are looking for a scale that will map n objects onto q-dimensional vector space yi =(yi1,,yiq)

4.3.1 Principal coordinate method

Starting from a Euclidean distance matrix D minimize the sum of the differences of the squared entries of D and the Euclidean distances d*(i,j) =(k=1q(yik-yjk)2)1/2 for i,j=1,,n i.e. minimize the function: g(y1,,yn) =i=1n-1j=i+1n(d2(i,j)-d*2(i,j)) w.r.t. the vectors yk First obtain the matrix A: Modify D by squaring off-diagonals and multiplying by -1/2 and. Then obtain positive semidefinite matrix B=KnAKn Where Kn =In-1/n*ones(n) Obtain the q largest (nonnegative) eigenvalues of B and sort by size as well as the respective eigenvector y(1)=(y11,,yn1),,y(q)=(y1q,,ynq) Normalize so that y(l)y(l)=lThese then give the q-dimensional configuration of the objects such that the center of mass is in the origin.

4.3.2 Kruskals method

Nonmetric multidimensional scaling procedure. Here one requires only the existence of proximity data i.e. the rank of the differences for the n(n-1)/2 pairs of objects suffices. E.g. for four objects i,j,k,l one must know whether d(i,j)<d(k,l) or d(kl)<d(i,j). The actual values d(i,j) are not necessary. A configuration Y1=(y11,,y1q),,Yn=(yn1,,ynq) for n objects in q dimensions is chosen so that the euclidean distances between the vectors represent as closely as possible the rank of the dissimilarity of the object pairs. A quality criterion is: gS(y1,,yn)=infg(y1,,yn) where (i,j)>0 and is monotonous w.r.t d g(y1,,yn) =(i=1n-1j=i+1n(d*(i,j)-(i,j))2)1/2 Peter P. Robejsek FernUniversitt Hagen, 2011 18

This stress function is invariant w.r.t changes in the coordinate system and has values between 0 and gSmax(y1,,yn) =(i=1n-1j=i+1n(d*(i,j)-d*)2)1/2 where d*=2/n(n-1)i=1n-1j=i+1nd*(i,j) Very good values of the configuration are for gs<0.05gSmax then in steps of .05 aup to Values 0.20 that are not considered satisfactory. For a given dimension of the representation space q chose a starting configuration y10=(y011,,y01q),,yn0=(y0n1,,y0nq) For the th step compute the Euclidean distances for all pairs yi-1,yi-1 their Euclidean distances d-1*(i,j)and sort according to the proximities d(i,j) so that the distance between the pair of objects with lowest proximity comes first. To compute the stress function gS transform the distances d-1*(i,j) monotonously into -1*(i,j)s.

5 KE 5 Classification and Identification

5.1 Cluster analysis
Based on a quantitative data matrix Y or a distance matrix D for the n objects it is possible to construct a set of m classes K={K1,,Km} each of which contains at least one at most all n objects. Clustering itself is the third step, preceded by the choice of classification type and in the second step choice of criteria for interclass homo-/heterogeneity. The classification K={K1,,Km} is called a cover if classes overlap but no class is fully contained in another class: KiKj{Ki,Kj} for i,j=1,,m KiKjKi and KiKjKj A quasi-hierarchy is a sequence of covers. It can be visualized as a tree where the topmost layer has finest granularity and the lowest layer is most clustered. The classes of one layer always contain the classes of the above layer. If Ki is a class in the quasi-hierarchy K then the union of all subclasses: KjKi where Kj K Kj {,Ki}. A cover whose classes do not intersect is called a partition: KiKj= i, j. A hierarchy is a special quasi-hierarchy where classes within levels do not intersect. A hierarchy is a sequence of partitions. An exhaustive classification contains all n objects within some class. A non-exhaustive classification leaves some objects unclassified. This can make sense for a finite exogenously given number of classes in order to prevent too much heterogeneity within classes. The within class homogeneity of class Ki is measured by a number h(Ki)0. Based on a distance matrix D one distinguishes the following indicators: minimum distance index: h(Ki)=minj,k Kid(j,k) the minimum distance between any two objects mostly too weak maximum distance index: h(Ki)=maxj,k Kid(j,k) the maximum distance between any two objects mostly too strict normalized sum of distances: h(Ki)=1/c j Kik Ki;j<kd(j,k) where c=|Ki|(|Ki|-1) where |Ki| is the number of objects in class i.

The among class heterogeneity is judged by heterogeneity measures v(Ki1,Ki2)0 and v(Ki,Ki)=0 as well as v(Ki1,Ki2)=v(Ki2,Ki1) for all classes KiK. For disjoint classes (i.e. members of a partition or classes on one level of a hierarchy) one uses Peter P. Robejsek FernUniversitt Hagen, 2011 19

single linkage: v( K i1 ,K i2 ) =

j K i1 , k K i 2


d( j ,k ) i.e. minimum inter object inter class dis-

tance i.e. heterogeneity of the most similar members complete linkage: v (K i1 , K i2 ) = max d ( j, k ) i.e. maximum inter class inter object
j K i1 , k K i2

heterogeneity i.e. heterogeneity of the most different members 1 average linkage: v (K i1 , K i2 ) = d ( j, k ) K K j i1 i2 K i1 k K i2

The homogeneity indicators can be used for quasi hierarchies and covers. Not so the heterogeneity indicators. These must be modified so that overlapping classes are reduced by over lapping objects before measuring heterogeneity. To judge the quality of classifications that are partitions one uses homogeneity based: g(K ) = h (K i )
K i K

heterogeneity based: g(K ) =


K i1 K i1 i2

K i2 K

v (K i1 , K i 2 )


h(K )
i K i K

based on homo- and heterogeneity: g(K ) = v (K i1 , K i2 ) K i1 K K i2 K i1 i2

A partition with only one class can be judged only on the basis of homogeneity. However using the first measure leads to an n object class. Hierarchies are never assessed in their totality but rather level by level based on the levels partition.

5.1.1 Constructing a partition

Based on the distance matrix D for all n objects choose a central object j1 for class K1. Then K10={j1}. Increase the class strength successively from the remaining objects: K11=K10{j2} where always that object j2 (jn) is chosen that has minimum distance to the central object: d(j1,j2)=minjj1d(j1,j). And further: K12=K11{j3} where d(j1,j3)=minjj1,j2d(j1,j). Stop the procedure once a homogeneity threshold h~ is exceeded: h(K1t)>h~ or if h(K1t)- h(K1t-1)> where is an arbitrary positive number. This process is carried forward for all classes based on distance matrices that are reduced by the members of previous classes.

5.1.2 Constructing a hierarchy

A hierarchy is a sequence of partitions K0, K1, K2, Divisive approaches start with the coarsest partition. If this contains all n objects the resulting hierarchy is exhaustive. Agglomerative approaches start from the finest partition and generate coarser partitions. If the finest partition contains all n objects the resulting hierarchy will be exhaustive. In the following assume the partition 0 is K10={1},, Kn0={n}. In every step of the iteration cumulate the two classes with minimum difference until only one class remains. The hierarchy K is a sequence of partitions K0,K1,,Kn-1 such that partition Kt-1 has n-(t-1) classes K1t-1,,Kn-(t-1)t-1. This leads to a total of n+(n-1)+(n-2)++1 classes of which only 2n-1 are different and therefore constitute the hierarchy. Start: K0={K10,, Kn0}={{1},,{n}} Peter P. Robejsek FernUniversitt Hagen, 2011 20

in step t generate partition Kt={K1t,,Kn-tt} according to interclass homogeneity in the previK i * t 1 K i * t 1 for i = min{i1*, i2 *} 2 1 K i t = K i +1t 1 for i max{i1*, i2 *} t 1 Ki else ous step: where i1 * and i2 * are chosen so that

v (K i1 * t 1, K i2 * t 1 ) =

i1 , i2 {1,..., n ( t 1)} i1 i2


v (K i1 t 1, K i2 t 1 )

The heterogeneity can be measured by single, complete and average linkage, although single linkage is most common as it discovers very broad classes. However the downside: very heterogeneous classes might be merged only because a single object lies in between. This procedure requires computation of heterogeneity in every step so that it can become computationally costly. The following recursion reduces the t step heterogeneity to those in t1step. For t=1,2,,n-1 v (K i1 * t 1 K i2 * t 1, K i t ) = 1v (K i1 * t 1, K i t ) + 2v (K i2 * t 1, K i t ) + 3 v (K i1 * t 1, K i t ) v (K i2 * t 1, K i t ) where the weights k are chosen according to the measure of heterogeneity used: 1 2 3 single linkage 0,5 0,5 -0,5 complete linkage 0,5 0,5 0,5 t 1 t 1 average linkage 0 K K
i1 * i2 *

K i1 *

t 1

+ K i2 *

t 1

K i1 *

t 1

+ K i2 * t 1

The resulting 2n-1 different classes can be represented as a dendrogram. Sometimes one also v (K i1 t 1, K i2 t 1 ) i.e. the heterogeneity of the fusioned classes uses g(Kt)= g(K t ) = t 1 min t 1 t 1
K i1 , K i2 K

the quality of the partition on one level of the hierarchy. as indicator for

5.2 Discriminant Analysis

5.2.1 DA for Partitions

Discriminant Analysis constitutes the allocation of objects to classes.

If n objects are allocated to m disjoint classes K1,,Km that therefore are a partition for the n objects then the objects of every class can be viewed as a learning sample of a population. Discrimination of the n+1st object is to decide which of the m populations it belongs to. Distinguish two data situations: 1. There are observations over p jointly normally distributed characteristics for each object 2. There exists only a distance matrix for the objects Normally distributed population

Let there be a learning sample of n objects where ni objects constitute the population i=1,,m such that n1++nm=n. For each subpopulation i observe p jointly normally distributed characteristics with (i) and identical covariance matrix both of which are unknown. Denote the p-dimensional vector of the kth object of the ith subpopulation by: yik =(yik1,,yikp) for i=1,,m; k=1,,ni Identification of objects

Peter P. Robejsek FernUniversitt Hagen, 2011 21

Assume m=2 then there are two populations with N((1),); N((2),). Then the densities are given by: fi(y) =1/sqrt((2)pdet) exp[-1/2(y-(i))-1(y-(i))] i=1,2 Take the density quotient f1(y)/f2(y) and find: h~12(y) =exp[-1/2(y-(1))-1(y-(1))]/exp[-1/2(y-(2))-1(y-(2))] =exp{-1/2[(y-(1))-1(y-(1))-((y-(2))-1(y-(2)))]} =((1)-(2))-1y-1/2[((1)-(2))-1((1)+(2))] |quadratic form h~12(y) is Fishers linear discriminant function that allocates an object with vector y to population i=1 if h~12(y)>0 else to i=2. In the case of unknown exp. value vectors and covariance matrix substitute the empirical estimators in the above formula. In the univariate case the above formula reduces to: h12(y) =(y1-y2)y/2 -1/2 [(y1-y2)(y1+y2)]/2 =(y1-y2)/2[y-1/2(y1+y2)] Then the cutoff point is midway between the populations. This classification by Euclidean distance from the exp. value vector does not hold for the multivariate case unless =2I. For the computation of hij(y)=-hji(y) Quality of discrimination

To find a simple measure of the quality of the discriminant function compute the classification on the basis of the discriminant function for every object and take the ratio of correctly classified to total objects. The measure of separation also indicates the quality of discriminance: T2(Y1,,Yp) =tr(ShSe-1) where

A larger value of T2 indicates better discriminance of the p characteristics between the m classes. To test whether discriminating between the two populations is significantly possible H0: it is not possible to discriminate use the statistic: T2(Y1,,Yp) Reject Null if: T2>cHL;1-(p,n-m,m-1) where: cHL;1-(p,n-m,m-1)2(2u++1)/2(v+1) F(2u++1),2(v+1);1- and: =min(p,m-1), u=1/2(|p-m+1|-1), v=1/2(n-m-p-1) For m=2: T2(Y1,,Yp)=tr(ShSe-1) =n1n2/n(n-2) (y1-y2)S-1(y1-y2) where S is an estimator for Se/(n-2) In this case the approximation of the quantiles of the Hotelling-Lawley-statistic is exact: T2(Y1,,Yp)(n-p-1)/p ~Fp,n-p-1 The characteristics discriminate significantly to the level if T2(Y1,,Yp) =n1n2/n(n-2)(y1-y2)S-1(y1-y2) > Fp,n-p-1;1- Peter P. Robejsek FernUniversitt Hagen, 2011 22 Reduction of characteristics

To reduce the number of characteristics to be observed to q<p discard those that are not essential to the discrimination: First estimate covariance matrix S =1/(n-m)i=1mk=1ni(yik-yi) (yik-yi) =1/(n-m)Se and the inverse S-1=(sjl)j,l=1,,p Compute the matrix A=(y1-y,y1-y,,ym-y) B =S-1A =(bji)j,l=1,,p The indispensability of the jth characteristic is given by: Uj =1/[(n-m)sjj]i=1mnibji2 for j=1,,p Eliminate the characteristic with lowest U and compute new measure of separation T2(Y1,,Yl-1,Yl+1,,Yp) =T2(Y1,,Yp)-Ul To obtain the estimator for the covariance matrix of the remaining characteristics cross out the lth row and column: S(1,,l-1,l+1,,p). Compute the inverse S(1,,l-1,l+1,,p)-1 and eliminate the lth row and column from A: A(1,,l-1,l+1,,p) and use B(1,,l-1,l+1,,p)= S(1,,l-1,l+1,,p)-1A(1,,l-1,l+1,,p) Arbitrary population

This approach is suited if the p characteristics of the n objects in m classes are not jointly normally distributed or are not known and only the distances are known. To allocate a new object to one of the classes information about its distance to all of the objects in te learning sample i.e. for classes K1,,Km. For a given measure of homogeneity h(Ki) the new object is allocated to that class whose homogeneity increase is minimal: h(Ki*{j})-h(Ki*) =mini {1,,m}(j(Ki{j})-h(Ki))

5.2.2 Discriminant analysis for hierarchies

Assume that the hierarchy was developed on the basis of single linkage heterogeneity. In order to preserve the structure of a hierarchy the new object j must be assigned to every partition K0,K1,,Kn-1 of the hierarchy. Therefore it must definitely be assigned to that one class K1n-1 of the most coarse partition Kn-1 which separates into two classes on level Kn-2. The object should be assigned to that class Ki*n-2 where v(Ki*n-2,{j}) =mini=1,2v(Kin-2,{j}) Then if this class is split on some finer partition level the object is assigned accordingly.

6 KE 6 Multivariate Linear Model

6.1 Introduction
Objective: To analyse the influence of qualitative and quantitative input variables on p quantitative response variables Data matrix Y contains observation data Design matrix X contains information about the input variables Parameter matrix contains information about the interaction of input and response variables If all input variables are quantitative: Multivariate regression. Here testing relies on iid assumption for objects (i.e. rows of Y) If all input variables are qualitative: Multivariate analysis of variance If both qualitative and quantitative data are present: Multivariate analysis of covariance

Peter P. Robejsek

FernUniversitt Hagen, 2011


6.2 General multivariate regression

To analyse the influence of m quantitative regressors (input variables) on p quantitative response variables consider the general multivariate model of the form: Y =X+E Where Y np random matrix n>p and E(Y)=X and CovY=In X is nm design matrix, n>m is unknown mp parameter matrix E is random np error matrix and E(E)=0 and CovE=CovY=In is the unknown pp covariance matrix of the response variables where the kronecker product is defined as: A B=aijB: a11 B a1q B A B = ap1 B apq B The parameter matrix gives the relationship between the m input and the p response variables for n objects.

if the ith row of Y is given by yi=(yi1,,yip) and the ith row of the design matrix X is given by xi=(xi1,,xim) and (j) gives the jth column of the parameter matrix , then the model functions as follows: For some value of the input variables xi observe the corresponding response variables yi then the parameter vector (j) is supposed to explain the interaction between input variables and the jth response variable as closely as possible. In the following assume that rgX=m i.e. full column rank s that XX is regular. Estimators: exp. value = ( X ' X ) 1 X'Y
covariance matrix = 1 S nm e where Se is the error matrix Se = Y ' Y Y ' X ( X ' X ) 1 X'Y To test hypotheses of the form: H0 : K=0 vs. H1: K0 where K is wm Testmatrix with full row rank compute the hypothesis matrix: Sh =YX(XX)-1K(K(XX)-1K)-1K(XX)-1XY The statistics are a function of the eigenvalues of ShSe-1: 12p0 The critical values depend on the number p of response variables as well as the degrees of freedom of the error ne and the hypothesis nh. For the general multivariate regression model we have: ne =n-rgX =n-m nh =rgK =w

Assume confidence level: Wilks test H0 : K=0 vs. H1: K0 Use the statistic: Reject null: Approximatiion for cW rej. null: where: Peter P. Robejsek

W =i=1p1/(1+i) W <cW;(p,ne,nh) -lnW >2pn(h);1- =ne+nh-1/2(p+nh+1) FernUniversitt Hagen, 2011



Hotelling-Lawley-Test Use the statistic: compute: where: Reject null: Pillai-Bartlett-Test Use the statistic: reject null: approximate: where , u and v as above. Roy test Use the statistic: where cR tabulated

HL =i=1pi cHL;1-(p,ne,nh) [2(2u++1)]/[2(v+1)] F(2u++1),2(v+1);1- =min(p,nh), u=1/2(|p-nh|-1), v=1/2(ne-p-1) HL >cHL;1-(p,ne,nh) PB =i=1p(i/(1+i)) PB >cPB;1-(p,ne,nh) PB/(- PB)>[(2u++1)/(2v++1)]F(2u++1),(2v++1);1-


For p=1 all tests reduce to a regular F-test and for nh=1 an exact F-test can be given for example in the case of Wilks: cW;(p,ne,1) =1/[1+p/(ne-p+1) Fp,n(e)-p+1;1-]

6.3 Multivariate Analysis of Variance

Objective: To analyse the influence of a number of qualitative variables (factors) that have a finite number of possible values on p characteristics of the objects of the population Just like Multivariate regression MANOVA can be formulated as: Y=X+E The variables are defined as above. Again the first column of the design matrix X is all ones. Else the matrix contains ones where the respective factor-level (column) is relevant. The rows refer to the mean factor level and their effects on the p characteristics. To ensure uniqueness one imposes restrictions (reparametrization restrictions) such that: Z=0 where Z is a zm restriction matrix

6.3.1 One-way MANOVA

Investigates the impact of a qualitative factor A with r levels on p quantitative characteristics. yij =+1+eij i=1,,r>1, j=1,,s>1, rs=n yij gives the p-dimensional observation vector for the jth object on the ith level of the factor A. is the mean vector and i is the vector giving the effects of the ith level of A. eij is the error for the observation yij with expected value 0 and covariance matrix . In tests the eijers are assumed to be iid normally distributed. Impose the reparametrisation constraint: i=1ri =0 Then find the estimator for the mean as given by: ^ = y..=1/ni=1rj=1syij The effects vector is estimated by: i ^ = yi.- y..= yi.-^ where yi.=1/sj=1syij To test the hypothesis H0: 1==r=0 i.e. factor A has no influence on the p characteristics use the tests introduced for multivariate regression. The hypothesis resp. error matrix is given by Sh =s i=1r( yi.- y..)( yi.- y..) =s i=1r i^ i^ Se =i=1rj=1s( yij- yi.)( yij- yi.) Peter P. Robejsek FernUniversitt Hagen, 2011 25

Accrdingly the degrees of freedom are: nh=r-1, ne=r(s-1) The covariance matrix is estimated just as in the multivariate regression: ^ =1/neSe =1/r(s-1)Se

6.3.2 Two way ANOVA with interaction terms

Analyses the influence of two qualitative characteristics A, B with r, s levels each on p quantitative characteristics. If each level of one factor is combined with every level of the other exactly ttimes and an interaction between the factors is assumed one obtains the multivariate ANOVA with double cross classification and interaction terms: yijk =+i+j+()ij+eijk i=1,,r>1; j=1,,s>1; k=1,,t>1; n=rst yijk represents the random vector (the observation vector) for the p characteristics for the kth object on level i of factor A and level j of factor B. is again the mean, i the vector of the effects of the jth level of factor A, j the vector of the effects of the jth level of factor B and ()ij the vector of the interaction effects of level i of factor A and level j of factor B. Also eijk represents the error vector with exp. value 0 and covariance matrix . To do tests, one must assume the eijks to be iid multivariate normal. The reparametrisation conditions are: i=1ri=0, j=1sj=0, i=1s()ij=j=1r()ij =0 The estimators for the effect vectors are: ^ = y | where y=1/ni=1rj=1sk=1tyijk The effect vector i of the ith level of the factor A is given by: i ^ = yi..-^ | where yi..=1/stj=1sk=1tyijk The effect vector j of the jth level of the factor B is given by: j ^ = y.j.-^ | where y.j.=1/rti=1rk=1tyijk The interaction effect vector ()ij of the ith level of the factor A and jth level of factor B is given by: ()ij^ = yij.- yi..- y.j.+ y...= yij.- i^-j^- ^ | where yij.=1/tk=1tyijk Three different hypotheses are tested, using the previously introduced tests: H0A: 1==r=0 H0B: 1==r=0 H0AB: 1==r=0 Hypothesis matrices and degrees of freedom: H0A: sti=1r( yi..- y...)( yi..- y...)=sti=1ri^i^ nhA =r-1 H0B: rtj=1s( y.j.- y...)( y.j.- y...)=rti=1sj^j^ nhB =s-1 AB H0 : ti=1rj=1s( yij.- yi..- y.j.+ y...) ( yij.- yi..- y.j.+ y...)=ti=1rj=1s()ij^()ij^ nhAB =(s-1)(r-1) error matrix in all cases Se =i=1rj=1sk=1t(yijk- yij.)(yijk- yij.) ne =rs(t-1) and the estimator of the covariance matrix: ^=1/neSe

6.4 Profile Analysis

Analysing time series behavior dependent on one or more variables. The p columns are treated as observations of one characteristic in p timesteps for n objects. Peter P. Robejsek FernUniversitt Hagen, 2011 26

The simple profile analysis looks at the influence of one qualitative factor A with r levels on s objects through time. This model is identical with univariate classification: yij =+i+eij i=1,,r; j=1,,s; n=rs; yij=(yij(t1),yij(t2),,yij(tp)) is the average vector, i is the effect on level i of the factor A. estimators are identical to one way classification: ^ = y..=1/ni=1rj=1syij i^= yi.- y.. = yi.-^ where yi.=1/sj=1syij Thus for the time step t the estimated mean is: ^(t) =yi.(t)=1/ni=1rj=1syij(t) i^=yi.(t)-y..(t) =yi.(t)-^(t) where yi.=1/sj=1syij(t) One commonly tests the hypothesis HoA: 1==r=0 To be able to use the known tests, compute the hypothesis- and error matrices: Sh A =si=1r( yi.- y..)( yi.- y..) Se =i=1rj=1s( yi.- y..)( yi.- y..) degrees of freedom: nhA =r-1, ne=r(s-1) A different common hypothesis is: Parallelity: H01: k1,,kr such that 1+k11p==r+kr1p where 1p=(1,,1) One tests that the impact of the r levels of factor A is identical but for a constant factor Hapothesis matrix: Sh(1) =(Ip-1|-1p-1)ShA(Ip-1|-1p-1) Se(1) =(Ip-1|-1p-1)SeA(Ip-1|-1p-1) degrees of freedom nh(1) =nhA=r-1; ne(1)=ne=r(s-1), p*=p-1 Identity of time-means: H02:1==r, i=1/p=1pi(t) for i=1,r One tests whether the effects of the r levels of factor A are identical over p timesteps Hypothesis and error matrix are constants: Sh(2) =1pSh1p Se(2) =1pSe1p and degrees of freedom: nh(2) =nhA=r-1; ne(2)=ne=r(s-1), p*=p-1

7 KE7 Discrete Regression

Models of discrete regression analysis observe n objects and their characteristics. Then these models estimate the probability that the object with a certain set of regressors will belong to one of a finite set of classes. If the regressand has only two possible values (two classes) then this is a binary-response model. In this case the Regression predicts the probability that given a combination of regressors X=(X1,,Xh-1) the level Y=1 of the regressand will occur (that the object belongs to class 1). The Probability P is given as a function of X P =G(0+X) Where 0 and =(1,,h-1) are parameters to be estimated and G-1 exists. If G is a linear function of the regressors one obtains a linear probability model. For G the distribution function of the standard normal dist. one obtains Probit/Normit model and for the logistic distribution function Flgt(z)=1/(1+e-z).

Peter P. Robejsek

FernUniversitt Hagen, 2011


7.1 General approach

Assume there is only one regressor X. Assume that the random variable Y that represents the response variable (regressand) Y is binary (0,1). Discrete regression then is a prognosis of the probability that the response variable takes the value y=1 or the value y=0 for a given value x of X. One can now assume a regression connection between the regressand Y and regressor X variable realizations y,x: y =0+1x Least squares estimation gives: (0^,1^) =(XX)-1XY 1 1 ... 1 where X = x1 ' x2 ' ... x3 ' where xk give the observations of a regressor and yk the according values of the regressand.

Also one can group the observations y1,,yn for m measurement points x1,,xm. Or it is possible that at the point xi (i=1,,m) one has observed the value y ni times. Then if the event y=1 took place ni(1) times and the event y=0 took place ni(0) times (ni=ni(1)+ni(0); n1+nm=n) then the probability that at position xi one will observe y=1 (pi=P(y=1|x=xi) can be estimated by: pi^ =ni(1)/ni

The regression relationship p=0+1x linearly explains the probability of observing y=1 given the value x of the regressor X. Therefore this is called the linear model of discrete regression. However the prediction of probabilities can exceed the interval [0,1] Therefore one uses transformations, which generate estimates that represent a genuine probability. The probit (normit) model uses the standard normal distribution function as a transformation: p =(0+1x) 2 x x 1 t2 where ( x ) = ( t ) dt = e dt 2 First calculate the probits: giprob =-1(pi^) =upi^ Where upi^ represents the pi^ quantile of the standard normal distribution. Then estimate by least squares the parameters of the regression relation: gprob = -1(pi) = 0+1x From this one obtains the estimated regression relation: p^ = (gprob) = -1(0^+1^x) Since is a distribution function one always obtains values between 0 and 1. A different frequently used transformation function is the logistic distribution function: Flgt(z) =1/(1+e-z) In the logit model one computes the logits: gilgt =F-1lgt(pi^) =ln(pi^/1-pi^) Then again estimate by least squares the parameters of the regression relation: glgt =F-1lgt(p) =ln(p/1-p) =0+v1x This yields the estimated regression relation: p^ =Flgt(g^lgt) =Flgt(0^+v1^x) =1/(1+exp(-(0^+v1^x)) Peter P. Robejsek FernUniversitt Hagen, 2011 28

Simple canonical summary: 1. Find relative frequencies as estimators for probabilities at obs. points 2. Compute ids/probits/logits by applying the inverse transform (G-1) 3. Regress logits on regressors 4. Insert weights into G to obtain estimates for unknown values of regressors

7.2 Binary response model

Assume a binary regressand Y with manifestation y and several regressors X1,,Xh-1 with corresponding manifestations x1,,xh-1. We then wish to estimate a relationship of the form: P =G(0+j=1h-1jxj) where the s are the unknown model parameters to be estimated. Then we want to find weights 0^,1^,,h-1^ such that P^ will then be an estimator for the probability that y=1 will be observed given x =(x1,,xh-1): P=P(y=1|x ) If for a combination of regressors xi=(xi1,,xih-1) one observes the realization y=1 ni(1) times and conversely y=0 ni(0) times and ni(1)>0 and ni(0)>0 then the probability that y=1 will occur given x =(x1,,x h-1) can be estimated by weighted least squares in two steps. This procedure is known as Berkson-Theil-Method. However the additional constraint ni=ni(1)+ni(0)>5 should be fulfilled.

7.2.1 Berkson-Theil-Method
The BTM uses weighted least squares to estimate the s. To this end one obtains for each of m measurement points xi =(xi1,,xih-1) i=1,,m related to the regressors X 1,,X h-1, ni observations of the binary regressand Y . If one observes the event y =1 ni(1) times one can estimate the probability that at the point xi one will observe yi=1 i.e. pi=P(y =1|x=xi) by: pi^ =ni(1)/ni Since the individual events follow a Bernoulli distribution we obtain for the exp. value resp. the variance of pi^: E(pi^) =E(ni(1)/ni) =1/niE(ni(1)) =1/ni ni pi 2 Var(pi^)=Var(ni(1)/ni)=1/ni Var(ni(1)) =1/ni2 ni pi(1-pi) =1/ni pi(1-pi) For sufficiently large ni we have si 2 =pi^(1-pi^)/ni As an estimator for the variance of pi^. This estimator is used for the least squares estimation of 0,1,,h-1. To estimate a relationship of the type: P =G(0+j=1h-1jxj) where G-1 exists first estimate: gi^ =G-1(pi^) and sG,i2 as an estimator for the variance of gi^. Then regress G=G-1 on the X 1,,X h-1 by the relationship: G=G-1(P) =0+j=1h-1jxj The regression model is: gi^=G-1(pi)^ =0+j=1h-1jxij+ei where the ei are independent error terms with variances estimated by sG,i2. Then: 1 x1 ' g 1 0 1 x 2 ' g2 1 = X= ,g , = , G = diag( sG12 ,..., sGm 2 ) 1 x m ' gm h 1 Peter P. Robejsek

FernUniversitt Hagen, 2011


Where G^ is the estimator for the diagonal covariance matrix of g^ resp. e=(e1,,em). Then the least squares estimator for the parameter is given by: ^ =(0^,1^,,h-1^) =(XG^-1X)-1XG^-1g^ Then for G=id we have the relationship G=P =0+j=1h-1jxj estimated by G^=P^ =0^linear+j=1h-1j^linearxj Where ^linear is the weighted least squares estimator for =(0,1,,h-1) estimated using gi^=pi^ and id^=diag(sid,12,,s2id,m)=diag(s12,,s2m) Then for G= we have the relationship G=-1(P) =0+j=1h-1jxj estimated by (P^) =(0^probit+j=1h-1j^probitxj) G^=-1(P^) =0^ probit +j=1h-1j^ probitxj Where ^probit is the weighted least squares estimator for =(0,1,,h-1) estimated using gi^= gi^probit= -1(pi^)=upi^ and ^=diag(s,12,,s2,m)=diag(s12/((g1^))2,,sm2/((gm^))2) is the estimator of the covariance matrix. The covariance matrix is estimated as follows. pi^ denotes the relative frequency of the yi=1 at xi. Then we have: pi^ =pi+ui |where ui is the unknown error with E(ui)=0, Var(ui)=pi(1-pi)/ni Now apply the inverse standard normal distribution function: -1(pi^) =-1(pi+ui) The Taylor series expansion around pi gives: -1(pi^) =-1(pi)+ui -1(pi)/pi+Ri -1 where (pi)/pi=1/[-1(pi)] and Ri is the remainder term which converges in probability towards zero as ni. Therefore -1(pi^) =-1(pi)+ui/[-1(pi)] -1 gi^= (pi^) =0+j=1h-1jxij+ei The error ei has the exp. value E(ei)=0 and the variance: Var(ei) =Var[ui/(-1(pi))] =Var[ui/(gi)] =pi(1-pi)/ni((gi))2 =si2/((gi))2 Therefore the estimator of the covariance matrix turns out as: ^ =diag(s,12,,s2,m)=diag(s12/((g1^))2,,sm2/((gm^))2) Finally for the logit model we have: G=Flgt we have the relationship P =Flgt(0+j=1h-1jxj) =1/[1+exp(-(0+j=1h-1jxj))] G=Flgt-1(P) =ln[P/(1-P)] =0+j=1h-1jxj Where ^logit is the weighted least squares estimator for =(0,1,,h-1) estimated using gi^=gi^logit=Flgt-1(pi^)=ln[pi^/(1-pi^)] and Flgt^=diag(sFlgt,12,,s2Flgt,m)=diag(1/(ni2si2),, 1/(ni2si2)) is the estimator of the covariance matrix. The covariance matrix is obtained from the following: pi^ =pi+ui The odds ratio then is: pi^/(1-pi^) =(pi+ui)/(1-pi-ui) =pi/(1-pi) (1+ui/pi)/(1-(ui/(1-pi)) So the log-odds ratio becomes: ln[pi^/(1-pi^)] =ln[pi/(1-pi)]+ln(1+ui/pi)-ln(1-[ui/(1-pi)] Expand the last terms as Taylor series around ui/pi resp. ui/(1-pi) drop higher order terms: ln[pi^/(1-pi^)] =ln[pi/(1-pi)]+ui/pi+ui/(1-pi) Peter P. Robejsek FernUniversitt Hagen, 2011 30

=0+j=1h-1jxj+ui/pi(1-pi) Where gi^=0+j=1h-1jxij+ei and gi^=ln[pi^/(1-pi^)]. Then ei has exp. value E(ei)=0 and variance: Var(ei) =Var[ui/pi(1-pi)] =1/pi2(1-pi)2 Var(ui) =1/pi2(1-pi)2 pi(1-pi)/ni =1/nipi(1-pi) =1/ni2si2 Then the estimator for the covariance matrix Flgt then becomes: Flgt^ =diag(sFlgt,12,,s2Flgt,m)=diag(1/(ni2si2),, 1/(ni2si2)) Then the resulting estimators for G and P are: P^ =Flgt(0^logit+j=1h-1jlogitxj) =1/[1+exp(-(0^logit+j=1h-1jlogitxj))] G^=Flgt-1(P^) =ln[P^/(1-P^)] =0^logit+j=1h-1^jlogitxj The goodness of fit can of the regression curves G^=P^ =0^linear+j=1h-1j^linearxj G^=-1(P^) =0^ probit +j=1h-1j^ probitxj G^=Flgt-1(P^) =0^logit+j=1h-1^jlogitxj can be measured by the multiple coefficient of determination:

To directly compare the quality of the estimation P^ at the point x one should use the 2 values (the lower the better the fit): G2 =i=1mni [pi^-p^(xi)]2/p^(xi)(1-p^(xi)) Where

for the respective models.

Peter P. Robejsek

FernUniversitt Hagen, 2011


7.2.2 ML estimation in the binary response model

For small numbers of observations ni, of the individual combinations of the characteristics of the regressors X1,,Xh-1 it is not possible to estimate pi by the relative frequency. In this case use maximum likelihood to compute parameter estimates for the s. Assume the model P =G(0+j=1h-1jxj) G =G-1(P) =(0+j=1h-1jxj) Obtain for the measurement points xi(xi1,,xih-1) i=1,,m ni1 observations of the regressand y. Denote by ni(1)0 the absolute frequency of the event y=1. Then the probability of this event can be estimated by: piG() =G(0+j=1h-1jxij) | where =(0,1,,h-1) Then the likelihood function can be given by LG =L(piG(),ni(1),nixi)

Then find for log-likelihood: lnLG =i=1m[ln(ni over ni(1))+ni(1)lnpiG()+ni(0)ln(1-piG())] =i=1m[ln(ni over ni(1))+ni(1)lnpiG()/(1-piG())+niln(1-piG())] Which is maximized w.r.t. the components of the parameter vector . To maximize compute first and second derivatives: Linear model: pilinear() =0+j=1h-1jxij =j=0h-1jxij lnLid =i=1m[ln(ni over ni(1))+ni(1)ln j=0h-1jxij+ni(0)ln(1-j=0h-1jxij)] lnLid/l =i=1m[ni(1))/j=0h-1jxij-ni(0)/(1-j=0h-1jxij)]xil 2lnLid/lv =-i=1m[ni(1))/(j=0h-1jxij)2+ni(0)/(1-j=0h-1jxij)2]xilxiv Then the ML estimator is: ^ = ^MLlinear =(^ML,0linear,, ^ML,h-1linear) P^ =P^MLlinear =G^ = G^ MLlinear = ^ML,0linear+j=1h-1ML,jlinearxj Probit model: piprobit() =(0+j=1h-1jxij) =(j=0h-1jxij) lnL =i=1m[ln(ni over ni(1))+ni(1)ln (j=0h-1jxij)+ni(0)ln(1-j=0h-1jxij)] Where is the distribution function of sthe standard normal distribution and the density function lnL/l = (i=1mxil)(j=0h-1jxij){[ni(1)-ni(1-j=0h-1jxij)]/[(j=0h-1jxij)(1-(j=0h-1jxij)]} 2lnL/lv =-i=1mnixilxiv{(j=0h-1jxij)/ [(j=0h-1jxij)(1-(j=0h-1jxij)]} Then the ML estimator is: ^ = ^MLprobit =(^ML,0probit,, ^ML,h-1probit) P^ =P^MLprobit =G^ = G^ MLprobit = ^ML,0probit+j=1h-1ML,jprobitxj Logit model: pilogit() =Flgt(0+j=1h-1jxij) =Flgt(j=0h-1jxij) =1/[1+exp(-(j=0h-1jxij))] m lnLFlgt =i=1 [ln(ni over ni(1))+ni(1)ln{1/[1+exp(-(j=0h-1jxij))]/1-(1/[1+exp(-(j=0h1 jxij))])}+ni(0)ln(1-1/[1+exp(-(j=0h-1jxij))])] = i=1m[ln(ni over ni(1))+[ni(1)-ni](j=0h-1jxij)-niln[1+exp(-(j=0h-1jxij))] Where Flgt is the logistic distribution function lnLFlgt/l =(i=1mxil)[ni(1)-ni/{1+exp(-(j=0h-1jxij))]} Peter P. Robejsek FernUniversitt Hagen, 2011 32

2lnLFlgt/lv=i=1mnixilxiv{ exp(-(j=0h-1jxij))/[1+exp(-(j=0h-1jxij))]2} Then the ML estimator is: ^ = ^MLlogit =(^ML,0logit,, ^ML,h-1logit) P^ =P^MLlogit =G^ = G^ MLlogit =1/[1+exp(-(^ML,0logit+j=1h-1ML,jlogitxj))]

7.3 Multi-Response Model

The binary response model allows the regressand Y to take the value 0 or 1. One then calculates on the basis of the regressors X1,,Xh-1 that have realizations x1,,xh-1 the probability that the realization of Y will be y=1 at the measurement point xi=(yi1,,xih-1) as: pi =P(y=1|x=xi) 1-pi =P(y=0|x=xi) is then the probability that y=0 The multi-response model allows regressands Y to take =1,,q realizations. Then let y=(y1,,yq) be a q-dimensional binary random vector where the entry y =1 if Y takes the realization y= else y=0 q Also let =1 y=1. Further denote the probability for observing y= at the observation point xi=(yi1,,xih-1) by pi: pi =P(y=|x=xi)=P(y=1|x=xi) | where =1qpi=1 For ni measurements at the point xi one will observe y= ni() times. Then the sample result can be written as: n1(1),,n1(q),n2(1),,n2(q),, nm(1),,nm(q) and: =1qni()=ni. The corresponding likelihood function can be given as:

The unknown probabilities pi are given as functions of the observation point xi conditional on unknown paramaters. These parameters are estimated by maximizing the likelihood function w.r.t. these parameters. The conditional logit model gives: pi1 =1/[1+k=2qexp(k0+j=1h-1kjxij)] =1/[1+k=2qexp(1,xi)k] pi =1/[1+k=2qexp(k0+j=1h-1kjxij)] =1/[1+k=2qexp(1,xi)k] For every realization of a different parameter vector =(0,, h-1) is allowed and L is maximized with respect to the parameter vectors 2,,q.

7.4 Relationship between discreet regression and discriminant analysis

Question: How to use discreet regression to classify a new object that is characterized by the observations x*=(x1*,xh-1*) of the regressors (x1,, xh-1) into one of the classes Y. If for the binary response model the probability for observing y=1 at x =x* has been estimated as: P =G(0+j=1h-1jxj) Then one can use 0,5 as the cutoff point and estimate for some new object: y^ =1 if p^=G(0^+j=1h-1j^xj*) >0,5 =0 else An alternative formulation can be given by: y^Fisher=n(0)/n(y^=1) if 0^+j=1h-1j^xj* >0,5-n(1)/n =-n(1)/n(y^=0) else These results correspond exactly with the linear probability model with LS estimators. If there are no residuals then probit and Logit models give identical results. One can relate this binary response estimation to Fishers linear discriminant function: x(0) =( x1(0),, xh-1(0)) Peter P. Robejsek FernUniversitt Hagen, 2011 33

x(1) =( x1(1),, xh-1(1)) xj(0) =1/n(0) i=1mni(0)xij for j=1,,h-1 m xj(1) =1/n(1) i=1 ni(1)xij for j=1,,h-1 m S(0) =1/[n(0)-1] i=1 ni(0)[xi- xj(0)][xi- xj(0)] S(1) =1/[n(1)-1] i=1mni(1)[xi- xj(1)][xi- xj(1)] S =1/(n-2) [n(0)-1]S(0)+(n(1)-1)S(1) Then Fishers discriminant function can be given as: h(x) =( x(0)- x(1))S-1x-1/2 ( x(0)- x(1))S-1( x(0)+ x(1)) Then the decision rule for x=x* is y^ =0 if h(x*)>0 =1 else

8 KE 8 Graphical approaches
Need quantitative data matrix, use scaling for qualitative data or mds for distance data to obtain quant. data matrix 1- and 2D o Stem-and-leaves o Box-plot o scatterplot o quantile-quantile plots (QQ) normality, outliers o Bi-plots 3- and nD o profiles, polygons, radii etc. o faces o Andrews plots o trees, boxes, castles according to cluster and discriminant

8.1 Joint representation of objects and criteria

represent n objects w.r.t to their values of p criteria represent p criteria w.r.t the n observed objects represent both objects and criteria together

8.1.1 Graphical representation of 1- and 2D data

Combine representation of both dispersion and centrality for objects through Stemand-leaves and Box-plots Assume one quantitative criterion X is observed in n objects so that each object is assigned the value xi Sort observations by size: x1x2xn Subdivide the interval [x1,xn] into equally sized bins and count objects Subdivide left vertical axis (top = max) into bins and plot respective number of data points within bin Right vertical axis should run from minimum (top) to maximum criterion values, set box between 0,25- and 0,75 quantile, give mean x by cross on axis and median x~ by circle, mark data outside [x-3s2,x+3s2] compute x quantiles as follows: x= o xk for n not integer let k be the next largest integer following n o (xk+xk+1) for n integer k=n

Peter P. Robejsek

FernUniversitt Hagen, 2011


8.1.2 Probability plotting

Used to compare the distributions of two sets of data QQ compares inv. dist. i.e. quantile, PP compares probabilities i.e. values of dist. fn. From a np quantitative data matrix Y first compute mean vector y and inverse covariance matrix S-1 as well as squared Mahalanobis distances from the mean di=(yi y)S-1(yi- y) sort d1d2dn find i/(n+1) quantiles: 2p;i/(n+1) then the n points (2p;i/(n+1),di) should be approximately located on the 45 line through the origin. The less the distribution is multivariate normal the greater will be the deviation from the 45 line.

8.1.3 Bi-Plot
To visualize rows and columns of a data matrix Y or a modified matrix Y* simultaneously yij*=yij-y.j To visualize a matrix with rank>2 in two dimensions use an approximation Y2 for Y*. This can be obtained from the singular value decomposition: Find two largest eigenvalues of Y*Y* and call these 1, 2 with corresponding eigenvectors q1, q2. and define normalized eigenvectors pi=1/iY*qk then Y2= (p1,p2)diag(1,2)(q1,q2) now factor Y2=HM where H is n2 with orthonormal columns: H= (n-1) (p1,p2) and M is accordingly: M= 1/(n-1) (1q1,2q2) =(M1,,Mp) is p2 Then the ith row of H represents the ith object and the jth row of M the jth characteristic. Then represent the rows of H as points and the rows of M as vectors to those points from the origin. Then the Euclidean distance of the points i and j approximates the mahalanobis distance of the ith and jth row of H i.e. the ith and jth objects. The dot product MjMj represents the covariance of the characteristics yj and yj The length of the jth vector represents the respective standard deviation of yj and the cosine between the jth and jth vectors represents the correlation between the jth and jth objects.

8.1.4 Further graphical techniques

MDS in 2 or three dimensions allows scatter plot Hierarchies and Quasi-Hierarchies are represented by dendrograms Factor analysis o display the factors in the space spanned by the characteristics o estimate factor values and plot these for the n objects FernUniversitt Hagen, 2011 35

Peter P. Robejsek

8.2 Individual representation of objects or characteristics

Examples focus on objects but transposing the data matrix enables us to represent characteristics in the same way.

8.2.1 Simple representations of objects

bars or profiles: for every characteristic and object plot a bar the height of which corresponds (is proportional) to the entry in the data matrix

polygonsal lines: same as bar only the data points are plotted above th characteristics and connected with lines

polygons in polar coordinates: These function in the same way if all values are positive. Then the result are stars, the angle is 2/p, the radius corresponds to the actual values.

Same for suns only here circle and center are not needed:

Glyphs: Subdivide top 1/6 of circle into p partitions, plot distances on lines through subdivirion point and center, pointing outside the circle. Two plotting alternatives: Peter P. Robejsek FernUniversitt Hagen, 2011 36

o lower 1/3 values length 0 (on radius), middle 1/3 length c and top 1/3 of values length 2c o lengths proportional to values

8.2.2 Representation by faces

represent n objects with p characteristics by n faces Flury-Riedwyl-faces allow representation of 36 parameters since for each half face one can vary eyesize, pupilsize, pupil position, horizontal position of eyes, vertical position of eyes inclination of eyes, curvature of eyebrow, thickness of eyebrow, position of eyebrow (vert. and horz.), upper hairline, lower hairline, hair darkness, hair, inclination faceline, nose, size of mouth, curvature of mouth Standardize data in the data matrix as follows: yij~ =yij-yijmin/(yijmax-yijmin)

for less than 36 parameters keep some features constant, for less than 18 choose symmetrical faces or vary only one half face extremely subjective due to choice of characteristic-feature matching

8.2.3 Andrews plots

Each object is represented by a sum of sine and cosine functions: Peter P. Robejsek FernUniversitt Hagen, 2011 37


=c1i/2+c2isint+c3icost+c4isin2t+c5icos2t++c(p-1)isin(p-1)/2t+ cpicos(p-1)/2t

To represent Andrews plots in polar coordinates plot the function fi(t)~ =fi(t)+c t[-,] where c|mint [-,]fi(t)| i=1,n

8.2.4 Representation of objects and their similarities

Assume a data matrix with nonnegative values possible etrics for the similarity of characteristics are e.g. their Euclidean d(j,k) distance as well as their absolute correlation |rjk|, computed columnwise (characteristics! not objects!) d(j,k) =(i=1n(yij-yik)2)1/2 |rjk| =|(i=1n(yij-yj)(yik-yk)) / sqrt[(i=1n(yij-yj)2)(i=1n(yik-yk)2)]| now use distances or coefficients of Indetermination (1-R2) to create a hierarchical cluster analysis and a dendrogram using complete linkage as heterogeneity metric Use the distance metrics to determine the length of the branches in the dendrogram:

For small numbers of characteristics one can also use rectangular blocks. o cluster the characteristics into three groups (e.g. use level of hierarchy with three classes) Peter P. Robejsek FernUniversitt Hagen, 2011 38

o assign each group a corresponding dimension of the block (height, width, length) o The dimension then is drawn proportional to the sum of the values of the distance metrics in the corresponding group. o Finally mark the individual characteristic boundaries on the blocks. This makes them look like parcels

Another visualization can be achieved with Kleiner-Hartigan trees, which use as basis the dendrogram of the hierarchy Begin with the stem, i.e. the class which encompasses all objects and draw thinner branches as partitions become finer. To determine the angle between two branches proceed as follows: o assign a maximum angle (e.g. 80) to the branching of the stem and a minimum angle (e.g. 30) to the two-element class with the minimum heterogeneity (or maximum homogeneity) of the two constituent one element classes. o All other branchings are allocated in between values. o call A, B, C, the classes of the hierarchy where A denotes the final class with all p elements. Call the heterogeneities of classes gA, gB, gC etc. Calculate the angle X =[(ln(gA+1)-ln(gX+1))+(ln(gX+1)-ln(gmin+1))]/[ln(gX+1)-ln(gmin+1)] for X=B,C,... where gmin denotes the heterogeneity of the classes that are subsumed first. An angle is divided along the vertical proportionally to the width of the branches When dividing the stem the direction is reversed with each branching When branches are subdivided the thicker one is chosen to go away from the trunk Choose the length as being proportional to the average value of the characteristics contained in them and the width proportional to the number of characteristics in one branch.

Peter P. Robejsek

FernUniversitt Hagen, 2011


Similar approach for castles, also by Kleiner-Hartigan only that angles are 0 The height of one storey (turret) above the ground is proportional to the minimum of the characteristic values of the characteristics contained in the turret minus some factor d times q, the minimum characteristic number contained in the class (e.g. if class contains chars 3, 2, 8 then q=2.

Peter P. Robejsek

FernUniversitt Hagen, 2011