0 Up votes0 Down votes

0 views19 pagesJul 22, 2019

© © All Rights Reserved

PDF, TXT or read online from Scribd

© All Rights Reserved

0 views

© All Rights Reserved

- Principal Component Extraction and Its Feature Selection
- Matrix Data Analysis
- physica
- Dp 06588
- Statistical Sampling Strategies for Geometric Tolerance Inspection by CMM
- 50120140502007
- Why Raghuram Rajan Ranked Gujarat Low - Yahoo India Finance
- Brochure XLSTAT
- Factor Analysis2
- Rothwell-Employability Paper JA Version Dec 04 Protected
- Chapter 1
- Multivariate Data Analysis Wiki
- Feature Selection Extraction
- Dimension Reduction
- Metzger 2005
- Financial Market Prediction
- scholkopf_kernel.pdf
- Focardi Sergio
- A New Noise Mitigation Scheme in Power Line Communication Systems Based on the Principal Components Analyses
- chapter_3

You are on page 1of 19

redressing method

Sergio Camiz

DIPARTIMENTO DI MATEMATICA

“GUIDO CASTELNUOVO”

The Guttman effect

Sergio Camiz

Dipartimento di Matematica Guido Castelnuovo - Università di Roma La Sapienza

e-mail: sergio.camiz@uniroma1.it

Abstract. The Guttman effect is introduced as a result of factor analyses of band data tables. The problems raised are

discussed, in particular with reference to vegetation analysis. Two possible investigation pathways resulted: one aiming

at using different analysis techniques, based on the Gaussian model (Ihm and van Groenewoud, 1975; Johnson and

Goodall, 1980) and the other based on the redressing of the scatter plots to remove this effect (Gauch and Hill, 1980;

Delicado and Aluja, 2003). In this paper, the various methods are described and a new simple redressing method is

introduced, based on the interpretation of the scatter plots and their parabolic pattern and some found properties.

| C C C C C C C C C C C C C C C

| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

This effect was first identified by Guttman (1953). Given ---+---------------------------------------------

U1 | 1 1 1

either a band data table, as the one in Table 1, or one U2 | 1 1 1 1

U3 | 1 1 1 1 1

U4 | 1 1 1 1 1

obtained by variables having a Gaussian distribution U5 |

U6 |

1 1 1 1 1

1 1 1 1 1

along a given factor, as in Table 2, if one submits the table U7 |

U8 |

1 1 1 1 1

1 1 1 1 1

to Principal Components Analysis (in the following, PCA) U9 |

U10|

1 1 1 1 1

1 1 1 1 1

U11| 1 1 1 1 1

the resulting pattern of row points on the plane spanned U12| 1 1 1 1 1

U13| 1 1 1 1 1

by the first two principal components has a horse-shoe U14|

U15|

1 1 1 1

1 1 1

shape, as the one represented in Figure 1. The same table,

submitted to Correspondence Analysis (in the following, Table 2 - A data table of variables having a

Gaussian distributions on a common factor.

CA), produces an arch pattern of row points on the plane

spanned by the first two factors, as represented in Figure

| C C C C C C C C C C C C C C C

| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

---+---------------------------------------------

2. In this latter case, the column-points pattern is very U1 | 5 3 1

U2 | 3 5 3 1

similar to the row-points one, whereas in the previous one U3 | 1 3 5 3 1

U4 | 1 3 5 3 1

U6 |

1 3 5 3 1

1 3 5 3 1

U7 | 1 3 5 3 1

U8 | 1 3 5 3 1

U9 | 1 3 5 3 1

Since then, these particular shapes were studied both to U10|

U11|

1 3 5 3 1

1 3 5 3 1

understand their theoretical rationale and to overcome U12|

U13|

1 3 5 3 1

1 3 5 3 1

U14| 1 3 5 3

U15| 1 3 5

Figure 1 - Horse-shoe effect of rows and columns of Figure 2 - Arch effect of rows and columns (superposed)

Table 1 on the plane spanned by principal axes 1 and 2 of Table 1 on the plane spanned by factors 1 and 2 of

of its Principal Components Analysis. Correspondence Analysis.

The Guttman effect

them, in order to get the pattern linear. Indeed, in some cases, a rectilinear pattern is considered

a better representation of the data structure. In particular if one wishes to use the coordinates for

other purposes, such as identifying the position of a point along the underlying factor. Limited,

as most of the studies were, to simulated data sets, they aimed at linearising the arch, say to get

rectilinear the distribution, but they did not take into account neither the multidimensional

pattern usually resulting from the analysis of real cases, nor its meaning. In this way the studies

were incomplete. Only recently Delicado and Aluja (2003) consider sufficiently the second

dimension in a very interesting way. In a contemporary study, Camiz and Polenta (2003) propose

an analogous redressing method, much simpler in its rationale, based on the current interpretation

of the points pattern on factor planes and on the parabolic shape of the scatter. In this paper, after

a review of the main suggestions found in literature aiming at solving both issues (section 2), the

methods based on the Gaussian model are proposed (section 3), followed by the Delicado and Aluja

(2003) redressing (section 4). Then, a two-dimensional interpretation of the Guttman effect is given

(section 5), the new method of redressing of Camiz and Polenta (2003) is proposed (section 6), with

its comparison with some other methods based on an application to real data (section 7).

Benzécri (1973-82) showed that the pattern of both row and column points of a band matrix on

the planes spanned by the first and the n-th factor of CA is an n-th degree polynomial. Van

Rijckevorsel (1987) studies the problem in the frame of homogeneity analysis. The Benzécri

approach was recently considered by Baccini et al. (1994), who theoretically explain the Guttman

effect. Given the probability distribution function of a random vector (X, Y) having standardised

bivariate normal distribution with correlation coefficient D

(Watson, 1933), is

the reconstruction formula given by CA, where n is the dimension of the solution, the eigenvalues

The Guttman effect

The second issue, namely the redressing of the horseshoe / arch pattern, was long treated in the

frame of vegetation science, since the multidimensional structure of a vegetation data table is

evident and very often depends on a mayor environmental factor (or a combination of several co-

occurrent factors). In this case, the eigenanalysis-based methods, such as PCA and CA, show the

Guttman effect, since the relation among factors and plant species is unimodal. The detection of

Guttman effect on the factor planes is thus an evidence of the seriation of species and relevés along

an environmental gradient, but vegetation scientists are not satisfied with this information and

would like to estimate the position of both species and relevés along this detected gradient. The

problem of this indirect gradients identification through exploratory analysis tools was already

criticised by Camiz (1991; 1994) who argues that this aim, typical of the inferential statistics, is

far beyond the limits of the exploratory methods.

Nevertheless, there are cases in which the position

of an item along the pattern would be used for

classification purposes and for the reorganisation

of the data table, even remaining in the frame of

exploratory analysis. This occurs in the so-called

table structuration, where both rows and columns

are rearranged, so that similar items are grouped

together and groups and items order reflect their

order along the environmental factor. Both these

issues have a mayor importance in vegetation

science, in particular since a structured table is a

basic tool for the definition of the syntaxa, i.e. the

vegetation taxonomic units. In general, to deal with

a linear structure is helpful, since the coordinates

are related directly to the identified factors, a fact

that is not true in the case of Guttman effect.

Benzécri (1973-82) proof was not known or they

wished to appeal to intuition: as a consequence, the

explanation of Guttman effect was given as a

multidimensional extension of the two-dimensional

pattern of a sample taken from a population

Figure 3 - (a) Example of two unimodal distributions

having different mode along the same gradient; (b) the represented by two similar unimodal distributions,

joint distribution of the two unimodal variables.

The Guttman effect

3a). If the sample is represented on a plane whose

each coordinate is the frequency of either

distribution, its pattern corresponds to the curve

represented in Figure 3b (Wartenberg et al., 1987;

see also Orlóci, 1978) and the Guttman effect

results from a composition of several such patterns.

This explanation is not exact and it is surprising

that nowadays it is still accepted. As a matter of

facts the practical / empirical interpretation can be

Figure 4 - Original and adjusted position of points on a

based on the usual method of interpretation of

plane according to the DECORANA adjustment based on

moving average. PCA and CA graphics: the items at the ends of the

distribution are negatively correlated, and those

close to each other have similar behaviour. The point is that the Guttman effect originates from

an eigenanalysis representation of distributions having their modes on different values of a

common factor. So, studies were carried out, in order to empirically clarify the relations among

modes positions, distributions turnover, distribution ranges, length of the factor, etc. (among

others, Gauch and Whittaker, 1972; Austin, 1976a; Kessel and Whittaker, 1976; Gauch et al.,

1977; Prentice, 1980) In addition, other studies were carried out to compare different analysis

techniques, including Kruskal (1964a; 1964b) Non-Metric MultiDimensional Scaling (Austin,

1976b; Kenkel and Orlóci, 1986). This latter was considered the most effective in minimising the

Guttman effect, but nevertheless, PCA and, more, CA, remain in the current practice of vegetation

data analysis.

An important development, so to say, was the discovery that CA is less heavily affected by the

Guttman effect than PCA: in this latter the folding of the pattern tails heavily biases the order of

the items along the first factor. Should this folding not exist, as in CA, the coordinates of items

along the first axis would be a sufficient approximation of the position of items along the factor.

At least, it was argued that the items order along the factor would be sufficiently well represented,

albeit this would not be the case of the reciprocal distances. So, CA became the common method

for the analysis of vegetation tables, thanks to its higher robustness to Guttman effect, a wrong

reason in respect to the true one, that a vegetation table is a transformation of a contingency table.

As a drawback of this choice, the awareness that PCA factors represent groups of co-occurrent

species, thus species associations (Noy-Meir, 1971) was eventually dropped. In all cases, the

attention was drawn on the position of items along the gradient, thus considering the problem only

unidimensional. This certainly depended on a superficial interpretation of Benzécri (1973-82)

explanation of the polynomial relations among factors: stating that the coordinates along the

second factor are but a second degree polynomial transformation of those on the first, those on the

third a third degree transformation, etc., the interest concerning the width of the arch was totally

lost. Indeed, Gauch and Hill (1980) state that the effect is «simply a mathematical artifact,

corresponding to non real data structure» (pag. 48), so that it is not surprising that they proposed

a detrending method, aiming at getting straight the arch pattern resulting from CA. Their

proposal, implemented both in DECORANA (Hill, 1979) and ter Braak’s (1988) CANOCO, was to

The Guttman effect

remove the arch from the second factor: for this aim they suggested two methods: i) to divide the

first axis range into intervals, compute the average of the second coordinates of points falling in

each interval, then subtract the interval average from the second coordinates of points falling in

the interval. It is evident that such a procedure, albeit getting almost rectangular the arch patter,

severely affects the reciprocal position of items, in particular in what concerns the second

dimension (Figure 4), but this seemed without importance to the detrending supporters, since «the

preservation of the second dimension of the solution provides no additional information about the

underlying order. Thus there is no reason to avoid detrending» (Peet et al., 1988: p. 926). ii) A

better technique that they propose is to fit the distribution along the axes following the first with

a polynomial regression and remove the fitted mean at each position along the first axis from the

coordinate. This procedure causes a smaller distortion of the points reciprocal position, but this

does not give any real advance to the problem solution. In fact, it is questionable if the adjusted

distribution has some meaning in what concerns the following factors. Presently, both DECORANA

and CANOCO are widely used in the vegetation studies framework and the detrending has become

a current (wrong) practice.

A different approach, theoretically more relevant, was to limit the attention to the uni-dimensional

environmental factor, thus dropping the hope to get a multidimensional representation, and

develop a method based on different assumptions. This led Ihm and Van Groenewoud (1975; 1984)

to develop an analysis based on the Gaussian model. This was already investigated by Gauch and

Chase (1974) and Gauch et al. (1974), that considered the Gaussian curve a reasonable

approximation of the ecological niche, an issue criticised by Austin (1976a; 1980; 1987) who

proposes other models too. In particular, he states that the symmetry of the distribution is far

from evidence, and that unimodality is acceptable, at least as far as the competition among species

is not taken into account.

The Gaussian model is based on the idea that p variables have a symmetrical distribution on the

real line given by

(1)

where is the maximum of the variable j, reached at the range mid-point, , the

mean, and is its standard deviation. With this model, the cover y of a plant species at the value

x of an environmental gradient may be estimated.

The Ihm and Van Groenewoud (1975) method is based on the hypothesis of equal variance of the

The Guttman effect

variables ( ), that allows mayor simplifications and gives a very simple solution.

Besides, is supposed defined for every real value x. For every couple of variables (j,k), a

dissimilarity index is given by the double natural logarithm of the integral of the product of the

two distributions and ,

(2)

Assuming the Gaussian distribution (1) for the ys, this gives the dissimilarities

thus a linear function of the squared differences among the means, with marginal values

and

where

To transform the matrix into scalar products, the Gower (1966) transformation is applied

The Guttman effect

In order to estimate the ys, the expected value of the j-th variable on the i-th observation

along the factor is estimated by the observed value : if it is error-free and the are chosen at

regular distance, the sum

(3)

is a numerical approximation of the integral (2). Albeit the are not always regularly distributed,

one may considered that they are independent and uniformly distributed along the factor in a

given interval [-A, A]. In this case the (3) is a Monte-Carlo estimate of the integral (2) and the

matrix can be analysed, giving estimates of both the eigenvalue and

the eigenvector of .

Following similar ideas, Johnson and Goodall (1980) propose a technique that they try to extend

to the multidimensional case (Goodall and Johnson, 1982; 1987). The method is essentially devoted

to the indirect identification of the ecological gradients underlying vegetation tables structure, so

that they distinguish the case of species present in a relevé, where they try to fit a Gaussian model

to the species cover, and the case of species absent, where they fit a parabola to the probability of

absence of the species. For the estimation of parameters they use the maximum likelihood method.

Considering the Gaussian model (1) in the reduced form

to fit, for each species, a Gaussian curve via the least-squares, by minimizing the sum of squares

(4)

where n is the number of relevés, xj is the position of the j-th relevé along the gradient, and Ki, ai,

bi are species specific parameters of the model to be estimated. Note that in (4) only the yij, the

cover value of the j-th species in the i-th relevé, is known. In particular, since xi is unknown,

standard regression techniques cannot be used to estimate Kj, aj, bj, so that they use an iterative

approach, where the estimation of the xi is done through maximum likelihood method. The

likelihood function is built taking into account either the species cover, if it is present, or its

The Guttman effect

Figure 5 - The Principal Curve as the curve whose Figure 6 - The Principal Curve of Oriented Points as

points are the mean of those projected onto them a set of Principal Oriented Points (Delicado and

(Delicado and Aluja, 2003). Aluja, 2003).

probability of absence, in this case using the fitted parabola. The procedure may thus be

summarised as follows:

1. Start from a rough approximation of the xi values on the true gradient;

2. fit the bell-shaped response curves for each species j on the basis of the yij in each relevé i,

according to the position xi along the gradient;

3. obtain a better approximation of xi by using the maximum likelihood method applied to both

functions (Gaussian and parabola);

4. iterate the process from 2. until convergence.

The model was tested for robustness by Goodall and Johnson (1987). Note that, for a better

convergence, the estimation of the species variances was substitute by a single pooled variance.

The application of this method, albeit very interesting and very sophisticated, seems limited to the

vegetation analysis, since it is questionable, for other kind of data, the double fit proposed, as well

as some other adjustments not described here.

In general, dealing with the Gaussian model it is evident that the equal variance hypothesis is far

from being reasonable. It is sufficient to look at a typical vegetation table to acknowledge that

there are species present in all relevés, even very far from each other along the first factor, and

other with a very limited range.

In all these discussions, it is clear that the understated assumption is that the dispersion of the

points from the factor (curve) line is simply noise or something otherwise caused by the arch effect,

so that a uni-dimensional solution is acceptable. This is totally false and it is sufficient to show the

pattern of items on factor planes once that the matrix is a multiband one, as the one in Table 3,

that will be discussed later.

In most recent times, Delicado and Aluja (2003) propose, as a correction of Guttman effect, the

use of Principal Curves. Principal curves of a random variable X are one-dimensional

parametrized curves having the self-consistency property that

The Guttman effect

mean of all points whose projection on

the curve (i.e. the orthogonal

projection on the straight line tangent

to the curve) is exactly "(t) (Figure 5,

Hastie and Stuetzle, 1989). Starting

from the concept of principal direction

associated to a point x as the straight

line through x that minimizes the

variances of the other points

projections on the line, Delicado (2001)

defines a set of principal oriented

points as a set of points coincident with

the mean on their corresponding

principal direction. In this way, he can

define a principal curve of oriented

points as a principal curve composed of

principal oriented points (Figure 6).

This allows Delicado and Huerta (2003)

to propose an implementation of the

method (for other algorithms, see

Hastie and Stuetzle, 1989; Kégl et al.,

2000). So, given a plane scatter of

points, Delicado and Aluja (2003) build

a scatter principal curve (Figure 7a),

then project each point onto it, thus

obtaining two coordinates: one along Figure 7 - Once identified a principal curve, points are projected onto

the curve (e.g. the length of the curve it (a); for a linear representation, the coordinate along the curve and

segment from the curve mid-point and the length of the projection are plotted (b). It is evident the difference

with the non-linear regression results (c).

the projection) and one the distance of

the point from its projection on the curve (Figure 7b). This kind of redressing is totally different

from both detrending proposed by Gauch and Hill (1980), in that it takes into account the

scattering of points along a second dimension, that is highly biased in both Gauch and Hill

methods, without any justification. In particular, in Figure 7c the same points are projected

vertically on the curve, as in the parabolic regression, and it is clear the enormous difference of

results.

Instead of limiting the attention to a single-band matrix, as it is usually done in literature, let us

consider now a multi-band matrix with bands having different width, such as the one shown in

Table 3, where the rows B and the columns P are supplemental ones. In both PCA and CA

The Guttman effect

supplemental elements are those that do not participate to the eigenanalysis, but are projected on

the factors the same. In this way, their position is interpreted as that of all other elements, but the

construction of the factors is not influenced by them. Rather, they may be used for the factors

interpretation, as external suggestion. In this case, their use is instrumental: we do not want that

they influence the factors, but we use them to set a limit to the scattering of all others elements. We

limit here our comments to the pattern of points on the plane spanned by the first two factors.

The PCA and CA applied to this data table give patterns shown in Figure 8 and Figure 9 and in

Figure 10 and Figure 11, respectively. This allows a better interpretation of the Guttman effect

than the one that can be done considering only one band matrix. Actually, the band matrix

represented in Table 1 is such that the two extreme columns are totally opposed, since it never

occurs to find rows having two 1s. Consequently, on the first axis the two columns are opposed to

each other. The same occurs for the central columns, that have no row with 1s in common with the

extremes: for this reason, on the first plane they are set as far as possible from both and, as a

consequence, it sets at the extreme of the second axis, whereas the two extremes are set opposite

to it. Intermediate columns having some row with 1 in common with the said one set in an

intermediate position, thus giving the typical arch pattern. The same happens for the rows.

Now, looking at the pattern of both columns (Figure 8) and rows (Figure 9) on the first factor

| P | Q | R | S

| 11111111112| 111111111122222| 0123456789012345678| 012345678901234567890

|12345678901234567890|123456789012345678901234|1234567891111111111222222222|123456789111111111122222222223

----+--------------------+------------------------+----------------------------+------------------------------

B1 | | 1 | |

B2 | | 1 | |

B3 | | 1 | |

B4 | | 1 | |

B5 | | 1 | |

B6 | | 1 | |

B7 | | 1 | |

B8 | | 1 | |

B9 | | 1 | |

B10 | | 1 | |

B11 | | 1 | |

B12 | | 1 | |

B13 | | 1 | |

B14 | | 1 | |

B15 | | 1 | |

B16 | | 1 | |

B17 | | 1 | |

B18 | | 1 | |

B19 | | 1 | |

B20 | | 1 | |

----+--------------------+------------------------+----------------------------+------------------------------

C1 |1 |11111 |111111111 |

C2 | 1 | 11111 | 111111111 |

C3 | 1 | 11111 | 111111111 |

C4 | 1 | 11111 | 111111111 |

C5 | 1 | 11111 | 111111111 |

C6 | 1 | 11111 | 111111111 |

C7 | 1 | 11111 | 111111111 |

C8 | 1 | 11111 | 111111111 |

C9 | 1 | 11111 | 111111111 |

C10 | 1 | 11111 | 111111111 |

C11 | 1 | 11111 | 111111111 |

C12 | 1 | 11111 | 111111111 |

C13 | 1 | 11111 | 111111111 |

C14 | 1 | 11111 | 111111111 |

C15 | 1 | 11111 | 111111111 |

C16 | 1 | 11111 | 111111111 |

C17 | 1 | 11111 | 111111111 |

C18 | 1 | 11111 | 111111111 |

C19 | 1 | 11111 | 111111111 |

C20 | 1| 11111| 111111111|

----+--------------------+------------------------+----------------------------+------------------------------

D1 |1 |11111 |111111111 |11111111111

D2 | 1 | 11111 | 111111111 | 11111111111

D3 | 1 | 11111 | 111111111 | 11111111111

D4 | 1 | 11111 | 111111111 | 11111111111

D5 | 1 | 11111 | 111111111 | 11111111111

D6 | 1 | 11111 | 111111111 | 11111111111

D7 | 1 | 11111 | 111111111 | 11111111111

D8 | 1 | 11111 | 111111111 | 11111111111

D9 | 1 | 11111 | 111111111 | 11111111111

D10 | 1 | 11111 | 111111111 | 11111111111

D11 | 1 | 11111 | 111111111 | 11111111111

D12 | 1 | 11111 | 111111111 | 11111111111

D13 | 1 | 11111 | 111111111 | 11111111111

D14 | 1 | 11111 | 111111111 | 11111111111

D15 | 1 | 11111 | 111111111 | 11111111111

D16 | 1 | 11111 | 111111111 | 11111111111

D17 | 1 | 11111 | 111111111 | 11111111111

D18 | 1 | 11111 | 111111111 | 11111111111

D19 | 1 | 11111 | 111111111 | 11111111111

D20 | 1| 11111| 111111111| 11111111111

The Guttman effect

Figure 8 - The pattern of columns of Table 3 on the first Figure 9 - The pattern of rows of Table 3 on the first

plane of principal components analysis. plane of principal components analysis.

plane of PCA of Table 3, one can observe that the bands are represented along different curves,

the closer to the centroid the narrower is the band. This means that the first principal plane

returns two independent information, as it should do: a position of a column or a row along a

(curvilinear) factor and the width of the range of the column or row along the factor itself. This

is a consequence of the correlation structure among adjacent rows and columns. In particular, the

length of the column vectors depends on the correlation of the column with this factor plane, so

that the folding of both the extreme columns and rows toward the centre originates from the lower

correlation of these columns with the others. This is made clear in Figure 8 where the narrower

bands are closer to the centre than the larger ones.

The same occurs in CA but in opposite direction, due to the centroid property of items

representation: the larger bands, that are tied to more items, are represented closer to the centre

and the supplemental elements, say the band with only one 1, are set at the extreme of the

graphics, forming a convex hull containing all the other points. This is evident for both columns

(Figure 10) and rows (Figure 11). As well, in place of the folding, the extreme elements tend to level

to the convex hull, due to the reduced relation structure towards the end of the band. Analogous

comments can be done when considering the following factors.

Summarising, the band structure of a matrix is represented as an arch pattern and the different

Figure 10 - Representation of the four column bands of Figure 11 - Representation of the three row bands of

Table 3 on the first factor plane of Correspondence Table 3 on the first factor plane of Correspondence

Analysis. Analysis.

The Guttman effect

convex hull of the band composed by single values

depends on the width of the band. In this way, the

two-dimensional representation is fully

interpreted. This interpretation entails two facts:

first, the Guttman effect has both a theoretical

explanation (Benzécri, 1973-82; Baccini et al.,

1994) and a practical interpretation: once it is

detected, a band structure of the data table should

be supposed; second, both the dimensions have a

meaning, in terms of position of the band mid-point

and of band width. Neither can be ignored but they

Figure 12 - Polar coordinates on the plane, adjusted to

the distance of the convex hull to the origin. had rather be taken into account in order to

understand the data structure.

Based on these facts, it becomes evident that the detrending, as suggested by Gauch and Hill (1980)

is senseless and the Gaussian ordinations are but a partial representation of a more complex

phenomenon, since they do not take into account the information tied to the variance. In the case

of vegetation, where the band represents the turnover of species along an environmental gradient,

if the position of the mode may be roughly or better estimated, the information concerning the

range of the gradient where a species may be found is totally lost.

Now, since we are in condition to correctly interpret the pattern of items once the Guttman effect

is detected, and since the factor of interest is often the curvilinear factor, we may wish to represent

this factor as a straight line as well as the information concerning the species dispersion along this

curvilinear factor. This claims for a different way of redressing, that is the one depicted in Figure

7b. Actually the proposal of Delicado and Aluja (2003) seems adequate, but some critics may

justify another suggestion, that will be shown in the next section.

The Delicado and Aluja (2003) use of principal curves is quite interesting, since they actually solve

the old problem of getting linear the non-linear representation of a factor underlying a

distribution, without introducing any arbitrary bias. Nevertheless, aiming at finding a curve

passing through the centre of the distribution, they suppose a kind of symmetry of the scatter

along the perpendicular to the curve, far from being the case in the case of Guttman effect. In

addition they do not take into account the known information on the shape of the distribution and

its meaning. Last, the proposed algorithm is quite cumbersome.

Our proposal (Camiz and Polenta, 2003) is to consider the existence of one limit to the scattering

of elements on the first factor plane of CA, that is the convex envelope of the supplemental lines

with only one 1. So, the idea is to consider as first coordinate the position along the curvilinear

pattern and, as second coordinate, the relative distance to that envelope (or to the centroid). These

The Guttman effect

Figure 13 - Linearised representation of the four column Figure 14 - Linearised representation of the three row

bands of Table 3 using angles and distances ratios as bands of Table 3 using angles and distances ratios as

found in Figure 10. found in Figure 11.

two coordinates represent respectively the position and dispersion statistics of the distribution of

the elements on the underlying factor.

In order to obtain such coordinates, we submit to CA a data table with added-on a set of virtual

row-units having only one 1, each unit in a different column position, and a set of virtual column-

variables, having only one 1, each one in a different unit. These two sets are used as supplemental

elements in CA and will provide, by linear interpolation, the convex hull of the distribution. Then

we consider the system of polar coordinates on the first factor plane of CA, namely the distance

to the origin of points and the angle between the straight line connecting a point to the origin and

the vertical axis. In formulae, if x, y are the two coordinates of a point on the two factors, the

considered transformation is

(5)

The set of points with the transformed coordinates can be plotted on orthogonal axes,

giving a redressed pattern. Actually, the pattern would be linear if the curves would be circles.

Dealing with parabolas, or in practice with some generic convex curve, one had rather to adjust

the distance r to the distance of the convex hull to the centroid along the same vector. So, in (5)

the relative distance to the centroid of the point A is the ratio . In formulae, given a point

A whose coordinates are , if B is the intersection with the convex hull of the straight line

connecting the origin O and A, with coordinates , the formula (5) becomes

The Guttman effect

, whose implementation is

simple, applied to Figure 10 and Figure

11 gives a nearly-linearised pattern,

shown in Figure 13 and Figure 14

respectively. It is clear from the

representation that the method is very

approximate. Albeit each band is still

represented by a curve line, the

adjustment seems acceptable for

exploratory purposes. It must be Figure 15 - The pattern of relevés of Ellenberg’s grassland data on the

observed that, toward the end of the plane spanned by the first two factors of Correspondence Analysis.

factor, the position of points is more

confused, since the width of the pattern is progressively reducing. This seems a limit that cannot

be overcome by any procedure.

7. An example of application

As an example of the application of the polar coordinates method, we consider the Ellenberg

grassland vegetation data table, taken from Müller-Dombois and Ellenberg (1974) and also used

by Camiz (1994) where the whole table is reported. The pattern of the relevés on the plane of first

two factors extracted from CA is shown in Figure 15. It is evident the Guttman effect,

corresponding to a mayor ecological gradient. Applying DECORANA adjustments to the table

according to the moving average of the vertical coordinates or to the residuals of a regression on

a second degree polynomial, the resulting patterns are represented in Figure 16 and Figure 17,

respectively. In both cases it is clear the enormous bias introduced by the adjustment. In

particular, in both cases, relevés 19 and 25 are set very far from each other, but most relations,

in particular distances, are strongly modified. In Figure 18 the pattern obtained using the

proposed method is shown: indeed the reciprocal position of points is much closer to the original.

Figure 16 - The pattern of relevés of Ellenberg’s grassland data, adjusted through the second degree

grassland data adjusted using the moving average. polynomial regression.

The Guttman effect

Camiz and Polenta (2003) method.

8. Conclusions

It was proved that the Guttman effect, far from being a distortion due to the algorithm of factor

analyses, is rather a pattern very informative of a particular data table structure. Indeed, in two

dimensions, it gives two information: i) that the data are influenced by a mayor underlying factor

that has a unimodal relation with the variables; and ii) the different range of the variables along

the gradient. For this reason, its removal without a proper technique, gets lost an important part

of its meaning. Rather, its rationale (Benzécri, 1973-82; Baccini et al. 1994), can be helpful in

building a technique of redressing that can keep at the best all the information represented on the

factor plane.

In comparison with the Gauch and Hill (1980) detrending methods, the proposed redressing by

polar coordinates, seem to give much better results, since both orders are kept nearly untouched.

In comparison with the Gaussian method, our method gives a second dimension, missing in Ihm

and van Groenewoud (1975); nevertheless, the order given by the Gaussian model, not reported

here, is different from ours and the reason should be further investigated. As well, a comparison

with both Johnson and Goodall (1979) and Delicado and Aluja (2003) techniques should be done.

We expect that only the latter could give comparable results, or maybe better. Actually, our

method may be used provided that the introduction of supplemental elements with 1 in only one

crossing may have some sense. A better adjustment, to reduce more the curvilinear pattern, would

be useful. As well, the extension of the technique to a similar one, able to take into account the

further dimensions could be an interesting development of this method.

Acknowledgements

The program CANOCO that allowed to run Gauch and Hill (1980) detrending was nicely given by

Cajo ter Braak; the Gaussian model program was nicely given by Peter Ihm and reviewed by

Vanda Tulli. The polar coordinates method was implemented by Giorgia Polenta. Their

contributions are gratefully acknowledged.

The Guttman effect

References

Austin, M.P. (1976a), «On non-Linear Species Response Models in Ordination», Vegetatio, vol. 33 (1): pp.

33-41.

Austin, M.P. (1976b), «Performance of four Ordination Techniques Assuming three Different non-Linear

Response Models», Vegetatio, vol. 33 (1): pp. 43-49.

Austin, M.P. (1980), «Searching for a Model for Use in Vegetation Analysis», Vegetatio, vol. 42: pp. 11-21.

Austin, M.P. (1987), «Models for the Analysis of Species’ Response to Environmental Gradients»,

Vegetatio, vol. 69: pp. 35-45.

Baccini, A., H. Caussinus, and A. de Falguerolles (1994), Diabolic Horseshoes, 9th International Workshop

on Statistical Modelling, Exeter, 11-15 July 1994.

Baxter, M.J. (1994), Exploratory Multivariate Analysis in Archaeology, Edinburgh, Edinburgh University

Press.

Benzécri, J.P. (ed.) (1973-82), L'Analyse des Données, 2 tomes, Paris, Dunod.

Camiz, S. (1991), «Reflections on Spaces and Relationships in Ecological Data Analysis: Effects, Problems,

and Possible Solutions», Coenoses, vol. 6 (1): pp. 3-13.

Camiz, S. (1994), «A Procedure for Structuring Vegetation Tables», Abstracta Botanica, vol. 18 (2): pp.

57-70.

Camiz, S., and G. Polenta (2003), «Effet Guttman: son interprétation et une nouvelle méthode de

redressement», in: Y. Dodge, G. Melfi (eds.) Méthodes et Perspectives en Classification - Actes du Xème

Congrès de la Société Francophone de Classification. Presses Académiques Neuchâtel: pp. 91-94.

Delicado, P. (2001), «Another Look at Principal Curves and Surfaces», Journal of Multivariate Analysis,

vol. 77: pp. 84-116.

Delicado, P., and T. Aluja (2003), «Principal Curves for Correcting the Horseshoe Effect in

Correspondence Analysis», International Conference on Correspondence Analysis and Related

Methods, (CARME 2003), Abstracts, Barcelona, Universidad Pompeu Fabra: p. 23.

Delicado, P., and M. Huerta (2003), «Principal Curves of Oriented Points: Theoretical and Computational

Improvements», Computational Statistics,vol. 18: pp. 293-315.

Gauch, H.G. jr., and G.B. Chase (1974), «Fitting the Gaussian Curve to Ecological Data», Ecology, vol. 55:

pp. 1377-1381.

Gauch, H.G. jr., G.B. Chase, and R.H. Wittaker (1974), «Ordination of Vegetation Samples by Gaussian

Species Distribution», Ecology, vol. 55: pp. 1382-1390.

Gauch, H.G.jr., and M.O. Hill (1980), «Detrended Correspondence Analysis: an Improved Ordination

Technique», Vegetatio, vol. 42: pp. 47-58.

Gauch, H.G.jr., and R.H. Whittaker (1972), «Comparison of Ordination Techniques», Ecology, vol. 53 (5):

pp. 868-875.

Gauch, H.G.jr., R.H. Whittaker, and T.R. Wentworth (1977), «A Comparative Study of Reciprocal

Averaging and other Ordination Techniques», Journal of Ecology, vol. 65: pp. 157-174.

Goodall, D.W., and R.W. Johnson (1982), «Non Linear Ordination in Several Dimensions. A Maximum

Likelihood Approach», Vegetatio, vol. 48: pp. 197-208.

Goodall, D.W., and R.W. Johnson (1987), «Maximum Likelihood Ordination - Some Improvements and

Further Tests», Vegetatio, vol. 71: pp.3-12.

Gower, J.C. (1966), «Some Distance Properties of Latent Root and Vector Methods used in Multivariate

Analysis», Biometrika, vol. 53: pp. 325-338.

Guttman, L. (1953), «A Note on Sir Cyril Burt's Factorial Analysis of Qualitative Data», British Journal

of Statistical Psychology, vol. 6: pp. 21-4.

The Guttman effect

Hastie, T., and W. Stuetzle (1989), «Principal Curves», Journal of the American Statistical Association,

vol. 84 (406): pp. 502-516, 1989.

Hill, M. O. (1979), DECORANA - a FORTRAN Program for Detrended Correspondence Analysis and

Reciprocal Averaging, Ithaca (N.Y.), Cornell University.

Ihm, P., and H. van Groenewoud (1975), «A Multivariate Ordering of Vegetation Data Based on Gaussian

Type Gradient Response Curves», Journal of Ecology, vol. 63: pp. 767-777.

Ihm, P., and H. van Groenewoud (1984), «Correspondence Analysis and Gaussian Ordination», in:

Chambers, J.M. et al. (eds.), Compstat Lectures: pp. 5-60.

Johnson, R.W. and D.W. Goodall (1980), «A Maximum Likelihood Approach to non-Linear Ordination»,

Vegetatio, vol. 41 (3): pp. 133-142.

Kégl, B., A. Krzyzak, T. Linder, and K. Zeger (2000), «Learning and Design of Principal Curves», IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 22 (3): pp. 281-297.

Kenkel, N.C., and L. Orlóci (1986), «Applying Metric and Nonmetric Multidimensional Scaling to Ecological

Studies: some New Results», Ecology, vol. 67(4): pp. 919-928.

Kessel, S.R., and R.H. Whittaker (1976), «Comparison of three Ordination Techniques», Vegetatio, vol.

32 (1): pp. 21-29.

Kruskal, J.B. (1964a), «Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric

Hypothesis», Psychometrika, vol. 29: pp. 1-27.

Kruskal, J.B. (1964b), «Nonmetric Multidimensional Scaling: a Numerical Method», Psychometrika, vol.

29: pp. 115-129.

Müller-Dombois, D., and E. Ellenberg (1974), Aims and Methods of Vegetation Ecology, New York, John

Wiley & Sons.

Noy-Meir, I. (1971), «Multivariate Analysis of the Semi-arid Vegetation in South-Eeastern Australia: Nodal

Ordination by Component Analysis», Proceedings of Ecological Society of Australia., vol. 6 : pp.

159-193.

Orlóci, L. (1978), Multivariate Analysis in Vegetation Research, Den Haag, Junk.

Peet, R.K., R.G. Knox, J.S. Case, and R.B. Allen (1988), «Putting Things in Order: the Advantages of

Detrended Correspondence Analysis», American Naturalist, vol. 31 (6): pp. 924-937.

Prentice, I.C. (1980), «Vegetation Analysis and Order Invariant Gradient Models», Vegetatio, vol. 42: pp.

27-34.

ter Braak, C. J. F. (1988), CANOCO: a FORTRAN program for canonical community ordination by

[partial] [detrended] [canonical] correspondence analysis, principal component analysis and

redundancy analysis (version 2.1), Wageningen (Netherlands), GLW Ministerie van Lanbouw en

Visserij.

Van Rijckevorsel, J. (1987), The Application of Fuzzy Coding and Horseshoes in Multiple Correspondence

Analysis, Leiden, DSWO Press.

Wartenberg, D., S. Ferson, and F.J. Rohlf (1987), «Putting Things in Order: a Critique of Detrended

Correspondence Analysis», American Naturalist, vol. 129 (3): pp. 435-448.

Watson, G.N., (1933), «Notes on Generating Functions of Polynomials: Hermite Polynomials», Journal of

the London Mathematical Society, vol. 8: pp. 194--199.

- Principal Component Extraction and Its Feature SelectionUploaded bydsp523
- Matrix Data AnalysisUploaded bySam
- physicaUploaded bypostscript
- Dp 06588Uploaded byeren raste
- Statistical Sampling Strategies for Geometric Tolerance Inspection by CMMUploaded byNavid
- Why Raghuram Rajan Ranked Gujarat Low - Yahoo India FinanceUploaded bynikunjkavadia
- Brochure XLSTATUploaded bypdshahv2008
- Chapter 1Uploaded byManish Yadav
- 50120140502007Uploaded byIAEME Publication
- Factor Analysis2Uploaded byTushar Kant
- Multivariate Data Analysis WikiUploaded bymaheshlakade755
- Feature Selection ExtractionUploaded bySihem Sih
- Rothwell-Employability Paper JA Version Dec 04 ProtectedUploaded byAlinCornea
- Dimension ReductionUploaded byLuis Santusamay
- Metzger 2005Uploaded byYulya Bagriy
- Financial Market PredictionUploaded bypankajpandeylko
- scholkopf_kernel.pdfUploaded bynhoc
- Focardi SergioUploaded byPrasad Hegde
- A New Noise Mitigation Scheme in Power Line Communication Systems Based on the Principal Components AnalysesUploaded bySEP-Publisher
- chapter_3Uploaded byAnkita Patel
- Machine Learning for Financial Prediction_ Experimentation With David Aronson’s Latest Work – Part 1Uploaded bycicciomessere
- 1-s2.0-S0024320502014819-mainUploaded bydwirosid
- 1wg9fy643zm09pUploaded byDavies Segera
- Aster image processUploaded byhugoluis_h
- 2014-Gunawardana-Role of particle size and composition in metal adsorption by solids.pdfUploaded bygagr720620
- NFL & NCAA Football Prediction Using Artificial Neural NetworksUploaded byErick101497
- A perspective on the performance of the entity.pdfUploaded byIonela Ene
- KV_C061Uploaded byNebojsa Redzic
- 27 Infant Homicides an Examination Using CAUploaded bysebastian rodriguez muñoz
- Beringer Its Not Magic 5-10Uploaded byRobert Ryan

- Moogsoft-DeloitteUploaded byVictor Toledo
- The WEKA Data Mining Software an UpdateUploaded byVictor Toledo
- Political Analysis Using RUploaded byVictor Toledo
- Kaggle Winner Call - Screen Share TemplateUploaded byVictor Toledo
- Informatica MDM BasicsUploaded bymangalagiri1973
- Preparing and Architecting for Machine LearningUploaded byNina Brown
- FTPartnersResearch-InsuranceTechnologyTrends.pdfUploaded byVictor Toledo
- Cummings - Optimization in Insurance.pdfUploaded byVictor Toledo
- 01-2-小玉Uploaded byVictor Toledo
- Cognition Wild ReviewUploaded byVictor Toledo
- K20426_C011_corrected.pdfUploaded byVictor Toledo
- Big DataUploaded byhelioteixeira
- DataMining Introduction ClassificationUploaded byVictor Toledo
- requirements-ch3Uploaded byVictor Toledo
- Mapa de datosUploaded byVictor Toledo
- Use Case PointsUploaded byDavid Campoverde
- Workshop d9 Steven Clark Peter Sondhelm James OrrUploaded bysam4u7
- sqjUploaded byVictor Toledo
- Hirst-Ontol-2009.pdfUploaded byVictor Toledo
- Ontologies explainedUploaded byVictor Toledo
- BiemannLDVOntology05.pdfUploaded byVictor Toledo
- Information Extraction and Named Entity RecognitionUploaded bynavyagayatri
- Weir Smyth, Herbert - Greek GrammarUploaded byDouglas Carvalho Ribeiro
- Understandingsoftwaremetrics 151128101210 Lva1 App6892Uploaded byVictor Toledo
- Nicta Publication Full 3875 (1)Uploaded byVictor Toledo
- Laine BEST PRACTICES FOR PROJECT HANDOVERUploaded byEngraamir
- Sales TemplateUploaded byVictor Toledo
- Structural HolesUploaded byVictor Toledo

- ch06Uploaded byBilal Hussain Soomro
- pretest ch10Uploaded byVinay Rao
- 4.Heterogenitas.pptUploaded byRan Tan
- Higher order derivatives of the inverse functionUploaded byandrej_liptaj
- 9 descriptive statisticsUploaded byapi-308082215
- ProbabilityStat SLIDESUploaded byKlevis Xhyra
- Polychoric correlationUploaded byralucam_tm
- Exercises Libor Market Model - ICLUploaded bymeko1986
- Homework of data analysisUploaded byAndrew Garza
- X-bar_and_R_ChartsUploaded bykentot
- mlmus3_ch10Uploaded byMichael Ray
- Conjugate Prior - WikipediaUploaded byAyan
- ME GATE 2015 set 1Uploaded byMayrym Rey Con
- OLSSON, A. on Latin Hypercube Sampling for Structural Reliaility AnalisysUploaded byBolívar Zanella Ribeiro
- DNV-RP-F118 Pipe Girth Weld AUT System Qualification and Project Specific Procedure Validation October 2010Uploaded byTroy
- BBA350Business Statistics IUploaded byIni Jones
- stochasticUploaded byClaudia Alina Iovan
- STAT1008 Final Exam Sem 2 2006 SolutionsUploaded byIpTony
- fraenkel4_ppt_ch10Uploaded byZahirul Haq
- CVEN2002 Week7Uploaded byKai Liu
- Imports Function in the Philippine EconomyyUploaded byChiloe Cerrafon
- Mark Scheme Jan 2005Uploaded byOE94
- Chapter 05Uploaded bymushtaque61
- The Introductory Statistics Course: A Ptolemaic Curriculum?Uploaded byjigsaw310
- Survival Models Extract 2012Uploaded byanuragk_01
- Chapter 3 (PR)Uploaded bySrikanta Karthik
- Bit error rate characterisation and modelling for GSMUploaded byPhilippe
- illustrates a random variable.docxUploaded byYvonne Alonzo De Belen
- Box Cox 1964Uploaded byedabank4712
- Sample SizeUploaded byliorkadosh

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.