You are on page 1of 44

Point-biserial correlation coefficient

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be "naturally" dichotomous, like gender, or an artificially dichotomized variable. In most situations it is not advisable to artificially dichotomize variables. When you artificially dichotomize a variable the new dichotomous variable may be conceptualized as having an underlying continuity. If this is the case, a biserial correlation would be the more appropriate calculation. The point-biserial correlation is mathematically equivalent to the Pearson (product moment) correlation, that is, if we have one continuously measured variable X and a dichotomous variable Y, rXY = rpb. This can be shown by assigning two distinct numerical values to the dichotomous variable. To calculate rpb, assume that the dichotomous variable Y has the two values 0 and 1. If we divide the data set into two groups, group 1 which received the value "1" on Y and group 2 which received the value "0" on Y, then the point-biserial correlation coefficient is calculated as follows:

where sn is the standard deviation used when you have data for every member of the population:

M1 being the mean value on the continuous variable X for all data points in group 1, and M0 the mean value on the continuous variable X for all data points in group 2. Further, n1 is the number of data points in group 1, n0 is the number of data points in group 2 and n is the total sample size. This formula is a computational formula that has been derived from the formula for rXY in order to reduce steps in the calculation; it is easier to compute than rXY. It is easy to show algebraically that there is an equivalent formula that uses sn1:

where sn1 is the standard deviation used when you only have data for a sample of the population:

To clarify:

Glass and Hopkins' book Statistical Methods in Education and Psychology, (3rd Edition)[1] contains a correct version of point biserial formula. Also the square of the point biserial correlation coefficient can be written:

We can test the null hypothesis that the correlation is zero in the population. A little algebra shows that the usual formula for assessing the significance of a correlation coefficient, when applied to rpb, is the same as the formula for an unpaired t-test and so

follows Student's t-distribution with (n1+n0 - 2) degrees of freedom when the null hypothesis is true. One disadvantage of the point biserial coefficient is that the further the distribution of Y is from 50/50, the more constrained will be the range of values which the coefficient can take. If X can be assumed to be normally distributed, a better descriptive index is given by the biserial coefficient

where u is the ordinate of the normal distribution with zero mean and unit variance at the point which divides the distribution into proportions n0/n and n1/n. As you might imagine, this is not the easiest thing in the world to calculate and the biserial coefficient is not widely used in practice. A specific case of biserial correlation occurs where X is the sum of a number of dichotomous variables of which Y is one. An example of this is where X is a person's total score on a test

composed of n dichotomously scored items. A statistic of interest (which is a discrimination index) is the correlation between responses to a given item and the corresponding total test scores. There are three computations in wide use[2], all called the point-biserial correlation: (i) the Pearson correlation between item scores and total test scores including the item scores, (ii) the Pearson correlation between item scores and total test scores excluding the item scores, and (iii) a correlation adjusted for the bias caused by the inclusion of item scores in the test scores. Correlation (iii) is

A slightly different version of the point biserial coefficient is the rank biserial which occurs where the variable X consists of ranks while Y is dichotomous. We could calculate the coefficient in the same way as where X is continuous but it would have the same disadvantage that the range of values it can take on becomes more constrained as the distribution of Y becomes more unequal. To get round this, we note that the coefficient will have its largest value where the smallest ranks are all opposite the 0s and the largest ranks are opposite the 1s. Its smallest value occurs where the reverse is the case. These values are respectively plus and minus (n1 + n0)/2. We can therefore use the reciprocal of this value to rescale the difference between the observed mean ranks on to the interval from plus one to minus one. The result is

where M1 and M0 are respectively the means of the ranks corresponding to the 1 and 0 scores of the dichotomous variable. This formula, which simplifies the calculation from the counting of agreements and inversions, is due to Gene V Glass (1966). It is possible to use this to test the null hypothesis of zero correlation in the population from which the sample was drawn. If rrb is calculated as above then the smaller of

and

is distributed as MannWhitney U with sample sizes n1 and n0 when the null hypothesis is true.

What is a point biserial correlation? The point biserial correlation is a measure of association between a continuous variable and a binary variable. It is constrained to be between -1 and +1. Calculation of the point biserial correlation Assume that X is a continuous variable and Y is categorical with values 0 and 1. Compute the point biserial correlation using the formula

where

This is mathematically equivalent to the traditional correlation formula. The interpretation is similar. The point biserial correlation is positive when large values of X are associated with Y=1 and small values of X are associated with Y=0. Examples FB represents postural sway in the forward-backward direction and is continuous. SS represents postural sway in the side-side direction and is also continuous. AGE_GRP represents the age group (0=Young, 1=Elderly) and is binary. FB and SS show a strong positive correlation with each other and a moderate correlation with age group. Postural sway correlations.

Source: http://lib.stat.cmu.edu/DASL/Datafiles/Balance.html Comparison of the point biserial correlation to boxplots

This is a boxplot of FB sway for each age group.

This is a plot of SS sway for each age group. Notice for both this and the previous graph that the elderly age group tends to have higher sway scores than the young group. Even so, there is still a large amount of overlap between these groups, which is why the point biserial correlations are only moderately positive.

The next few pages will show some correlations using data from a breast feeding study I was involved with. In a study of breastfeeding, the point biserial correlation between exclusive breastfeeding at discharge and distance from the hospital is -0.06.

Notice that there is little or no association between distance and breast feeding. Exclusive breast feeders tended to live at a wide range of distances from the hospital and so did the non breast feeders. The point biserial correlation between exclusive breastfeeding and mothers age is 0.37.

Notice that exclusive breast feeders were more likely to have older mothers and the non exclusive breast feeders were more likely to have young mothers. There still remains a large overlap between the two groups, as is indicated by the moderately positve correlation. The point biserial correlation between exclusive breastfeeding at discharge and age at discharge is -0.27.

Notice that exclusive breast feeders were more likely to have shorter stays at the hospital (younger ages at discharge) and the non exclusive breast feeders were more likely to have longer stays. Again, the two groups still show a good degree of overlap, which is why the correlation is only weakly negative. This work is licensed under a Creative Commons Attribution 3.0 United States License. It was written by Steve Simon on 2005-08-18, edited by Steve Simon, and was last modified on . This page needs minor revisions. Category: Definitions, Category: Measuring agreement.

Phi coefficient
In statistics, the phi coefficient (also referred to as the "mean square contingency coefficient" and denoted by (or r) is a measure of association for two binary variables introduced by Karl Pearson[1]. This measure is similar to the Pearson correlation coefficient in its interpretation. In fact, a Pearson correlation coefficient estimated for two binary variables will return the phi coefficient.[2] The square of the Phi coefficient is related to the chi-squared statistic for a 22 contingency table (see Pearson's chi-squared test)[3]

where n is the total number of observations. Two binary variables are considered positively associated if most of the data falls along the diagonal cells. In contrast, two binary variables are considered negatively associated if most of the data falls off the diagonal. If we have a 22 table for two random variables x and y y = 1 y = 0 total x=1 x=0 total where n11, n10, n01, n00, are non-negative "cell cell counts" that sum to n, the total number of observations. The phi coefficient that describes the association of x and y is

Phi is related to the point-biserial correlation coefficient and Cohen's d and estimates the extent of the relationship between two variables (2 x 2).[4] The Phi coefficient ( j ) The table below shows the first time driving test results of a sample of 200 individuals classified by gender and success or failure in the examination. We wish to explore the association between the two variables, the null hypothesis being that there is no relationship between gender and success/failure in driving test results. Gender and success or failure in first time driving test results GENDER Male SUCCESS 70 FAILURE 28

(98)

Female

50 (120)

52 (80)

(102) (200)

When each of the variables is truly dichotomous, that is to say, can only take two values (male/female); pass/fail) then the phi coefficient ( j ) is an appropriate test of association. Phi is given as:

where the cells of a 2 x 2 contingency table and the marginal totals are lettered as follows: a c (m) B D (n) (k) (l) (N)

Substituting our data from the table above:

The significance of phi can be tested from the following formula: 2=Nj2 where degrees of freedom are given by: df = (r-1) (c-1) r = rows, c = columns in the contingency table. In our example (r-1) (c-1) = (2-1) (2-1) = 1 df. 2 = 200 (.0524) = 10.48. Looking up the chi-square ( 2 ) values in the Appendix A, Table 10, we see that a value of 6.635 is significant at the 1% level. Our obtained value exceeds this. We therefore conclude that the null hypothesis is rejected and that there is a statistically significant association between the gender of the driver and first time driving success or failure. The correlation coefficient tetrachoric r (r t) In our discussion of phi coefficient ( j ) above, we have used as our example the association between the truly dichotomous variables, gender and driving test success. Suppose, however, that we wish to explore the association between two dichotomized variables, variables, that is, that are recorded in only two categories but which are, in reality, continuous and normally distributed. Why bother, the reader may well ask? Why not simply use the Pearson product moment correlation coefficient (r)?

Regrettably it is often the case that the necessary data required to computer Pearsons r (say, for example, marks on a Mathematics course and scores on a basic computing course) are simply not available in their continuous data format, having been recorded as dichotomous (high level/low level; satisfactory/unsatisfactory). In this event, no matter how continuous and normal the underlying distributions may be, the researcher must turn to a measure of association appropriate to her data. Tetrachoric r (r t for short) is just such a measure of association for use with dichotomized variables. Because the computation of r t is laborious, an approximation is generally used. All we need to know is the value of AD/BC from the fourfold table below. We can then read off the estimation of tetrachoric r (r t) in Table 11 in Appendix A. Suppose that the achievement of 100 first year university students on a basic computing course has been classified as satisfactory or unsatisfactory on the basis of their high level or low level mathematics ability on entry to the course. The data are set out thus: Satisfactory/unsatisfactory ratings on a basic computing course SATISFACTORY HIGH LEVEL MATHS A ON ENTRY UNSATISFACTORY B

40

10

LOW LEVEL MATHS C ON ENTRY

D [=100] 20 30

Substituting for

we get

Entering Appendix Awesee thatfor = 5.81-6.3, the estimated r t is .61. Our result is subject to exactly the same type of interpretation that applies to Pearsons correlation coefficient r, since r t is an estimate of that correlation coefficient. However, the standard error of r t is considerably larger than for r. Moreover, the approximation obtained from using Appendix Aworks best when both variables have been dichotomized on the basis of a 50/50 split. Finally, if AD is found to be less that BC, simply use the ratio in entering Table 12 in Appendix A. Remember, the larger of the two products is always placed in the numerator.

The contingency coefficient C Suppose a researcher wishes to explore the relationship between social class background and educational achievement and obtains data on secondary school students placement in different strata of the schools curriculum in a very large comprehensive school drawing students across a very wide socio-economic range. She wishes to determine the association between curriculum placement and social class, her null hypothesis being that social class and school curriculum placement are unrelated. The contingency coefficient C is a measure of the association between two sets of attributes and is particularly useful when one or both of those attributes is/are at the nominal level of measurement, as is the case in the present example. The data are set out thus: School curriculum placement and students social class background CURRICULUM PLACEMENT I AND II G.C.S.E. 22 N.V.Q. 11 No examination 1 Total 23 SOCIAL CLASS III 41 36 12 89 IV 22 40 48 110 V 13 19 62 94

(98) (106) (123) (327)

(G.C.S.E. is the UKs General Certificate of Secondary Education, the examinations for which are taken by most students at age 16; N.V.Q. is the UKs National Vocational Qualification, the examinations for which are taken by students of later secondary school age and beyond) Chi-square can be computed as shown in the main text. 2 = 85.59 Inserting the chi-square value into the formula for C:

The significance of C is determined by reference to the value of 2. Degrees of freedom are given by df =(r-1) (c-1) where r = rows and c = columns. In our example df = (3-1) (4-1) = 6. Interpolating Appendix A at df = 6, we see that a critical value of 16.81 is significant at the 1% level. Our obtained value exceeds this. The researcher should not support the null hypothesis and conclude that there is a significant association between social class and placement in the school curriculum. There are a number of limitations in connection with the use of C (see Siegel, 1956) but, on balance, these are outweighed by the wide applicability of C and its freedom from assumptions and requirements which make many other measures of association inapplicable. Combining independent significance tests of partial relations

Following up a commonly-held belief among some teachers that one particular ethnic group of students achieves better than another, irrespective of socio-economic circumstances, a teacher/researcher decides to test the proposition in several samples of students drawn from two ethnic communities. The analysis of her data is set out in the table below. It seems to her that there may be a trend supporting the belief but not at a conventionally-accepted level of statistical significance (.05 or.01). The actual levels of significance (the s) are generated in the SPSS analysis. Partial relation between two nominal scale variables (X 1 = ETHNICITY, Y 1 = ACADEMIC ACHIEVEMENT) with a third nominal scale variable (X 2 = SOCIOECONOMIC LEVEL) held constant X 2 SOCIO-ECONOMIC LEVEL School 2 School 3 School 4

Y ACADEMIC ACHIEVEMENT

School 1

HIGH LOW

X1 X 1 ETHNICITY X1 X 1 ETHNICITY ETHNICITY ETHNICITY A B A B A B A B 5 11 11 21 19 29 5 10 22 16 33 26 49 41 20 15 2 = 2.22 df=1 2 = 3.05 df=1 2 = 2.20 df=1 2 = 2.38 df=1 =.14 =.07 =.14 =.13

Fisher (1941: 97-8) shows that the product of several independent s may be transformed into a function having a 2 distribution by the application of the following formula based on natural logarithms and easily computed on a scientific calculator: 2 = 2 log e ( 1) ( 2) ( 3) . . . ( k) NOTE Itfollows that because s are less than 1.00, so too will be their products. Moreover, the logarithm of a number less than 1 is negative. Multiplying by 2 therefore, gives a positive product whose significance can be determined by reference to the chi-square Table 10 in Appendix A. Degrees of freedom are given by twice the number of the independent tests combined. In our example df = 2 x 4 = 8. 2 = 2 log e (.14)(.07)(.14)(.13) = 2 log e (.0001783) = 2 (8.6320) = 17.26 (df=8), statistically significant at r .05.

An alternative formula based on common logarithms is given by: 2 = 4.60517 log 10( 1) ( 2) ( 3) . . . ( k) = 4.60517 log 10 (.14)(.07)(.14)(.13) = 4.60517 (3.7488) = 17.26 (df=8), statistically significant at r .05.

What is a phi coefficient? The phi coefficient is a measure of the degree of association between two binary variables. This measure is similar to the correlation coefficient in its interpretation. Two binary variables are considered positively associated if most of the data falls along the diagonal cells (i.e., a and d are larger than b and c). In contrast, two binary variables are considered negatively associated if most of the data falls off the diagonal. Formula for the phi coefficient. The formula for Phi is

Notice that Phi compares the product of the diagonal cells (a*d) to the product of the offdiagonal cells (b*c). The denominator is an adjustment that ensures that Phi is always between -1 and +1.

An example of computing Phi. The data in the table below shows breast feeding status at discharge (columns) and 3 days after discharge (rows). Notice that most of the data are on the diagonal. This makes sense. Most mothers who were partial or no breast feeding at discharge would probably continue in that

pattern three days later. The same holds true for exclusive breast feeding. This is shown by the Phi coefficient.

There is a strong association between breast feeding status at discharge and breast feeding status 3 days after discharge.

A second example. The following table is a similar measure of breast feeding, with the columns representing discharge and the rows representing 6 months after discharge. Notice that there is still a tendency for values to fall in the diagonal cells, but it is less strong than the previous example. The computation of Phi emphasizes this:

There is a weak association between breast feeding status at discharge and at 6 months after discharge.

Using SPSS to compute Phi. In SPSS, you create a two by two table by selecting ANALYZE | DESCRIPTIVE STATISTICS | CROSSTABS from the menu. In the dialog box, you can click on the STATISTICS button to get a second dialog box. In this dialog box, select the Phi and Cramer's V option. Note: Cramer's V is useful for tables larger than 2 by 2. We will not discuss it in this presentation, but you can find details in Conover WJ Practical Nonparametric Statistics, 2nd Edition. (1980) New York NY: John Wiley and Sons, Inc. page 181.

Interpretation of the Phi coefficient. I have general rule of thumb for correlation coefficients and you can use the same rule for the Phi coefficient.

-1.0 to -0.7 strong negative association. -0.7 to -0.3 weak negative association.

-0.3 to +0.3 little or no association. +0.3 to +0.7 weak positive association. +0.7 to +1.0 strong positive association.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Definitions, Category: Measuring agreement.

Contingency table
In statistics, a contingency table (also referred to as cross tabulation or cross tab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. It is often used to record and analyze the relation between two or more categorical variables. The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the Drapers' Company Research Memoirs Biometric Series I published in 1904. A crucial problem of multivariate statistics is finding (direct-)dependence structure underlying the variables contained in high dimensional contingency tables. If some of the conditional independences are revealed, then even the storage of the data can be done in a smarter way (see Lauritzen (2002)). In order to do this one can use information theory concepts, which gain the information only from the distribution of probability, which can be expressed easily from the contingency table by the relative frequencies.

Contents
[hide]

1 Example 2 Measures of association 3 See also 4 References 5 External links

[edit] Example
Suppose that we have two variables, sex (male or female) and handedness (right- or left-handed). Further suppose that 100 individuals are randomly sampled from a very large population as part of a study of sex differences in handedness. A contingency table can be created to display the

numbers of individuals who are male and right-handed, male and left-handed, female and righthanded, and female and left-handed. Such a contingency table is shown below.
Right-handed Left-handed Totals Males 43 Females 44 Totals 87 9 4 13 52 48 100

The numbers of the males, females, and right- and left-handed individuals are called marginal totals. The grand total, i.e., the total number of individuals represented in the contingency table, is the number in the bottom right corner. The table allows us to see at a glance that the proportion of men who are right-handed is about the same as the proportion of women who are right-handed although the proportions are not identical. The significance of the difference between the two proportions can be assessed with a variety of statistical tests including Pearson's chi-squared test, the G-test, Fisher's exact test, and Barnard's test, provided the entries in the table represent individuals randomly sampled from the population about which we want to draw a conclusion. If the proportions of individuals in the different columns vary significantly between rows (or vice versa), we say that there is a contingency between the two variables. In other words, the two variables are not independent. If there is no contingency, we say that the two variables are independent. The example above is the simplest kind of contingency table, a table in which each variable has only two levels; this is called a 2 x 2 contingency table. In principle, any number of rows and columns may be used. There may also be more than two variables, but higher order contingency tables are difficult to represent on paper. The relation between ordinal variables, or between ordinal and categorical variables, may also be represented in contingency tables, although such a practice is rare.

[edit] Measures of association


Main article: Phi coefficient Main article: Cramr's V

The degree of association between the two variables can be assessed by a number of coefficients: the simplest is the phi coefficient defined by

where 2 is derived from Pearson's chi-squared test, and N is the grand total of observations. varies from 0 (corresponding to no association between the variables) to 1 or -1 (complete association or complete inverse association). This coefficient can only be calculated for frequency data represented in 2 x 2 tables. can reach a minimum value -1.00 and a maximum value of 1.00 only when every marginal proportion is equal to .50 (and two diagonal cells are empty). Otherwise, the phi coefficient cannot reach those minimal and maximal values.[1] Alternatives include the tetrachoric correlation coefficient (also only applicable to 2 2 tables), the contingency coefficient C, and Cramr's V. C suffers from the disadvantage that it does not reach a maximum of 1 or the minimum of -1; the highest it can reach in a 2 x 2 table is .707; the maximum it can reach in a 4 4 table is 0.870. It can reach values closer to 1 in contingency tables with more categories. It should, therefore, not be used to compare associations among tables with different numbers of categories.[2] Moreover, it does not apply to asymmetrical tables (those where the numbers of row and columns are not equal). The formulae for the C and V coefficients are:

and

k being the number of rows or the number of columns, whichever is less. C can be adjusted so it reaches a maximum of 1 when there is complete association in a table of

any number of rows and columns by dividing C by (recall that C only applies to tables in which the number of rows is equal to the number of columns and therefore equal to k). The tetrachoric correlation coefficient assumes that the variable underlying each dichotomous measure is normally distributed.[3] The tetrachoric correlation coefficient provides "a convenient measure of [the Pearson product-moment] correlation when graduated measurements have been reduced to two categories."[4] The tetrachoric correlation should not be confused with the Pearson product-moment correlation coefficient computed by assigning, say, values 0 and 1 to represent the two levels of each variable (which is mathematically equivalent to the phi coefficient). An extension of the tetrachoric correlation to tables involving variables with more than two levels is the polychoric correlation coefficient. The Lambda coefficient is a measure of the strength of association of the cross tabulations when the variables are measured at the nominal level. Values range from 0 (no association) to 1 (the

theoretical maximum possible association). Asymmetric lambda measures the percentage improvement in predicting the dependent variable. Symmetric lambda measures the percentage improvement when prediction is done in both directions. The uncertainty coefficient is another measure for variables at the nominal level. All of the following measures are used for variables at the ordinal level. The values range from 1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.

Gamma test: No adjustment for either table size or ties. Kendall tau: Adjustment for ties. o Tau b: For square tables. o Tau c: For rectangular tables.

Correlation ratio
In statistics, the correlation ratio is a measure of the relationship between the statistical dispersion within individual categories and the dispersion across the whole population or sample. The measure is defined as the ratio of two standard deviations representing these types of variation. The context here is the same as that of the intraclass correlation coefficient, whose value is the square of the correlation ratio.

Contents
[hide]

1 Definition 2 Range 3 Example 4 Pearson v. Fisher 5 References

[edit] Definition
Suppose each observation is yxi where x indicates the category that observation is in and i is the label of the particular observation. Let nx be the number of observations in category x and

and

where is the mean of the category x and ratio (eta) is defined as to satisfy

is the mean of the whole population. The correlation

which can be written as

i.e. the weighted variance of the category means divided by the variance of all samples. It is worth noting that if the relationship between values of and values of is linear (which is certainly true when there are only two possibilities for x) this will give the same result as the square of the correlation coefficient, otherwise the correlation ratio will be larger in magnitude. It can therefore be used for judging non-linear relationships.

[edit] Range
The correlation ratio takes values between 0 and 1. The limit represents the special case of no dispersion among the means of the different categories, while refers to no dispersion within the respective categories. Note further, that is undefined when all data points of the complete population take the same value.

[edit] Example
Suppose there is a distribution of test scores in three topics (categories):

Algebra: 45, 70, 29, 15 and 21 (5 scores) Geometry: 40, 20, 30 and 42 (4 scores) Statistics: 65, 95, 80, 70, 85 and 73 (6 scores).

Then the subject averages are 36, 33 and 78, with an overall average of 52. The sums of squares of the differences from the subject averages are 1952 for Algebra, 308 for Geometry and 600 for Statistics, adding to 2860, while the overall sum of squares of the differences from the overall average is 9640. The difference of 6780 between these is also the weighted sum of the square of the differences between the subject averages and the overall average:

This gives

suggesting that most of the overall dispersion is a result of differences between topics, rather than within topics. Taking the square root

Observe that for the overall sample dispersion is purely due to dispersion among the categories and not at all due to dispersion within the individual categories. For a quick comprehension simply imagine all Algebra, Geometry, and Statistics scores being the same respectively, e.g. 5 times 36, 4 times 33, 6 times 78. The limit refers to the case without dispersion in the categories contributing to the overall dispersion. The trivial requirement for this extreme is that all category means are the same.

[edit] Pearson v. Fisher


The correlation ratio was introduced by Karl Pearson as part of analysis of variance. Ronald Fisher commented: As a descriptive statistic the utility of the correlation ratio is extremely limited. It will be noticed that the number of degrees of freedom in the numerator of arrays[1] to which Egon Pearson (Karl's son) responded by saying Again, a long-established method such as the use of the correlation ratio [45 The "Correlation Ratio" ] is passed over in a few words without adequate description, which is perhaps hardly fair to the student who is given no opportunity of judging its scope for himself.[2] depends on the number of the

Multiple correlation
In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables. It is measured by the coefficient of determination, but under the particular assumption that the best possible linear predictors are used, whereas the coefficient of determination is defined for more general cases. The coefficient of multiple determination takes values between zero and one; a higher value

indicates a better predictability of the dependent variable from the independent variables, with a value of one indicating that the predictions are exact and a value of zero indicating that no linear combination of dependent variables is better than the simpler predictor which consists of mean of the target variable.

[edit] Definition
The coefficient of multiple determination R2 (a scalar), can be computed using the vector c of cross-correlations between the predictor variables (independent variables) and the target variable (dependent variable), and the matrix Rxx of inter-correlations between predictor variables. It is given by
R2 = c' Rxx1 c,

where c ' is the transpose of c, and Rxx1 is inverse of the matrix Rxx. If all the predictor variables are uncorrelated, the matrix Rxx is the identity matrix and R2 simply equals c' c, the sum of the squared cross-correlations. If there is cross-correlation among the predictor variables, the inverse of the cross-correlation matrix accounts for this.

[edit] Properties
Unlike the coefficient of determination in a regression involving just two variables, the coefficient of multiple determination is not computationally commutative: a regression of y on x and z will in general have a different R2 than will a regression of z on x and y. For example, suppose that in a particular sample the variable z is uncorrelated with both x and y, while x and y are linearly related to each other. Then a regression of z on y and x will yield an R2 of zero, while a regression of y on x and z will yield a positive R2

Partial correlation
In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed.

Contents
[hide]

1 Formal definition 2 Computation o 2.1 Using linear regression o 2.2 Using recursive formula o 2.3 Using matrix inversion 3 Interpretation o 3.1 Geometrical o 3.2 As conditional independence test 4 Semipartial correlation (part correlation) 5 Use in time series analysis 6 See also 7 External links 8 References o 8.1 Other

[edit] Formal definition


Formally, the partial correlation between X and Y given a set of n controlling variables Z = {Z1, Z2, , Zn}, written XYZ, is the correlation between the residuals RX and RY resulting from the linear regression of X with Z and of Y with Z, respectively. In fact, the first-order partial correlation is nothing else than a difference between a correlation and the product of the removable correlations divided by the product of the coefficients of alienation of the removable correlations. The coefficient of alienation, and its relation with joint variance through correlation are available in Guilford (1973, pp. 344345).

[edit] Computation
[edit] Using linear regression

A simple way to compute the partial correlation for some data is to solve the two associated linear regression problems, get the residuals, and calculate the correlation between the residuals. If we write xi, yi and zi to denote i.i.d. samples of some joint probability distribution over X, Y and Z, solving the linear regression problem amounts to finding n-dimension vectors

with N being the number of samples and the scalar product between the vectors v and w. Note that in some implementations the regression includes a constant term, so the matrix would have an additional column of ones. The residuals are then

and the sample partial correlation is

[edit] Using recursive formula

It can be computationally expensive to solve the linear regression problems. Actually, the nthorder partial correlation (i.e., with |Z| = n) can be easily computed from three (n - 1)th-order partial correlations. The zeroth-order partial correlation XY is defined to be the regular correlation coefficient XY. It holds, for any :

Navely implementing this computation as a recursive algorithm yields an exponential time complexity. However, this computation has the overlapping subproblems property, such that using dynamic programming or simply caching the results of the recursive calls yields a complexity of .

Note in the case where Z is a single variable, this reduces to:

[edit] Using matrix inversion

In

time, another approach allows all partial correlations to be computed between any two

variables Xi and Xj of a set V of cardinality n, given all others, i.e., , if the correlation matrix (or alternatively covariance matrix) = (ij), where ij = XiXj, is invertible[citation needed] . If we define P = 1, we have:

[edit] Interpretation

Geometrical interpretation of partial correlation [edit] Geometrical

Let three variables X, Y, Z [where x is the Independent Variable (IV), y is the Dependent Variable (DV), and Z is the "control" or "extra variable"] be chosen from a joint probability distribution over n variables V. Further let vi, 1 i N, be N n-dimensional i.i.d. samples taken from the joint probability distribution over V. We then consider the N-dimensional vectors x (formed by the successive values of X over the samples), y (formed by the values of Y) and z (formed by the values of Z). It can be shown that the residuals RX coming from the linear regression of X using Z, if also considered as an N-dimensional vector rX, have a zero scalar product with the vector z generated by Z. This means that the residuals vector lives on a hyperplane Sz that is perpendicular to z.

The same also applies to the residuals RY generating a vector rY. The desired partial correlation is then the cosine of the angle between the projections rX and rY of x and y, respectively, onto the hyperplane perpendicular to z.[1]
[edit] As conditional independence test See also: Fisher transformation

With the assumption that all involved variables are multivariate Gaussian, the partial correlation XYZ is zero if and only if X is conditionally independent from Y given Z.[2] This property does not hold in the general case. To test if a sample partial correlation correlation can be used: vanishes, Fisher's z-transform of the partial

The null hypothesis is , to be tested against the two-tail alternative . We reject H0 with significance level if:

where () is the cumulative distribution function of a Gaussian distribution with zero mean and unit standard deviation, and N is the sample size. Note that this z-transform is approximate and that the actual distribution of the sample (partial) correlation coefficient is not straightforward. However, an exact t-test based on a combination of the partial regression coefficient, the partial correlation coefficient and the partial variances is available.[3] The distribution of the sample partial correlation was described by Fisher.[4]

[edit] Semipartial correlation (part correlation)


The semipartial (or part) correlation statistic is similar to the partial correlation statistic. Both measure variance after certain factors are controlled for, but to calculate the semipartial correlation one holds the third variable constant for either X or Y, whereas for partial correlations one holds the third variable constant for both.[citation needed] The semipartial correlation measures unique and joint variance while the partial correlation measures unique variance[clarification needed]. The semipartial (or part) correlation can be viewed as more practically relevant "because it is scaled to (i.e., relative to) the total variability in the dependent (response) variable. "Conversely, it is less theoretically useful because it is less precise about the unique contribution of the independent variable. Although it may seem paradoxical, the semipartial correlation of X with Y is always less than the partial correlation of X with Y. unique and joint variance while the partial correlation measures unique variance. The semipartial (or part) correlation can be viewed as more practically relevant "because it is scaled to (i.e., relative to)

the total variability in the dependent (response) variable." [5] Conversely, it is less theoretically useful because it is less precise about the unique contribution of the independent variable. Although it may seem paradoxical, the semipartial correlation of X'' with Y is always less than the partial correlation of X with Y.[citation needed]

[edit] Use in time series analysis


In time series analysis, the partial autocorrelation function (sometimes "partial correlation function") of a time series is defined, for lag h, as

Partial Correlation
Partial correlation is a method used to describe the relationship between two variables whilst taking away the effects of another variable, or several other variables, on this relationship.

Partial correlation is best thought of in terms of multiple regression; StatsDirect shows the partial correlation coefficient r with its main results from multiple linear regression.

A different way to calculate partial correlation coefficients, which does not require a full multiple regression, is show below for the sake of further explanation of the principles:

Consider a correlation matrix for variables A, B and C (note that the multiple line regression function in StatsDirect will output correlation matrices for you as one of its options):

A B C

A * r(AB) r(AC)

B * r(BC)

The partial correlation of A and B adjusted for C is:

The same can be done using Spearman's rank correlation co-efficient.

The hypothesis test for the partial correlation co-efficient is performed in the same way as for the usual correlation co-efficient but it is based upon n-3 degrees of freedom.

Please note that this sort of relationship between three or more variables is more usefully investigated using the multiple regression itself (Altman, 1991).

The general form of partial correlation from a multiple regression is as follows:

- where tk is the Student t statistic for the kth term in the linear model.

Spearman's rank correlation coefficient


In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter (rho) or as , is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or 1 occurs when each of the variables is a perfect monotone function of the other. Spearman's coefficient can be used when both dependent (outcome; response) variable and independent (predictor) variable are ordinal numeric, or when one variable is a ordinal numeric and the other is a continuous variable. However, it can also be appropriate to use Spearman's correlation when both variables are continuous.[1]

Contents
[hide]

1 Definition and calculation 2 Related quantities 3 Interpretation 4 Example 5 Determining significance 6 Correspondence analysis based on Spearman's rho 7 See also 8 References o 8.1 Further reading 9 External links

[edit] Definition and calculation


The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables.[2] For a sample of size n, the n raw scores are converted to ranks , and is computed from these:

Tied values are assigned a rank equal to the average of their positions in the ascending order of the values. In the table below, notice how the rank of values that are the same is the mean of what their ranks would otherwise be:
Variable 0.8 1.2 Position in the ascending order 1 2 1 Rank

1.2 2.3 18

3 4 5 4 5

In applications where ties are known to be absent, a simpler procedure can be used to calculate .[2][3] Differences between the ranks of each observation on the two variables are calculated, and is given by:

[edit] Related quantities


Main article: Correlation and dependence

There are several other numerical measures that quantify the extent of statistical dependence between pairs of observations. The most common of these is the Pearson product-moment correlation coefficient, which is a similar correlation method to Spearman's rank, that measures the "linear" relationships between the raw numbers rather than between their ranks. An alternative name for the Spearman rank correlation is the "grade correlation";[4] in this, the "rank" of an observation is replaced by the "grade". In continuous distributions, the grade of an observation is, by convention, always one half less than the rank, and hence the grade and rank correlations are the same in this case. More generally, the "grade" of an observation is proportional to an estimate of the fraction of a population less than a given value, with the halfobservation adjustment at observed values. Thus this corresponds to one possible treatment of tied ranks. While unusual, the term "grade correlation" is still in use.[5]

[edit] Interpretation

A positive Spearman correlation coefficient corresponds to an increasing monotonic trend between X and Y.

A negative Spearman correlation coefficient corresponds to a decreasing monotonic trend between X and Y.

The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable). If Y tends to increase when X increases, the Spearman correlation coefficient is positive. If Y tends to decrease when X increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease when X increases. The Spearman correlation increases in magnitude as X and Y become closer to being perfect monotone functions of each other. When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1. A perfect monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi Xj and Yi Yj always have the same sign. A perfect monotone decreasing relationship implies that these differences always have opposite signs. The Spearman correlation coefficient is often described as being "nonparametric." This can have two meanings. First, the fact that a perfect Spearman correlation results when X and Y are related by any monotonic function can be contrasted with the Pearson correlation, which only gives a perfect value when X and Y are related by a linear function. The other sense in which the Spearman correlation is non-parametric in that its exact sampling distribution can be obtained without requiring knowledge (i.e., knowing the parameters) of the joint probability distribution of X and Y.

[edit] Example
In this example, we will use the raw data in the table below to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week.

IQ, 106 86 100 101 99 103 97 113 112 110

Hours of TV per week, 7 0 27 50 28 29 20 12 6 17

First, we must find the value of the term table below.

. To do so we use the following steps, reflected in the

1. Sort the data by the first column ( ). Create a new column and assign it the ranked values 1,2,3,...n. 2. Next, sort the data by the second column ( ). Create a fourth column and similarly assign it the ranked values 1,2,3,...n. 3. Create a fifth column to hold the differences between the two rank columns ( and ). 4. Create one final column IQ, 86 97 99 100 Hours of TV per week, 0 20 28 27 to hold the value of column rank 1 2 3 4 rank 1 6 8 7 0 0 4 16 5 25 3 9 squared.

101 103 106 110 112 113

50 29 7 17 6 12

5 6 7 8 9 10

10 9 3 5 2 4

5 25 3 9 4 16 3 9 7 49 6 36

With found, we can add them to find can now be substituted back into the equation,

. The value of n is 10. So these values

which evaluates to = 0.175757575... with a P-value = 0.6864058 (using the t distribution) This low value shows that the correlation between IQ and hours spent watching TV is very low. In the case of ties in the original values, this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).

[edit] Determining significance


One approach to testing whether an observed value of is significantly different from zero (r will always maintain 1 r 1) is to calculate the probability that it would be greater than or equal to the observed r, given the null hypothesis, by using a permutation test. An advantage of this approach is that it automatically takes into account the number of tied data values there are in the sample, and the way they are treated in computing the rank correlation. Another approach parallels the use of the Fisher transformation in the case of the Pearson product-moment correlation coefficient. That is, confidence intervals and hypothesis tests relating to the population value can be carried out using the Fisher transformation:

If F(r) is the Fisher transformation of r, the sample Spearman rank correlation coefficient, and n is the sample size, then

is a z-score for r which approximately follows a standard normal distribution under the null hypothesis of statistical independence ( = 0).[6][7] One can also test for significance using

which is distributed approximately as Student's t distribution with n 2 degrees of freedom under the null hypothesis.[8] A justification for this result relies on a permutation argument.[9] A generalization of the Spearman coefficient is useful in the situation where there are three or more conditions, a number of subjects are all observed in each of them, and it is predicted that the observations will have a particular order. For example, a number of subjects might each be given three trials at the same task, and it is predicted that performance will improve from trial to trial. A test of the significance of the trend between conditions in this situation was developed by E. B. Page[10] and is usually referred to as Page's trend test for ordered alternatives.

[edit] Correspondence analysis based on Spearman's rho


Classic correspondence analysis is a statistical method that gives a score to every value of two nominal variables. In this way the Pearson correlation coefficient between them is maximized. There exists an equivalent of this method, called grade correspondence analysis, which maximizes Spearman's rho or Kendall's tau.[11]

[edit] See also


Statistics portal

Kendall tau rank correlation coefficient Chebyshev's sum inequality, rearrangement inequality (These two articles may shed light on the mathematical properties of Spearman's .)

Correlation coefficients
Some of the more popular rank correlation statistics include
1. Spearman's 2. Kendall's

3. Goodman and Kruskal's

An increasing rank correlation coefficient implies increasing agreement between rankings. The coefficient is inside the interval [1, 1] and assumes the value:

1 if the disagreement between the two rankings is perfect; one ranking is the reverse of the other. 0 if the rankings are completely independent. 1 if the agreement between the two rankings is perfect; the two rankings are the same.

Following Diaconis (1988), a ranking can be seen as a permutation of a set of objects. Thus we can look at observed rankings as data obtained when the sample space is (identified with) a symmetric group. We can then introduce a metric, making the symmetric group into a metric space. Different metrics will correspond to different rank correlations.

[edit] General Correlation Coefficient


Kendall Kendall (1944) showed that his tau and Spearman's rho are particular cases of a general correlation coefficient. Suppose we have a set of objects, which are being considered in relation to two properties,

represented by and , forming the sets of values and . To any pair of individuals, say the -th and the -th we assign a -score, denoted by , and a -score, denoted by . The only requirement made to this functions is anti-symmetry, so and . Then the generalised correlation coefficient is defined by

[edit] Kendall's as a particular case

If

is the rank of the -member according to the

-quality, we can define

and similarly for . The sum

is twice the amount of concordant pairs minus the is just the number

discordant pairs (see Kendall tau rank correlation coefficient). The sum of terms , equal to coefficient. , and so for . It follows that

is equal to the Kendall's

[edit] Spearman's as a particular case

If , are the ranks of the -member according to the simply define

and the -quality respectively, we can

The sums

and

are equal, since both

and

range from to

. Then we have:

now

since

and .

are both equal to the sum of the first

natural numbers, namely

We also have

and hence

Being

the sum of squares of the first

naturals, the last equation reduces to

Further

and thus, substituting into the original formula these results we get

which is exactly the Spearman's rank correlation coefficient .

What is the definition of Spearman's rank-order correlation?


There are two methods to calculate Spearman's rank-order correlation depending on whether: (1) your data does not have tied ranks or (2) your data has tied ranks. The formula for when there are no tied ranks is:

where di = difference in paired ranks and n = number of cases. The formula to use when there are tied ranks is:

where i = paired score.

More Correlation Coefficients


Lesson Overview

Why so many Correlation Coefficients Point Biserial Coefficient Phi Coefficient Measures of Association: C, V, Lambda Biserial Correlation Coefficient Tetrachoric Correlation Coefficient Rank-Biserial Correlation Coefficient Coefficient of Nonlinear Relationship (eta) Homework

Why so many Correlation Coefficients We introduced in lesson 5 the Pearson product moment correlation coefficient and the Spearman rho correlation coefficient. There are more. Remember that the Pearson product moment correlation coefficient required quantitative (interval or ratio) data for both x and y whereas the Spearman rho correlation coefficient applied to ranked (ordinal) data for both x and y. You should review levels of measurement in lesson 1 before we continue. It is often the case that the data variables are not at the same level of measurement, or that the data might instead of being quantitative be catagorical (nominal or ordinal). In addition to correlation coefficients based on the product moment and thus related to the Pearson product moment correlation coefficient, there are coefficients which are instead measures of association which are also in common use.

For the purposes of correlation coefficients we can generally lump the interval and ratio scales together as just quantitative. In addition, the regression of x on y is closely related to the regression of y on x, and the same coefficient applies. We list below in a table the common choices which we will then discuss in turn.
Variable Y\X Quantitiative X Biserial rb Ordinal X Nominal X Point Biserial rpb

Quantitative Y Pearson r Ordinal Y Nominal Y Biserial rb

Spearman rho/Tetrachoric rtet Rank Biserial rrb Phi, L, C, Lambda

Point Biserial rpb Rank Bisereal rrb

Before we go on we need to clarify different types of nominal data. Specifically, nominal data with two possible outcomes are call dichotomous.

Point-Biserial The point-biserial correlation coefficient, referred to as rpb, is a special case of Pearson in which one variable is quantitative and the other variable is dichotomous and nominal. The calculations simplify since typically the values 1 (presence) and 0 (absence) are used for the dichotomous variable. This simplification is sometimes expressed as follows: rpb = (Y1 - Y0) sqrt(pq) / Y, where Y0 and Y1 are the Y score means for data pairs with an x score of 0 and 1, respectively, q = 1 - p and p are the proportions of data pairs with x scores of 0 and 1, respectively, and Y is the population standard deviation for the y data. An example usage might be to determine if one gender accomplished some task significantly better than the other gender. Phi Coefficient If both variables instead are nominal and dichotomous, the Pearson simplifies even further. First, perhaps, we need to introduce contingency tables. A contingency table is a two dimensional table containing frequencies by catagory. For this situation it will be two by two since each variable can only take on two values, but each dimension will exceed two when the associated variable is not dichotomous. In addition, column and row headings and totals are frequently appended so that the contingency table ends up being n + 2 by m + 2, where n and m are the number of values each variable can take on. The label and total row and column typically are outside the gridded portion of the table, however.

As an example, consider the following data organized by gender and employee classification (faculty/staff). (htm doesn't provide the facility to grid only the table's interior).
Class.\Gender Female (0) Male (1) Totals Staff Faculty Totals: 10 5 15 5 10 15 15 15 30

Contingency tables are often coded as below to simplify calculation of the Phi coefficient.
Y\X 1 0 0 A C 1 B D Totals A+B C+D

Totals: A + C B + D

With this coding: phi = (BC - AD)/sqrt((A+B)(C+D)(A+C)(B+D)). For this example we obtain: phi = (25-100)/sqrt(15151515) = -75/225 = -0.33, indicating a slight correlation. Please note that this is the Pearson correlation coefficient, just calculated in a simplified manner. However, the extreme values of |r| = 1 can only be realized when the two row totals are equal and the two column totals are equal. There are thus ways of computing the maximal values, if desired.
Measures of Association: C, V, Lambda As product moment correlation coefficients, the point biserial, phi, and Spearman rho are all special cases of the Pearson. However, there are correlation coefficients which are not. Many of these are more properly called measures of association, although they are usually termed coefficients as well. Three of these are similar to Phi in that they are for nominal against nominal data, but these do not require the data to be dichotomous.

One is called Pearson's contingency coefficient and is termed C whereas the second is called Cramer's V coefficient. Both utilize the chi-square statistic so will be deferred into the next lesson. However, the Goodman and Kruskal lambda coefficient does not, but is another commonly used association measure. There are two flavors, one called symmetric when the researcher does not specify which variable is the dependent variable and one called asymmetric which is used when such a designation is made. We leave the details to any good statistics book.
Biserial Correlation Coefficient Another measure of association, the biserial correlation coefficient, termed rb, is similar to the point biserial, but pits quantitative data against ordinal data, but ordinal data with an underlying continuity but measured discretely as two values (dichotomous). An example might be test performance vs anxiety, where anxiety is designated as either high or low. Presumably, anxiety can take on any value inbetween, perhaps beyond, but it may be difficult to measure. We further assume that anxiety is normally distributed. The formula is very similar to the point-biserial but yet different: rb = (Y1 - Y0) (pq/Y) /
Y,

where Y0 and Y1 are the Y score means for data pairs with an x score of 0 and 1, respectively, q = 1 - p and p are the proportions of data pairs with x scores of 0 and 1, respectively, and Y is the population standard deviation for the y data, and Y is the height of the standardized normal distribution at the point z, where P(z'<z)=q and P(z'>z)=p. Since the factor involving p, q, and the height is always greater than 1, the biserial is always greater than the point-biserial. Tetrachoric Correlation Coefficient The tetrachoric correlation coefficient, rtet, is used when both variables are dichotomous, like the phi, but we need also to be able to assume both variables really are continuous and normally distributed.

Thus it is applied to ordinal vs. ordinal data which has this characteristic. Ranks are discrete so in this manner it differs from the Spearman. The formula involves a trigonometric function called cosine. The cosine function, in its simpliest form, is the ratio of two side lengths in a right triangle, specifically, the side adjacent to the reference angle divided by the length of the hypotenuse. The formula is: rtet = cos (180/(1 + sqrt(BC/AD)). Rank-Biserial Correlation Coefficient The rank-biserial correlation coefficient, rrb, is used for dichotomous nominal data vs rankings (ordinal). The formula is usually expressed as rrb = 2 (Y1 - Y0)/n, where n is the number of data pairs, and Y0 and Y1, again, are the Y score means for data pairs with an x score of 0 and 1, respectively. These Y scores are ranks. This formula assumes no tied ranks are present. This may be the same as a Somer's D statistic for which an online calculator is available. Coefficient of Nonlinear Relationship (eta) It is often useful to measure a relationship irrespective of if it is linear or not. The eta correlation ratio or eta coefficient gives us that ability. This statistic is interpretted similar to the Pearson, but can never be negative. It utilizes equal width intervals and always exceeds |r|. However, even though r is the same whether we regress y on x or x on y, two possible values for eta can be obtained.

Again, the calculation goes beyond what can be presented here at the moment.

Biserial Correlation Coefficient


Explanations > Social Research > Analysis > Biserial Correlation Coefficient Description | Example | Discussion | See also

Description
The Biserial Correlation Coefficient is used where there are two sets of scores for the same people or for two matched groups. It is calculated as: Rb = ( (M1 - M2) * p1p2 ) / ZSy

where M1 and M2 are the means of the two groups. p1 and p2 are the proportions of the two groups form of the total. Sy is the standard deviation on the continuous variable as a whole and Z is the Z-score from a table of the normal distributions for p1 or p2, whichever is smaller.

You might also like