11 views

Uploaded by AdãodaLuz

Fallas-Mantenimiento

- b104
- A Comparative Study of Women Employees on Job Satisfaction Reflections From Banks and FMCG Companies
- 19. Moha Asri-final
- HSC COMMERCE MATHS II ANSWER SHEET .pdf
- Corelation and Regression
- Worksheet
- debra lesson plan feb 3
- Chapter 3 Research Methodology Al
- Some Mental Abilities Related to the Discus Achievement
- paper_ed6_33-1
- PS-11 CT Calcs-Ver 2-6 Misc.xls
- A Seminal Case Study on Application of Last Planner System with Cash Flow Data for Improvement in Construction Management Practices
- Information Behaviour and Socio-Economic Empowerment of Textile Market Women in Southern Nigeria
- 24-79-2-PB
- Computational Methods in Linguistics
- math 1040 final project
- Comparacion AADEE_AVL Version Ingles New
- Quantitative Analysis Forecasting
- 05 Describing Relationship Between Variables
- Pre Assessment

You are on page 1of 8

THIS IS THE MSWORD DOCUMENT FROM WHICH THE JOURNAL OF THE AMERICAN MATHEMATICAL

ASSOCIATION OF TWO YEAR COLLEGES CREATED THE PUBLISHED MAY 2017 VERSION.

THE 1ST PAGE OF THE AS-PUBLISHED VERSION CAN BE FOUND AT THE END OF THIS DOCUMENT;

THE FULL AS-PUBLISHED VERSION, PAGES 4-7, CAN BE FOUND STARTING AT:

https://amatyc.site-ym.com/page/EducatorMay2017

----------------------------------------------------------------

SUMMARY

The best equation for introducing students to the meaning of the correlation coefficient is not

Karl Pearson's "product-moment" formula. Better alternatives are presented and discussed here.

The relationship between the correlation and regression coefficients is reviewed. Some historical

discussion is included.

1. Introduction

Almost a century ago, Symonds (1926) published a paper that contained 52 different formulas

for calculating the "product-moment (Pearson) coefficient of correlation" (a.k.a., correlation

coefficient, index of co-relation, or simply "r"); decades before computers, he focused on

formulas useful for "ease of computation". Rodgers and Nicewander (1988) included 13 such

formulas, where "each formula suggests a different way of thinking about" the correlation

coefficient; the authors focused on "interesting" formulas (p. 59); some of their 13 formulas are

not found among Symond's 52. This present paper focuses on 4 formulas useful for teaching the

correlation coefficient's meaning to novice students; one of the 4 is not found in either of those

previous papers.

Which characteristics of a formula are useful for explaining a difficult concept? At the top of

such a list would certainly be that it is both simple and instructive. For example, the concept of

"momentum" as taught in introductory physics classes is explained using:

Contrast that with the following formula for the correlation coefficient that is found most

prominently in many introductory statistics textbooks published in the past 100 years:

()(

)

Standard Formula: = ) (

)

(

Having taught the correlation coefficient in classrooms and seminars for many years, this author

concludes that the Standard Formula does not belong in introductory statistics textbooks, since it

is far from simple and provides virtually no instructive value to the average novice student, who

v1.01, June 29 2017, Four Formulas for Correlation Coefficient, J. Zorich, www.johnzorich.com p. 1

can be more confused than educated by it. Instead, introductory textbooks should include one or

more of the following four formulas.

An unnecessary obstacle to simplifying the formula for the correlation coefficient is the

commonly held belief that the formula must output the coefficient's "sign", which is said to

indicate whether the correlation is "positive" or "negative". It can be argued that the sign is

irrelevant and that correlation is neither positive nor negative, as shown by the following

formula and discussion (the formula is from Rogers and Nicewander, p. 62):

Formula 1: =

where

b = slope of the least-squares linear regression line, using Y on the vertical axis and

X on the horizontal; the linear regression equation is typically given as Y = a +

bX

Sy = standard deviation of the plotted vertical-axis sample-data values

Sx = standard deviation of the plotted horizontal-axis sample-data values.

As seen in Formula 1, the sign of the correlation coefficient is always the same as that of the

slope, which is typically referred to as the "regression coefficient". The sign of the regression

coefficient (i.e., the "b" in Y = a + bX) tells us whether the relationship between X and Y is

negative or positive (i.e., whether the plotted linear regression line is an increasing or decreasing

function); the sign of the correlation coefficient provides no additional information. If the world

had never heard of the correlation coefficient's sign, nothing would have been lost, and the gain

would have been many fewer confused students.

The next formula is one of Symond's. However, it was printed incorrectly (the square root sign

is missing). The correct formula (from Rodgers & Nicewander, p. 62) is:

Formula 2: =

where

bX = slope of the least-squares linear regression line when Y is plotted on the vertical

axis, and X is plotted on the horizontal

bY = slope of the least-squares linear regression line when X is plotted on the vertical

axis, and Y is plotted on the horizontal

r = takes the same sign as bX and bY, which always have the same sign.

Formula 2 says that the correlation coefficient is the geometric average of the slopes of the two

possible ways to plot the data, i.e., Y vs. X and X vs. Y.

Francis Galton discovered the correlation coefficient in 1888. He had been frustrated by the

problematic fact that a plot of Y vs. X yields not only a different slope than a plot of X vs. Y but

v1.01, June 29 2017, Four Formulas for Correlation Coefficient, J. Zorich, www.johnzorich.com p. 2

also a different relationship. To explain that problem, let the equation for Y vs. X be represented

as Y = a + bXX, and the equation for X vs. Y as X = c + bYY. If the latter equation is rearranged

to yield Y = (X c)/bY, then amazingly (X c)/bY does not equal a + bXX. Referring to a paper

Galton published a few years before, he wrote that the "stature of the father is correlated to that

of the adult son, and the stature of the adult son to that of the father...; ...what I there called

'regression,' is different in the different cases" (Galton 1888, p. 143). Twenty years later, he

reminisced that "I could not [yet] see my way to express the results of [such regression analyses]

in a single formula" (Galton 1908, p. 302). Galton found his "single formula" when he

discovered the correlation coefficient.

To make that discovery, Galton changed how he plotted his X,Y data. Instead of scaling the

horizontal and vertical plot axes in the same units as the raw data, he scaled them in units of

"probable error" (which is a 19th century term that is arithmetically related to the 20th century's

"standard deviation"). In other words, he plotted values that had been "transmuted into terms of

a new scale" (Galton 1888, p. 136). In effect, he divided each of the individual X and Y values by

their respective probable error. These transmuted values are denoted by XT and YT, where the

subscript "T" indicates transmuted raw data. Plots of YT vs. XT and of XT vs. YT exhibited the

same "problematic fact" that was mentioned above (that is, (XT c)/bY did not equal a + bXXT).

Amazingly the two plots had the identical linear regression slope. It is interesting to note that in

his "haste to prepare a paper" on this discovery before anyone else (Galton 1890, p. 421), he did

not clearly explain what meaning to assign to this identical slope that he there called "r", other

than what he gave in the very last words of the paper: "r measures the closeness of correlation"

(Galton 1888, p.145). He did not improve upon that explanation in his subsequent abbreviated

version of the publication (Galton 1889).

Formula 3 is the most direct calculation of the slope of the line on Galton's plot of "transmuted"

X,Y values (i.e., the slope of YT vs. XT). It is a classic Cartesian-coordinate calculation of "rise

over run", where the rise is and the run is , and where the rise and run values are

standardized by dividing by their respective standard deviations.

)

(

Formula 3: = )

(

where

= arithmetic mean of the plotted vertical-axis sample-data values (i.e., the Yi's)

= arithmetic mean of the plotted horizontal-axis sample-data values (i.e., the

Xi's)

Sy = standard deviation of the plotted vertical-axis sample-data values

Sx = standard deviation of the plotted horizontal-axis sample-data values

Xi = any arbitrarily chosen sample-data value plotted on the horizontal axis

Yei = a + bXi = the output of the least-squares linear regression equation derived

from the "Y on X" plot of the sample-data; this Yei value must be calculated

with the Xi value used in Formula 3's denominator.

v1.01, June 29 2017, Four Formulas for Correlation Coefficient, J. Zorich, www.johnzorich.com p. 3

The simplest and possibly the most instructive formula for the magnitude of the correlation

coefficient is this last one, which is a ratio of two standard deviations (the formula is from

Rogers and Nicewander, p. 62):

()

Formula 4 (magnitude only): =

where S(Yei) is the standard deviation of all the Yei values derived from all the plotted Xi

sample-data values, and Sy is the standard deviation of all the plotted Yi sample-data values.

regression slope (i.e., "b") of a given value, the smallest possible standard deviation for the

vertical-axis sample-data Yi values is when the plotted X,Y points occur exactly on the linear

regression line. The S(Yei) in Formula 4 is that smallest standard deviation. When any of the

plotted X,Y points deviate from that line (in such a way that "b" remains that same "given

value"), the result is a larger standard deviation, which is the Sy in Formula 4.

Formula 4 says that the correlation coefficient equals the fraction of the total observed variation

in Y (as measured by Sy) that can be explained by a perfectly linear relationship between X and Y

(as measured by S(Yei)). The remaining variation is caused by other factors (e.g., measurement

error or confounding factors). For a more mathematically rigorous explanation, it is necessary to

square both sides of the formula, thereby converting the standard deviations into variances and

the correlation coefficient into what is known as the Coefficient of Determination; the correct

statement then is that the coefficient of determination equals the fraction of the observed

variance in Y that can be explained by a perfectly linear relationship between X and Y. However,

since the correlation coefficient value is the same, no matter whether the linear regression plot is

Y vs. X or X vs. Y, the best explanatory statement for use with novice students may be this: The

correlation coefficient represents the fraction of the observed co-variation between the X and Y

variables that can be explained by a perfectly linear relationship between them.

v1.01, June 29 2017, Four Formulas for Correlation Coefficient, J. Zorich, www.johnzorich.com p. 4

Figure 1. Identical Very High Correlation Coefficients But Very Different Slopes

Slopes = 0.0001 (lower line), 10.0000 (upper line)

100

90

80

70

60

50

40

30

20

10

0

0 1 2 3 4 5 6 7 8

Slopes = 0.0001 (lower line), 10.0000 (upper line)

100

90

80

70

60

50

40

30

20

10

0

0 1 2 3 4 5 6 7 8

v1.01, June 29 2017, Four Formulas for Correlation Coefficient, J. Zorich, www.johnzorich.com p. 5

Figure 3. Identical Slopes But Different Correlation Coefficients

r = 0.9999 (lower line), 0.9000 (upper line)

100

90

80

70

60

50

40

30

20

10

0

0 1 2 3 4 5 6 7 8

r = 0.9999 (lower line), 0.00003 (upper line)

100

90

80

70

60

50

40

30

20

10

0

0 1 2 3 4 5 6 7 8

v1.01, June 29 2017, Four Formulas for Correlation Coefficient, J. Zorich, www.johnzorich.com p. 6

3. Concluding Remarks

In Galton's words, two variables "are said to be co-related when the variation of the one is

accompanied on the average by more or less variation of the other". He also said that the

correlation coefficient is a "simple number" that expresses..."the nearness with which they vary

together" (Galton 1888, pp. 135-136, emphases added). Looking only at Figures 3 and 4, it is

tempting to agree with Galton's definition that the correlation coefficient is a "measure of the

closeness of the co-relation" (Galton 1888, p. 140; emphases added), because in each of those

figures the X,Y data-set that displays the larger correlation coefficient (i.e., the lower line on

each chart) also displays much more "closeness" and "nearness" to its line than does the other

data-set. However, that definition cannot be used to explain why the correlation coefficients in

Figure 2 are identical even though the upper line on the chart was produced with a data-set that

displays much less "closeness" and "nearness" than does the other data-set.

Therefore, it might have been better for Galton to have defined the correlation coefficient as:

A measure of the closeness of the co-relation when comparing linear regression plots of

X,Y data-sets that have the same regression coefficient (i.e., same slope).

That definition helps explain Figures 3 and 4. In order to also help explain Figures 1 and 2, it

might have been even better to define the correlation coefficient as:

The fraction of the observed co-variation between the X,Y variables that can be explained

by a perfectly linear relationship between them.

To help explain any of those figures, the Standard Formula can be used. However, these four

other formulas dissect out the correlation coefficient's meaning so that it can be grasped more

easily by novice students.

REFERENCES:

1. Galton, F. (1888), "Co-relations and their Measurement, chiefly from Anthropometric Data",

Proceedings of the Royal Society, 45, 135-145.

2. Galton, F. (1889), "Correlations and their Measurement, chiefly from Anthropometric Data",

Nature, 39, 238.

3. Galton, F. (1890), "Kinship and Correlation", North American Review, 150, 419-431

5. Rogers, J. L., and Nicewander, W. (1988), "Thirteen Ways to Look at the Correlation

Coefficient", The American Statistician, 42:1, 59-66.

Correlation, Journal of Educational Psychology, XVII, 458-469.

v1.01, June 29 2017, Four Formulas for Correlation Coefficient, J. Zorich, www.johnzorich.com p. 7

Four Formulas for Teaching the Meaning

of the Correlation Coefficient

John Zorich

Ohlone Community College

Almost a century ago, Symonds (1926) published a paper that Four Simple and Instructive Formulas

contained 52 different formulas for calculating the product- An unnecessary obstacle to simplifying the formula for the

moment (Pearson) coefficient of correlation (a.k.a., correla- correlation coefficient is the commonly held belief that the

tion coefficient, index of co-relation, or simply r); decades formula must output the coefficients sign, which is said to in-

before computers, he focused on formulas useful for ease dicate whether the correlation is positive or negative. It can be

of computation. Rodgers and Nicewander (1988) included argued that the sign is irrelevant and that correlation is neither

13 such formulas, where each formula suggests a different positive nor negative, as shown by the following formula and

way of thinking about the correlation coefficient; the authors discussion (Rogers & Nicewander, 1988, p. 62):

focused on interesting formulas (p. 59), and some of their

Sx

13 formulas are not found among Symonds 52. This present r =b , (1)

Sy

article focuses on four formulas useful for teaching the corre-

lation coefficients meaning to novice students; one of the four where

is not found in either of those previous articles.

Which characteristics of a formula are useful for explain- b = slope of the least-squares linear regression

ing a difficult concept? At the top of such a list would certainly line, using Y on the vertical axis and X on the

be that it is both simple and instructive. For example, the con- horizontal; the linear regression equation is

cept of momentum, as taught in introductory physics classes, typically given as Y= a + bX .

is explained using S y = standard deviation of the plotted vertical-axis

sample-data values.

Momentum

= Mass Velocity.

S x = standard deviation of the plotted horizontal-

Contrast that with the following standard formula for the cor- axis sample-data values.

relation coefficient that is found most prominently in many in- As seen in formula 1, the sign of the correlation coeffi-

troductory statistics textbooks published in the past 100 years: cient is always the same as that of the slope, which is typically

r=

( X i )(

X Yi Y ) .

referred to as the regression coefficient. The sign of the regres-

sion coefficient (i.e., the b in Y= a + bX ) tells us whether

( X ) (Y Y )

2 2

i X i the relationship between X and Y is negative or positive (i.e.,

whether the plotted linear regression line is an increasing or

Having taught the correlation coefficient in classrooms decreasing function); the sign of the correlation coefficient

and seminars for many years, this author concludes that the provides no additional information. If the world had never

standard formula does not belong in introductory statistics heard of the correlation coefficients sign, nothing would have

textbooks, since it is far from simple and provides virtually been lost, and the gain would have been many fewer con-

no instructive value to the average novice student, who can fused students.

be more confused than educated by it. Instead, introduc- The next formula is one of Symonds (1926). However, it

tory textbooks should include one or more of the following was printed incorrectly, with the square root sign missing. The

four formulas. correct formula (Rodgers & Nicewander, 1988, p. 62) is:

r = bX bY , (2)

- b104Uploaded bySuperb Heart
- A Comparative Study of Women Employees on Job Satisfaction Reflections From Banks and FMCG CompaniesUploaded byarcherselevators
- 19. Moha Asri-finalUploaded byAhmed Mounir Ghoneim
- HSC COMMERCE MATHS II ANSWER SHEET .pdfUploaded byNamdeo Jadhav
- Corelation and RegressionUploaded byJaydev Chakraborty
- WorksheetUploaded byfieqa
- debra lesson plan feb 3Uploaded byapi-286186447
- Chapter 3 Research Methodology AlUploaded byMar Ordanza
- Some Mental Abilities Related to the Discus AchievementUploaded byThe Swedish Journal of Scientific Research (SJSR) ISSN: 2001-9211
- paper_ed6_33-1Uploaded bybhemalee
- PS-11 CT Calcs-Ver 2-6 Misc.xlsUploaded byMaria Saucedo Sanchez
- A Seminal Case Study on Application of Last Planner System with Cash Flow Data for Improvement in Construction Management PracticesUploaded bySEP-Publisher
- Information Behaviour and Socio-Economic Empowerment of Textile Market Women in Southern NigeriaUploaded byIOSRjournal
- 24-79-2-PBUploaded byEno Eno Csid
- Computational Methods in LinguisticsUploaded bylarsen.mar
- math 1040 final projectUploaded byapi-242645250
- Comparacion AADEE_AVL Version Ingles NewUploaded byShereen Salah
- Quantitative Analysis ForecastingUploaded byaerogem
- 05 Describing Relationship Between VariablesUploaded byJarevic Cayabyab
- Pre AssessmentUploaded byaboulazhar
- anecdotalrecord-160128192256Uploaded byNadiya Rashid
- MTH302 MCQsUploaded byTaimoor Sultan
- Proiect econometrieUploaded byDenisa Andreea
- Introduction to SPSS part1Uploaded byIdhaniv
- Dexter et al 2009Uploaded bySedPaleo
- Portfolio Risk & ReturnUploaded byAnonymous
- Chapter 4 Empirical Findings Introduction All the Data for 170Uploaded byfranklib
- employee wwelfare projectUploaded byVijay Kishan
- Zh 32265269Uploaded byAJER JOURNAL
- The_Role_of_Project_Management_Information_Systems_towards_The_Success_of_a_Project_The_Case_of_Construction_Projects_in_Nairobi_Kenya.pdfUploaded byNahidul Islam

- Tirage Au Sort de LotsUploaded byAli Mhamadi AliHamada Chahidi
- ADMN-P-04 Vehicle Maintenance ProcessUploaded byAdãodaLuz
- Rolling MTBF Calculation ExampleUploaded byAdãodaLuz
- 137736819 Modelo de Relatorio de ManutencaoUploaded byAdãodaLuz
- Self-Made Sampling Plans v2Uploaded byAdãodaLuz
- CALCULO MTBF (2)Uploaded byAdãodaLuz
- sam-htasUploaded byAdãodaLuz
- Lean demoUploaded byapi-3835934
- Fab Time Characteristic Curve GeneratorUploaded byAdãodaLuz
- an5426-baseband-calculator.xlsUploaded byAdãodaLuz
- Maintenance ROI CalculatorUploaded byAdãodaLuz
- Organogram Maintenance DeptUploaded byAdãodaLuz
- Catalogo Dashboards HighUploaded byAdãodaLuz
- PLANTILA INFORME DE MANTENIMIENTO CATENARIA.xlsUploaded byAdãodaLuz
- Tirage Au Sort de LotsUploaded byAdãodaLuz
- Cont Man ManutençãoUploaded byAdãodaLuz
- Gestão Da Manutenção - Ordens de ServiçosUploaded byAdãodaLuz
- PythagoreUploaded byAdãodaLuz
- Six Sigma Template KitUploaded byAdãodaLuz
- SamUploaded byAdãodaLuz
- Plantila Informe de Mantenimiento CatenariaUploaded byAdãodaLuz
- Mapa Geral de Preventiva 2007 - 240507_N3Uploaded byAdãodaLuz
- Copy of 1581080803 IndicatorsUploaded byAdãodaLuz
- MTTF Trend ExampleUploaded byAdãodaLuz
- MTTF Control 1Uploaded byAdãodaLuz
- Total Cost Per Product CPP ExampleUploaded byAdãodaLuz
- Calcul Intervalle DatesUploaded byAdãodaLuz
- Cost Weighted OEE TemplateUploaded byAdãodaLuz
- Skm Model Target Availability Model Hvdc InterconnectorsUploaded byAdãodaLuz

- can co2 emissions affect how long you live regression kikerUploaded byapi-300088087
- Corelatii Ale Job Satisfaction....Uploaded byRoxana Grigoras
- Bray 24 May 2002 Arias Intensity AttenuationUploaded byAkhilamanne
- investigaciones zincUploaded byguisely
- Bielak, 2006Uploaded byDiego Maximiliano Herrera
- Inflation and the PoorUploaded byAndi Alimuddin Rauf
- 89518303 Public Finance and Public Policy Solutions ManualUploaded byJustin Maillis
- Ada 247020Uploaded byashofon
- 7 Step Problem SolvingUploaded byandy_wang1972
- valuation of reactive powerUploaded byapi-3697505
- Leverage and its Impact on Earnings and Share PriceUploaded byeditor_ijtel
- Writing Test ItemsUploaded byDia Pasere
- CorrelationUploaded byDaasshhvviinnaa
- Regression and Multivariate analysisUploaded byKesav Kumar
- NURSING DATA MANAGEMENT.pdfUploaded byMarvin Calma
- beverenUploaded byLennard Pang
- 13 Yi Ching TsaiUploaded byKhushi Awan
- hookworm article (journal)Uploaded byBeverly Flair Gerodias
- Ratfish InfoUploaded bysmartinuf
- Speed DelayUploaded bySandy Sandeep
- Hershcovis_and_Barling_2010_JOB.pdfUploaded byMuhammad Salim Ullah Khan
- Economic Development and the Impacts of Natural Disasters. Economics Letters 94.Uploaded bymichael17ph2003
- 2014 - Eight- No Nine - Problems With Big DataUploaded byAnuraagKashyap
- Quality matters.pdfUploaded byAnonymous BQcnG1
- Schonemann. On models and muddles of heritability. Article.pdfUploaded byMartínez Horacio
- Analyzing neural responses to natural signals.pdfUploaded bybillpetrrie
- M.phil Thesis ManagementUploaded bySankar Covai
- International Journal of Adolescence and YouthUploaded bysorinaiulia
- Chapter 1Uploaded byduniahusna
- SSGB Question Paper 21 June 15 DelhiUploaded bykapindra