Statistics: part 2 Regression Analysis and SPSS

Correlation
(syllabus chapter 2) Bjorn Winkens Methodology and Statistics University of Maastricht Bjorn.Winkens@stat.unimaas.nl 11 April 2008
Methodology and Statistics | University of Maastricht © Bjorn Winkens 2008

Content
• Covariance and correlation • Pearson correlation coefficient • Tests and confidence interval for correlations • Spearmann correlation • Pitfalls
2

Association
Study goal = examine the association between two variables Some questions arise: • What measure of association should we use? • Is there a positive or negative association? • Is there a linear association? • Is there a significant association?
3

Covariance (1)
= measure of how much two random variables vary together • difference with variance? • formula:

cov( X , Y ) =

∑ ( x − x )( y
i i

i

− y)

n −1

• Note: cov(X,X) = var(X)
4

Covariance (2) Example: • X = height. Y = weight • Positive or negative covariance? • x = 181 cm.Y) = 35. y = 76.35 Height (cm) 5 .5 kg • Cov(X.Y) = 0.0 – Positive association – Strong or weak? 110 100 - + Weight (kg) 90 80 70 60 50 150 + 160 170 180 190 - 200 • X* = height in meters: – Cov(X*.

Correlation (1) = measure of linear association between two random variables • Notation: – population: ρ (rho) – sample: r • Can take any value from -1 to 1 • Closer to -1: stronger negative association • Closer to +1: stronger positive association 6 .

38 • Corr(X*. X* = height (cm.Y) = r = 0.38 7 .Y) = r = 0. Y ) r= = 2 2 s X sY ∑ ( xi − x ) ∑ ( yi − y ) i i i ∑ ( x − x )( y i i − y) • No dimension • Invariant under linear transformations Example (X. Y = weight (kg)): • Corr(X.Correlation (2) • Pearson’s correlation coefficient cov( X . m).

Practical examples (1) Serum cholesterol (mg/dL) FEV (l) Strong positive correlation r = 0.9 Height (in) Weak positive correlation r = 0.3 Dietary intake of cholesterol 8 .

2 Number of cigarettes per day Correlation r = 0.0 Caffeine No association? 9 .Practical examples (2) Difficulty Numerical Task (DNT) FEV (l) Weak negative correlation r = -0.

Practical examples (3) • r = 0.6 • straight line appropriate? Y X Always check linearity by a (scatter)plot ! 10 .

shape is important 11 .Size does not matter.

Two-sample z-test: H0: ρ1 = ρ2 (independent samples) 12 . One-sample z-test: H0: ρ = ρ0 • • Fisher’s z-transformation Confidence interval for ρ 3.Test for correlation coefficient(s) 1. One-sample t-test: H0: ρ = 0 2.

One-sample t-test: H0: ρ = 0 Example: Is there a correlation between serumcholesterol levels in spouses? • • • • X = serum-cholesterol husband (normally distributed) Y = serum-cholesterol wife (normally distributed) H0: ρ = 0. H1: ρ ≠ 0 t-test: n−2 t=r 2 1− r t-distributed with df = n-2 when H0: ρ = 0 is true 13 .

25 2 1 − 0.Example: serum-cholesterol (1) • n = 100 spouse pairs • Pearson’s correlation coefficient r = 0.56 • t-test: t = 0.25 • Is this correlation large enough to reject H0: ρ = 0? 100 − 2 = 2.25 • Conclusion? 14 .

006 -2.56 0 2.56 15 .56) = 0.012 • Conclusion? t98 distribution P(t98 ≤ -2.Example: serum-cholesterol (2) • Two-sided p-value: p = 2*0.006 P(t98 ≥ 2.56) = 0.006 = 0.

Be aware! • Significance depends on sample size: n 10 20 50 100 200 Significant (α = 0.14 16 .63 0.28 0.20 0.44 0.05) if r ≥ 0.

Example: Estriol – SPSS (1) Birthweight (g) Example: Is there an association between estriol level and birthweight? Sample: n = 31 5000 4500 4000 3500 3000 2500 2000 0 5 10 15 20 25 30 Estriol (mg/24 hr) 17 .

610** 1 . Correlation is significant at the 0. Conclusion? H0: ρ = 0. (2-tailed) N Pearson Correlation Sig. . (2-tailed) N Birthweight 1 .Example: Estriol – SPSS (2) SPSS: Estriol Correlations Estriol Pearson Correlation Sig.000 31 31 .1? H0: ρ = 0.000 .01 level (2-tailed).610** .3? 18 . 31 31 Birthweight **.

Test for correlation coefficient(s) 1. One-sample z-test: H0: ρ = ρ0 • • Fisher’s z-transformation Confidence interval for ρ 3. Two-sample z-test: H0: ρ1 = ρ2 (independent samples) 19 . One-sample t-test: H0: ρ = 0 2.

5 than above 0.One-sample z-test: H0: ρ = ρ0 (1) • If ρ0 ≠ 0.g. r has a skewed distribution – e.5 – previous t-test for correlations is invalid! • Solution: Fisher’s z-transformation van r 1 ⎛1+ r ⎞ z = ln⎜ ⎟ 2 ⎝1− r ⎠ • ln = natural logarithm (base = e = 2.5 more “room” for deviation below 0.718) 20 . H0: ρ = 0.

8 21 .05 z = -1.8 r = 0.05 r = 0.1 z = 0.Fisher’s z-transformation z = 1.1 r = -0.

λ = (z – z0)√(n-3) ~ N(0.One-sample z-test: H0: ρ = ρ0 (2) • z is approximately normally distributed under H0 with mean 1 ⎛ 1 + ρ0 ⎞ z0 = ln⎜ ⎜1− ρ ⎟ ⎟ 2 ⎝ 0 ⎠ and variance 1/(n-3) • Equivalently.1) link31 22 .

H1: ρ ≠ ρ0 • Compute sample correlation coefficient r • Transform r and ρ0 to z and z0.One-sample z-test: H0: ρ = ρ0 (3) In conclusion: • H0: ρ = ρ0 (≠ 0). respectively. using Fisher’s z-transformation • Compute test statistic λ = (z – z0)√(n-3) • Compute p-value (λ ~ N(0.1)) • Not (yet) available in SPSS!!! 23 .

1 is expected based on previous research with sons and non-biological fathers Sample: • n = 100 biological fathers and sons • Pearson’s correlation coefficient r = 0.Example: Body weight (1) Research question: Association between body weights of father and son different for biological than for non-biological fathers? Previous research: A correlation of 0.38 24 .

40 – 0.10)*√(100 – 3) = 2.38/0.10.40 z0 = 0.955 • p-value = 0.90) = 0.0031 • Conclusion? • Confidence interval? 25 .10 • r = 0.62) = 0.10/0. H1: ρ ≠ 0.38 • ρ0 = 0.10 z = 0.Example: Body weight (2) • H0: ρ = ρ0 = 0.5*ln(1.5*ln(1.10 • λ = (0.

α) CI for zρ z1 = z – z1-α/2 / √(n – 3). ρ 2 = 2 z2 e +1 e +1 2 z1 2 z2 26 .Confidence interval for ρ Step 1: compute sample correlation r Step 2: transform r to a Fisher z-score (z) Step 3: compute a 100%x(1 . z2 = z + z1-α/2 / √(n – 3) Step 4: transform this CI to CI for ρ: e −1 e −1 ρ1 = 2 z1 .

20 z2 = 0.Example: Body weight (3) 95% confidence interval for ρ: Step 1: sample correlation r = 0.2 + 1) = 0.96/√97 = 0.6 + 1) = 0.40 + 1.62) = 0.5*ln(1.40 – 1.96/√97 = 0.20 ρ2 = (e2*0.40 Step 3: z1 = 0.6 – 1)/ (e2*0.54 Conclusion? 27 .2 – 1)/ (e2*0.38 (n = 100) Step 2: z = 0.60 Step 4: ρ1 = (e2*0.38/0.

60) 4.Example: Body weight (4) 95% CI for ρ: 1. 0. Compute r: r = 0.38 2.40 3. Compute CI for zρ: (0. Transform CI for zρ back to CI for ρ: (0. 0.54) 28 3 2 4 1 . Transform to zscore: z = 0.20.20.

One-sample t-test: H0: ρ = 0 2. Two-sample z-test: H0: ρ1 = ρ2 (independent samples) 29 . One-sample z-test: H0: ρ = ρ0 • • Fisher’s z-transformation Confidence interval for ρ 3.Test for correlation coefficient(s) 1.

r2 = 0. r1 = 0.Example: Body-weight (1) – different design – Research question: Association between body weights of father and son different for biological than for non-biological fathers? No previous research Two samples: • First group (biological): n1 = 100.10 30 .38 • Second group (non-biological): n2 = 50.

1)-distributed under H0 • Compare with one-sample z-test (sheet 22) 31 . correlation r2 • Test statistic: λ= z1 − z 2 1 1 + n1 − 3 n2 − 3 • λ is approximately N(0.Two-sample z-test: H0: ρ1 = ρ2 • Samples: Fisher’s z-scores: – group 1: sample size n1. correlation r1 z1 z2 – group 2: sample size n2.

r1 = 0.5*ln(1.40 − 0.10 = 1.38 – Group 2 (non-biological): n2 = 50.Example: Body-weight (2) – different design – • Samples: – Group 1 (biological): n1 = 100.40 – z2 = 0.38/0.091 Conclusion? 32 – z1 = 0. r2 = 0.10 .10 • Fisher’s transformation: 0.90) = 0.5*ln(1.62) = 0.69 • Test statistic: λ = 1 1 + 97 47 • p-value = 0.10/0.

then significance tests based on the Pearson correlation coefficient are no longer valid • A non-parametric alternative should then be used. For example.Rank correlation (1) • Assumed that X and Y are normally distributed • If X and/or Y are either ordinal or have a distribution far from normal (due to outliers). a test based on the Spearman rank correlation coefficient 33 .

Rank correlation (2) Spearman’s rank correlation coefficient: = Pearson’s correlation coefficient based on the ranks of X and Y • Less sensitive for outliers. more general association (not specifically linear) • n ≥ 10 (or 30): similar tests and CI as for Pearson correlation • n < 10 (or 30): exact significance levels can be found in table • Many ties (same value): use Kendall’s Tau 34 .

Normality check (1) • Use pp-plots and histograms to check normality (symmetry) • Problem with (significance) tests for normality: – Small sample size: no or little power to detect discrepancy from normality – Medium or large sample size: no or small impact due to central limit theorem • Data skewed (outliers) & small sample size data transformation 35 .

5 2 0.Normality check (2) Be aware: significance depends on sample size! 1.0 1 2 3 4 5 6 0 1 2 3 4 5 6 Outcome Outcome Shapiro-Wilk: p = 0.039 36 .961 p = 0.5 6 Frequency Frequency 1.0 4 .

maximal score = 10 • Spearman rank correlation = 0.593 (Pearson’s correlation = 0.45. df = 24 – 2 = 22 p-value < 0.01 • Conclusion? • Remarks? 37 .845) • t-test for Spearman rank correlation: t = 3.Example: Apgar scores • Apgar score (physical condition) at 1 and 5 minutes for 24 newborns • Minimal score = 0.

yi) • … Note: • No mathematical problem • Interpretation is incorrect 38 .Pitfalls • Spurious correlations • No measurement of agreement • Change scores (Y-X) always related to baseline X (“regression to the mean”) • Dependent pairs of observations (xi.

Dependent pairs of observation • Association between study duration and grade • Plot 1: dependency ignored negatively association • Plot 2: dependency taken into account (data from same subject connected) positively association 10 9 8 10 9 8 GRADE GRADE 7 6 5 4 3 4 5 6 7 7 6 5 4 3 4 5 6 7 STUDY DURATION STUDY DURATION Students were measured twice!!! 39 .

1986) • Prediction – Regression analysis 40 .Relation between two variables Three main purposes: • Association – Pearson or Spearman correlation coefficient • Agreement (same quantity: X = Y) – Method of Bland and Altman (Lancet.

QUESTIONS? 41 .