Professional Documents
Culture Documents
2
Correlation Analysis
• the degree to which two variables are associated
• strength of the relationship (correlation)
between two variables
• may be either positive or negative.
• Its magnitude depends on the units of
measurement.
• Assumes the data are from a bivariate normal
population.
• Does not necessarily imply causation
3
Correlation Analysis
Data
Quantitative Qualitative
Or Numerical Or Categorical
5
Scatter Plots
• A scatter plot is a graph of the ordered
pairs (x, y) of numbers consisting of the
independent variable, x, and the
dependent variable, y.
Scatter Plots - Example
• Construct a scatter plot for the data obtained in a
study of age and systolic blood pressure of six
randomly selected subjects.
• The data is given on the next slide.
Scatter Plots - Example
Subject Age, x Pressure, y
A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
Scatter Plots - Example
Positive Relationship
150
150
Pressure
Pressure
140
140
130
130
120
120
40
40 50
50 60
60 70
70
Age
Age
Scatter Plots – Other Example
Negative Relationship
90
90
80
80
grade
Finalgrade
70
70
Final
60
60
50
50
40
40
55 10
10 15
15
Number
Numberofofabsences
absences
Scatter Plots – Other Example
No Relationship
10
Y 10
5
5
y
0
0 0 10 20 30 40 50 60 70
0 10 20 30 X 40 50 60 70
x
Scatter Plot
3
2
2
2
1
1
1
0
0
y2
y2
y2
-1
0
-1
-2
-1
-2
-3
-2
-3
-4
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
y1 y1 y1
12
Correlation Coefficient Range
Coefficient Strength of
Range Relationship
0.00 - 0.20 Very Low
0.20 - 0.40 Low
0.40 - 0.60 Moderate
0.60 - 0.80 High Moderate
0.80 - 1.00 Very High
Correlation
Coefficient
xy x y /n 6 d2
rxy
1
x y
n(n 2 1)
2 2
x x x /n
“d” is the difference between
2
ranks of two variables
2
y y y /n
a measure of the intensity of the linear
association between variables. 14
Notes of Caution
1. An observed relationship between two
variables does not imply that there is some
causal link between the two variables.
IQ
Shoe size
15
2. A relationship between two variables can be
influenced by confounding variables.
However for each gender, there does not
appear to be an association.
Number of
Haircut done
in a year men
Women
Height
Any study, especially an observational study, has
the potential to be wrongly interpreted because
of confounding variables.
16
Interpreting Correlation
• Correlation is a mathematical concept which
describes a pattern of numbers
• As such it may NOT help to understand what is
going on in the real world
• Does NOT tell why one variable is correlated with
another
Computational Formula for
Correlation
rxy you obtain a
• By substituting and rearranging,
substantial (and not very transparent) formula
for
N XY X Y
rxy
2 2 2 2
N X X N Y Y
Computing a correlation
Cigarettes Lung
(X) 2 Capacity
X 2 XY Y (Y)
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
50 750 1585 6680 180
Computing a Correlation
(5)(1585) (50)(180)
rxy
2 2
(5)(750) 50 (5)(6680) 180
7925 9000
(3750 2500)(33400 32400)
1075
.9615
1250 (1000)
Coefficient of Determination
• Allows us to be precise when we interpret a
correlation coefficient
• Tells us how much variance on one variable is
accounted for by the variance on the other
variable
• Expressed as a percentage (%)
Selected Values of r and r2
Selected values of r and r 2
2
% %
r r accounted for not accounted for
25
Regression Analysis
• It includes any techniques for modeling and
analyzing several variables, when the focus is on
the relationship between a dependent variable
and one or more independent variables.
29
Linear Regression Analysis
Purpose –
To determine the regression equation. It is used to predict
the value of one variable (Y, called the dependent variable)
based on another variable (X, called the independent variable).
Procedure:
•Select a sample from the population, and list the paired data
(X, Y) for each observation.
• Draw a scatter diagram to give a visual portrayal of the
relationship.
• Determine the regression equation Y = a + bX.
30
Regression Analysis
• Y is the average predicted value of Y for any X.
• a is the Y-intercept, or the estimated Y value when
X = 0.
• b is called the slope of the line. It is the average
change in Y for each change of one unit in X.
• The least squares principle is used to obtain a and
b and are given by:
x
b yx rxy
y bxy rxy
y
x a y b yx x 31
Regression Example
32
Calculations
Temp (C0) DO(mg/l)
5.42 15.52
6.16 15.39
6.34 15.63
6.42 15.75
6.76 15.65
6.86 14.95
6.91 15.49
7.30 14.54
7.68 14.46
7.78 14.15
8.30 14.76
8.57 14.48
8.72 13.17
33
Calculations
The values we need to calculate are:
1. Mean of X
2. Mean of Y
3. Sum of X
4. Sum of Y
5. Sum of X times Y (∑XY)
6. Sum of X2
7. Sum of Y2
Note that there is a difference between the sum of X2 and the sum of
X, squared!
Σ X2 (ΣX )2
34
35
• Machine calculation of a and b (intercept and
slope)
36
37
EXAMPLE 1
• Develop a regression equation for the
information given in the EXAMPLE that can
be used to estimate the selling price based on
the number of pages.
• b = 0.01714, a = 16.00175.
• Y = 16.00175 + 0.01714X .
• What is the estimated selling price of a 650-
page book?
• Y = 16.00175 + 0.01714(650) = $27.14. 38
EXAMPLE 2
• A computer-repair technician recorded data
on the number of computers serviced and the
amount of time to complete the service for 11
randomly selected service visits. The number
of computers serviced, x, ranged from 1 to 7,
while the time to complete the service, y,
ranged from about 30 minutes to nearly 3.5
hours. The scatter plot of the data shows a
strong, positive linear relationship between
these two variables.
39
40
41
• Obtain the estimated linear regression
equation y = a + bx.
y= 10.19 + 24.83 x
where y = service time
x = no. of computer serviced
• Predict the number of minutes required for
service when it is reported that five
computers are down.
43
Correlation is calculated whenever:
* both X and Y is measured in each subject and
quantify how much they are linearly associated.
* in particular the Pearson's product moment
correlation coefficient is used when the assumption
of both X and Y are sampled from normally-
distributed populations are satisfied
* or the Spearman's moment order correlation
coefficient is used if the assumption of normality is
not satisfied.
* correlation is not used when the variables are
manipulated, for example, in experiments. 44
With correlation you don't have to think about
cause and effect. You simply quantify how well
two variables relate to each other. With
regression, you do have to think about cause
and effect as the regression line is determined
as the best way to predict Y from X.
45
Line of Best Fit
Line of Best Fit
• Definition - A Line of Best is a straight line on a
Scatter plot that comes closest to all of the dots on
the graph.
• A Line of Best Fit does not touch all of the dots.
• A Line of Best Fit is useful because it allows us to:
– Understand the type and strength of the
relationship between two sets of data
– Predict missing Y values for given X values, or
missing X values for given Y values
Age Height
(months) (inches)
18 76.1
19 77 Work with your group
20 78.1 to make a prediction
21 for the height at:
22 78.8
23 79.7
24 79.9
• 21 months
25 81.1
26 81.2 • 28 months
27 82.8
28
• 20 years
29 83.5
Equation For Line of Best Fit
y = 0.6618x + 64.399
Corr = .794
Now, let’s plot Levelt vs.
Levelt-2
Corr = .531
Moving Averages
• A moving average is an average that is updated
or recomputed for every new time period being
considered.
• The most recent information is utilized in each
new moving average.
• Shown here are shipments (in millions of dollars)
for electric lighting and wiring equipment over a
12-month period. Use these data to compute a 4-
month moving average for all available months.
Moving Average
• a form of average which has been adjusted to
allow for seasonal or cyclical components of a
time series.
• When a variable, like the number of unemployed,
or the cost of strawberries, is graphed against
time, there are likely to be considerable seasonal
or cyclical components in the variation
63
Moving Average
The n day simple moving average for day d is computed by:
64
Year 1996 1997 1998
Quarter 1 2 3 4 1 2 3 4 1 2 3 4
Sales 189 244 365 262 190 266 359 250 201 259 401 265
265
265.25 269.25
270.75