You are on page 1of 66

Business Statistics -2

Gayatri V Singh, PhD,


AMITY University
1
Today’s Highlights
• Correlation Analysis
• Correlation Type
• Regression Analysis
• Difference Between Correlation and
Regression
• Time Series Analysis

2
Correlation Analysis
• the degree to which two variables are associated
• strength of the relationship (correlation)
between two variables
• may be either positive or negative.
• Its magnitude depends on the units of
measurement.
• Assumes the data are from a bivariate normal
population.
• Does not necessarily imply causation
3
Correlation Analysis
Data

Quantitative Qualitative
Or Numerical Or Categorical

Karl Pearson Spearman’s


Correlation Rank Correlation
4
Correlation Coefficient
• The value of r can range between -1 and
+ 1.
• If r = 0, then there is no correlation
between the two variables.
• If r = 1 (or -1), then there is a perfect
positive (or negative) relationship
between the two variables.

5
Scatter Plots
• A scatter plot is a graph of the ordered
pairs (x, y) of numbers consisting of the
independent variable, x, and the
dependent variable, y.
Scatter Plots - Example
• Construct a scatter plot for the data obtained in a
study of age and systolic blood pressure of six
randomly selected subjects.
• The data is given on the next slide.
Scatter Plots - Example
Subject Age, x Pressure, y
A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
Scatter Plots - Example
Positive Relationship

150
150
Pressure
Pressure

140
140

130
130

120
120
40
40 50
50 60
60 70
70
Age
Age
Scatter Plots – Other Example
Negative Relationship

90
90
80
80
grade
Finalgrade

70
70
Final

60
60
50
50
40
40
55 10
10 15
15
Number
Numberofofabsences
absences
Scatter Plots – Other Example
No Relationship

10
Y 10

5
5
y

0
0 0 10 20 30 40 50 60 70
0 10 20 30 X 40 50 60 70
x
Scatter Plot
3

2
2
2

1
1
1

0
0
y2

y2

y2

-1
0

-1

-2
-1

-2

-3
-2

-3

-4
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3

y1 y1 y1

r=+1 r=-1 r=0

12
Correlation Coefficient Range
Coefficient Strength of
Range Relationship
0.00 - 0.20 Very Low
0.20 - 0.40 Low
0.40 - 0.60 Moderate
0.60 - 0.80 High Moderate
0.80 - 1.00 Very High
Correlation
Coefficient

Karl Pearson Spearman

xy x y /n 6 d2
rxy
1
x y
n(n 2 1)
2 2
x x x /n
“d” is the difference between
2
ranks of two variables
2
y y y /n
a measure of the intensity of the linear
association between variables. 14
Notes of Caution
1. An observed relationship between two
variables does not imply that there is some
causal link between the two variables.
IQ

Shoe size

15
2. A relationship between two variables can be
influenced by confounding variables.
However for each gender, there does not
appear to be an association.
Number of
Haircut done
in a year men

Women
Height
Any study, especially an observational study, has
the potential to be wrongly interpreted because
of confounding variables.
16
Interpreting Correlation
• Correlation is a mathematical concept which
describes a pattern of numbers
• As such it may NOT help to understand what is
going on in the real world
• Does NOT tell why one variable is correlated with
another
Computational Formula for
Correlation
rxy you obtain a
• By substituting and rearranging,
substantial (and not very transparent) formula
for
N XY X Y
rxy
2 2 2 2
N X X N Y Y
Computing a correlation
Cigarettes Lung
(X) 2 Capacity
X 2 XY Y (Y)
0 0 0 2025 45
5 25 210 1764 42
10 100 330 1089 33
15 225 465 961 31
20 400 580 841 29
50 750 1585 6680 180
Computing a Correlation
(5)(1585) (50)(180)
rxy
2 2
(5)(750) 50 (5)(6680) 180
7925 9000
(3750 2500)(33400 32400)
1075
.9615
1250 (1000)
Coefficient of Determination
• Allows us to be precise when we interpret a
correlation coefficient
• Tells us how much variance on one variable is
accounted for by the variance on the other
variable
• Expressed as a percentage (%)
Selected Values of r and r2
Selected values of r and r 2
2
% %
r r accounted for not accounted for

.10 .01 1% 99%


.20 .04 4% 96%
.30 .09 9% 91%
.40 .16 16% 84%
.50 .25 25% 75%
.60 .36 35% 64%
.70 .49 49% 51%
.80 .64 64% 36%
.90 .81 81% 19%
1.00 1.00 100% 0%
Spearman rank correlation
• This can be applied when you have two
observations per item, and you want to test
whether the observations are related.
• Computing the sample correlation gives an
indication.
• Non parametric method:
– Less power but more robust.
– Does not assume normal distribution
Spearman’s rank correlation coefficient
• Step 1: Prepare data:
– Same as Pearson.
– Order the values of the probes by increasing
hybridization values.
– Construct the rank vectors.
• Step 2: Compute coefficient between probe sets of
interest: 2
6 d
1 2
nn 1
Regression Analysis

A statistical procedure used to find


relationships among a set of variables

25
Regression Analysis
• It includes any techniques for modeling and
analyzing several variables, when the focus is on
the relationship between a dependent variable
and one or more independent variables.

• More specifically, regression analysis helps us to


understand how the typical value of the dependent
variable changes when any one of the independent
variables is varied, while the other independent
variables are held fixed. 26
Regression coefficient
• The regression coefficient is the slope of the
regression line and tells you what the nature
of the relationship between the variables is.
• How much change in the independent
variables is associated with how much change
in the dependent variable.
• The larger the regression coefficient the more
change.
Regression: 3 Main Purposes

• To describe (or model)

• To predict (or estimate)

• To control (or administer)


Types of Regression

• Simple Linear Regression


• Non – Linear Regression
• Multiple Regression
• Logistic Regression (GLM)
• Conditional Logistic Regression

29
Linear Regression Analysis
Purpose –
To determine the regression equation. It is used to predict
the value of one variable (Y, called the dependent variable)
based on another variable (X, called the independent variable).

Procedure:
•Select a sample from the population, and list the paired data
(X, Y) for each observation.
• Draw a scatter diagram to give a visual portrayal of the
relationship.
• Determine the regression equation Y = a + bX.
30
Regression Analysis
• Y is the average predicted value of Y for any X.
• a is the Y-intercept, or the estimated Y value when
X = 0.
• b is called the slope of the line. It is the average
change in Y for each change of one unit in X.
• The least squares principle is used to obtain a and
b and are given by:
x
b yx rxy
y bxy rxy
y
x a y b yx x 31
Regression Example

• Dissolved Oxygen and Temperature


• Our hypothesis is that there is a significant
linear relationship between DO and
temperature.
• Which is the dependent variable?
• Which is the independent variable?

32
Calculations
Temp (C0) DO(mg/l)
5.42 15.52
6.16 15.39
6.34 15.63
6.42 15.75
6.76 15.65
6.86 14.95
6.91 15.49
7.30 14.54
7.68 14.46
7.78 14.15
8.30 14.76
8.57 14.48
8.72 13.17
33
Calculations
The values we need to calculate are:
1. Mean of X
2. Mean of Y
3. Sum of X
4. Sum of Y
5. Sum of X times Y (∑XY)
6. Sum of X2
7. Sum of Y2
Note that there is a difference between the sum of X2 and the sum of
X, squared!
Σ X2 (ΣX )2

34
35
• Machine calculation of a and b (intercept and
slope)

36
37
EXAMPLE 1
• Develop a regression equation for the
information given in the EXAMPLE that can
be used to estimate the selling price based on
the number of pages.
• b = 0.01714, a = 16.00175.
• Y = 16.00175 + 0.01714X .
• What is the estimated selling price of a 650-
page book?
• Y = 16.00175 + 0.01714(650) = $27.14. 38
EXAMPLE 2
• A computer-repair technician recorded data
on the number of computers serviced and the
amount of time to complete the service for 11
randomly selected service visits. The number
of computers serviced, x, ranged from 1 to 7,
while the time to complete the service, y,
ranged from about 30 minutes to nearly 3.5
hours. The scatter plot of the data shows a
strong, positive linear relationship between
these two variables.
39
40
41
• Obtain the estimated linear regression
equation y = a + bx.
y= 10.19 + 24.83 x
where y = service time
x = no. of computer serviced
• Predict the number of minutes required for
service when it is reported that five
computers are down.

y = 10.19 + 24.83(5) = 134.3 minutes 42


Difference between Correlation and
Regression

• The correlation answers the STRENGTH of


linear association between paired variables,
say X and Y. On the other hand, the regression
tells us the FORM of linear association that
best predicts Y from the values of X.

43
Correlation is calculated whenever:
* both X and Y is measured in each subject and
quantify how much they are linearly associated.
* in particular the Pearson's product moment
correlation coefficient is used when the assumption
of both X and Y are sampled from normally-
distributed populations are satisfied
* or the Spearman's moment order correlation
coefficient is used if the assumption of normality is
not satisfied.
* correlation is not used when the variables are
manipulated, for example, in experiments. 44
With correlation you don't have to think about
cause and effect. You simply quantify how well
two variables relate to each other. With
regression, you do have to think about cause
and effect as the regression line is determined
as the best way to predict Y from X.

45
Line of Best Fit
Line of Best Fit
• Definition - A Line of Best is a straight line on a
Scatter plot that comes closest to all of the dots on
the graph.
• A Line of Best Fit does not touch all of the dots.
• A Line of Best Fit is useful because it allows us to:
– Understand the type and strength of the
relationship between two sets of data
– Predict missing Y values for given X values, or
missing X values for given Y values
Age Height
(months) (inches)
18 76.1
19 77 Work with your group
20 78.1 to make a prediction
21 for the height at:
22 78.8
23 79.7
24 79.9
• 21 months
25 81.1
26 81.2 • 28 months
27 82.8
28
• 20 years
29 83.5
Equation For Line of Best Fit
y = 0.6618x + 64.399

X (months) Formula Y (inches)

21 0.6618(21) + 64.399 78.3

28 0.6618(28) + 64.399 82.9

240 0.6618(240) + 64.399 223.3


Time Series Analysis
A time series is a sequence of observations
which are ordered in time (or space).
If observations are made on some
phenomenon throughout time, it is most
sensible to display the data in the order in
which they arose, particularly since successive
observations will probably be dependent.
Time series are best displayed in a scatter plot.
Time Series Analysis

 Understand time-series forecasts techniques


 Understand four possible components
 Understand how to use regression models for
trend analysis
 Understand nature of autocorrelation
Types of time series data
• There are two kinds of time series data:
• Continuous, where we have an observation at every
instant of time, e.g. lie detectors, electrocardiograms.
• We denote this using observation X at time t, X(t).
• Discrete, where we have an observation at (usually
regularly) spaced intervals. We denote this as Xt.
• Examples
Economics - weekly share prices, monthly profits
Meteorology - daily rainfall, wind speed, temperature
Sociology - crime figures (number of arrests, etc),
employment figures
54
Time series data are composed of four elements
Trend Cyclicality Seasonality irregularity
• Trend : Long term general direction of data
• Cycles: highs and lows through which data move over
time periods usually of more than a year
• Seasonal: shorter cycles, which usually occur in time
periods of less than one year
• Irregularity: rapid changes in the data, which occur in
even shorter time frames than seasonal effects
Stationary: data that contain no trend, cyclical or
seasonal effects
Linear Regression Trend Analysis
• The response variable ‘Y’, is being forecast
• The independent variable ‘X’, the time periods
• Linear model
Yi = β0 + β1Xti + εi
Yi = Data value for period i
Xti = ith time period
Checking for dependence
We will NOT assume that Yt-1 is independent of Yt
Example: Is tomorrow’s temperature independent of today’s?

Suppose y1 ...yT are the temperatures measured daily for


several years. Which of the following two predictors would
work better:
i. the average of the temperatures from the
previous year
ii. the temperature on the previous day?
If the readings are iid N( , 2), what would be your prediction
for YT+1?

This example demonstrates that we should handle dependent


time series quite differently from independent series.
Checking for Independence
Independence:

Knowing Yt does not help you in predicting Yt+1

It is not always easy just to look at the data and decide


whether a time series is independent.

So how can we tell?


Plot Yt vs. Yt-1 to check for a relationship
or
Plot Yt vs. Yt-s for s = 1, 2, …
C1 C2 C3 C4
t Y(t) Y(t-1) Y(t-2)
1 5 * *
2 8 5 *
3 1 8 5
Now each row has Y at
4 3 1 8 time t, Y one period
5 9 3 1 ago, and Y two periods
ago
6 4 9 3

Y Y lagged once Y lagged twice


Each point is a pair of adjacent
years.
e.g. (Level1929, Level1930)

First, let’s plot Levelt vs.


Levelt-1

Corr = .794
Now, let’s plot Levelt vs.
Levelt-2

Corr = .531
Moving Averages
• A moving average is an average that is updated
or recomputed for every new time period being
considered.
• The most recent information is utilized in each
new moving average.
• Shown here are shipments (in millions of dollars)
for electric lighting and wiring equipment over a
12-month period. Use these data to compute a 4-
month moving average for all available months.
Moving Average
• a form of average which has been adjusted to
allow for seasonal or cyclical components of a
time series.
• When a variable, like the number of unemployed,
or the cost of strawberries, is graphed against
time, there are likely to be considerable seasonal
or cyclical components in the variation

63
Moving Average
The n day simple moving average for day d is computed by:

If we have ten measurements, M1 through M10, and we wish to


compute a four day moving average, the moving averages for
successive days are:

64
Year 1996 1997 1998
Quarter 1 2 3 4 1 2 3 4 1 2 3 4
Sales 189 244 365 262 190 266 359 250 201 259 401 265

265
265.25 269.25

270.75

4 Period Moving Average

You might also like