You are on page 1of 7

Certification Course on

Quality Assurance and Statistical Quality Techniques


Course Level A Regression and Correlation Code 1.08: Central Tendency
Issue No.: 01
Effective Date: 15-04-2014

REGRESSION & CORRELATION THEORY


In statistics: regression means functional relationship between variables.

Areas of application:

Correlation and Regression find a lot of applications in industrial quality control as they present
a means to predict and control product / process behavior by studying relationship across
variables. Some key uses are in:

1. Producing a consistent quality output

---By determining important factors responsible in producing variability in the output quality.

---By determining, to what extent variation in a factor will be causing the variation in output
quality.

2. Replacing costly destructive test by less costly non- destructive tests.

3. Industrial research

In correlation and regression studies, the Engineer takes data as he finds them instead of
controlled laboratory condition, and discovers the relationship.
Bivariate distribution

Regression theory is built upon the concept of a bivariate function. Joint distribution of two variables is
known as bivariate distribution. A chart known as scatter diagram can show relations between two
variables
A line or curve that shows how the mean of the values of one variable change with the values of other
variable is called a line or curve of regression.
Regression of Y on X
--It is the relationship between the average values of Y for a given X & the values of X.
Example: strength of cotton yarn (Y) and fibre length (X)

Bivariate Chart between strength & Fibre Length


Each point in the Figure represents a pair of measurements. The area of Bivariate chart is
subdivided into cells and chart is converted into bivariate frequency distribution (see next
Figure ) This Figure is the bivariate frequency distribution of 183 pieces of cotton yarn with
respect to their skiem strength and fibre length.

2 1 2

124.5 2 1 1 1 1

1 3

114.5 2 1 2

4 2 2

104.5 1 1 1 9 7 1 1

4 8 9 11 3 1

94.5 1 12 10 5 7 7 1

1 2 4 12 5 3 1

84.5 1 4 5 2 1

2 3 2 1 1

74.5 1 1 1 2
0.545 0.645 0.745 0.845 0.945

Bivariate frequency distribution

DISRIBUTION OF Y FOR GIVEN VALUES OF X.


It can be seen that for any given values of X, there is no single value of Y, but a distribution of Y
values. For instance when fibre length lies between 0.695 and 0.745 inch, the cotton strength is
distributed over 7 cells ranging from 74.5 lbs. to strength of 109.5 lbs. This tendency for cotton
strength to be distributed over a range of values for a given value of fibre length is not only the
characteristic of a sample but also is the characteristic of universe itself.
THE QUESTION ARISES:
1. Why do we not get a single value of Y for any given value of X?
2. Why do we have a distribution of Y, values for each value of X ?.
3. What physical phenomena produces such a result.?

REASONS ARE:
-- The dependent variable Y is affected by variable other than X.
-- The dependent variable Y is affected by variable X and also by many other variables.
-- There is host of chance forces that causes the error of measurement.

VARIATION IN MEAN VALUE OF Y WITH X

Though there is not a single value of Y for a given value of X but there is a tendency for Y values
to be higher when X is higher and lower when X is lower. The mean value of Y increases steadily
with X. It is this locus of mean values that is called the “REGRESSION OF Y ON X.”

If it is a straight line, the regression is said to be” linear”

There is also a regression of X on Y. This would be the locus of mean values of X for a given
value of Y. A line of regression may have either a positive or negative slope indicating the type
of relationship.

COEFFICIENT OF CORRELATION

With every linear regression there is associated a coefficient of correlation which measures the degree
of association between two variables denoted by r.

n  xy   x  y 
r .
n  x   x  n  y   y 
2 2 2 2

If r is positive then the slope of the distribution is positive and if it is negative then slope is also negative.
When all the points are on the line, the deviation of y values from the line of regression is 0 and r
becomes  1, indicating a perfect linear relation.
The nearer r is to 1, closer are the points to the line of regression, thus the magnitude of r may be taken
as a measure of the degree to which the association between the variable approaches a linear functional
relationship.

When r is zero, the variables are described as linearly un-correlated.


These scatter graphs display how strong relationships between two variables also display a high
value of r tending towards 1 or minus one, while the weaker ones have values of r closer to zero
as the figure on right bottom displays.

Following is an example for calculating r from a set of x and y values

 x  15  y  1  xy  9  x 2  55  y 2  15

n  xy   x  y  5(9)  151


r
n  x   x 
2 2
n  y   y 
2 2

5(55)  152 5(15)  1
2  0.986
STANDARD ERROR OF ESTIMATE

It is a measure of reliability of the estimate from line of regression. The SD of the distribution of
Y values for a given value of X gives it. It helps in determining a confidence interval for Y. The SD
of Y for a given X is commonly called the “ STANDARD ERROR OF ESTIMATE”, since it measures
the error involved in using the regression value to estimate Y. The universe quantity is denoted
by est and sample estimate by Sest.

∑(𝑦 − 𝑦̂)
Sest = √ 𝑛 −2

CONFIDENCE AND PREDICTION INTERVALS FOR FORECAST VALUES

The 95% confidence interval for the forecast values ŷ of x is

Where

This means that there is a 95% probability that the true linear regression line of the population
will lie within the confidence interval of the regression line calculated from the sample data.

Confidence vs. prediction intervals

In the graph on the left of Figure, a linear regression line is calculated to fit the sample data
points. The confidence interval consists of the space between the two curves (dotted lines).
Thus there is a 95% probability that the true best-fit line for the population lies within the
confidence interval (e.g. any of the lines in the figure on the right above).

There is also a concept called prediction interval. Here we look at any specific value of x, x0, and
find an interval around the predicted value ŷ0 for x0 such that there is a 95% probability that
the real value of y (in the population) corresponding to x0 is within this interval (see the graph
on the right side of Figure 1).

The 95% prediction interval of the forecasted value ŷ0 for x0 is

where the standard error of the prediction is

For any specific value x0 the prediction interval is more meaningful than the confidence
interval.