You are on page 1of 14


Final Project

Raiha, Maheen ,Fabiha Mahnoor , Zara


Q1 Differentiate between Correlation and Regression Analysis.


Correlation is a statistical measurement of the relationship between two variables. Possible correlations range from +1 to 1.

A zero correlation indicates that there is no relationship between the variables. A correlation of 1 indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down. A correlation of +1 indicates a perfect positive correlation, meaning that both variables move in the same direction together. In other words Correlation computes the value of the Pearson correlation coefficient, r. Its value ranges from -1 to +1. The correlation answers the STRENGTH of linear association between paired variables, say X and Y Correlation is calculated whenever: * both X and Y is measured in each subject and quantify how much they are linearly associated. * in particular the Pearson's product moment correlation coefficient is used when the assumption of both X and Y are sampled from normally-distributed populations are satisfied * or the Spearman's moment order correlation coefficient is used if the assumption of normality is not satisfied. * correlation is not used when the variables are manipulated, for example, in experiments. if you interchange variables X and Y in the calculation of correlation coefficient you will get the same value of this correlation coefficient REGRESSION:

Technique of fitting a simple equation to real data points. The most typical type of regression is linear regression (meaning you use the equation for a straight line, rather than some other type of curve), constructed using the least-squares method (the line you choose is the one that minimizes the sum of the squares of the distances between the line and the data points). It's customary to use "a" or "alpha" for the intercept of the line, and "b" or "beta" for the slope; so linear regression gives you a formula of the form: y = bx + a Linear regression quantifies goodness of fit with r2, sometimes shown in uppercase as R2.

The "best" linear regression model is obtained by selecting the variables (X's) with at least strong correlation to Y, i.e. >= 0.80 or <= -0.80 The same underlying distribution is assumed for all variables in linear regression. Thus, linear regression will underestimate the correlation of the independent and dependent when they (X's and Y) come from different underlying distributions.

The regression tells us the FORM of linear association that best predicts Y from the values of X.

Linear regression is used whenever: * at least one of the independent variables (Xi's) is to predict the dependent variable Y.Some of the Xi's are dummy variables, i.e.

Xi = 0 or 1, which are used to code some nominal variables. * if one manipulates the X variable, e.g. in an experiment. Linear regression are not symmetric in terms of X and Y. That is interchanging X and Y will give a different regression model (i.e. X in terms of Y) against the original Y in terms of X.

Q2. Give Any Example of Spurious Correlation between any two REAL WORLD variables & highlight the hidden factor. "Spurious Relation (or Correlation) : A situation in which measures of two or more variables are statistically related (they cover) but are not in fact causally linkedusually because the statistical relation is caused by a third variable. When the effects of the third variable are removed, they are said to have been partialed out. A spurious correlation, as defined in definition a, is sometimes called an "illusory correlation." Lurking Variable. A third variable that causes a correlation between two others sometimes, like the troll under the bridge, an unpleasant surprise when discovered. A lurking variable is a source of a spurious correlation.For example, if researchers found a correlation between individuals' college grades and their income later in life, they might wonder whether doing well in school increased income. It might; but good grades and high income could both be caused by a third (lurking or hidden variable) such as tendency to work hard." For example, if the students in a psychology class who had long hair got higher scores on the midterm than those who had short hair, there would be a correlation between hair length and test scores. Not many people, however, would believe that there was a causal link and that, for example, students who wished to improve their

grades should let their hair grow. The real cause might be gender: that is, women (who usually have longer hair) did better on the test. Or that might be a spurious relationship too. The real cause might be class rank: Seniors did better on the test than sophomores and juniors, and, in this class, the women (who also had longer hair) were mostly seniors, whereas the men (with shorter hair) were mostly sophomores and juniors." Here in the example long hairs is one variable and grades the other, while the cause of higher grades might be Gender but here they have stressed upon long hairs and grades which indicates that in their class women achieve high grades. So this indicates the FALLACY factor of two variable.

Q 3 : Write down a paragraph on why we do regression analysis.

Regression Analysis: We do regression analysis for the following reasons:

a. Explaining the relationship between Y and X variables with a model b. Estimating and testing the intensity of their relationship c. Given a fixed x value, we can predict y value.

Applications of regression analysis exist in almost every field. In economics, the dependent variable might be a family's consumption expenditure and the independent variables might be the family's income, number of children in the family, and other factors that would affect the family's consumption patterns. In political science, the dependent variable might be a state's level of welfare spending and the independent variables measures of public opinion and institutional variables that would cause the state to have higher or lower levels of welfare spending. In sociology, the dependent variable might be a measure of the social status of various occupations and the independent variables characteristics of the occupations (pay, qualifications, etc.). In psychology, the dependent variable might be individual's racial tolerance as measured on a standard scale and with indicators of social background as independent variables. In education, the dependent variable might be a student's score on an achievment test and the independent variables characteristics of the student's family, teachers, or school.

B1=0.976 B0=250.553 Y^ = bo + b1 X1

Y^ = 395.5817809 + 0.976 X1
Therefore,for every 1 unit change in X1, there will be 0.976 change in Y^ as 0.976 is the gradient of the function. The Y intercept of the function is 395.5817809 .

SSR = Sum ( Y^ - y)^2 = 1.23

SST = Sum ( Y - y)^2 = 1.35

R^2= SSR / SST 1.23 / 1.347 r^2 = 0.912927 The value of r^2 is quite high (91.29 %) meaning there is a strong relationship and dependence b/ w X & Y.

Adjusted r ^ 2 = 1-[( 1- r^2) n-1/ n-k-1]

= 0.910439

After adjusting for the no. of explanatory variables and sample size, the adjusted R^2 is also very high ( 91.04 % ) which means there is strong dependance between X & Y .



20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 0 10 20 30 40 Series1 2 per. Mov. Avg. (Series1)

T he scatter plot also supports our analysis that on average both the variables show same movements and trend.

Q 4. Give Descriptive Statistics and interpret the data.

DESCRIPTIVE STATISTICS: Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. There are three major characteristics of a single variable that we tend to look at:

the distribution the central tendency the dispersion

The Distribution. The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of persons who had each value. Central Tendency. The central tendency of a distribution is an estimate of the "center" of a distribution of values. There are three major types of estimates of central tendency:

Mean Median Mode

Dispersion. Dispersion refers to the spread of the values around the central tendency. There are two common measures of dispersion, the range, the standard deviation and the variance.



H0 : B1 = B2 = 0 Ha : at least 1 B is not = 0 Its a one tailed test with 5 % significance level.

SSR = Sum ( Y^- y)^2


d.f for SSR = k = 1

MSR = SSR / k

= 7.83525

SSE = Sum (Y y) ^2

= 1346829407

d.f = n-k-i = 35 MSE = SSE / n-k-1

= 37411927.98


= 2.03614


F c.v =

2.09432 > 0.323985026 Since F > F c.v , so reject H0 : B1 = B2 = 0 Therefore, Ha : at least 1 B is not = 0

Intercept X Variable 1

Coefficients 498.3621 0.978816

Standard Error 440.8294 0.051096

t Stat 1.13051 19.15623

P-value 0.265949 3.92E-20

Lower 95% -396.569 0.875085

Upper 95% 1393.293 1.082547

Lower 95.0% -396.569 0.875085

Upper 95.0% 1393.293 1.082547

HYPOTHESIS Ho : B1 =0 Ha : B1 is not = 0

Its a 2 tailed test with d.f = 35 & 0.05 significance level .

t = ( b1 - B1) / S(b1)

d.f = n-2 = 37 -2 = 35

Standard error of model = [{sum ( Y^ - Y )^2}/ 35]^ 0.5 = 3364957 S ( b1) = standard error of model / [sum ( X- x)^ 0.5 = 3364957/1285049376

t = 1.13051

t c.v = 0.324174


t > t c.v

1.13051 > 0.324174

So reject Ho : B1 =0. Therefore, as the gradient is not equal to 0 there is a linear relationship b/w Y & X and they are dependent on each other.


calculated p value = 0.265949 alpha = 0.05

As, calculated p value ( 0.265949 ) > alpha ( 0.05 )

Thus, accept Ho :

Mean of X=6350.59



Mean of Y=6549.84



Grand mean =(37*6350.59)+(36*6549.84) =6429.39

ANOVA Source Between Within Total SS


Df 1


F 0.0071

5378744671 71 =5378744671+536810.8654 72


Calculated F > critical F 0.007 > 4.0012. False so Accept Ho :

6000 5000 4000 3000 2000 1000 0 -1000 0 -2000 -3000 -4000 -5000 5000 10000 15000 20000 Series1

As the residuals are identically divided around the mean i.e. 0 , thus, the errors are independent and random & are so they are IID .