You are on page 1of 3

Advance Statistics Notes:-

1. Examining Relationships
 Graphical technique used to examine relationships between two variables is known as
scatterplot analysis.
 The numerical summary of relationship between two numerical variables is correlation.
 Technique which explains how one variable effects another is regression analysis.
 The study of relationships between two variables looks at possible cause & effect relationships.
 Hence the variables are divided into one of the two types- Response variable which measures
outcome of the study i.e. effect. Explanatory variable which explains or influences change in the
response variable.
 Scatterplots -show relationship between two quantitative variables.
 In a scatterplot we look for overall pattern, form, direction, strength and outliers.
 Associations can be of two types- positive and negative
 Positive – when as x increases y also increases and vice versa
 Negative – when x increases y decreases and vice versa
 A scatterplot tells us a lot about the form, direction and strength of a relationship between two
variables.
 Lines are one way to describe the form of relationship and a linear relationship is one of the
widely used ways of defining relationships.
 A linear relationship is strong if points on scatterplot are close to the line and is weak if most of
the points are away from the line.
 Correlation is measure of how far the points are away from the line that describes the
relationship. It measures the direction and strength of the linear relationship between two
quantitative variables. It is denoted by r.
1 xi−x ❑ yi− y ❑
 Correlation r between two variables x and y is given by r = ∑ ❑( )( )
n−1 sx sy
where x_, y_, sx and sy are the means and standard deviations of x and y respectively.
 Correlation does not make any distinction between explanatory & response variables i.e r
between x and y will be same as between y and x.
 It can only be calculated for quantitative variables.
 It does not change with changes in units of measurement. It does not have any units itself.it is
just a number between -1 and 1.
 Positive r (0<r<1) indicated +ve association between x and y. Negative r (-1<r<0) indicates
negative association between x and y.
 Values of r close to 0 indicate weak linear relationship. A linear relationship is strong if its r value
is closer to 1 or -1. If 1 or -1 then it is a perfectly linear relationship
 It only checks linearity. Not responsible for curved or quadratic relationships
 It is affected by outliers similar to mean and SD.
 A regression line summarizes the relationship between the two variables in a specific setting
where you have an explanatory variable & a response variable. It shows how a response variable
y changes when explanatory variable x changes.
 The line that is closest to all points in terms of average distance is also called the Least Squares
Regression Line.
 The Least Square Regression Line is the one that makes the sum of squares of the vertical
distances of the data points from the line as small as possible.
 The equation of the least square regression line is given by yhat=a+ bx where b is slope of line
and a is the intercept.
sy
 The slope of line, b=r ( ) and the intercept a is given by, a= ymean−( b ) xmean
sx
 Slope is the rate of change of y; the amount of change in y when x increases by 1 unit.
 Intercept is value of yhat when x = 0.
 Prediction is value of yhat for given values of x.
 There is a difference between explanatory variable and response variable. And if we switch
them in our data, we will get a different relationship
 Correlation and slope have a relationship. So a change in 1 SD in x corresponds to change in r
SDs in y.
 The least square regression line passes through xmean, ymean.
 Square of correlation is r2 which is variation in y that is explained by variation in x.
 Residuals are a good way to analyze the fit of the least square regression.
 Residual = observed – predicted = y-yhat
 The mean of residuals is 0.
 Residual plot is a scatter plot of residuals against the explanatory variable i.e. a plot of y-yhat vs
x
 In a residual plot, if there are curved patterns then there is no linear relationship
 If there is an increasing spread, then prediction of yhat will be less accurate for large x and vice
versa
 Individual plots with large residuals are outliers
 Individual plots that are extreme in x direction have strong influence on the regression line.
 Limitations of Correlation & Regression – they describe only linear relationships
 Easily affected by outliers
 Extrapolation is act of using the regression line to predict values for x far outside the range of
values on which regression line is based.
 A lurking variable is a variable that has an important effect on the relationship between the
variables in study but is not included among the variables studied.
 Association does not imply causation.
2. One Way ANOVA
 One way analysis of Variance (ANOVA) compares several means.
 Method often used in scientific or medical experiments like treatments, processes, materials or
products are being compared.
 In ANOVA, we compare the variability between the groups to see how apart their mean are
with the variability within the groups i.e. how much variation is there in the data group itself.
That’s why we call in analysis of variance.
 It is based on two assumptions- one, the observations are random samples from normal
distributions. Two, the populations have the same variance, σ2. Hence before we perform
ANOVA, we need to check if these conditions are met by the data.
 ANOVA procedures are not very sensitive to unequal variances so there is a rule to be followed –
If the largest SD (not variance) is less than twice the smallest SD, then we can use ANOVA and
the results will be valid.
 To check on the normality, in case of very small samples it is difficult to determine whether the
samples are coming from a normal distribution. In such cases we can do a rough check by i)
comparing the group means to the medians as for symmetric distributions these will be equal
. ii) Check using boxplots or histograms to see how the data looks.
 To check on equal variance – the groups SD can be compared
 The ANOVA Model – Notation
 If we sample n observations from each of k populations, total no of observations N = k*n
 yij represents the jth observation in group
 ymeani represents mean in group i
 ymean represents mean of all observations
 Sum of Squares – total variation of all the observations about the overall mean is measured by
what is called Total Sum of Squares(SSt)
 SSt = ΣΣ( yij− ymean)2
 The variation can be split into two components; one, variation of the group means about the
overall mean i.e. between group variations. Two, variation of the individual observations about
their group mean i.e. within group variable
 SSt= SSb + SSw
 SSb = n* Σ( ymeani− ymean)2
 SSw = ΣΣ( yij− ymeani )2
 Total Sum of Squares = Between groups of sum of Squares + Within groups sum of squares
 Each sum of squares has a certain number of degrees of freedom
 SSt compares N observations to the overall mean, so has N-1 degrees of freedom.
 SSb compares k means to the overall mean, so has k-1 degrees of freedom.
 SSw compares N observations to k sample means, so has N-k degrees of freedom.
 The degrees of freedom are related in the same way as the sums of squares : dft = dfb +dfw
 The degrees of freedom indicates how many values are free to vary. When we are considering
variances or sum of squares because the sum of deviations is always 0, the last deviation can be
found if we know all the others. So if we have n deviations, then only n-1 are free to vary.
 The Mean Sum of Squares for each source of variation is defined as sum of squares/ degrees of
freedom.
 MSb = SSb / (k-1)
 MSw = SSw / (N-k)
 Through the F Test in ANOVA, it can be shown that if the Null Hypothesis is TRUE, then there
are no differences between the unknown population means, i.e. MSb and MSw are very similar.

You might also like