You are on page 1of 5

SCI1020: Introduction to Statistical Reasoning


EXPLORING DATA- Relationship between two Quantitative Variables

Student's Name: Tutorial Day/Time:

PRELIMINARY READING: D S Moore et al, “Basic Practice of Statistics”, Chs 4-5.

On completion of this workshop you should be able to:

1. Produce a scatterplot of quantitative data with appropriate explanatory and response axes;

2. Recognise a linear pattern and the general formula for a straight line;

3. Calculate a predicted value given the equation of the linear regression line;

4. Add a linear line of best fit to data using MS EXCEL, and describe the regression line (equation and correlation);

5. Assess the closeness of fit using the least-squares criterion as reflected in the correlation coefficient;

6. Obtain residual values and interpret their size and distribution about the line in the form of a residual plot;

7. Find or calculate and interpret the squared correlation, r 2.


These problems are to help you engage with the lecture material, and also to make sure that everyone is up-

to-speed before the workshop starts. Please make sure you do them before class each week!

Q.1 State in your own words what is meant by each of the terms listed below. Be specific.

Term Definition

Explanatory variable An independent variable of x-axis that explains the variations in

response variable

Response Variable Variables that measures outcome of study of verticals Y-axis

Association Determines direction of a trend which could be either positive or

negative or none

Statistic measure strength and direction of linear relationship


Regression line Straight line that determines a response variable (Y) changes as

explanatory variable (X) changes

Residual Di erence between observed value of response variable and value

predicted by regression line (Residual=Observed Y- Predicted Y)

Q.2 What is the general equation of a straight line? Define all the terms in the equation.

Y= mx + c

m = gradient/slope

c = intercept (Value of Y, when X=0)

Y = response variable

X = explanatory variable

Week 2 Copyright 2021: Monash University Page | 1

Q.3 Do Q5.2 from Moore et al text, p.130.

What is the regression line equation based on the description of the trend in this example?

Q5.2 You use same bar of soap to shower each morning. Bar weights 80g when new - weight goes down

by 5g per day on average. What is equation of regression line for predicting weight from days of use?


R^2= 1

X= Days Y=Weight


Q.4 Demonstration of correlation and least squares regression.

a) Go to the website (Note

that spaces in the URL are underscores_ ).

Create a scatterplot of linear trend (similar to plot #1 below. Observe the size of the correlation

coefficient for different scatter patterns. Use “Draw your own line” to draw a line of best fit.

Change the intercept and slope, trying to minimise the sum of the squares of the residuals as shown

by the “relative SS” value. Compare yours with the “Show least-squares line” which is placed by

calculation. No written answers are required here just observe the values.

b) Describe the relationship in the x-y data plotted below:

Quiz score vs chocolate consumption 5. Change in pulse rate with exercise 3. Measured radioactive decay

1. 2. 1400

Pulse rate after exercise (beats

120 1200

Counts per minute

100 1000

Quiz score (%)

per minut)

80 800

60 600

40 400

20 20 200

0 50 100 150 200 250 300
0 20 40 60 80 100 120
0 5 10 15

Daily Chocolate consumption (g) Pulse rate before exercise (beats per minute) Time (mins)

Identify the association (positive/negative/none) and correlation (strong/moderate/weak/none) present.

PLOT 1 2 3

Association None Positive Negative

Correlation None Moderate None

Estimate r 0 0.8-0.9 N/A

(If approp.)


Also see “Introduction to Excel” Section 2.7 pp.13-15.

Using Excel to produce a scatterplot of the data and add a LINEAR line of best fit:

• Plot Response variable (y-axis) against explanatory variable (x-axis). Excel: left hand column = x

• Select the chart layout that has the line and fx so that you obtain the equation of the line of best fit.

• Note the correlation coefficient, r, and its square, R2;

• Note the coefficients which are the intercept and slope for the equation;

• Obtain the full regression analysis including a residual plot by using Data Analysis/Regression. Note

that it asks for the y-data column first, and you need the data in columns, not rows;

• Check the appropriateness of linearity by interpreting the residual plot. Describe the scatter or pattern

in the residual plot: A random scatter of residuals, plus and minus, along the added line of best fit

indicates that linear IS appropriate. This is important. Data can often look linear, but a closer check

often reveals that a different trend is present!

Week 2 Page | 2

Q.5 Do Q4.29 (7th ed: Q4.28) from Moore et al, p.121 and with the same data, Q5.39, p.155.

Download the data set “Sparrowhawk” from the Moodle page/Part 1: Exploring Data.

Produce the scatterplot of the relationship in the (x,y) data.

Describe the association between New Adults arriving and the percentage of returning birds:

Negative association. This means when x-axis value increases, Y-axis value will decrease

What is the general strength of the association?


EXTRA: What is the R-squared value?

What does this R-squared value specifically tell us about this association?

The regression line explains 56% of variation in Y

ALSO: Apply linear regression analysis using Excel. Include the residual plot.

TUTOR CHECK OF PLOTS: Scatterplot and Residual plot

(Not done in class? You must attach your plots printed out)

ALSO: Describe the residual plot. Is there any trend: is there a curve of data about the line of best

fit OR are the data points randomly scattered either side along the linear trend line?

What does this tell you about fitting a linear model to these data?

The data is randomly scattered, so it is obvious that

the data ts to a linear relationship

Moore Q5.39 a) What is the equation of the linear model for this relationship?

Do not use x and y designations but replace them with a descriptive notation for the variables.

(% if returning sparrowhawks) = -0.304 x (% of adults from previous year

return) + 31.934

Moore 5.39 b) For this sparrowhawk data:

Value of the Slope = -0.304

What does the slope actually indicate (use the actual value to explain size of influence of x on y)?

The slope shows a 0.304-times decline in the percentage of adults birds

(y) from the previous return in the percentage of returning birds (x)

Week 2 Page | 3
Moore 5.39 c) Use the model to predict the new adult number if 45% of adults from the previous
year return:
Predicted value x=45

Y= -0.304 x + 31.934

Y = 18.254

Value predicted = 18.254

EXTRA: Verify the value of the residual for the datum point where x = 45. Show the full
calculation of a residual.
Residual at each x = (data y value- line predicted ŷ value)
Residual = (Data Y) - (Predicted Y)

= 17 - 18.254

= -1.254

Check your value against the Excel output for x = 45.

Q.6 Appropriateness? Causation? Application? ….

Explain at least two cautions that you should make when making interpretations of an x-y
relationship using linear regression analysis.
Consider Moore et al, Chapter 5 summary, 8th ed, pp.152-153 (7th ed: pp 151-152).

1. Pay closer attention to anomalies and in uential observations. Due

to these, correlation and regression lines would shift signi cantly.

2. A substantial association between variables does not necessarily

indicate a cause-and-e ect relationship between them.

From this exercise, you should make sure you able to:
• draw a scatterplot:
• obtain a line of best fit for linear data and identify its equation
• obtain a residual plot
• obtain the correlation coefficient
• state what each of the items above tells you about the data.

MARK : /10

Week 2 Page | 4

You might also like