You are on page 1of 12


Two Variable Statistics


a) A numerical indicator of the direction and intensity of a relationship between two

variables is the correlation coefficient. A strong positive link is indicated, for example, if

the correlation coefficient between exam scores and study time is 0.9.

b) This statistical method predicts the value of a dependent variable (one that depends on the

value of an independent variable). For instance, using historical advertising spending

data, a company may utilize linear regression to forecast future sales.

c) Small sample bias: This type of statistical bias happens when there is insufficient sample

size utilized to estimate a population parameter, which might produce unreliable results.

For example, if a political survey was carried out in a city with just 50 participants.

d) When two variables seem to be related, but in reality, a third variable drives their

relationship, this is known as a common cause relationship. For instance, there is a real

cause for the apparent correlation between shark attacks and ice cream sales: hot weather.

2. In order to investigate the statistical correlation between the quantity of trees and

Starbucks locations throughout various neighbourhoods, I would utilize a variety of


Conduct a Comprehensive Survey: In order to get particular data for communities

where accessible data might not be sufficient, survey design would be essential. The

survey would be sent out via a number of platforms, including social media, email, and

door-to-door surveys, to both local companies and households.

Utilize Publicly Available Data: Open data resources with details on business locations

and tree counts are available from numerous cities and towns, including Starbucks. To

guarantee data accuracy and comprehensiveness, it would be imperative to gather

information from several reliable sources.


Perform Visual Inspections: This phase entails physically going to each

neighbourhood to count the amount of trees and determine whether Starbucks is

present. Visual inspections function as a process of verification to authenticate and

certify the correctness of information gleaned from external sources.

Biases and Mitigation:

Selection Bias: If the study's sample is not representative of the total population, bias

may result. The use of stratified sampling would reduce selection bias. This entails

segmenting the city according to specific parameters, such income distribution or

urban development, into several tiers or subgroups. To guarantee that the sample

accurately reflects the diversity of the city's communities, proportionate sampling

from each stratum is used.

Measurement Bias: Inaccurate or inconsistent data collection leads to measurement

bias. Adhering to a consistent process for data collection is essential to minimize

measurement bias. This strategy would provide precise guidelines for surveyors on

how to recognize and count trees and collect data on Starbucks locations in a uniform


3. 𝑦 = 13.695𝑥 − 30.04


I made two columns of data, one for Years and the other for Height using Microsoft

Excel. I then used the insert chart option after selecting the entire data range. Next, I

chose the scatter option. By choosing Chart/Add Trend line, I was able to obtain the

equation for the line of best fit. Verifying that Linear is the default configuration.

After making selection of the chart's straight line, selecting Format Trend

line/Options. Checked the equation display in the chart box. I also showed 𝑟 2 .

4. Correlation coefficient (𝑟) = 0.9032

Coefficient of determination (𝑟 2 ) = 0.8158

From the results above, the correlation coefficient is 0.9032. Since the correlation

coefficient is very close to 1 and positive. Therefore, we can conclude that there is a

strong positive linear correlation between the height and years variable.

The coefficient of determination, 𝑟 2 , indicates that 81.58% of the variation in the

height can be explained by the variation of years.


residuals plot


0 20 40 60 80 100 120 140

Based on the residual plot, we cannot see any pattern and the residual data points are

randomly dispersed, we can conclude that the best fit line model is an appropriate model

to predict the height. Also, based on the actual data, the error/difference in the actual and

predicted y-values ranges from -30 to 35.


Source of data: Climate at a Glance Global Time Series (Climate at a Glance Global

Time Series, n.d.)



Authority and Reputation: Established and well-known organizations, government

agencies, academic institutions, and reputable research centres are generally considered

reliable sources.

Timeliness and Updates: the website has data going back to the 90s and is continuously

updated to stay current on the latest changes.

7. Data from year 2000 to 2012 for temperature anomaly for the month of January.

years temperature


1 0,41

2 0,55

3 0,57

4 0,62

5 0,62

6 0,56

7 0,69

8 0,68

9 0,6

10 0,58

11 0,67

12 0,71

13 0,62

scatter plot
0,7 y = 0,0133x + 0,5131
R² = 0,4366
temp anomaly 0,6
0 2 4 6 8 10 12 14

The equation of line of best fit 𝑦 = 0.0133𝑥 + 0.5131 with y = temperature anomaly

and x = years.

The correlation coefficient r = 0.6608

The coefficient of determination = 0.4364

This implies that there is a weak positive correlation between temperature anomalies and

time. Approximately 43.64% of the data variability is explained by the model.

8. Approximating:

10 years before, x will be = - 9 (1990) and by substituting into the model

𝑦 = 0.0133 ∗ (−9) + 0.5131

= -0.6839 degrees

And 10 years after, x will be = 23 (2022) and by substituting into the model

𝑦 = 0.0133 ∗ (23) + 0.5131

= 3,5721 degrees.

9. Residuals plot

residuals plot
0 2 4 6 8 10 12 14

x years

From the residuals plot, there seems to be a linear relationship between the residuals

which will imply that the linear model is not the best fit for the data s there should not

exist any correlation among the residuals.

10. Data Collection Location Bias: The temperature data may not adequately reflect

worldwide trends if it is primarily collected from a limited number of regions. I might

look into it by finding out where the data collection sites are located geographically.

Sampling Frequency Bias: Bias may be introduced by uneven or irregular data

collecting intervals. I will try to find out how frequently and consistently data is


11. Scatter plot


y = 23,328x - 59,831
R² = 0,9131



0 2 4 6 8 10


x 1 2 3 4 5 6 7 8 9

residuals 29,303 6,975 -7,753 -16,481 -21,709 -20,137 -6,865 6,707 29,979

Residuals = y values – estimated y values from line of best fit.

Residuals plot





0 2 4 6 8 10



The residual plot follows a quadratic pattern, hence the data does not follow a linear



y -6,2 -5,7 1,4 15,1 35,4 62,3 95,8 135,9 182,6


x 1 2 3 4 5 6 7 8 9

residuals -1 -0,5 1 1,9 -0,3 -2,3 0,8 -2,4 -2,5

The y estimates where obtained by substituting the x values into the new model

𝑦 = 3.3𝑥 2 − 9.4𝑥 − 0.1 , then by subtracting the estimates from the original y values

we got the residuals. By using Excel chart function/scatter, we obtained the following

residuals plots against the x values:



0 2 4 6 8 10

From the residuals plot, the data points are scattered with no Identifiable pattern

hence, the model 𝑦 = 3.3𝑥 2 − 9.4𝑥 − 0.1 is valid for this dataset.

13. Cause-Effect:

CAUSE: shorter travel time

EFFECT: lower test scores

The majority of students who commute less, perhaps as a result of living closer to

school, may tend to wake up later than their peers whose homes are located farther

away. They might therefore occasionally turn to procrastination, which could result in

subpar academic achievement.

14. Cause-Effect:

CAUSE: Children with more books at home

EFFECT: Earn PHD when they grow up


A child's interest in learning may grow if they have books at home. A book library is

also common in the homes of highly educated individuals, such as PhD holders. so

that their offspring would acquire the same passion.

15. Accidental

This is purely coincidental, as individuals did not consciously select their phone

number and birth date is not directly associated with it.

16. Cause-Effect

CAUSE: Drinking coffee in the morning

EFFECT: Have insomnia at night

Coffee consumption typically results in trouble sleeping.



ET AL, C. (2002). Mathematics of Data Management. McGraw-Hill Ryerson School.

Climate at a Glance Global Time Series. (n.d.). National Centres for Environmental



You might also like