Two Variable Statistics

1
Two Variable Statistics

2
a) A numerical indicator of the direction and intensity of a relationship between two
variables is the correlation coefficient. A strong positive link is indicated, for example, if
the correlation coefficient between exam scores and study time is 0.9.
b) This statistical method predicts the value of a dependent variable (one that depends on the
value of an independent variable). For instance, using historical advertising spending
data, a company may utilize linear regression to forecast future sales.
c) Small sample bias: This type of statistical bias happens when there is insufficient sample
size utilized to estimate a population parameter, which might produce unreliable results.
For example, if a political survey was carried out in a city with just 50 participants.
d) When two variables seem to be related, but in reality, a third variable drives their
relationship, this is known as a common cause relationship. For instance, there is a real
cause for the apparent correlation between shark attacks and ice cream sales: hot weather.
2. In order to investigate the statistical correlation between the quantity of trees and
Starbucks locations throughout various neighbourhoods, I would utilize a variety of
techniques:
Conduct a Comprehensive Survey: In order to get particular data for communities
where accessible data might not be sufficient, survey design would be essential. The
survey would be sent out via a number of platforms, including social media, email, and
door-to-door surveys, to both local companies and households.
Utilize Publicly Available Data: Open data resources with details on business locations
and tree counts are available from numerous cities and towns, including Starbucks. To
guarantee data accuracy and comprehensiveness, it would be imperative to gather
information from several reliable sources.

3
Perform Visual Inspections: This phase entails physically going to each
neighbourhood to count the amount of trees and determine whether Starbucks is
present. Visual inspections function as a process of verification to authenticate and
certify the correctness of information gleaned from external sources.
Biases and Mitigation:
Selection Bias: If the study's sample is not representative of the total population, bias
may result. The use of stratified sampling would reduce selection bias. This entails
segmenting the city according to specific parameters, such income distribution or
urban development, into several tiers or subgroups. To guarantee that the sample
accurately reflects the diversity of the city's communities, proportionate sampling
from each stratum is used.
Measurement Bias: Inaccurate or inconsistent data collection leads to measurement
bias. Adhering to a consistent process for data collection is essential to minimize
measurement bias. This strategy would provide precise guidelines for surveyors on
how to recognize and count trees and collect data on Starbucks locations in a uniform
way.
3. 𝑦 = 13.695𝑥 − 30.04
Explanation
I made two columns of data, one for Years and the other for Height using Microsoft
Excel. I then used the insert chart option after selecting the entire data range. Next, I
chose the scatter option. By choosing Chart/Add Trend line, I was able to obtain the
equation for the line of best fit. Verifying that Linear is the default configuration.
After making selection of the chart's straight line, selecting Format Trend
line/Options. Checked the equation display in the chart box. I also showed 𝑟 2 .
4
4. Correlation coefficient (𝑟) = 0.9032
Coefficient of determination (𝑟 2 ) = 0.8158
From the results above, the correlation coefficient is 0.9032. Since the correlation
coefficient is very close to 1 and positive. Therefore, we can conclude that there is a
strong positive linear correlation between the height and years variable.
The coefficient of determination, 𝑟 2 , indicates that 81.58% of the variation in the
height can be explained by the variation of years.
5.
residuals plot
40
30
20
10
resifuals
0
0 20 40 60 80 100 120 140
-10
-20
-30
-40
years
Based on the residual plot, we cannot see any pattern and the residual data points are
randomly dispersed, we can conclude that the best fit line model is an appropriate model
to predict the height. Also, based on the actual data, the error/difference in the actual and
predicted y-values ranges from -30 to 35.
6.
Source of data: Climate at a Glance Global Time Series (Climate at a Glance Global
Time Series, n.d.)

5
Reliability
Authority and Reputation: Established and well-known organizations, government
agencies, academic institutions, and reputable research centres are generally considered
reliable sources.
Timeliness and Updates: the website has data going back to the 90s and is continuously
updated to stay current on the latest changes.
7. Data from year 2000 to 2012 for temperature anomaly for the month of January.
years temperature
anomaly
1 0,41
2 0,55
3 0,57
4 0,62
5 0,62
6 0,56
7 0,69
8 0,68
9 0,6
10 0,58
11 0,67
12 0,71
13 0,62
6
scatter plot
0,8
0,7 y = 0,0133x + 0,5131
R² = 0,4366
temp anomaly 0,6
0,5
0,4
0,3
0,2
0,1
0
0 2 4 6 8 10 12 14
years
The equation of line of best fit 𝑦 = 0.0133𝑥 + 0.5131 with y = temperature anomaly
and x = years.
The correlation coefficient r = 0.6608
The coefficient of determination = 0.4364
This implies that there is a weak positive correlation between temperature anomalies and
time. Approximately 43.64% of the data variability is explained by the model.
8. Approximating:
10 years before, x will be = - 9 (1990) and by substituting into the model
𝑦 = 0.0133 ∗ (−9) + 0.5131
= -0.6839 degrees
And 10 years after, x will be = 23 (2022) and by substituting into the model
𝑦 = 0.0133 ∗ (23) + 0.5131
= 3,5721 degrees.
7
9. Residuals plot
residuals plot
0
0 2 4 6 8 10 12 14
-0,2
-0,4
-0,6
residuals
-0,8
-1
-1,2
-1,4
-1,6
-1,8
x years
From the residuals plot, there seems to be a linear relationship between the residuals
which will imply that the linear model is not the best fit for the data s there should not
exist any correlation among the residuals.
10. Data Collection Location Bias: The temperature data may not adequately reflect
worldwide trends if it is primarily collected from a limited number of regions. I might
look into it by finding out where the data collection sites are located geographically.
Sampling Frequency Bias: Bias may be introduced by uneven or irregular data
collecting intervals. I will try to find out how frequently and consistently data is
collected.
8
11. Scatter plot
200
y = 23,328x - 59,831
150
R² = 0,9131
100
y
50
0
0 2 4 6 8 10
-50
x
x 1 2 3 4 5 6 7 8 9
residuals 29,303 6,975 -7,753 -16,481 -21,709 -20,137 -6,865 6,707 29,979
Residuals = y values – estimated y values from line of best fit.
Residuals plot
9
40
30
20
residuals
10
0
0 2 4 6 8 10
-10
-20
-30
x
The residual plot follows a quadratic pattern, hence the data does not follow a linear
relationship.
12.
y -6,2 -5,7 1,4 15,1 35,4 62,3 95,8 135,9 182,6
estimates
x 1 2 3 4 5 6 7 8 9
residuals -1 -0,5 1 1,9 -0,3 -2,3 0,8 -2,4 -2,5
The y estimates where obtained by substituting the x values into the new model
𝑦 = 3.3𝑥 2 − 9.4𝑥 − 0.1 , then by subtracting the estimates from the original y values
we got the residuals. By using Excel chart function/scatter, we obtained the following
residuals plots against the x values:

10
2,5
2
1,5
1
0,5
residuals
0
0 2 4 6 8 10
-0,5
-1
-1,5
-2
-2,5
-3
x
From the residuals plot, the data points are scattered with no Identifiable pattern
hence, the model 𝑦 = 3.3𝑥 2 − 9.4𝑥 − 0.1 is valid for this dataset.
13. Cause-Effect:
CAUSE: shorter travel time
EFFECT: lower test scores
The majority of students who commute less, perhaps as a result of living closer to
school, may tend to wake up later than their peers whose homes are located farther
away. They might therefore occasionally turn to procrastination, which could result in
subpar academic achievement.
14. Cause-Effect:
CAUSE: Children with more books at home
EFFECT: Earn PHD when they grow up

11
A child's interest in learning may grow if they have books at home. A book library is
also common in the homes of highly educated individuals, such as PhD holders. so
that their offspring would acquire the same passion.
15. Accidental
This is purely coincidental, as individuals did not consciously select their phone
number and birth date is not directly associated with it.
16. Cause-Effect
CAUSE: Drinking coffee in the morning
EFFECT: Have insomnia at night
Coffee consumption typically results in trouble sleeping.

12
References
ET AL, C. (2002). Mathematics of Data Management. McGraw-Hill Ryerson School.
Climate at a Glance Global Time Series. (n.d.). National Centres for Environmental
Information. https://www.ncei.noaa.gov/access/monitoring/climate-at-a-
glance/global/time-series/globe/land_ocean/12/1/2000-2012

Two Variable Statistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Two Variable Statistics

Uploaded by

Copyright:

Available Formats

1

Two Variable Statistics

a) A numerical indicator of the direction and intensity of a relationship between two

value of an independent variable). For instance, using historical advertising spending

data, a company may utilize linear regression to forecast future sales.

Starbucks locations throughout various neighbourhoods, I would utilize a variety of

Conduct a Comprehensive Survey: In order to get particular data for communities

door-to-door surveys, to both local companies and households.

guarantee data accuracy and comprehensiveness, it would be imperative to gather

information from several reliable sources.

Perform Visual Inspections: This phase entails physically going to each

neighbourhood to count the amount of trees and determine whether Starbucks is

present. Visual inspections function as a process of verification to authenticate and

certify the correctness of information gleaned from external sources.

Biases and Mitigation:

segmenting the city according to specific parameters, such income distribution or

accurately reflects the diversity of the city's communities, proportionate sampling

from each stratum is used.

Measurement Bias: Inaccurate or inconsistent data collection leads to measurement

bias. Adhering to a consistent process for data collection is essential to minimize

4. Correlation coefficient (𝑟) = 0.9032

Coefficient of determination (𝑟 2 ) = 0.8158

The coefficient of determination, 𝑟 2 , indicates that 81.58% of the variation in the

height can be explained by the variation of years.

predicted y-values ranges from -30 to 35.

Time Series, n.d.)

Authority and Reputation: Established and well-known organizations, government

updated to stay current on the latest changes.

The correlation coefficient r = 0.6608

The coefficient of determination = 0.4364

time. Approximately 43.64% of the data variability is explained by the model.

10 years before, x will be = - 9 (1990) and by substituting into the model

𝑦 = 0.0133 ∗ (−9) + 0.5131

𝑦 = 0.0133 ∗ (23) + 0.5131

exist any correlation among the residuals.

worldwide trends if it is primarily collected from a limited number of regions. I might

Sampling Frequency Bias: Bias may be introduced by uneven or irregular data

11. Scatter plot

Residuals = y values – estimated y values from line of best fit.

y -6,2 -5,7 1,4 15,1 35,4 62,3 95,8 135,9 182,6

residuals -1 -0,5 1 1,9 -0,3 -2,3 0,8 -2,4 -2,5

residuals plots against the x values:

CAUSE: shorter travel time

EFFECT: lower test scores

subpar academic achievement.

CAUSE: Children with more books at home

EFFECT: Earn PHD when they grow up

that their offspring would acquire the same passion.

number and birth date is not directly associated with it.

CAUSE: Drinking coffee in the morning

EFFECT: Have insomnia at night

Coffee consumption typically results in trouble sleeping.

ET AL, C. (2002). Mathematics of Data Management. McGraw-Hill Ryerson School.

You might also like