Professional Documents
Culture Documents
variables is the correlation coefficient. A strong positive link is indicated, for example, if
the correlation coefficient between exam scores and study time is 0.9.
b) This statistical method predicts the value of a dependent variable (one that depends on the
c) Small sample bias: This type of statistical bias happens when there is insufficient sample
size utilized to estimate a population parameter, which might produce unreliable results.
For example, if a political survey was carried out in a city with just 50 participants.
d) When two variables seem to be related, but in reality, a third variable drives their
relationship, this is known as a common cause relationship. For instance, there is a real
cause for the apparent correlation between shark attacks and ice cream sales: hot weather.
2. In order to investigate the statistical correlation between the quantity of trees and
techniques:
where accessible data might not be sufficient, survey design would be essential. The
survey would be sent out via a number of platforms, including social media, email, and
Utilize Publicly Available Data: Open data resources with details on business locations
and tree counts are available from numerous cities and towns, including Starbucks. To
Selection Bias: If the study's sample is not representative of the total population, bias
may result. The use of stratified sampling would reduce selection bias. This entails
urban development, into several tiers or subgroups. To guarantee that the sample
measurement bias. This strategy would provide precise guidelines for surveyors on
how to recognize and count trees and collect data on Starbucks locations in a uniform
way.
3. 𝑦 = 13.695𝑥 − 30.04
Explanation
I made two columns of data, one for Years and the other for Height using Microsoft
Excel. I then used the insert chart option after selecting the entire data range. Next, I
chose the scatter option. By choosing Chart/Add Trend line, I was able to obtain the
equation for the line of best fit. Verifying that Linear is the default configuration.
After making selection of the chart's straight line, selecting Format Trend
line/Options. Checked the equation display in the chart box. I also showed 𝑟 2 .
4
From the results above, the correlation coefficient is 0.9032. Since the correlation
coefficient is very close to 1 and positive. Therefore, we can conclude that there is a
strong positive linear correlation between the height and years variable.
5.
residuals plot
40
30
20
10
resifuals
0
0 20 40 60 80 100 120 140
-10
-20
-30
-40
years
Based on the residual plot, we cannot see any pattern and the residual data points are
randomly dispersed, we can conclude that the best fit line model is an appropriate model
to predict the height. Also, based on the actual data, the error/difference in the actual and
6.
Source of data: Climate at a Glance Global Time Series (Climate at a Glance Global
Reliability
agencies, academic institutions, and reputable research centres are generally considered
reliable sources.
Timeliness and Updates: the website has data going back to the 90s and is continuously
7. Data from year 2000 to 2012 for temperature anomaly for the month of January.
years temperature
anomaly
1 0,41
2 0,55
3 0,57
4 0,62
5 0,62
6 0,56
7 0,69
8 0,68
9 0,6
10 0,58
11 0,67
12 0,71
13 0,62
6
scatter plot
0,8
0,7 y = 0,0133x + 0,5131
R² = 0,4366
temp anomaly 0,6
0,5
0,4
0,3
0,2
0,1
0
0 2 4 6 8 10 12 14
years
The equation of line of best fit 𝑦 = 0.0133𝑥 + 0.5131 with y = temperature anomaly
and x = years.
This implies that there is a weak positive correlation between temperature anomalies and
8. Approximating:
= -0.6839 degrees
And 10 years after, x will be = 23 (2022) and by substituting into the model
= 3,5721 degrees.
7
9. Residuals plot
residuals plot
0
0 2 4 6 8 10 12 14
-0,2
-0,4
-0,6
residuals
-0,8
-1
-1,2
-1,4
-1,6
-1,8
x years
From the residuals plot, there seems to be a linear relationship between the residuals
which will imply that the linear model is not the best fit for the data s there should not
10. Data Collection Location Bias: The temperature data may not adequately reflect
look into it by finding out where the data collection sites are located geographically.
collecting intervals. I will try to find out how frequently and consistently data is
collected.
8
200
y = 23,328x - 59,831
150
R² = 0,9131
100
y
50
0
0 2 4 6 8 10
-50
x
x 1 2 3 4 5 6 7 8 9
residuals 29,303 6,975 -7,753 -16,481 -21,709 -20,137 -6,865 6,707 29,979
Residuals plot
9
40
30
20
residuals
10
0
0 2 4 6 8 10
-10
-20
-30
x
The residual plot follows a quadratic pattern, hence the data does not follow a linear
relationship.
12.
estimates
x 1 2 3 4 5 6 7 8 9
The y estimates where obtained by substituting the x values into the new model
𝑦 = 3.3𝑥 2 − 9.4𝑥 − 0.1 , then by subtracting the estimates from the original y values
we got the residuals. By using Excel chart function/scatter, we obtained the following
2,5
2
1,5
1
0,5
residuals
0
0 2 4 6 8 10
-0,5
-1
-1,5
-2
-2,5
-3
x
From the residuals plot, the data points are scattered with no Identifiable pattern
hence, the model 𝑦 = 3.3𝑥 2 − 9.4𝑥 − 0.1 is valid for this dataset.
13. Cause-Effect:
The majority of students who commute less, perhaps as a result of living closer to
school, may tend to wake up later than their peers whose homes are located farther
away. They might therefore occasionally turn to procrastination, which could result in
14. Cause-Effect:
A child's interest in learning may grow if they have books at home. A book library is
also common in the homes of highly educated individuals, such as PhD holders. so
15. Accidental
This is purely coincidental, as individuals did not consciously select their phone
16. Cause-Effect
References
Climate at a Glance Global Time Series. (n.d.). National Centres for Environmental
Information. https://www.ncei.noaa.gov/access/monitoring/climate-at-a-
glance/global/time-series/globe/land_ocean/12/1/2000-2012