Professional Documents
Culture Documents
ggplot is used to create the plots of the dataset passed. Here we have passed the data read
from the roadrace.csv file. Additional parameters to the plot is added by the + sign.
geom_bar() is used to specify that we want to make a bar plot, and the color to be given to the
plot.
labs() is used to give the labels to the plot.
From the bar graph we can see that majority of the runner are from Maine as 4458 runners
among the total 5875 are from Maine while 1417 are from other places. Around 76% of the
participants were from Maine, and around 24% of the participants were from somewhere else.
Maine Away
Total 4458 1417
Percentage 76% 24%
summary(maine_times)
IQR(maine_times)
summary(away_times)
IQR(away_times)
Firstly, from the histograms it is clear that the two distributions are symmetric, with a higher bins
(larger frequency) for Maine as it had larger number of participants. We can speculate that the
overall performance was better for Away runners, which is indicated by the Min, Max, Median,
Mean, 1st Quantile.
boxplot function is used to make the box-and-whisker plot of passed datasets. Here we have
passed the maine_times and away_times as the datasets.
This figure depicts the side-by-side (parallel boxplots) of the Maine runners’ time and Away
runners’ time. As, 1st Quartile, Median, 3rd Quartile, and Inter Quartile Range (IQR) are almost
similar for both of them, and thus, we can see the symmetry in the two boxes in the boxplots.
Moreover, the whiskers length are also similar as the Min and Max values are also in close
proximity to each other. There, are a good number of outliers in both cases, where only few of
them are very far in both cases. Maine, has larger number of outliers, it is most probably
attributed to its larger participants.
summary(male_age)
IQR(male_age)
summary(female_age)
IQR(female_age)
This figure also represents two box plots, one for Male runners’ ages and other for the female
runners’ ages. From the graph we can see that female runners were comparatively younger
than male runners, as the value of Q1, median and Q3 are smaller for this category. And there
was a large variation of age among the male runners. Moreover, we can see that the
distribution of the male runner is slightly left skewed while the female runner is right skewed.
Finally, in terms of outliers, there is only one in male category, however, there are few outliers
among female runners.
Q2
Consider the dataset motorcycle.csv posted on eLearning. It
contains the number of fatal motorcycle accidents that occurred
in each county of South Carolina during 2009. Create a boxplot
of data and provide relevant summary statistics. Discuss the
features of the data distribution. Identify which counties may be
considered outliers. Why might these counties have the highest
numbers of motorcycle fatalities in South Carolina?
This boxplot shows the fatal motorcycle accidents in various counties of South Carolina during
2009 where 75% of the number of accidents are above 6 and below 23. Also the distribution is
right skewed. The GREENVILE and the HORRY are potential outliers counties with death of 51
and 60 motorcyclist respectively. Some potential reason for having the highest number of
fatalities could be poor design of roads, unfit vehicles on the road, lack of rules and regulations,
severe weather etc.
Section 2: R Code
Q1
# Install the ggplot2 package
install.packages("ggplot2")
# Create a boxplot comparing the times of Maine runners and Away runners
boxplot(maine_times, away_times, names = c("Maine", "Away"), main =
"Runners' Times by Origin", xlab = "Origin", ylab = "Time (minutes)", col =
c("skyblue", "lightgreen"))
Q2
# Read the data from the CSV file "motorcycle.csv" into a data frame named
data2
data2 <- read.csv("motorcycle.csv")
# Subset the data frame to extract the counties where the number of fatal
motorcycle accidents matches the outliers
counties <- subset(data2$County, data2$Fatal.Motorcycle.Accidents ==
outliers)