You are on page 1of 13

CS 6313: Mini Project #2

Name : Group Trio 14

Names of group members

Anusha Gupta (axg230026)


Avtans Kumar (axk220317)
Md Shahir Zaoad (mxz230002)

Contribution of each group member


All the group members contributed equally. Group members came up with the solutions through
group discussion. The coding and documenting has also been done in close collaboration.

Section 1: Answers and explanations


Q1
1. Consider the dataset roadrace.csv posted on eLearning. It
contains observations on 5875 runners who finished the 2010
Beach to Beacon 10K Road Race in Cape Elizabeth, Maine. You
can read the dataset in R using read.csv function.
a. Create a bar graph of the variable Maine, which identifies whether
a runner is from Maine or from somewhere else (stated using Maine
and Away). You can use barplot function for this. What can we
conclude from the plot? Back up your conclusions with relevant
summary statistics.

ggplot is used to create the plots of the dataset passed. Here we have passed the data read
from the roadrace.csv file. Additional parameters to the plot is added by the + sign.
geom_bar() is used to specify that we want to make a bar plot, and the color to be given to the

plot.
labs() is used to give the labels to the plot.

ggplot(data, aes(x = Maine)) + geom_bar(fill = "skyblue", color = "black") +


labs(title = "Runners Origin (Maine vs Away)", x = "Origin", y = "Count")
total_maine_runners <- sum(data$Maine == "Maine", na.rm = TRUE)
total_other_state_runners <- sum(data$Maine == "Away", na.rm = TRUE)
cat(total_maine_runners, total_other_state_runners)

From the bar graph we can see that majority of the runner are from Maine as 4458 runners
among the total 5875 are from Maine while 1417 are from other places. Around 76% of the
participants were from Maine, and around 24% of the participants were from somewhere else.

Maine Away
Total 4458 1417
Percentage 76% 24%

b. Create two histograms the runners’ times (given in minutes) —


one for the Maine group and the second for the Away group. Make
sure that the histograms on the same scale. What can we conclude
about the two distributions? Back up your conclusions with
relevant summary statistics, including mean, standard deviation,
range, median, and interquartile range.
hist function is used to plot the draws. xlim & ylim parameters are passed to set the range
within which the graph should be plotted.
summary function is used to produce the result summaries of the data.

I QR function is used to calculate the interquartile range of a dataset.

# Create a histogram of Maine runners' times


hist(maine_times, main = "Maine Runners' Times", xlab = "Time (minutes)",
ylab = "Frequency", col = "skyblue", border = "black", xlim = c(20,160))

# Create a histogram of Away runners' times


hist(away_times, main = "Away Runners' Times", xlab = "Time (minutes)", ylab
= "Frequency", col = "skyblue", border = "black", ylim=c(0,1500), xlim =
c(20,160))

summary(maine_times)
IQR(maine_times)
summary(away_times)
IQR(away_times)
Firstly, from the histograms it is clear that the two distributions are symmetric, with a higher bins
(larger frequency) for Maine as it had larger number of participants. We can speculate that the
overall performance was better for Away runners, which is indicated by the Min, Max, Median,
Mean, 1st Quantile.

Min 1st Quantile Median Mean 3rd Quantile Max IQR


Maine Time 30.57 50.00 57.03 58.20 64.24 152.17 14.24775
Away Time 27.78 49.15 56.92 57.82 64.83 133.71 15.674

c. Repeat (b) but with side-by-side boxplots.

boxplot function is used to make the box-and-whisker plot of passed datasets. Here we have
passed the maine_times and away_times as the datasets.

boxplot(maine_times, away_times, names = c("Maine", "Away"), main =


"Runners' Times by Origin", xlab = "Origin", ylab = "Time (minutes)", col =
c("skyblue", "lightgreen"))

This figure depicts the side-by-side (parallel boxplots) of the Maine runners’ time and Away
runners’ time. As, 1st Quartile, Median, 3rd Quartile, and Inter Quartile Range (IQR) are almost
similar for both of them, and thus, we can see the symmetry in the two boxes in the boxplots.
Moreover, the whiskers length are also similar as the Min and Max values are also in close
proximity to each other. There, are a good number of outliers in both cases, where only few of
them are very far in both cases. Maine, has larger number of outliers, it is most probably
attributed to its larger participants.

d. Create side-by-side boxplots for the runners’ ages (given in


years) for male and female runners. What can we conclude about
the two distributions? Back up your conclusions with relevant
summary statistics, including mean, standard deviation, range,
median, and interquartile range.

boxplot(male_age, female_age, names = c("Male", "Female"), main = "Runners'


Age by Sex", xlab = "Sex", ylab = "Age", col = c("skyblue", "lightgreen"))

summary(male_age)
IQR(male_age)

summary(female_age)
IQR(female_age)
This figure also represents two box plots, one for Male runners’ ages and other for the female
runners’ ages. From the graph we can see that female runners were comparatively younger
than male runners, as the value of Q1, median and Q3 are smaller for this category. And there
was a large variation of age among the male runners. Moreover, we can see that the
distribution of the male runner is slightly left skewed while the female runner is right skewed.
Finally, in terms of outliers, there is only one in male category, however, there are few outliers
among female runners.

Min 1st Quantile Median Mean 3rd Quantile Max IQR


Male Age 9.00 30.00 41.00 40.45 51.00 83.00 21
Female Age 7.00 28.00 36.00 37.24 46.00 86.00 18

Q2
Consider the dataset motorcycle.csv posted on eLearning. It
contains the number of fatal motorcycle accidents that occurred
in each county of South Carolina during 2009. Create a boxplot
of data and provide relevant summary statistics. Discuss the
features of the data distribution. Identify which counties may be
considered outliers. Why might these counties have the highest
numbers of motorcycle fatalities in South Carolina?
This boxplot shows the fatal motorcycle accidents in various counties of South Carolina during
2009 where 75% of the number of accidents are above 6 and below 23. Also the distribution is
right skewed. The GREENVILE and the HORRY are potential outliers counties with death of 51
and 60 motorcyclist respectively. Some potential reason for having the highest number of
fatalities could be poor design of roads, unfit vehicles on the road, lack of rules and regulations,
severe weather etc.

Min 1st Quantile Median Mean 3rd Quantile Max IQR


Accidents 0.00 6.00 13.50 17.02 23.00 60.0 17

Section 2: R Code
Q1
# Install the ggplot2 package
install.packages("ggplot2")

# Load the ggplot2 library for data visualization


library(ggplot2)

# Read the data from the "roadrace.csv" file


data <- read.csv("roadrace.csv")

# Create a bar chart to show the distribution of runners by origin (Maine vs


Away)
ggplot(data, aes(x = Maine)) + geom_bar(fill = "skyblue", color = "black") +
labs(title = "Runners Origin (Maine vs Away)", x = "Origin", y = "Count")

# Count the total number of Maine and other state runners


total_maine_runners <- sum(data$Maine == "Maine", na.rm = TRUE)
total_other_state_runners <- sum(data$Maine == "Away", na.rm = TRUE)
cat(total_maine_runners, total_other_state_runners)

# Extract times for Maine and Away runners


maine_times <- data$Time..minutes.[data$Maine == "Maine"]
away_times <- data$Time..minutes.[data$Maine == "Away"]

# Create a histogram of Maine runners' times


hist(maine_times, main = "Maine Runners' Times", xlab = "Time (minutes)",
ylab = "Frequency", col = "skyblue", border = "black", xlim = c(20,160))

# Create a histogram of Away runners' times


hist(away_times, main = "Away Runners' Times", xlab = "Time (minutes)", ylab
= "Frequency", col = "skyblue", border = "black", ylim=c(0,1500), xlim =
c(20,160))

# Summary statistics and Interquartile Range (IQR) calculation for Maine


runners' times
summary(maine_times)
IQR(maine_times)

# Summary statistics and Interquartile Range (IQR) calculation for Away


runners' times
summary(away_times)
IQR(away_times)

# Create a boxplot comparing the times of Maine runners and Away runners
boxplot(maine_times, away_times, names = c("Maine", "Away"), main =
"Runners' Times by Origin", xlab = "Origin", ylab = "Time (minutes)", col =
c("skyblue", "lightgreen"))

# Extract ages for male and female runners


male_age <- data$Age[data$Sex == "M"]
female_age <- data$Age[data$Sex == "F"]

# Create a boxplot to compare age distribution between sexes


boxplot(male_age, female_age, names = c("Male", "Female"), main = "Runners'
Age by Sex", xlab = "Sex", ylab = "Age", col = c("skyblue", "lightgreen"))

# Summary statistics and Interquartile Range (IQR) calculation for male


runners' ages
summary(male_age)
IQR(male_age)

# Summary statistics and Interquartile Range (IQR) calculation for female


runners' ages
summary(female_age)
IQR(female_age)

Q2
# Read the data from the CSV file "motorcycle.csv" into a data frame named
data2
data2 <- read.csv("motorcycle.csv")

# Create a boxplot of the number of fatal motorcycle accidents, labeling the


y-axis as "No of accidents" and coloring the boxes sky blue
boxplot(data2$Fatal.Motorcycle.Accidents, names = c("Accidents"), ylab = "No
of accidents", col = c("skyblue"))
# Store the boxplot object without plotting it
bp <-boxplot(data2$Fatal.Motorcycle.Accidents, plot = FALSE)

# Extract the outliers from the boxplot


outliers <- bp$out

# Print the values of the outliers


print(outliers)

# Subset the data frame to extract the counties where the number of fatal
motorcycle accidents matches the outliers
counties <- subset(data2$County, data2$Fatal.Motorcycle.Accidents ==
outliers)

# Print the counties corresponding to the outlier values


print(counties)

# Summary statistics and Interquartile Range (IQR) calculation for fatal


motorcycle accidents
accidents <-data2$Fatal.Motorcycle.Accidents
summary(accidents)
IQR(accidents)

You might also like