You are on page 1of 7

ST1131

Tutorial 2 Solution

1. Forced Expiratory Volume (FEV) is an index of pulmonary function that measures the volume of air
expelled after 1 second of constant effort. The dataset FEV.csv on LumiNUS contains measurements
for 654 children aged 3 to 19 years of age. The purpose of the data collection was to study how FEV
is affected by certain other variables. The variables that we shall work with are
Age: Age in years.
FEV: FEV measurement.
Hgt: Height in inches.
height: Height in meters
Sex: 0 = female, 1 = male.
Smoking status: 0 = current non-smoker, 1 = current smoker.

(a) What is the response variable in this study?


(b) Create a histogram of FEV and comment on it.
(c) Create a boxplot of FEV and identify how many outliers there are. Investigate your data and
comment on these outliers.
(d) Create separate histograms for male and female FEV, then obtain separate numerical summaries
for males and female FEV. Comment on what you observe.
(e) Create a scatterplot with height (in metres) on the x-axis and FEV on the y-axis.
Hint: In R, use command plot(x,y) to create a scatter plot of y vs x, where vector x is the
vector of values of the explanatory and y is the vector of values of the response.
(f) Compute the correlation between FEV and height and comment on your results.
Hint: In R, use command cor(x,y) to calculate the correlation value of the two vectors x and y.

Solution: Import data into R:

> fev = read.csv("C:/Data/FEV.csv")


> names(fev)

[1] "ID" "Age" "FEV" "Hgt" "Sex" "Smoke" "height"

> attach(fev)

(a) FEV is the response variable.


(b) Create a histogram:
> # hist(FEV, col = 10, freq= FALSE)
The histogram is shown in Figure 1. The sample is unimodal. Compared to the overlaid bell-shape
curve, the distribution looks slightly right-skewed. Most of the observations are within a range
of 0.5 to 6 – there are no observations that are separated from the rest. However, this does not
mean there are no outliers.
(c) Box plot of FEV is shown in Figure 2.
> # boxplot(FEV, col = 10, ylab = "FEV", main = "Boxplot of FEV")
>
> # outlier values
> out = boxplot(FEV, col = 10, ylab = "FEV", main = "Boxplot of FEV")$out
> # get the index of the outliers:
> index = which(FEV %in% c(out))
> length(index)

1
ST1131

Figure 1: Histogram of FEV

[1] 9
> #information of all the outliers:
> fev[c(index),]
ID Age FEV Hgt Sex Smoke height
321 2142 14 4.84 72 1 0 1.83
452 33041 12 5.22 70 1 0 1.78
464 37241 13 4.88 73 1 0 1.85
517 49541 13 5.08 74 1 0 1.88
609 6144 19 5.10 72 1 0 1.83
624 25941 15 5.79 69 1 0 1.75
632 37441 17 5.63 73 1 0 1.85
648 71141 17 5.64 70 1 0 1.78
649 71142 16 4.87 72 1 1 1.83
There are 9 outliers. Check the information of these 9 outliers, we would see that all these outliers
are males, most (8/9) are non-smokers, and they are rather tall.
(d) Separate histograms for male and female FEV are given in Figure 3.
> female = FEV[which(Sex==0)] #values of FEV of females
> male = FEV[which(Sex==1)] #values of FEV of males
>
> # opar <- par(mfrow=c(1,2)) #arrange a figure which has 1 row, 2 columns
>
> # hist(female, col = 2, freq= FALSE, main = "Histogram of Female FEV", ylim = c(0,0.52))
>
> # hist(male, col = 4, freq= FALSE, main = "Histogram of Male FEV", ylim = c(0,0.52))
>
> # par(opar)
>
The summary could be:
The shapes of the two histograms are quite different. Both are unimodal, but for females, it is
almost symmetrical but for males it is quite right-skewed.

2
ST1131

Figure 2: Boxplot of FEV

The median FEV for females is much lower than that for males (2.49 compared to 2.605).
In addition, the variabilty in the male group is higher than the variability in the female group.
The respective IQRs are 1.54 and 1.05.

Figure 3: Histograms of FEV classified by gender

(e) The scatter plot of FEV vs height (m) where it classifes the data points by gender (more advanced)
is given in Figure 4.
> # plot(height, FEV)
> # to answer the question, this is enough. It gives a simple scatter plot
> cor(FEV, height)
[1] 0.8675619
The computed correlation is quite high, and it is clear from the plot that there is a strong positive
linear association between the two variables overall. The range of FEV for males appears larger
than the range for females, as does the range of heights. The variability of FEV at lower heights
does seem to be slightly less than the variability of FEV at greater heights.

3
ST1131

Figure 4: Scatter plot of FEV classified by gender

2. Question of interest: Do male college students follow their school’s teams more closely than female?
The following data was collected in class on Monday morning, after a particularly exciting and impor-
tant basketball game. The question asked was: Did you watch the game on TV last night?

Whole Game Part of the Game None of It Total


Male 10 12 4 26
Female 21 24 30 75
Total 31 36 34 101

(a) What is the response variable and explanatory variable in this study?
(b) For each gender, find the percentage of watching game (all, part and none).
(c) State the pairs of percentages for comparison. Plot a bar plot which helps to compare the per-
centages found.
(d) Is it fair to say that males were more likely to watch the game than females?
(e) What population that your conclusion above can apply to?

Solution:
> male = c(10, 12, 4)
> female = c(21, 24, 30)
> #create a matrix with 2 rows for males and females:
> x = matrix(c(male, female),nrow = 2, byrow = T); x

[,1] [,2] [,3]


[1,] 10 12 4
[2,] 21 24 30

(a) Response variable: Game watching (categorical) with three categories: whole, part and none.
Explanatory variable: gender (categorical) with two categories: male, female.

4
ST1131

(b) > tab = prop.table(x, margin = 1)*100; tab


[,1] [,2] [,3]
[1,] 38.46154 46.15385 15.38462
[2,] 28.00000 32.00000 40.00000
Consider the group of 26 males, 38.46% of them watching the whole game, 46.15% of them
watching part of the game and 15.39% of them did not watch the game.
Consider the group of 75 females: 28% of them watch the whole game, 32% of them watch part
of the game and 40% of them did not watch the game.
(c) The three pairs of percentages for comparing males vs females are: (38.46, 28); (46.15, 32), (15.39,
40).
> # colname = c("Whole", "Part", "None")
> # colnames(tab) = colname
> # barplot(tab, beside = TRUE, xlab = "",
> # main="",col=c(2, 3), legend = c("male", "female"))

Figure 5: Bar plots of percentage watching game for each gender

The bar plot is as given in Figure5


(d) It is fair to say that males were more likely to watch the game than females. Even though the
numberof females who watched the whole game or part of it was larger than the number of men,
the percentages were higher for males.
(e) Since the sample were collected on students attending the class on Monday morning, we cannot
can apply the conclusion above to the population of all students. The reason is: there could
be students that might stay up late watching the game and celebrating and missed the class on
Monday morning.

3. (Question for Group Work) The researchers investigated if the life expectancy of animals depend on
the length of their gestational period. They considered a dataset, Animals.csv, which has observations
on the average longevity (in years) and average gestational period (in days) for 21 animals.

(a) What is the response variable in this study?


(b) Derive the numerical summaries for each of the two variables (gestation and longevity).

5
ST1131

(c) Derive the correlation between the two variables.


(d) Create a scatter plot for the two variables and give your comment on the relationship between
them.
(e) Is there any outlier in this dataset (a point that is outlier for both variables)? If yes, find the
correlation of the two variables without the outlier(s). Give your comment on the change of the
correlation value.

Solution:

> data = read.csv("C:/Data/Animals.csv")


> names(data)

[1] "animal" "gestation" "longevity"

> # gestation: average gestation period (days)


> # longevity: average longevity of animal (years)
> attach(data)

(a) Response variable: longevity.


(b) Summaries of each variable:
> summary(gestation)
Min. 1st Qu. Median Mean 3rd Qu. Max.
22.0 57.0 115.0 164.3 240.0 624.0
> spread_ges = c(var(gestation),sd(gestation)); spread_ges
[1] 22398.9143 149.6627
> summary(longevity)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 9.00 11.00 13.67 14.00 35.00
> spread_long = c(var(longevity),sd(longevity)); spread_long
[1] 61.533333 7.844319
(c) Correlation between the two variables is 0.856
> cor(gestation, longevity)
[1] 0.8563872
(d) > # plot(gestation, longevity, pch = 20, col = "red")
The scatter plot is as given in Figure6. It shows a positive relationship and possibly linear betwene
the gestational period and longevity. It looks like the spread/variability of longevity is larger when
the gestational period increases. There is a possible outlier (at the top right corner).
(e) Plotting a boxplot for gestational period, there is one outlier (the 8th data point — elephant)
with 624 days of gestation.
Plotting a boxplot for longevity, there are 4 outliers (2nd, 8th, 12th and 13th points).
Hence, the 8th data point is the outlier for both variables.
> cor(gestation[-8], longevity[-8])
[1] 0.7520062
The new correlation is 0.752, decreased about 12%. Thus, this could be an influential point in
the data.

6
ST1131

Figure 6: Scatter plot of Gentational period and Longevity

You might also like