Professional Documents
Culture Documents
Report on
Faithful
Submitted by:
Rahul Gupta
18021141087
Section-B
Old Faithful is a cone geyser located in Yellowstone National Park in Wyoming,
United States. It was named in 1870 during the Washburn-Langford-Doane
Expedition and was the first geyser in the park to receive a name. It is a highly
predictable geothermal feature, and has erupted every 44 to 125 minutes since
2000. The geyser and the nearby Old Faithful Inn are part of the Old Faithful Historic
District.
Description
Waiting time between eruptions and the duration of the eruption for the Old Faithful
geyser in Yellowstone National Park, Wyoming, USA.
Usage
faithful
Format
Details
There are many versions of this dataset around: Azzalini and Bowman (1990) use a
more complete version.
Analysis
> nrow(faithful) (Finding out the number of rows in the given data set)
[1] 272
> ncol(faithful) (Finding out the number of columns in the given data set)
[1] 2
> str(faithful) (to know the structure of data set, number of observations and
number of variables)
'data.frame': 272 obs. of 2 variables:
$ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
$ waiting: num 79 54 74 62 85 55 88 85 51 85 ...
Structure function tells about the structure of the data set.How many observations
and how many variables are there in the given dataset we can be able to find out
very easily by using “str” function
> head(faithful) (to call top 6 rows of data set)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
>
> tail(faithful) (to call bottom 6 rows of data set)
eruptions waiting
267 4.750 75
268 4.117 81
269 2.150 46
270 4.417 90
271 1.817 46
272 4.467 74
>
> summary(faithful) (to know the statistical parameters)
eruptions waiting
Min. :1.600 Min. :43.0
1st Qu.:2.163 1st Qu.:58.0
Median :4.000 Median :76.0
Mean :3.488 Mean :70.9
3rd Qu.:4.454 3rd Qu.:82.0
Max. :5.100 Max. :96.0
> range(faithful$eruptions)
[1] 1.6- 5.1
> range((faithful$waiting))
[1] 43- 96
Analysis: Summary functions gives the whole picture of the dataset in the most
efficient way. We can say data our dataset lies in the range of 1.6 to 5.1 for
eruptions and 43 to 96 for waiting variable. In the given dataset mean and
median are almost similar which means that distribution is symmetric.
mode_eruptions=names(sort(-table(faithful$eruptions)))
> mode_eruptions
[1] 1.867
It means that 1.867 comes maximum times among the given dataset
mode_waiting=names(sort(-table(faithful$waiting)))
mode_waiting
[1] 78
This number tells us that waiting time in most cases is 78 in the given faithful
dataset
> cor(faithful$eruptions,faithful$waiting)
[1] 0.9008112
Analysis Cor is the co-relation between two variables. Co-relation generally tells
about how much one variable depend on second variable. Co-relation about 0.5
are supposed to be best. In the given dataset correlation 0.9 shows that co-
relation is highly positive and linear.
> sd(faithful$eruptions)
[1] 1.141371
> sd(faithful$waiting)
[1] 13.59497
Analysis: Sd is the standard deviation. It tells us how much is the deviation from
the mean. Since standard deviation is small in both eruptions and waiting which
means that in both variables data are very close to their respective mean.
> cov(faithful$eruptions,faithful$waiting)
[1] 13.97781
Analysis: Cov is covariance. It is a measure of how much two random variables
vary together. Here covariance is small and positive, it means that the smaller
values of one variable mainly correspond with the smaller values of the other
variables, since the sign is positive it shows that variables show similar behavior.
The sign of covariance therefore shows the tendency in the linear relationship
between the variables.
> library(ggplot2) (Call library function ggplot)
ggplot(faithful,aes(eruptions))+
geom_histogram(binwidth=1)+xlab("eruptions")+
ylab("waiting")
Analysis: The histogram clearly shows that there is a linear relationship between
eruptions and waiting time.
Analysis: It is noted that there are two clusters that must correspond to the two
different types of eruption hypothesised earlier. There is, however, a very clear
pattern that affects both sections of the data. As imagined straight line could be
added fairly easily to the diagram, there is a positive correlation between the x
and y co-ordinates. In other words, as the length of time between eruptions
increases, so does the duration of the eruption itself. Physical arguments would
suggest this sensible as the longer time will cause a bigger build-up of pressure.
> lm1 <- lm(eruptions ~ waiting, data=faithful)
> summary(lm1)
Call:
lm(formula = eruptions ~ waiting, data = faithful)
Residuals:
Min 1Q Median 3Q Max
-1.29917 -0.37689 0.03508 0.34909 1.19329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***
waiting 0.075628 0.002219 34.09 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analysis: The standard value to accept the hypothesis is that P(r)≤ 0.05 and in
our regression square we got P(r) for eruptions and waiting time both 2*10^-16
which is very less than our standard value. According to our values our
hypothesis for the variables is true and we will accept them.