Professional Documents
Culture Documents
TOPICS COVERED:
• DESCRIPTION OF DATASET
– DEPENDENT & INDEPENDENT VARIABLES
• DESCRIPTIVE STATISTICS AND ANALYSIS
– SUMMARY
• MEAN
• MEDIAN
• MAXIMUM
• MINIMUM
• QUARTILES
• VARIANCES
– BOXPLOT
– HISTOGRAM
• CONFIDENCE INTERVAL ESTIMATION
• HYPOTHESIS TESTING
• CORRELATION ANALYSIS
– SCATTER PLOT
– CORRELATION COEFFICIENT
• REGRESSION ANALYSIS
1)DESCRIPTION OF DATASET
• Campus Dataset contains 23 variables & 118 observations.
• Main variables of interest are: year, apps, top25, ver500, mth500 & stufac.
• Other variables like school, lapps_1, bball seems to be of no usage as of now & can
be called as “Dummy Variables”.
• Dependent Variable: Top25, Stufac
• Independent Variable: Year, Apps
• This is because the top25 and student faculty ratio is dependent on the year and
applications.
2)DESCRIPTIVE STATISTICS AND ANALYSIS
2.1) SUMMARY
X1 = summary(athlet1$year)
X1 #It's giving the 5-number summary of the years, 1992 or 1993
X3 = summary(athlet1$top25)
X3 #It's giving the 5-number summary for perc fresh class in 25 hs perc
X4 = summary(athlet1$ver500)
X4 #It's giving the 5-number summary for perc fresh>= 500 on verbal SAT
X5 = summary(athlet1$mth500)
X5 #It's giving the 5-number summary for perc fresh>= 500 on math SAT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.00 62.00 81.00 77.56 93.00 99.00
X6 = summary (athlet1$stufac)
X6 #It's giving the 5-number summary of the student-faculty ratio
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 12.00 16.00 15.05 18.00 24.00
To get the entire summary of the variables including variances we are using the following
code to represent in the matrix form
V =c("MEASURES","MINIMUM VALUE","1ST QUARTILE VALUE","MEDIAN","MEAN VALUE","3
RD QUARTILE VALUE","MAXIMUM VALUE","VARIANCE")
A =
matrix(c(V,"Year",X1,var(athlet1$year),"Applications",X2 ,var(athlet1$apps),"
TOP 25",X3,var(athlet1$top25),"STUDENT-FACULTY RATIO
",X4,var(athlet1$stufac)), nrow = 8 , ncol = 5)
[,1] [,2]
[1,] "MEASURES" "Year"
[2,] "MINIMUM VALUE" "1992"
[3,] "1st QUARTILE VALUE" "1992"
[4,] "MEDIAN" "1992.5"
[5,] "MEAN VALUE" "1992.5"
[6,] "3rd QUARTILE VALUE" "1993"
[7,] "MAXIMUM VALUE" "1993"
[8,] "VARIANCE" "0.252136752136752"
[,3] [,4]
[1,] "Applications" "TOP 25"
[2,] "3303" "36"
[3,] "6965.75" "55"
[4,] "8684" "64"
[5,] "10552.2542372881" "68.2795698924731"
[6,] "13546.5" "85"
[7,] "23342" "97"
[8,] "24638713.2681443" "25"
[,5]
[1,] NA
[2,] "STUDENT-FACULTY RATIO\n"
[3,] "20"
[4,] "36"
[5,] "48"
[6,] "54"
[7,] "70.5"
[8,] "94"
2.2)BOXPLOTS
This is the code to get the box plot of our Variables : Applications , Top25 and Student
Faculty Ratio. We will not make a box plot for variable Year because it only has two
possible values.
# BOXPLOT FOR Applications
boxplot(athlet1$apps, xlab = "TOTAL APPLICATIONS" , ylab= "VALUES" , col =
"light blue")
Interpretation:
The box ranges from Q1 (6966) to Q3 (13546), having a median of 8684 shown by bold line
inside the box. The horizontal lines lying outside measures the minimum value ( 3303) &
maximum value (23342).
2.3)HISTOGRAMS
This is the code to get the Histograms of our data.
hist(athlet1$apps, col="light pink", xlab= "TOTAL APPLICATIONS" , ylab=
"VALUES", main= "HISTOGRAM OF TOTAL APPLICATIONS")
Interpretation:
This is the histogram of total applications . More than 25 frequencies in the range of 5000-
10000. There are very few applications more than 15000.
hist(athlet1$year, col="light blue",xlab= "YEAR" , ylab= "VALUES", main=
"HISTOGRAM OF YEAR")
Interpretation:
This is the histogram of the year. Frequency of 1992 is equal to that of 1993, which means
that count of 1992 is equal to the count of 1993. There are around 60 data points for both
1992 & 1993.
data: athlet1$apps
t = 23.093, df = 117, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9647.29 11457.22
sample estimates:
mean of x
10552.25
Result:
The 95% CI for population mean (mu) is (9647.29, 11457.22)
~ Finding 99% CI for population proportion from sample data “year”
#Assuming success to be private college denoted by 1.
S <- athlet1$year
s <- length(S[S==1992]) # It shows no. of success
s
## [1] 59
n <- length(S) # It shows sample size n
## [1] 118
binom.test(s,n,conf.level = 0.99)
data: s and n
number of successes = 59, number of trials =
118, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.5
99 percent confidence interval:
0.3792609 0.6207391
sample estimates:
probability of success
0.5
Result:
The 99% CI for population proportion (p) is (0.3792609, 0.6207391)
4)HYPOTHESIS TESTING
~ Testing at 95% to determine whether the average stufac exceeds 18 or not.
# Intuitively, our hypothesis will be:
# H0: mu = 18
# H1: mu > 18
t.test(athlet1$stufac, mu = 18, conf.level=0.95, alternative= "greater")
One Sample t-test
data: athlet1$stufac
t = -8.1774, df = 117, p-value = 1
alternative hypothesis: true mean is greater than 18
95 percent confidence interval:
14.4529 Inf
sample estimates:
mean of x
15.05085
Conclusion:
We do have insufficient evidence to reject our null hypothesis because
1) p-value (i.e 1) is greater than level of significance (i.e 0.05);
2) 18 lies in the CI (14.4529, Inf)
5)CORRELATION ANALYSIS
5.1)SCATTER PLOT
Y1 <- athlet1$year
Y2 <- athlet1$apps
Y3 <- athlet1$top25
Y4 <- athlet1$cstufac
Y5 <- athlet1$lapps
{A} Scatter Plot b/w Year and Applications
plot(Y1,Y2,xlab="Year",ylab="Applications",main="SCATTER PLOT" , pch=18)
Conclusion:
We can infer that above plot shows two linear vertical lines for the applications received in
1992 and 1993.
Conclusion:
We can infer that above plot is left-skewed. Thus, we’ll see plot using log values which
reduces the skewness & shows more linear relation.
Scatter Plot
plot(Y5,Y3,xlab="APPLICATIONS",ylab="TOP 25",main="SCATTER PLOT" , pch=18)
{C} Scatter Plot b/w Applications & CHANGE IN STUDENT FACULTY RATIO
plot(Y2,Y4,xlab="APPLICATIONS",ylab="STUDENT FACULTY RATIO",main="SCATTER
PLOT", pch=18)
5.2)CORRELATION COEFFICIENT & ANALYSIS
cor() command gives us the correlation coefficient of each variable with one another. A
correlation coefficient varies from range (-1,1).
ANALYSIS TABLE
1 PERFECT POSTIVE CORRELATION
0.6 to 0.9 HIGH POSTIVE CORRELATION
0.1 to 0.5 LOW POSTIVE CORRELATION
0 NO CORRELATION
-0.1 to -0.5 LOW NEGATIVE CORRELATION
-0.6 to -0.9 HIGH NEGATIVE CORRELATION
-1 PERFECT NEGATIVE CORRELATION
cor(athlet1[1:10]
finfour lapps
year 0.03587480 0.006668464
apps 0.16612714 0.970345677
top25 NA NA
ver500 NA NA
mth500 NA NA
stufac -0.07684385 -0.202483414
bowl 0.12937146 0.175858905
btitle 0.24068486 0.032863701
finfour 1.00000000 0.172073111
lapps 0.17207311 1.000000000
Analysis:
We can see that:
• The values of bowl and apps have a low positive correlation ( 0.16113704 )
• The values of bowl and finfour have a low positive correlation ( 0.12937146 ) which is lower
than the positive correlation between bowl and apps.
• The values of btitle and finfour have a low positive correlation ( 0.24068486 ) which is higher
than the positive correlation between bowl and app or bowl and finfour.
6)REGRESSION ANALYSIS
X= athlet1$apps
Y= athlet1$bowl
model1
Call:q
lm(formula = Y ~ X)
Coefficients:
(Intercept) X
2.862e-01 1.624e-05
Residuals:
Min 1Q Median 3Q Max
-0.6462 -0.4177 -0.3622 0.5143 0.6513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.862e-01 1.076e-01 2.660 0.00893 **
X 1.624e-05 9.236e-06 1.758 0.08130 .
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1