You are on page 1of 18

SME ASSIGNMENT (SEM-3)

TOPICS COVERED:
• DESCRIPTION OF DATASET
– DEPENDENT & INDEPENDENT VARIABLES
• DESCRIPTIVE STATISTICS AND ANALYSIS
– SUMMARY
• MEAN
• MEDIAN
• MAXIMUM
• MINIMUM
• QUARTILES
• VARIANCES
– BOXPLOT
– HISTOGRAM
• CONFIDENCE INTERVAL ESTIMATION
• HYPOTHESIS TESTING
• CORRELATION ANALYSIS
– SCATTER PLOT
– CORRELATION COEFFICIENT
• REGRESSION ANALYSIS

Loading Dataset into Rstudio


library(readxl)
athlet1 <- read_excel("D:\\Downloads\\athlet1.xlsx")
View(athlet1)

1)DESCRIPTION OF DATASET
• Campus Dataset contains 23 variables & 118 observations.
• Main variables of interest are: year, apps, top25, ver500, mth500 & stufac.
• Other variables like school, lapps_1, bball seems to be of no usage as of now & can
be called as “Dummy Variables”.
• Dependent Variable: Top25, Stufac
• Independent Variable: Year, Apps
• This is because the top25 and student faculty ratio is dependent on the year and
applications.
2)DESCRIPTIVE STATISTICS AND ANALYSIS
2.1) SUMMARY
X1 = summary(athlet1$year)
X1 #It's giving the 5-number summary of the years, 1992 or 1993

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1992 1992 1992 1992 1993 1993
X2 = summary(athlet1$apps)
X2 #It's giving the 5-number summary of Applications for admission

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 3303 6966 8684 10552 13546 23342

X3 = summary(athlet1$top25)
X3 #It's giving the 5-number summary for perc fresh class in 25 hs perc

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 36.00 55.00 64.00 68.28 85.00 97.00

X4 = summary(athlet1$ver500)
X4 #It's giving the 5-number summary for perc fresh>= 500 on verbal SAT

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 20.0 36.0 48.0 54.0 70.5 94.00

X5 = summary(athlet1$mth500)
X5 #It's giving the 5-number summary for perc fresh>= 500 on math SAT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39.00 62.00 81.00 77.56 93.00 99.00

X6 = summary (athlet1$stufac)
X6 #It's giving the 5-number summary of the student-faculty ratio
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 12.00 16.00 15.05 18.00 24.00

To get the entire summary of the variables including variances we are using the following
code to represent in the matrix form
V =c("MEASURES","MINIMUM VALUE","1ST QUARTILE VALUE","MEDIAN","MEAN VALUE","3
RD QUARTILE VALUE","MAXIMUM VALUE","VARIANCE")
A =
matrix(c(V,"Year",X1,var(athlet1$year),"Applications",X2 ,var(athlet1$apps),"
TOP 25",X3,var(athlet1$top25),"STUDENT-FACULTY RATIO
",X4,var(athlet1$stufac)), nrow = 8 , ncol = 5)

[,1] [,2]
[1,] "MEASURES" "Year"
[2,] "MINIMUM VALUE" "1992"
[3,] "1st QUARTILE VALUE" "1992"
[4,] "MEDIAN" "1992.5"
[5,] "MEAN VALUE" "1992.5"
[6,] "3rd QUARTILE VALUE" "1993"
[7,] "MAXIMUM VALUE" "1993"
[8,] "VARIANCE" "0.252136752136752"
[,3] [,4]
[1,] "Applications" "TOP 25"
[2,] "3303" "36"
[3,] "6965.75" "55"
[4,] "8684" "64"
[5,] "10552.2542372881" "68.2795698924731"
[6,] "13546.5" "85"
[7,] "23342" "97"
[8,] "24638713.2681443" "25"
[,5]
[1,] NA
[2,] "STUDENT-FACULTY RATIO\n"
[3,] "20"
[4,] "36"
[5,] "48"
[6,] "54"
[7,] "70.5"
[8,] "94"

2.2)BOXPLOTS
This is the code to get the box plot of our Variables : Applications , Top25 and Student
Faculty Ratio. We will not make a box plot for variable Year because it only has two
possible values.
# BOXPLOT FOR Applications
boxplot(athlet1$apps, xlab = "TOTAL APPLICATIONS" , ylab= "VALUES" , col =
"light blue")
Interpretation:
The box ranges from Q1 (6966) to Q3 (13546), having a median of 8684 shown by bold line
inside the box. The horizontal lines lying outside measures the minimum value ( 3303) &
maximum value (23342).

# BOXPLOT FOR Top25


boxplot(athlet1$top25, xlab = "TOP 25" , ylab= "VALUES" , col = "light
green")
Interpretation:
The box ranges from Q1 (55) to Q3 (85), having a median of 64 shown by bold line inside
the box. The horizontal lines lying outside measures the minimum value (36) & maximum
value (97).

# BOXPLOT FOR Stufac


boxplot(athlet1$stufac, xlab = "Student Faculty Ratio" , ylab= "VALUES" , col
= "medium purple 1")
Interpretation:
The box ranges from Q1 (12) to Q3 (18), having a median of 16 shown by bold line inside
the box. The horizontal lines lying outside measures the minimum value (7) & maximum
value (24).

2.3)HISTOGRAMS
This is the code to get the Histograms of our data.
hist(athlet1$apps, col="light pink", xlab= "TOTAL APPLICATIONS" , ylab=
"VALUES", main= "HISTOGRAM OF TOTAL APPLICATIONS")
Interpretation:
This is the histogram of total applications . More than 25 frequencies in the range of 5000-
10000. There are very few applications more than 15000.
hist(athlet1$year, col="light blue",xlab= "YEAR" , ylab= "VALUES", main=
"HISTOGRAM OF YEAR")
Interpretation:
This is the histogram of the year. Frequency of 1992 is equal to that of 1993, which means
that count of 1992 is equal to the count of 1993. There are around 60 data points for both
1992 & 1993.

hist(athlet1$ctop25, col="light green",xlab= "CHANGE IN TOP 25" , ylab=


"VALUES ", main= "HISTOGRAM OF CHANGE IN TOP25")
Interpretation:
This is the histogram of change in Top 25. The frequency of the top 25 is making a normal
distribution around 0.

hist(athlet1$cstufac, col="medium purple 1", xlab= "CHANGE IN STUDENT FACULTY


RATIO" , ylab= "VALUES ", main= "HISTOGRAM OF CHANGE IN STU-FAC RATIO")
Interpretation:
This is the histogram of change in student faculty ratio. The frequency of change in student
faculty ratio is in normal distribution around -1.

3)CONFIDENCE INTERVAL (CI) ESTIMATION


~ Finding 95% CI for population mean from sample data “police”
t.test(athlet1$apps, conf.level=0.95)
One Sample t-test

data: athlet1$apps
t = 23.093, df = 117, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9647.29 11457.22
sample estimates:
mean of x
10552.25

Result:
The 95% CI for population mean (mu) is (9647.29, 11457.22)
~ Finding 99% CI for population proportion from sample data “year”
#Assuming success to be private college denoted by 1.
S <- athlet1$year
s <- length(S[S==1992]) # It shows no. of success
s
## [1] 59
n <- length(S) # It shows sample size n

## [1] 118

binom.test(s,n,conf.level = 0.99)

Exact binomial test

data: s and n
number of successes = 59, number of trials =
118, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.5
99 percent confidence interval:
0.3792609 0.6207391
sample estimates:
probability of success
0.5

Result:
The 99% CI for population proportion (p) is (0.3792609, 0.6207391)

4)HYPOTHESIS TESTING
~ Testing at 95% to determine whether the average stufac exceeds 18 or not.
# Intuitively, our hypothesis will be:
# H0: mu = 18
# H1: mu > 18
t.test(athlet1$stufac, mu = 18, conf.level=0.95, alternative= "greater")
One Sample t-test

data: athlet1$stufac
t = -8.1774, df = 117, p-value = 1
alternative hypothesis: true mean is greater than 18
95 percent confidence interval:
14.4529 Inf
sample estimates:
mean of x
15.05085

Conclusion:
We do have insufficient evidence to reject our null hypothesis because
1) p-value (i.e 1) is greater than level of significance (i.e 0.05);
2) 18 lies in the CI (14.4529, Inf)

5)CORRELATION ANALYSIS
5.1)SCATTER PLOT
Y1 <- athlet1$year
Y2 <- athlet1$apps
Y3 <- athlet1$top25
Y4 <- athlet1$cstufac
Y5 <- athlet1$lapps
{A} Scatter Plot b/w Year and Applications
plot(Y1,Y2,xlab="Year",ylab="Applications",main="SCATTER PLOT" , pch=18)

Conclusion:
We can infer that above plot shows two linear vertical lines for the applications received in
1992 and 1993.

Scatter Plot b/w STUDENT FACULTY RATIO AND CHANGE IN TOP 25


plot(Y4,Y3,xlab="STUDENT FACULTY RATIO",ylab="CHANGE IN TOP 25",main="SCATTER
PLOT" , pch=18)
{B} Scatter Plot b/w Applications and TOP 25
plot(Y2,Y3,xlab="APPLICATIONS",ylab="TOP 25",main="SCATTER PLOT" , pch=18)

Conclusion:
We can infer that above plot is left-skewed. Thus, we’ll see plot using log values which
reduces the skewness & shows more linear relation.

Scatter Plot
plot(Y5,Y3,xlab="APPLICATIONS",ylab="TOP 25",main="SCATTER PLOT" , pch=18)

{C} Scatter Plot b/w Applications & CHANGE IN STUDENT FACULTY RATIO
plot(Y2,Y4,xlab="APPLICATIONS",ylab="STUDENT FACULTY RATIO",main="SCATTER
PLOT", pch=18)
5.2)CORRELATION COEFFICIENT & ANALYSIS
cor() command gives us the correlation coefficient of each variable with one another. A
correlation coefficient varies from range (-1,1).
ANALYSIS TABLE
1 PERFECT POSTIVE CORRELATION
0.6 to 0.9 HIGH POSTIVE CORRELATION
0.1 to 0.5 LOW POSTIVE CORRELATION
0 NO CORRELATION
-0.1 to -0.5 LOW NEGATIVE CORRELATION
-0.6 to -0.9 HIGH NEGATIVE CORRELATION
-1 PERFECT NEGATIVE CORRELATION

cor(athlet1[1:10]

year apps top25 ver500


year 1.000000000 0.003250838 NA NA
apps 0.003250838 1.000000000 NA NA
top25 NA NA 1 NA
ver500 NA NA NA 1
mth500 NA NA NA NA
stufac -0.017379317 -0.134807263 NA NA
bowl 0.000000000 0.161137045 NA NA
btitle -0.052414242 0.057727689 NA NA
finfour 0.035874800 0.166127136 NA NA
lapps 0.006668464 0.970345677 NA NA
d93 1.000000000 0.003250838 NA NA
avg500 NA NA NA NA
cfinfour NA NA NA NA
clapps NA NA NA NA

mth500 stufac bowl btitle


year NA -0.01737932 0.00000000 -0.05241424
apps NA -0.13480726 0.16113704 0.05772769
top25 NA NA NA NA
ver500 NA NA NA NA
mth500 1 NA NA NA
stufac NA 1.00000000 -0.04685707 0.02208990
bowl NA -0.04685707 1.00000000 -0.02139802
btitle NA 0.02208990 -0.02139802 1.00000000
finfour NA -0.07684385 0.12937146 0.24068486
lapps NA -0.20248341 0.17585890 0.03286370
d93 NA -0.01737932 0.00000000 -0.05241424
avg500 NA NA NA NA
cfinfour NA NA NA NA
clapps NA NA NA NA

finfour lapps
year 0.03587480 0.006668464
apps 0.16612714 0.970345677
top25 NA NA
ver500 NA NA
mth500 NA NA
stufac -0.07684385 -0.202483414
bowl 0.12937146 0.175858905
btitle 0.24068486 0.032863701
finfour 1.00000000 0.172073111
lapps 0.17207311 1.000000000

Analysis:
We can see that:
• The values of bowl and apps have a low positive correlation ( 0.16113704 )

• The values of bowl and finfour have a low positive correlation ( 0.12937146 ) which is lower
than the positive correlation between bowl and apps.

• The values of btitle and finfour have a low positive correlation ( 0.24068486 ) which is higher
than the positive correlation between bowl and app or bowl and finfour.

6)REGRESSION ANALYSIS
X= athlet1$apps
Y= athlet1$bowl

plot(X,Y,xlab=" Applications Received",ylab=" BOWL GAMES",main=" APPLICATIONS


AND BOWL GAMES PLAYED")

model1 <- lm(Y~X)


#Taking Crimes is dependent on Total Enrollments since correlation between
th em is highest.

model1

Call:q
lm(formula = Y ~ X)

Coefficients:
(Intercept) X
2.862e-01 1.624e-05

#Adding regression line to Scatter Plot


abline(model1, col= "red",lty = 3, lwd= 3 )

# Regression equation becomes:


# Y = 2.862e-01 + (1.624e-05)X
# For any value of x, this model gives the predicted/fitted value of y.

# Now, Adding fitted points on regression line


points(Y1, model1$fitted.values, col= "blue")
summary(model1)
##
## Call:
lm(formula = Y ~ X)

Residuals:
Min 1Q Median 3Q Max
-0.6462 -0.4177 -0.3622 0.5143 0.6513

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.862e-01 1.076e-01 2.660 0.00893 **
X 1.624e-05 9.236e-06 1.758 0.08130 .
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4959 on 116 degrees of freedom


Multiple R-squared: 0.02597, Adjusted R-squared: 0.01757
F-statistic: 3.092 on 1 and 116 DF, p-value: 0.0813

You might also like