Eample of Assignment - Data Analysis

SME ASSIGNMENT (SEM-3)
TOPICS COVERED:
• DESCRIPTION OF DATASET
– DEPENDENT & INDEPENDENT VARIABLES
• DESCRIPTIVE STATISTICS AND ANALYSIS
– SUMMARY
• MEAN
• MEDIAN
• MAXIMUM
• MINIMUM
• QUARTILES
• VARIANCES
– BOXPLOT
– HISTOGRAM
• CONFIDENCE INTERVAL ESTIMATION
• HYPOTHESIS TESTING
• CORRELATION ANALYSIS
– SCATTER PLOT
– CORRELATION COEFFICIENT
• REGRESSION ANALYSIS
Loading Dataset into Rstudio

library(readxl)
athlet1 <- read_excel("D:\\Downloads\\athlet1.xlsx")
View(athlet1)
1)DESCRIPTION OF DATASET
• Campus Dataset contains 23 variables & 118 observations.
• Main variables of interest are: year, apps, top25, ver500, mth500 & stufac.
• Other variables like school, lapps_1, bball seems to be of no usage as of now & can
be called as “Dummy Variables”.
• Dependent Variable: Top25, Stufac
• Independent Variable: Year, Apps
• This is because the top25 and student faculty ratio is dependent on the year and
applications.
2)DESCRIPTIVE STATISTICS AND ANALYSIS
2.1) SUMMARY
X1 = summary(athlet1$year)
X1 #It's giving the 5-number summary of the years, 1992 or 1993
## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 1992 1992 1992 1992 1993 1993
X2 = summary(athlet1$apps)
X2 #It's giving the 5-number summary of Applications for admission

## 3303 6966 8684 10552 13546 23342
X3 = summary(athlet1$top25)
X3 #It's giving the 5-number summary for perc fresh class in 25 hs perc

## 36.00 55.00 64.00 68.28 85.00 97.00
X4 = summary(athlet1$ver500)
X4 #It's giving the 5-number summary for perc fresh>= 500 on verbal SAT

## 20.0 36.0 48.0 54.0 70.5 94.00
X5 = summary(athlet1$mth500)
X5 #It's giving the 5-number summary for perc fresh>= 500 on math SAT
## 39.00 62.00 81.00 77.56 93.00 99.00
X6 = summary (athlet1$stufac)
X6 #It's giving the 5-number summary of the student-faculty ratio
## 7.00 12.00 16.00 15.05 18.00 24.00
To get the entire summary of the variables including variances we are using the following
code to represent in the matrix form
V =c("MEASURES","MINIMUM VALUE","1ST QUARTILE VALUE","MEDIAN","MEAN VALUE","3
RD QUARTILE VALUE","MAXIMUM VALUE","VARIANCE")
A =
matrix(c(V,"Year",X1,var(athlet1$year),"Applications",X2 ,var(athlet1$apps),"
TOP 25",X3,var(athlet1$top25),"STUDENT-FACULTY RATIO
",X4,var(athlet1$stufac)), nrow = 8 , ncol = 5)
[,1] [,2]
[1,] "MEASURES" "Year"
[2,] "MINIMUM VALUE" "1992"
[3,] "1st QUARTILE VALUE" "1992"
[4,] "MEDIAN" "1992.5"
[5,] "MEAN VALUE" "1992.5"
[6,] "3rd QUARTILE VALUE" "1993"
[7,] "MAXIMUM VALUE" "1993"
[8,] "VARIANCE" "0.252136752136752"
[,3] [,4]
[1,] "Applications" "TOP 25"
[2,] "3303" "36"
[3,] "6965.75" "55"
[4,] "8684" "64"
[5,] "10552.2542372881" "68.2795698924731"
[6,] "13546.5" "85"
[7,] "23342" "97"
[8,] "24638713.2681443" "25"
[,5]
[1,] NA
[2,] "STUDENT-FACULTY RATIO\n"
[3,] "20"
[4,] "36"
[5,] "48"
[6,] "54"
[7,] "70.5"
[8,] "94"
2.2)BOXPLOTS
This is the code to get the box plot of our Variables : Applications , Top25 and Student
Faculty Ratio. We will not make a box plot for variable Year because it only has two
possible values.
# BOXPLOT FOR Applications
boxplot(athlet1$apps, xlab = "TOTAL APPLICATIONS" , ylab= "VALUES" , col =
"light blue")
Interpretation:
The box ranges from Q1 (6966) to Q3 (13546), having a median of 8684 shown by bold line
inside the box. The horizontal lines lying outside measures the minimum value ( 3303) &
maximum value (23342).
# BOXPLOT FOR Top25

boxplot(athlet1$top25, xlab = "TOP 25" , ylab= "VALUES" , col = "light
green")
Interpretation:
The box ranges from Q1 (55) to Q3 (85), having a median of 64 shown by bold line inside
the box. The horizontal lines lying outside measures the minimum value (36) & maximum
value (97).
# BOXPLOT FOR Stufac

boxplot(athlet1$stufac, xlab = "Student Faculty Ratio" , ylab= "VALUES" , col
= "medium purple 1")
Interpretation:
The box ranges from Q1 (12) to Q3 (18), having a median of 16 shown by bold line inside
the box. The horizontal lines lying outside measures the minimum value (7) & maximum
value (24).
2.3)HISTOGRAMS
This is the code to get the Histograms of our data.
hist(athlet1$apps, col="light pink", xlab= "TOTAL APPLICATIONS" , ylab=
"VALUES", main= "HISTOGRAM OF TOTAL APPLICATIONS")
Interpretation:
This is the histogram of total applications . More than 25 frequencies in the range of 5000-
10000. There are very few applications more than 15000.
hist(athlet1$year, col="light blue",xlab= "YEAR" , ylab= "VALUES", main=
"HISTOGRAM OF YEAR")
Interpretation:
This is the histogram of the year. Frequency of 1992 is equal to that of 1993, which means
that count of 1992 is equal to the count of 1993. There are around 60 data points for both
1992 & 1993.
hist(athlet1$ctop25, col="light green",xlab= "CHANGE IN TOP 25" , ylab=

"VALUES ", main= "HISTOGRAM OF CHANGE IN TOP25")
Interpretation:
This is the histogram of change in Top 25. The frequency of the top 25 is making a normal
distribution around 0.
hist(athlet1$cstufac, col="medium purple 1", xlab= "CHANGE IN STUDENT FACULTY

RATIO" , ylab= "VALUES ", main= "HISTOGRAM OF CHANGE IN STU-FAC RATIO")
Interpretation:
This is the histogram of change in student faculty ratio. The frequency of change in student
faculty ratio is in normal distribution around -1.
3)CONFIDENCE INTERVAL (CI) ESTIMATION

~ Finding 95% CI for population mean from sample data “police”
t.test(athlet1$apps, conf.level=0.95)
One Sample t-test
data: athlet1$apps
t = 23.093, df = 117, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9647.29 11457.22
sample estimates:
mean of x
10552.25
Result:
The 95% CI for population mean (mu) is (9647.29, 11457.22)
~ Finding 99% CI for population proportion from sample data “year”
#Assuming success to be private college denoted by 1.
S <- athlet1$year
s <- length(S[S==1992]) # It shows no. of success
s
## [1] 59
n <- length(S) # It shows sample size n
## [1] 118
binom.test(s,n,conf.level = 0.99)
Exact binomial test
data: s and n
number of successes = 59, number of trials =
118, p-value = 1
alternative hypothesis: true probability of success is not equal to 0.5
0.3792609 0.6207391
sample estimates:
probability of success
0.5
Result:
The 99% CI for population proportion (p) is (0.3792609, 0.6207391)
4)HYPOTHESIS TESTING
~ Testing at 95% to determine whether the average stufac exceeds 18 or not.
# Intuitively, our hypothesis will be:
# H0: mu = 18
# H1: mu > 18
t.test(athlet1$stufac, mu = 18, conf.level=0.95, alternative= "greater")
One Sample t-test
data: athlet1$stufac
t = -8.1774, df = 117, p-value = 1
alternative hypothesis: true mean is greater than 18
14.4529 Inf
sample estimates:
mean of x
15.05085
Conclusion:
We do have insufficient evidence to reject our null hypothesis because
1) p-value (i.e 1) is greater than level of significance (i.e 0.05);
2) 18 lies in the CI (14.4529, Inf)
5)CORRELATION ANALYSIS
5.1)SCATTER PLOT
Y1 <- athlet1$year
Y2 <- athlet1$apps
Y3 <- athlet1$top25
Y4 <- athlet1$cstufac
Y5 <- athlet1$lapps
{A} Scatter Plot b/w Year and Applications
plot(Y1,Y2,xlab="Year",ylab="Applications",main="SCATTER PLOT" , pch=18)
Conclusion:
We can infer that above plot shows two linear vertical lines for the applications received in
1992 and 1993.
Scatter Plot b/w STUDENT FACULTY RATIO AND CHANGE IN TOP 25

plot(Y4,Y3,xlab="STUDENT FACULTY RATIO",ylab="CHANGE IN TOP 25",main="SCATTER
PLOT" , pch=18)
{B} Scatter Plot b/w Applications and TOP 25
plot(Y2,Y3,xlab="APPLICATIONS",ylab="TOP 25",main="SCATTER PLOT" , pch=18)
Conclusion:
We can infer that above plot is left-skewed. Thus, we’ll see plot using log values which
reduces the skewness & shows more linear relation.
Scatter Plot
plot(Y5,Y3,xlab="APPLICATIONS",ylab="TOP 25",main="SCATTER PLOT" , pch=18)
{C} Scatter Plot b/w Applications & CHANGE IN STUDENT FACULTY RATIO
plot(Y2,Y4,xlab="APPLICATIONS",ylab="STUDENT FACULTY RATIO",main="SCATTER
PLOT", pch=18)
5.2)CORRELATION COEFFICIENT & ANALYSIS
cor() command gives us the correlation coefficient of each variable with one another. A
correlation coefficient varies from range (-1,1).
ANALYSIS TABLE
1 PERFECT POSTIVE CORRELATION
0.6 to 0.9 HIGH POSTIVE CORRELATION
0.1 to 0.5 LOW POSTIVE CORRELATION
0 NO CORRELATION
-0.1 to -0.5 LOW NEGATIVE CORRELATION
-0.6 to -0.9 HIGH NEGATIVE CORRELATION
-1 PERFECT NEGATIVE CORRELATION
cor(athlet1[1:10]
year apps top25 ver500

year 1.000000000 0.003250838 NA NA
apps 0.003250838 1.000000000 NA NA
top25 NA NA 1 NA
ver500 NA NA NA 1
mth500 NA NA NA NA
stufac -0.017379317 -0.134807263 NA NA
bowl 0.000000000 0.161137045 NA NA
btitle -0.052414242 0.057727689 NA NA
finfour 0.035874800 0.166127136 NA NA
lapps 0.006668464 0.970345677 NA NA
d93 1.000000000 0.003250838 NA NA
avg500 NA NA NA NA
cfinfour NA NA NA NA
clapps NA NA NA NA
mth500 stufac bowl btitle

year NA -0.01737932 0.00000000 -0.05241424
apps NA -0.13480726 0.16113704 0.05772769
top25 NA NA NA NA
ver500 NA NA NA NA
mth500 1 NA NA NA
stufac NA 1.00000000 -0.04685707 0.02208990
bowl NA -0.04685707 1.00000000 -0.02139802
btitle NA 0.02208990 -0.02139802 1.00000000
finfour NA -0.07684385 0.12937146 0.24068486
lapps NA -0.20248341 0.17585890 0.03286370
d93 NA -0.01737932 0.00000000 -0.05241424
avg500 NA NA NA NA
cfinfour NA NA NA NA
clapps NA NA NA NA
finfour lapps
year 0.03587480 0.006668464
apps 0.16612714 0.970345677
top25 NA NA
ver500 NA NA
mth500 NA NA
stufac -0.07684385 -0.202483414
bowl 0.12937146 0.175858905
btitle 0.24068486 0.032863701
finfour 1.00000000 0.172073111
lapps 0.17207311 1.000000000
Analysis:
We can see that:
• The values of bowl and apps have a low positive correlation ( 0.16113704 )
• The values of bowl and finfour have a low positive correlation ( 0.12937146 ) which is lower
than the positive correlation between bowl and apps.
• The values of btitle and finfour have a low positive correlation ( 0.24068486 ) which is higher
than the positive correlation between bowl and app or bowl and finfour.
6)REGRESSION ANALYSIS
X= athlet1$apps
Y= athlet1$bowl
plot(X,Y,xlab=" Applications Received",ylab=" BOWL GAMES",main=" APPLICATIONS

AND BOWL GAMES PLAYED")
model1 <- lm(Y~X)

#Taking Crimes is dependent on Total Enrollments since correlation between
th em is highest.
model1
Call:q
lm(formula = Y ~ X)
Coefficients:
(Intercept) X
2.862e-01 1.624e-05
#Adding regression line to Scatter Plot

abline(model1, col= "red",lty = 3, lwd= 3 )
# Regression equation becomes:

# Y = 2.862e-01 + (1.624e-05)X
# For any value of x, this model gives the predicted/fitted value of y.
# Now, Adding fitted points on regression line

points(Y1, model1$fitted.values, col= "blue")
summary(model1)
##
## Call:
lm(formula = Y ~ X)
Residuals:
Min 1Q Median 3Q Max
-0.6462 -0.4177 -0.3622 0.5143 0.6513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.862e-01 1.076e-01 2.660 0.00893 **
X 1.624e-05 9.236e-06 1.758 0.08130 .
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4959 on 116 degrees of freedom

Multiple R-squared: 0.02597, Adjusted R-squared: 0.01757
F-statistic: 3.092 on 1 and 116 DF, p-value: 0.0813

Eample of Assignment - Data Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Eample of Assignment - Data Analysis

Uploaded by

Copyright:

Available Formats

SME ASSIGNMENT (SEM-3)

Loading Dataset into Rstudio

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

# BOXPLOT FOR Top25

# BOXPLOT FOR Stufac

hist(athlet1$ctop25, col="light green",xlab= "CHANGE IN TOP 25" , ylab=

hist(athlet1$cstufac, col="medium purple 1", xlab= "CHANGE IN STUDENT FACULTY

3)CONFIDENCE INTERVAL (CI) ESTIMATION

Exact binomial test

Scatter Plot b/w STUDENT FACULTY RATIO AND CHANGE IN TOP 25

year apps top25 ver500

mth500 stufac bowl btitle

plot(X,Y,xlab=" Applications Received",ylab=" BOWL GAMES",main=" APPLICATIONS

model1 <- lm(Y~X)

#Adding regression line to Scatter Plot

# Regression equation becomes:

# Now, Adding fitted points on regression line

Residual standard error: 0.4959 on 116 degrees of freedom

You might also like