You are on page 1of 24

NATIONAL INSTITUTE OF FASHION TECHNOLOGY

DATA ANALYTICS AND R

ASSIGNMENT II

SUBMITTED TO:
MR. SUBHANKAR MISHRA
ASST. PROF
NIFT BHUBANESWAR

SUBMITTED BY:
VAISISTHA BAL
BFT/17/470
DEPT. OF FASHION TECHNOLOGY

1
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

ABOUT THE DATASET


This project is based on the boston housing price dataset. The data to be analyzed were
collected by Harrison and Rubinfeld in 1978.
VARIABLES OF THE DATA SET

• CRIM: per capita crime rate by town


• ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
• INDUS: proportion of non-retail business acres per town
• CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
• NOX: nitric oxides concentration (parts per 10 million)
• RM: average number of rooms per dwelling
• AGE: proportion of owner-occupied units built prior to 1940
• DIS: weighted distances to five Boston employment centres
• RAD: index of accessibility to radial highways
• TAX: full-value property-tax rate per $10,000
• PTRATIO: pupil-teacher ratio by town
• B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
• LSTAT: % lower status of the population
• MEDV : Median value of owner-occupied homes in $1000's
STEP 1: STATISTICAL CALCULATIONS
DATASET AND PACKAGES WE NEED
library(tidyverse)
library(data.table)
housing <- fread(file = "housing.csv")
h = housing[1:100, 14] #selecting the range and column to work on
h
• To calculate Mean:
mean(h, trim = 1/10)

Result: 21.76625
Interpretation: The trim function removes 10% of the indentation or the leading and trailing
whitespace from the first and last lines while maintaining the mean. The mean for the column
“MEDV” and the selected range “h” is 21.7665, which is the average of the median value of
prices

• To calculate standard deviation:


sd(h)
Result: 5.948352
Interpretation: This function shows the quantity by which each of the data points
differ from the mean of the values

• To sort:

2
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

sort(v)
Result: [1] 12.7 13.1 13.2 13.5 13.6 13.9 14.4 14.5 14.5 14.8 15.0 15.2 15.6 16.0
[15] 16.5 16.6 16.6 17.4 17.5 18.2 18.2 18.4 18.7 18.9 18.9 18.9 18.9 19.3
[29] 19.4 19.4 19.6 19.6 19.7 19.9 20.0 20.0 20.0 20.2 20.3 20.4 20.5 20.6
[43] 20.8 20.9 21.0 21.0 21.2 21.2 21.4 21.4 21.6 21.7 21.7 22.0 22.0 22.2
[57] 22.2 22.5 22.6 22.8 22.9 22.9 22.9 23.1 23.3 23.4 23.4 23.5 23.6 23.9
[71] 23.9 24.1 24.2 24.7 24.7 24.7 24.8 25.0 25.0 25.0 25.3 26.6 26.6 27.1
[85] 27.5 28.0 28.4 28.7 28.7 30.8 31.6 33.0 33.2 33.4 34.7 34.9 35.4 36.2
[99] 38.7 43.8
Interpretation: sorts the values in the selected range in ascending order.
• To calculate inter quartile range:
IQR(h)
Result: 5.8
Interpretation: It calculates the inter quartile range for the given dataset. In this
case, the data range is stored in h. IQR is equal to 5.8 means; it removes 25% of the
data from the front and the end and shows how spread out the middle range is. IQR
can be performed on an ordered data set from smallest to highest. It shows the
likelihood of where the new data point will be within the data set.
• To calculate standard deviation:
sd(h)
Result: 5.948352
Interpretation: This function shows the quantity by which each of the data points
differ from the mean of the values.

• To display stem: stem(h)


Result:
1 | 3334444
1 | 55555667777888899999999
2 | 0000000000111111111122222223333333333444444
2 | 5555555577788899
3 | 12333
3 | 55569
4|4
Interpretation: Stem is a visualization technique that’s used to understand the data
distribution

• To calculate quantile:
quantile(h)

Result: 0% 25% 50% 75% 100%


12.7 18.9 21.5 24.7 43.8

3
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

Interpretation: The quantile shows the range of data from lowest to highest. 0% is
the lowest point which is equal to 12.7 and 100% is the highest range which is equal
to 43.8. 50% shows the median equal to 21.5. 25% shows the median of 0% to 50%
which is equal to 18.9 and 75% is the median from 50% to 100% equal to 24.7.

• To calculate median:
median(h)
Result: 21.5
Interpretation: Middle value of the data range is 21.5.

• To calculate mad
Mad(h)
Result: 4.07715
Interpretation: It shows the average absolute difference of column values from each
other.
• To find out variance
var(h)
Result: 35.38289
• To find out the maximum
max(h)
Result: 43.8
Interpretation: In the specified data range, max shows the highest value, i.e.
43.8.
• To find out the minimum
min(h)
Result: 12.7
Interpretation: In the specified range, min function shows the lowest value, i.e. 12.7

BASIC GRAPHS
• barplot(table(v))

Interpretation: The highest number or count of median value of prices lie between
the range of 17.5 to 19.7

4
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

• hist(v)

Interpretation: shows the distribution of median value of prices between the points
with the highest median values for the specified data range falling between 20 to 25

• rug(v)

Interpretation: Rug displays individual points on the graph, where the values are
more likely to occur. With more details, as we can, most value fall between 20 to 25

5
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

• Rainbow chart v<-200


pie(rep(1, v), labels = "", col = rainbow(v), border = NA,
main = "pie(*, labels=\"\", col=rainbow(v), border=NA,..")

Interpretation:

• pie(table(v))

6
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

STEP 2: GG PLOT GRAPHS


PACKAGES WE MAY NEED
- library(readr)
- library(ggplot2)
- library(corrplot)
- library(mlbench)
- library(Amelia)
- library(plotly)
- library(reshape2)
- library(caret)
- library(caTools)
- library(dplyr)
- library(cowplot)
housing <- read.csv("housing.csv")
housing <- housing[complete.cases(housing),]

str(housing) #compact version of housing dataset

head(housing) #top 6 samples or first n rows

summary(housing) # data summary

7
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

• CORRELATION PLOT
corrplot(cor(select(housing, -CHAS)), method = 'number')

NOTE: In statistics, dependence or association is any statistical relationship,whether


causal or not, Between two random variables or two sets of data.
1. `MEDV` decreases while `RAD` and `TAX` increases with increase in `CRIM``
2. `TAX` increases highly linearly with increase with `RAD`
3. `INDUS` and `NOX` Both are inversely proportional to `DIS`

4. The percentage of lower status population(LSTAT) in the neighborhood affects


house prices,

With increase in poor income households in an area,the median price of the houses
decreases exponentially.

8
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

• DENSITY PLOT OF MEDV


ggplot(housing, aes(MEDV)) +
stat_density()

NOTE: The plot reveals the peak densities of MEDV are in Between 15 and 30.
• SCATTER PLOTS
SCATTER PLOTS of different variables w.r.t `MEDV` with geom SMOOTHENING
emphasizing on showing correlation patterns
melt(select(housing,
c(CRIM, RM, AGE, RAD,
TAX, LSTAT, MEDV, INDUS,
NOX, PTRATIO, ZN)),
id.vars = "MEDV") %>%
ggplot(aes(x=value, y=MEDV, color=variable)) +
geom_point(alpha=0.7) + stat_smooth(aes(color='black')) +
facet_wrap(~variable, scales='free', ncol = 2) +
labs(x="Variable Value", y="Median House Price($1000)") + theme_minimal()

9
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

NOTE:

1. With lower crime rate(CRIM), the `MEDV` value is high

2. There is a slight decrease in Median Price(MEDV) of the houses with the increasing `AGE`
of the house

3. With the increase in the the Number of rooms (RM), the Median Price(MEDV) also has an
almost linear increase; while avg. rooms of the range 4-6 has slow increase in housing prices,
whereas in the range 6-9, there is a higher rate of increase in housing prices

• BOXPLOT
par(mfrow=c(1, 4))
boxplot(housing$CRIM, main='CRIM', col='aquamarine')
boxplot(housing$ZN, main='ZN', col='aquamarine')
boxplot(housing$RM, main='RM', col='aquamarine')
boxplot(housing$B, main='B', col='aquamarine')

NOTE: It can be observed that variable 'crim' and 'black' take wide range of values.
Variables 'CRIM', 'ZN', 'RM' and 'B' have a large difference between their median and mean
which indicates lot of outliers in respective variables.
CRIM: Most houses in Bostom Regions have lower crime index rates
• HISTOGRAM PLOT
ggh_nox = ggplot(housing, aes(x=NOX)) +
geom_histogram(aes(y=..density..), bins=50,
color='darkblue',fill="lightblue", position="dodge") +
geom_density(alpha=.2, fill='antiquewhite3') + theme_minimal()

ggh_lstat = ggplot(housing, aes(x=LSTAT)) +


geom_histogram(aes(y=..density..), bins=150, color='orange',
fill="yellow", position="dodge") +
geom_density(alpha=.2, fill='antiquewhite3') + theme_minimal()

ggh_zn = ggplot(housing, aes(x=ZN)) +


geom_histogram(aes(y=..density..), bins=50, color='darkred',

10
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

fill="red", position="dodge") +
geom_density(alpha=.2, fill='antiquewhite3') + theme_minimal()

ggh_tax = ggplot(housing, aes(x=TAX)) +


geom_histogram(aes(y=..density..), bins=90, color='darkgreen',
fill="green", position="dodge") +
geom_density(alpha=.2, fill='antiquewhite3') + theme_minimal()

plot_grid(ggh_tax, ggh_zn, ggh_lstat, ggh_nox,


ncol = 2, nrow = 2)

NOTE: TAX: Tax rates for different houses are NOT UNIFORM due to different socio-
economic attributes
LSTAT: The Percent of Lower Status commumnities are more uniform in the 3-19
percentage range from observation
ZN: There is a high percentage of house in Boston which have 0 Landzones with
<25000 sq feet of land zones.
We can conclude that there is small scope of new residential developments around
the existing houses

NOX: The concentration of Nitrogen Oxide levels vary all around the Boston area UN-
UNIFORMLY

We can also observe that all the houses lie under 1.0 NOX pptm(parts per ten million)
level which is recommended as healthy NO2 levels for WHO

STEP 3: TIDY THE DATA SET


boston <- fread(file = "housing.csv")
boston <- boston %>% drop_na() #tidy

11
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

The drop_na() helps us remove all the missing functions from the dataset. It will help us
further with transformation and Linear Modelling, since we don’t want missing values to
affect our modelling.
STEP 4: TRANSFORMATION
PACKAGES WE MAY NEED:
- Library(tidyverse)
- Library(dplyr)
BASIC FUNCTIONS OF DPLYR:

• Filter() :Pick observations by their values


• Arrange(): Reorder the rows
• Select(): Pick variables by their names.
• Mutate(): Create new variables with functions of existing variables
• Summarise(): Collapse many values down to a single summary
USE OF TRANSFORMATION FUNCTIONS ON THE DATASET
Note: The original dataset “housing.csv” stored in “boston”, so as not to destroy the original
values.
library(tidyverse)
library(data.table)
boston <- fread(file = "housing.csv") #mentioned above

• Find out crime rate greater than 1


cm <- boston %>%
select(CRIM) %>%
filter(CRIM>=1)
view(cm)

NOTE: Boston is the dataset we are working on. Since, we want information related
to crime rate, it is easier to work on the specific column. So we use Select() to pick the
CRIM; crime rate, column out of the data set. Then, we use Filter(), to choose all the
observations greater than 1 by the conditional statement, CRIM>=1. Used “cm” as a
vector to store the result.

12
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

RESULT: Type “cm” to view the


result in R console. It is not
feasible to show the list of
result, so it is more convenient
to attach the screenshot.

• Sort MEDV in descending order


desc <- boston %>%
select(MEDV) %>%
arrange(desc(MEDV))

view(desc)
NOTE: We’re working on the “boston” dataset. To store the result of transformation,
we assign a vector “desc”. To work on MEDV, we choose that specific column by
using select(). To sort in descending order, we must use the arrange(). However, by
default, arrange() sorts in ascending order, so we specify arrange(desc(MEDV))
RESULT: To view the full result in a new file, type “view(desc)”.

Screenshot of the result

13
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

• Find out highest crime rates in the town in descending order, along with their
respective median values
m <- mean(boston$CRIM)
hcrim <- boston %>%
select(CRIM, MEDV) %>%
filter(CRIM >= m) %>%
arrange(desc(CRIM))
view(hcrim)

NOTE: In order to find the highest crime rates, we find out the mean crime rate. Crime
rates greater than mean are the highest range and the crime rates less than the mean
are the lowest range.
Mean of crime rates is stored in “m”. Boston is the dataset we are working on.
Since we have to display two variables, highest crime rates along with their specific
Median values, we choose CRIM and MEDV using select(). To pick crime rates greater
than the mean, we use filter() with the conditional statement CRIM>= m, since mean
crime rate is stored in m. eventually, we sort the crime rates in descending order using
arrange(desc(CRIM))

RESULT: To view result in a separate file, type “view(hcrim))”. It isn’t feasible to show
the full result , so, it is convenient to attach the screenshot

Screenshot of the result

14
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

• Find the range between minimum and average age


a <- min(boston$AGE)
a #minimum age
ma<- mean(boston$AGE)
ma #mean age
rage<- boston %>%
select(AGE)%>%
filter(AGE>= a & AGE<= ma) %>%
view(rage) #range of age

NOTE: We’re working on the boston dataset. We’ve to perform statistical calculations
in order to proceed with the transformation. To find out the minimum, assign a vector
a, and use the min() for minimum age.
For average age, use mean(), which will be stored in “ma”. Now, to find ot the range
of age, we create a vector “rage”, choose the AGE column by using select(). To filter
the required obs, use the filter() along the conditional statement AGE>= a & AGE<=
ma.
RESULT: Now, you can view the full result in a new file using view(rage). It isn’t feasible
to show the full result, so it is easy to attach the screenshot.

Screenshot of the result

P.S: Tried to sort the list using arrange, however, there seemed to be an issue with the
library packages and the cloud server, which is why it failed run.

15
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

• Find out the proportion of African Americans in the town


According to the information on the dataset, we’ve B. But we don’t know the
proportion value.
B= 1000(bk-0.63)e+02 where Bk is the proportion of blacks by town
MATHETICAL FUNCTION TO FIND OUT BK:
black= boston$B/1000
sq = sqrt(black)
BK = sq + 0.63
CODE:
#step 1
People <- boston %>%
select(B)
# step 2
People %>%
mutate(AA = sqrt(boston$B/1000))
NOTE: Please note that the Step 1, is meant to choose the required variable from the
boston dataset, using select() for “B”
Step 2 is meant to do the above mathematical calculation in a single formula and
store it in the vector “AA”,i.e. African American. To do the necessary mathematical
function, we also used the sqrt() to find the square root of “boston$B/1000”
Since, we want a new column, we can use the mutate()
RESULT: We can view the full result in the R console by running the formula. It isn’t
feasible to show the full result, so, it is appropriate to attach the screenshot.

Screenshot of the result

16
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

HYPOTHESIS:
H0 : Increase in the crime rates leads to decrease in median value of the house prices
Ha: Increase in the crime rates do not lead to decrease in median value of the house prices

STEP 5: LINEAR MODELLING


PACKAGES WE MAY NEED
library(tidyverse)
library(data.table)
library (tidy)

library(dplyr)
CODE: The step by step process with interpretationa and notes on each step have been
mentioned below
boston <- fread(file = "housing.csv") #mentioned above
FIND CORRELATION
cor(boston$CRIM, boston$MEDV)
RESULT: -0.2863118
NOTE: There is a negative correlation between the two variables, CRIM and MEDV. We will
be worki-ng on those two variables considering the hypothesis we want to test.
Rules of correlation: If cor > 0, positively related and directly proportional
If cor< 0, negatively related and inversely proportional
If cor= 0, variables are independent of each other

PLOT THE GRAPH


Now, to visualize the relation between the two variables, it is better to plot a graph. We’ve
to consider that the house prices are dependent on the crime rates, and not vice versa. So,
crime rates is the independent variable and house prices is the dependent variable. Hence,
we should plot, CRIM on the X-axis and MEDV on the Y-axis
ggplot(boston) +
geom_point(mapping = aes(CRIM,MEDV))

17
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

NOTE: As we can see, there seems o be a linear relationship between CRIM and MEDV,
however, the data points are concentrated on one side.

FIND THE SLOPE AND THE INTERCEPT


Linear modelling helps us predict the value of Y on the basis of X. For that, we must know
the slope and the intercept.
We can use the lm() to find out the slope and intercept w.r.t Y.
linearMod <- lm(data = boston, MEDV ~ CRIM)
print(linearMod)

RESULT:
Call:
lm(formula = MEDV ~ CRIM, data = boston)

Coefficients:
(Intercept) CRIM
25.189 -1.011

PLOT THE GRAPH


Now, in order to check whether the slope and the intercept satisfy the linear relationship
w.r.t to Y, we can now plot a graph using ggplot(), geom_point() and geom_abline(). In
order to do that, put the values of slope and intercept from the previous step in the code
below.
ggplot(boston, mapping = aes(CRIM,MEDV)) +
geom_point() +

geom_abline(slope = -1.011, intercept = 25.189 )

18
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

RESULT: We can clearly see a linear graph here. However, it is an inverse linear relationship,
with a few outliers.
FIND OUT THE MODEL SUMMARY
Use the summary() to calculate the summary of the linearMod variable. It is stored in the
vector “modelSummary”. We can view the result by calling the lm() and using the print()

modelSummary <- summary(linearMod)


print(modelSummary)
RESULT
Call:
lm(formula = CRIM ~ MEDV, data = boston)

Residuals:
Min 1Q Median 3Q Max
-2.5990 -1.4706 -1.0395 0.3764 9.9371

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.35000 0.32443 10.326 < 2e-16 ***
MEDV -0.08110 0.01281 -6.332 5.88e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.396 on 449 degrees of freedom


Multiple R-squared: 0.08197, Adjusted R-squared: 0.07993
F-statistic: 40.09 on 1 and 449 DF, p-value: 5.881e-10

19
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

PERFORM TRAINING AND TESTING


select(boston, CRIM, MEDV)
set.seed(220)

sample <- sample(c(TRUE, FALSE), nrow(boston),replace = T, prob = c(0.7,0.3))


sample
train <- boston[sample, ]
test <- boston[!sample, ]
NOTE: set.seed() helps us to a pick a sample size that we want to put into training and
testing. The parameters 0.7 and 0.3 determine the size of training data and testing data
respectively.
RESULT: 70% of the data will go into training and 30% of the data will go into testing. The
code will show the sample for the same. However, it isn’t feasible to display the full result,
so it is appropriate the screenshot below.

Sample for Training and Testing

TRAINING DATA MODEL AND SUMMARY


bmodel <- lm(data = train, MEDV ~ CRIM)
print(bmodel)

20
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

NOTE: After the data has been sent for training, we can use the lm() to find the intercept
and slope for the training data, w.r.t to Y.
RESULT:
Call:
lm(formula = MEDV ~ CRIM, data = train)

Coefficients:
(Intercept) CRIM
25.320 -0.876
This result is important when we need to plot the graph to visualize the difference between
actual data and training data.

summary(bmodel)
RESULT:
Call:
lm(formula = MEDV ~ CRIM, data = train)

Residuals:
Min 1Q Median 3Q Max
-17.038 -5.479 -2.389 3.038 32.768

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.3197 0.5505 45.994 < 2e-16 ***
CRIM -0.8760 0.1847 -4.742 3.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.725 on 329 degrees of freedom


Multiple R-squared: 0.06397, Adjusted R-squared: 0.06112
F-statistic: 22.48 on 1 and 329 DF, p-value: 3.162e-06
PLOT THE GRAPH BETWEEN ACTUAL AND TRAINING DATA
ggplot(train, mapping = aes(CRIM,MEDV)) +

geom_point() +
geom_abline(slope = -0.876, intercept =25.320 , color = "blue") +
geom_abline(slope =-1.011 , intercept = 25.189, color ="red")
NOTE: Blue represents the training data and red represents the original data

21
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

RESULT

TESTING AND PREDICTION

prediction <- predict(bmodel, test)


avsp <- data.frame(cbind(actuals=test$MEDV, predicted=prediction))
NOTE: storing the predicted data in avsp

CREATE NEW COLUMN- PREDERROR

NOTE: Now, in order to know the difference between actual data and predicted value, it is
better to create a new column, such as predError, using the mutate()

avsp <- avsp %>%


mutate(predError = actuals - predicted)
CREATE A NEW TIBBLE
NOTE: we can now store the variables, actuals, predicted and predError in a new table
avspT <- as_tibble(avsp)

view(avspT)
RESULT: The full result can be seen using the view(). It isn’t feasible to display the full
result, so it is appropriate to attach the screenshot

22
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

GRAPH OF ACTUAL VS PREDERROR


ggplot(data = avspT) +
geom_point(mapping = aes(x = actuals, y = predError))
RESULT:

23
NATIONAL INSTITUTE OF FASHION TECHNOLOGY

FINAL INFERENCE
Since the p-value from the lm() summary, is very very low as compared to the coefficient of
the H0, we can say that the null hypothesis is accepted. We can establish that there is an
inversely proportional linear relationship between CRIM and MEDV.

24

You might also like