You are on page 1of 47

BUSINESS ANALYTICS

Assignment

E. Raviteja
Roll no: 10
Day 1_2 Assignment:
Q1) Identify the Data type for the Following:

Activity Data Type


Number of beatings from Wife Discrete
Results of rolling a dice Continuous
Weight of a person Ratio
Weight of Gold Ratio
Distance between two places Ratio
Length of a leaf Continuous
Dog's weight Ratio
Blue Color Nominal
Number of kids Discrete
Number of tickets in Indian railways Continuous
Number of times married Discrete
Gender (Male or Female) Nominal data type

Q2) Identify the Data types, which were among the following
Nominal, Ordinal, Interval, Ratio.
Data Data Type
Gender Nominal data type
High School Class Ranking Ordinal
Celsius Temperature Interval
Weight Ratio
Hair Color Nominal
Socioeconomic Status Ordinal
Fahrenheit Temperature Interval
Height Ratio
Type of living accommodation Nominal
Level of Agreement Ordinal
IQ(Intelligence Scale) Interval
Sales Figures Ordinal
Blood Group Ordinal data type
Time Of Day Interval
Time on a Clock with Hands Interval
Number of Children Ordinal
Religious Preference Nominal
Barometer Pressure Interval
SAT Scores Interval
Years of Education Ordinal
Q3) Three Coins are tossed, find the probability that two heads and one tail are obtained?
Ans) The probability of getting 2 heads and 1 tail is 3/8
Q4) Two Dice are rolled, find the probability that sum is
a) Equal to 1 = 0/36 = 0
b) Less than or equal to 4 = 6/36 = 1/6
c) Sum is divisible by 2 and 3 = 6/36 = 1/6
Answer:
The sample space S of two dice is shown below.
S = { (1,1),(1,2),(1,3),(1,4),(1,5),(1,6)
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6)
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6)
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6)
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6) }
a) Let E be the event "sum equal to 1". There are no outcomes which correspond to a
sum equal to 1, hence
P (E) = n (E) / n(S) = 0 / 36 = 0
b) Less than or equal to 4
Three possible outcomes give a sum equal to 4: E = {(1, 3), (2, 2), (3, 1)}, hence.  
P(E) = n(E) / n(S) = 3 / 36 = 1 / 12
c) Sum is divisible by 2 and 3
No of possible outcomes =24
Probability= 24/36=2/3
Q5) A bag contains 2 red, 3 green and 2 blue balls. Two balls are drawn at random. What is the
probability that none of the balls drawn is blue?
Ans) number of ways of drawing 2 balls out of 7 is 21
Number of ways of drawing 2 balls out of 2red and 3green balls is 10
The probability that none of the balls drawn is blue is 10/21
Answer:
Total no of chances to draw 2 balls at random from 7 coloured balls,
n(s)=7C2=21
let E be an event to draw 2 balls other than blue.
no of chances to draw two balls other than blue are,
n(E)=2C2+3C2+2C1.3C1=1+3+6=10
the probability that none of balls drawn is blue is,
p(E)=n(E)/n(s)=10/21

Q6) Calculate the Expected number of candies for a randomly selected child
Below are the probabilities of count of candies for children (ignoring the nature of the child-
Generalized view)
CHILD Candies count Probability
A 1 0.015
B 4 0.20
C 3 0.65
D 5 0.005
E 6 0.01
F 2 0.120
Child A – probability of having 1 candy = 0.015.
Child B – probability of having 4 candies = 0.20
Answer:
Candies: 1+4+3+5+6+2 =21
Expected number of candies for a randomly selected child= 3.5
Q7) Calculate Mean, Median, Mode, Variance, Standard Deviation, Range & comment about
the values / draw inferences, for the given dataset
- For Points,Score,Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range and also
Comment about the values/ Draw some inferences.
Answer: Mean of Points: 3.5965
Median of the points: 3.695
Mode of the points: 3.92
Standard Deviation of the points: 0.3352
Range: Max – min = 2
Mean for the scores: 3.216
Mode for the scores: 3.44
Median: 3.325
Standard deviation: 0.317
Range: Max – Min =3.911
Q8) Calculate Expected Value for the problem below
a) The weights (X) of patients at a clinic (in pounds), are
108, 110, 123, 134, 135, 145, 167, 187, 199
Assume one of the patients is chosen at random. What is the Expected Value of the
Weight of that patient?
Expected value of one of the patients chosen at random=145.333
Q9) Calculate Skewness, Kurtosis & draw inferences on the following data
Cars speed and distance
Skewness: 2.05755
Kurtosis:
SP and Weight(WT)
Q10) Draw inferences about the following boxplot & histogram

Answer:
Histograms are the most common way to plot a vector of numeric data. To create a histogram
we’ll use the hist() function. The main argument to hist() is a x, a vector of numeric data. If
you want to specify how the histogram bins are created, you can use the  breaks argument. To
change the color of the border or background of the bins, use col and border:
Let’s create a histogram of the weights in the Chicken Weight dataset:
Box plot values:
Q11) Suppose we want to estimate the average weight of an adult male in Mexico. We draw a
random sample of 2,000 men from a population of 3,000,000 men and weigh them. We find that
the average person in our sample weighs 200 pounds, and the standard deviation of the sample is
30 pounds. Calculate 94%,98%,96% confidence interval?
98% confidence interval=1.65
94% confidence interval=1.88
96% confidence interval=0.848
Q12) Below are the scores obtained by a student in tests
34,36,36,38,38,39,39,40,40,41,41,41,41,42,42,45,49,56
1) Find mean, median, variance, standard deviation.
2) What can we say about the student marks?
Answer: 34,36,36,38,38,39,39,40,40,41,41,41,41,42,42,45,49,56
Total sum=738
Mean=738/18 =41
Median=40.5
Variance =-7,-5,-5,-3,-3,-2,-2,-1,-1,0,0,0,0,1,1,4,8,15
49,25,25,9,9,4,4,1,1,0,0,0,0,1,1,16,64,225=434
Standard deviation= 434/18=24.11
Square root of 24.11=4.91
2)The student marks are quietly average because all the students are managed to pass by getting
the marks of 40.5 as median. And their standard deviation is 4.91
Q13) What is the nature of skewness when mean, median of data are equal?
 Answer: When the values of mean, median and mode are equal, there is no skewness.
Q14) What is the nature of skewness when mean > median ?
Answer: If mean >median than the skewness will be positive skewness
Q15) What is the nature of skewness when median > mean?
Answer: If median>mean than the skewness will be negative skewness
Q16) What does positive kurtosis value indicates for a data ?
Answer: A distribution with a positive kurtosis value indicates that the distribution has heavier
tails and a sharper peak than the normal distribution. For example, data that follow a t
distribution have a positive kurtosis value. The solid line shows the normal distribution and the
dotted line shows a distribution with a positive kurtosis value.

Q17) What does negative kurtosis value indicates for a data?


Answer: A distribution with a negative kurtosis value indicates that the distribution has lighter
tails and a flatter peak than the normal distribution. For example, data that follow a beta
distribution with first and second shape parameters equal to 2 have a negative kurtosis value. The
solid line shows the normal distribution and the dotted line shows a distribution with a negative
kurtosis value.

Q18) Answer the below questions using the below boxplot visualization.

What can we say about the distribution of the data?


Answer: Box Plot: Display of Distribution. ... The box plot (a.k.a. box and whisker diagram) is
a standardized way of displaying the distribution of data based on the five number summary:
minimum, first quartile, median, third quartile, and maximum.
Q19) Comment on the below Boxplot visualizations?

Draw an Inference from the distribution of data for Boxplot 1 with respect Boxplot 2.
Q 20) Calculate probability from the given dataset for the below cases
Data _set: Cars.csv
Calculate the probability of MPG of Cars for the below cases.
MPG <- Cars$MPG
a. P(MPG>38)
b. P(MPG<40)
c. P (20<MPG<50)
Q 21) Check whether the data follows normal distribution
a) Check whether the MPG of Cars follows Normal Distribution
Dataset: Cars.csv
b) Check Whether the Adipose Tissue (AT) and Waist Circumference(Waist) from wc-
at data set follows Normal Distribution
Dataset: wc-at.csv
Q 22) Calculate the Z scores of 90% confidence interval,94% confidence interval, 60%
confidence interval.
Q 23) Calculate the t scores of 95% confidence interval, 96% confidence interval, 99%
confidence interval for sample size of 25
Answer: 95% = 2.064, 96% = 2.492, 99% = 2.797
Q 24) A Government company claims that an average light bulb lasts 270 days. A researcher
randomly selects 18 bulbs for testing. The sampled bulbs last an average of 260 days, with a
standard deviation of 90 days. If the CEO's claim were true, what is the probability that 18
randomly selected bulbs would have an average life of no more than 260 days?
Hint:
Rcode  pt (tscore, DF)
DF  degrees of freedom
Answer: ((260-270)/ ((90/sqrt (18)))) = -10/21.2132034 = -0.047140
Degree of freedom  18-1 = 17
Probability = 0.3218
Day 5 Assignment:
Multi Linear Regression

1. Prepare a prediction model for profit of 50_startups data. Do transformations for


getting better predictions of profit and Make a table containing R^2 value for each
prepared model.
Answer:
## R&D Spend Administration Marketing Spend State
## Min. : 0 Min. : 51283 Min. : Zero Length: 50
## 1st Qu.: 39936 1st Qu.:103731 1st Qu.:129300 Class: character
## Median: 73051 Median: 122700 Median: 212716 Mode: character
## Mean : 73722 Mean : 121345 Mean : 211025
## 3rd Qu.:101603 3rd Qu.:144842 3rd Qu.:299469
## Max. : 165349 Max. : 182646 Max. : 471784
## Profit
## Min. : 14681
## 1st Qu.: 90139
## Median: 107978
## Mean : 112013
## 3rd Qu.:139766
## Max. : 192262
# Variance
Var (startup_50$`R&D Spend`)
## 2107017150
Var (startup_50$Administration)
## 784997271
Var (startup_50$`Marketing Spend`)
## 14954920097
Var (startup_50$Profit)
## 1624588173
# Standard Deviation
Sd (startup_50$`R&D Spend`)
## 45902.26
Sd (startup_50$Administration)
## 28017.8
Sd (startup_50$`Marketing Spend`)
## 122290.3
Sd (startup_50$Profit)
## 40306.18
Checking how many city are in state
Unique (startup_50$State)
## "New York" "California" "Florida"
Creating 3 dummy variable for state
startup_50 <- cbind(startup_50,ifelse(startup_50$State=="New York",1,0),
ifelse(startup_50$State=="California",1,0), ifelse(startup_50$State=="Florida",1,0))
# renaming the column
Set names (startup_50, 'V2','New York')
Set names (startup_50, 'V3','California')
Set names (startup_50, 'V4','Florida')
# Ploting the data on scatter plot
# Plot (startup_50) # this line give us error because we have a texual values state
Plot (startup_50 [,-c ('State')]) # in this plot we are plotting dummy which seems no relative
Plot (startup_50 [,-c ('State’, ‘New York','California','Florida')]) # after removing state and
dummy columns

Summary (Profit Model)


## Call:
## Lm (formula = Profit ~ `R&D Spend` + Administration + `Marketing Spend`,
## data = startup_50)
## Residuals:
## Min 1Q Median 3Q Max
## -33534 -4795 63 6606 17275
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
## `R&D Spend` 8.057e-01 4.515e-02 17.846 < 2e-16 ***
## Administration -2.682e-02 5.103e-02 -0.526 0.602
## `Marketing Spend` 2.723e-02 1.645e-02 1.655 0.105
## Signif. Codes: Zero '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
## F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16
P value is greater than 0.05 so now checking the influence records
Summary (Profit_Model_Inf)
## Call:
## Lm (formula = Profit ~ `R&D Spend` + Administration + `Marketing Spend`,
## data = startup_50 [-c (50, 49),])
## Residuals:
## Min 1Q Median 3Q Max
## -16252 -4983 -2042 6019 13631
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 5.910e+04 5.917e+03 9.988 6.92e-13 ***
## `R&D Spend` 7.895e-01 3.635e-02 21.718 < 2e-16 ***
## Administration -6.335e-02 4.392e-02 -1.442 0.156
## `Marketing Spend` 1.689e-02 1.353e-02 1.249 0.218
## Signif. Codes: Zero '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7349 on 44 degrees of freedom
## Multiple R-squared: 0.9627, Adjusted R-squared: 0.9601
## F-statistic: 378.3 on 3 and 44 DF, p-value: < 2.2e-16
Variance Inflation factor to check collinearity b/n variables
Profit Model <- lm (Profit~`R&D Spend`+Administration+`Marketing Spend`, data =
startup_50)
Class (startup_50$`Marketing Spend`)
## [1] "numeric"
Vif (Profit Model)
## `R&D Spend` Administration `Marketing Spend`
## 2.468903 1.175091 2.326773
Summary (Profit Model)
## Call:
## Lm (formula = Profit ~ `R&D Spend` + Administration + `Marketing Spend`,
## data = startup_50)
## Residuals:
## Min 1Q Median 3Q Max
## -33534 -4795 63 6606 17275
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
## `R&D Spend` 8.057e-01 4.515e-02 17.846 < 2e-16 ***
## Administration -2.682e-02 5.103e-02 -0.526 0.602
## `Marketing Spend` 2.723e-02 1.645e-02 1.655 0.105
## Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9232 on 46 degrees of freedom
## Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
## F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16
## Vif>10 then there exists collinearity among all the variables
## Added Variable plot to check correlation b/n variables and o/p variable
AvPlots (Profit Model)

Summary (Profit_Model_Final)
## Call:
## Lm (formula = Profit ~ `R&D Spend` + `Marketing Spend`, data = startup_50)
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## `R&D Spend` 7.966e-01 4.135e-02 19.266 <2e-16 ***
## `Marketing Spend` 2.991e-02 1.552e-02 1.927 0.06.
## Signif. Codes: Zero '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
Plot (Profit_Model_Final)

2. Predict sales of the computer


# Read dats from file
Library (data. Table)
Colnames (Computer Data)
## "V1" "price" "speed" "hd" "ram" "screen" "cd"
## "multi" "premium" "ads" "trend"
str (Computer Data)
## Classes 'data.table' and 'data.frame': 6259 obs. of 11 variables:
## $ V1 : int 1 2 3 4 5 6 7 8 9 10...
## $ price: int 1499 1795 1595 1849 3295 3695 1720 1995 2225 2575...
## $ speed: int 25 33 25 25 33 66 25 50 50 50...
## $ hd : int 80 85 170 170 340 340 170 85 210 210...
## $ ram : int 4 2 4 8 16 16 4 2 8 4...
## $ screen: int 14 14 15 14 14 14 14 14 14 15...
## $ cd : Chr “no" "no" "no" "no" ...
## $ multi: Chr “no" "no" "no" "no" ...
## $ premium: chr "yes" "yes" "yes" "no" ...
## $ ads : int 94 94 94 94 94 94 94 94 94 94...
## $ trend: int 1 1 1 1 1 1 1 1 1 1...
## - attr (*, ".internal.selfref") =<externalptr>
# Creating Dummy Variable
Summary (comp data)
## Price speed hd ram
## Min. : 949 Min. : 25.00 Min. : 80.0 Min. : 2.000
## 1st Qu.:1794 1st Qu.: 33.00 1st Qu.: 214.0 1st Qu.: 4.000
## Median: 2144 Median: 50.00 Median: 340.0 Median: 8.000
## Mean : 2220 Mean : 52.01 Mean : 416.6 Mean : 8.287
## 3rd Qu.:2595 3rd Qu.: 66.00 3rd Qu.: 528.0 3rd Qu.: 8.000
## Max]. : 5399 Max. : 100.00 Max. : 2100.0 Max. : 32.000
## screen ads trend cd_dummy1
## Min. : 14.00 Min. : 39.0 Min. : 1.00 Min. : 0.0000
## 1st Qu.:14.00 1st Qu.:162.5 1st Qu.:10.00 1st Qu.:0.0000
## Median: 14.00 Median: 246.0 Median: 16.00 Median: 0.0000
## Mean : 14.61 Mean : 221.3 Mean : 15.93 Mean : 0.4646
## 3rd Qu.:15.00 3rd Qu.:275.0 3rd Qu.:21.50 3rd Qu.:1.0000
## Max. : 17.00 Max. : 339.0 Max. : 35.00 Max. : 1.0000
## multi_dummy1 premium_dummy1 cd_dummy2 multi_dummy2
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median: 0.0000 Median: 1.0000 Median: 1.0000 Median: 1.0000
## Mean : 0.1395 Mean : 0.9022 Mean : 0.5354 Mean : 0.8605
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000
## premium_dummy2
## Min. : 0.00000
## 1st Qu.:0.00000
## Median: 0.00000
## Mean : 0.09778
## 3rd Qu.:0.00000
## Max. : 1.00000
Creating first model
Summary (comp model)
## Call:
## Lm (formula = price ~ speed + hd + ram + screen + ads + trend +
## cd_dummy1 + multi_dummy1 + premium_dummy1, data = comp_data)
## Residuals:
## Min 1Q Median 3Q Max
## -1093.77 -174.24 -11.49 146.49 2001.05
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 307.98798 60.35341 5.103 3.44e-07 ***
## speed 9.32028 0.18506 50.364 < 2e-16 ***
## hd 0.78178 0.02761 28.311 < 2e-16 ***
## ram 48.25596 1.06608 45.265 < 2e-16 ***
## screen 123.08904 3.99950 30.776 < 2e-16 ***
## Ads 0.65729 0.05132 12.809 < 2e-16 ***
## trend -51.84958 0.62871 -82.470 < 2e-16 ***
## cd_dummy1 60.91671 9.51559 6.402 1.65e-10 ***
## multi_dummy1 104.32382 11.41268 9.141 < 2e-16 ***
## premium_dummy1 -509.22473 12.34225 -41.259 < 2e-16 ***
## Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 275.3 on 6249 degrees of freedom
## Multiple R-squared: 0.7756, Adjusted R-squared: 0.7752
## F-statistic: 2399 on 9 and 6249 DF, p-value: < 2.2e-16
Summary (comp_model_final)
## Call:
## Lm (formula = price ~ speed + hd + ram + screen + ads + trend +
## cd_dummy1 + multi_dummy1 + premium_dummy1, data = comp_data)
## Residuals:
## Min 1Q Median 3Q Max
## -1093.77 -174.24 -11.49 146.49 2001.05
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 307.98798 60.35341 5.103 3.44e-07 ***
## speed 9.32028 0.18506 50.364 < 2e-16 ***
## hd 0.78178 0.02761 28.311 < 2e-16 ***
## ram 48.25596 1.06608 45.265 < 2e-16 ***
## screen 123.08904 3.99950 30.776 < 2e-16 ***
## Ads 0.65729 0.05132 12.809 < 2e-16 ***
## trend -51.84958 0.62871 -82.470 < 2e-16 ***
## cd_dummy1 60.91671 9.51559 6.402 1.65e-10 ***
## multi_dummy1 104.32382 11.41268 9.141 < 2e-16 ***
## premium_dummy1 -509.22473 12.34225 -41.259 < 2e-16 ***
## Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 275.3 on 6249 degrees of freedom
## Multiple R-squared: 0.7756, Adjusted R-squared: 0.7752
## F-statistic: 2399 on 9 and 6249 DF, p-value: < 2.2e-16
3. Prepare a prediction model for predicting Price.
Summary (Corolla Model)
## Call:
## Lm (formula = Price ~ Age_08_04 + KM + HP + cc + Doors + Gears +
## Quarterly Tax + Weight, data = Corolla)
## Residuals:
## Min 1Q Median 3Q Max
## -9366.4 -793.3 -21.3 799.7 6444.0
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) -5.573e+03 1.411e+03 -3.949 8.24e-05 ***
## Age_08_04 -1.217e+02 2.616e+00 -46.512 < 2e-16 ***
## KM -2.082e-02 1.252e-03 -16.622 < 2e-16 ***
## HP 3.168e+01 2.818e+00 11.241 < 2e-16 ***
## cc -1.211e-01 9.009e-02 -1.344 0.17909
## Doors -1.617e+00 4.001e+01 -0.040 0.96777
## Gears 5.943e+02 1.971e+02 3.016 0.00261 **
## Quarterly Tax 3.949e+00 1.310e+00 3.015 0.00262 **
## Weight 1.696e+01 1.068e+00 15.880 < 2e-16 ***
## Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1342 on 1427 degrees of freedom
## Multiple R-squared: 0.8638, Adjusted R-squared: 0.863
## F-statistic: 1131 on 8 and 1427 DF, p-value: < 2.2e-16
Vif (Corolla Model)
## Age_08_04 KM HP cc Doors
## 1.884620 1.756905 1.419422 1.163894 1.156575
## Gears Quarterly_Tax Weight
## 1.098723 2.311431 2.516420
AvPlots (Corolla Model)

StepAIC (Corolla Model)


## Start: AIC=20693.89
## Price ~ Age_08_04 + KM + HP + cc + Doors + Gears + Quarterly_Tax +
## Weight
##Df Sum of Sq. RSS AIC
## - Doors 1 2943 2571786477 20692
## - cc 1 3256511 2575040045 20694
## <none> 2571783534 20694
## - Quarterly Tax 1 16377633 2588161166 20701
## - Gears 1 16393629 2588177163 20701
## - HP 1 227730786 2799514319 20814
## - Weight 1 454465243 3026248777 20926
## - KM 1 497917334 3069700867 20946
## - Age_08_04 1 3898860600 6470644134 22017
## Step: AIC=20691.89
## Price ~ Age_08_04 + KM + HP + cc + Gears + Quarterly_Tax + Weight
##Df Sum of Sq. RSS AIC
## - cc 1 3254209 2575040686 20692
## <none> 2571786477 20692
## - Quarterly Tax 1 16503849 2588290326 20699
## - Gears 1 17093855 2588880332 20699
## - HP 1 228761929 2800548406 20812
## - Weight 1 484447009 3056233485 20938
## - KM 1 498427860 3070214337 20944
## - Age_08_04 1 3898877516 6470663993 22015
## Step: AIC=20691.7
## Price ~ Age_08_04 + KM + HP + Gears + Quarterly Tax + Weight
##Df Sum of Sq. RSS AIC
## <none> 2575040686 20692
## - Quarterly Tax 1 14976762 2590017448 20698
## - Gears 1 17276597 2592317283 20699
## - HP 1 225684613 2800725299 20810
## - Weight 1 484245502 3059286188 20937
## - KM 1 506728527 3081769213 20948
## - Age_08_04 1 3902107988 6477148674 22014
## Call:
## Lm (formula = Price ~ Age_08_04 + KM + HP + Gears + Quarterly_Tax +
## Weight, data = Corolla)
## Coefficients:
## (Intercept) Age_08_04 KM HP Gears
## -5.478e+03 -1.217e+02 -2.094e-02 3.133e+01 5.990e+02
## Quarterly_Tax Weight
## 3.737e+00 1.673e+01
Summary (Corolla_Model_final)
## Call:
## Lm (formula = Price ~ Age_08_04 + KM + HP + log (cc) + Gears +
## Quarterly_Tax + Weight, data = Corolla)
## Residuals:
## Min 1Q Median 3Q Max
## -10498.6 -763.2 -30.4 759.7 6611.2
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 8.288e+03 2.662e+03 3.114 0.00188 **
## Age_08_04 -1.211e+02 2.585e+00 -46.868 < 2e-16 ***
## KM -1.928e-02 1.263e-03 -15.262 < 2e-16 ***
## HP 3.677e+01 2.907e+00 12.649 < 2e-16 ***
## Log (cc) -2.261e+03 3.726e+02 -6.067 1.67e-09 ***
## Gears 5.582e+02 1.912e+02 2.920 0.00356 **
## Quarterly Tax 6.545e+00 1.361e+00 4.808 1.69e-06 ***
## Weight 1.870e+01 1.059e+00 17.658 < 2e-16 ***
## Signif. Codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1326 on 1428 degrees of freedom
## Multiple R-squared: 0.867, Adjusted R-squared: 0.8664
## F-statistic: 1330 on 7 and 1428 DF, p-value: < 2.2e-16
Simple linear Regression
1) Calories_consumed-> predict weight gained using calories consumed
2) Delivery_time -> Predict delivery time using sorting time
3) Emp_data -> Build a prediction model for Churn_out_rate
4) Salary_hike -> Build a prediction model for Salary_hike
Do the necessary transformations for input variables for getting better R^2 value for the model
prepared.
1 - Calories_consumed-> predict weight gained using calories consumed
summary(Calories_consumed)
## Weight.gained..grams. Calories.Consumed
## Min. : 62.0 Min. :1400
## 1st Qu.: 114.5 1st Qu.:1728
## Median : 200.0 Median :2250
## Mean : 357.7 Mean :2341
## 3rd Qu.: 537.5 3rd Qu.:2775
## Max. :1100.0 Max. :3900
# Variance and Standard deviation of Calories.Consumed column
var(Calories_consumed$Calories.Consumed)
## [1] 565668.7
sd(Calories_consumed$Calories.Consumed)
## [1] 752.1095
# Variance and Standard deviation of Weight.gained..grams. column
var(Calories_consumed$Weight.gained..grams.)
## [1] 111350.7
sd(Calories_consumed$Weight.gained..grams.)
## [1] 333.6925
Creating Linear Model for weight gain
WeightGainModel <- lm(Weight.gained..grams. ~ Calories.Consumed, data =
Calories_consumed)
summary(WeightGainModel)
## Call:
## lm(formula = Weight.gained..grams. ~ Calories.Consumed, data = Calories_consumed)
## Residuals:
## Min 1Q Median 3Q Max
## -158.67 -107.56 36.70 81.68 165.53
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -625.75236 100.82293 -6.206 4.54e-05 ***
## Calories.Consumed 0.42016 0.04115 10.211 2.86e-07 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 111.6 on 12 degrees of freedom
## Multiple R-squared: 0.8968, Adjusted R-squared: 0.8882
## F-statistic: 104.3 on 1 and 12 DF, p-value: 2.856e-07
plot(Calories_consumed)

Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.8968. That’s mean this model will predict the output 89.68% time correct
2 - Delivery_time -> Predict delivery time using sorting time
summary(delivery_time)
## Delivery.Time Sorting.Time
## Min. : 8.00 Min. : 2.00
## 1st Qu.:13.50 1st Qu.: 4.00
## Median :17.83 Median : 6.00
## Mean :16.79 Mean : 6.19
## 3rd Qu.:19.75 3rd Qu.: 8.00
## Max. :29.00 Max. :10.00
# Variance and Standard deviation of Delivery.Time column
var(delivery_time$Delivery.Time)
## [1] 25.75462
sd(delivery_time$Delivery.Time)
## [1] 5.074901
# Variance and Standard deviation of Sorting.Time column
var(delivery_time$Sorting.Time)
## [1] 6.461905
sd(delivery_time$Sorting.Time)
## [1] 2.542028
Creating Linear Model for delivery time
deliverTimeModel <- lm(Delivery.Time ~ Sorting.Time, data = delivery_time)
summary(deliverTimeModel)
## Call:
## lm(formula = Delivery.Time ~ Sorting.Time, data = delivery_time)
## Residuals:
## Min 1Q Median 3Q Max
## -5.1729 -2.0298 -0.0298 0.8741 6.6722
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5827 1.7217 3.823 0.00115 **
## Sorting.Time 1.6490 0.2582 6.387 3.98e-06 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.935 on 19 degrees of freedom
## Multiple R-squared: 0.6823, Adjusted R-squared: 0.6655
## F-statistic: 40.8 on 1 and 19 DF, p-value: 3.983e-06
plot(deliverTimeModel)
Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.6823. That’s mean this model will predict the output 68.23% time correct
3 - Emp_data -> Build a prediction model for Churn_out_rate
summary(Emp_data)
## Salary_hike Churn_out_rate
## Min. :1580 Min. :60.00
## 1st Qu.:1618 1st Qu.:65.75
## Median :1675 Median :71.00
## Mean :1689 Mean :72.90
## 3rd Qu.:1724 3rd Qu.:78.75
## Max. :1870 Max. :92.00
# Variance and Standard deviation of Salary_hike column
var(Emp_data$Salary_hike)
## [1] 8481.822
sd(Emp_data$Salary_hike)
## [1] 92.09681
# Variance and Standard deviation of Churn_out_rate column
var(Emp_data$Churn_out_rate)
## [1] 105.2111
sd(Emp_data$Churn_out_rate)
## [1] 10.25725
Creating Linear Model for Churn_out_rate
Churn_out_rate_Model <- lm(Churn_out_rate ~ Salary_hike, data = Emp_data)
summary(Churn_out_rate_Model)
## Call:
## lm(formula = Churn_out_rate ~ Salary_hike, data = Emp_data)
## Residuals:
## Min 1Q Median 3Q Max
## -3.804 -3.059 -1.819 2.430 8.072
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 244.36491 27.35194 8.934 1.96e-05 ***
## Salary_hike -0.10154 0.01618 -6.277 0.000239 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 4.469 on 8 degrees of freedom
## Multiple R-squared: 0.8312, Adjusted R-squared: 0.8101
## F-statistic: 39.4 on 1 and 8 DF, p-value: 0.0002386
plot(Churn_out_rate_Model)
Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.8312 That’s mean this model will predict the output 83.12% time correct
4 - Salary_hike -> Build a prediction model for Salary_hike
summary(Salary_hike)
## YearsExperience Salary
## Min. : 1.100 Min. : 37731
## 1st Qu.: 3.200 1st Qu.: 56721
## Median : 4.700 Median : 65237
## Mean : 5.313 Mean : 76003
## 3rd Qu.: 7.700 3rd Qu.:100545
## Max. :10.500 Max. :122391
# Variance and Standard deviation of Salary_hike column
var(Salary_hike$YearsExperience)
## [1] 8.053609
sd(Salary_hike$YearsExperience)
## [1] 2.837888
# Variance and Standard deviation of Churn_out_rate column
var(Salary_hike$Salary)
## [1] 751550960
sd(Salary_hike$Salary)
## [1] 27414.43
Creating Linear Model for Salary_hike
Salary_hike_Model <- lm(Salary ~ YearsExperience, data = Salary_hike)
summary(Salary_hike_Model)
## Call:
## lm(formula = Salary ~ YearsExperience, data = Salary_hike)
## Residuals:
## Min 1Q Median 3Q Max
## -7958.0 -4088.5 -459.9 3372.6 11448.0
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25792.2 2273.1 11.35 5.51e-12 ***
## YearsExperience 9450.0 378.8 24.95 < 2e-16 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16
plot(Salary_hike_Model)
Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.957 That’s mean this model will predict the output 95.7% time correct

Day 6 Logistic Regression


Output variable -> y
y -> Whether the client has subscribed a term deposit or not
Binomial ("yes" or "no")

## age default balance housing


## Min. :18.00 Min. :0.00000 Min. : -8019 Min. :0.0000
## 1st Qu.:33.00 1st Qu.:0.00000 1st Qu.: 72 1st Qu.:0.0000
## Median :39.00 Median :0.00000 Median : 448 Median :1.0000
## Mean :40.94 Mean :0.01803 Mean : 1362 Mean :0.5558
## 3rd Qu.:48.00 3rd Qu.:0.00000 3rd Qu.: 1428 3rd Qu.:1.0000
## Max. :95.00 Max. :1.00000 Max. :102127 Max. :1.0000
## loan duration campaign pdays
## Min. :0.0000 Min. : 0.0 Min. : 1.000 Min. : -1.0
## 1st Qu.:0.0000 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.: -1.0
## Median :0.0000 Median : 180.0 Median : 2.000 Median : -1.0
## Mean :0.1602 Mean : 258.2 Mean : 2.764 Mean : 40.2
## 3rd Qu.:0.0000 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.: -1.0
## Max. :1.0000 Max. :4918.0 Max. :63.000 Max. :871.0
## previous poutfailure poutother poutsuccess
## Min. : 0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean : 0.5803 Mean :0.1084 Mean :0.0407 Mean :0.03342
## 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :275.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## poutunknown con_cellular con_telephone con_unknown
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000
## Median :1.0000 Median :1.0000 Median :0.00000 Median :0.000
## Mean :0.8175 Mean :0.6477 Mean :0.06428 Mean :0.288
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.000
## divorced married single joadmin.
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.1152 Mean :0.6019 Mean :0.2829 Mean :0.1144
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## joblue.collar joentrepreneur johousemaid jomanagement
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.2153 Mean :0.03289 Mean :0.02743 Mean :0.2092
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## joretired joself.employed joservices jostudent
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.05008 Mean :0.03493 Mean :0.09188 Mean :0.02075
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## jotechnician jounemployed jounknown y
## Min. :0.000 Min. :0.00000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.000 Median :0.00000 Median :0.00000 Median :0.000
## Mean :0.168 Mean :0.02882 Mean :0.00637 Mean :0.117
## 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.000 Max. :1.00000 Max. :1.00000 Max. :1.000
## Classes 'data.table' and 'data.frame': 45211 obs. of 32 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ default : int 0 0 0 0 0 0 0 1 0 0 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : int 1 1 1 1 0 1 1 1 1 1 ...
## $ loan : int 0 0 1 0 0 0 1 0 0 0 ...
Yes the client has subscribed a term deposit

I have a dataset containing family information of married couples, which have around 10
variables & 600+ observations. Independent variables are ~ gender, age, years married, children,
religion etc. I have one response variable which is number of extra marital affairs. Now, I want
to know what all factor influence the chances of extra marital affair.
##
## Call:
## glm(formula = affairs ~ age + yearsmarried + religiousness +
## education + occupation + rating + factor(gender_F) + factor(Children_F),
## data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0503 -1.7226 -0.7947 0.2101 12.7036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.87201 1.13750 5.162 3.34e-07 ***
## age -0.05098 0.02262 -2.254 0.0246 *
## yearsmarried 0.16947 0.04122 4.111 4.50e-05 ***
## religiousness -0.47761 0.11173 -4.275 2.23e-05 ***
## education -0.01375 0.06414 -0.214 0.8303
## occupation 0.10492 0.08888 1.180 0.2383
## rating -0.71188 0.12001 -5.932 5.09e-09 ***
## factor(gender_F)1 0.05409 0.30049 0.180 0.8572
## factor(Children_F)1 -0.14262 0.35020 -0.407 0.6840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 9.575934)
##
## Null deviance: 6529.1 on 600 degrees of freedom
## Residual deviance: 5669.0 on 592 degrees of freedom
## AIC: 3074.3
##
## Number of Fisher Scoring iterations: 2

## Start: AIC=3074.31
## affairs ~ age + yearsmarried + religiousness + education + occupation +
## rating + factor(gender_F) + factor(Children_F)
##
## Df Deviance AIC
## - factor(gender_F) 1 5669.3 3072.3
## - education 1 5669.4 3072.3
## - factor(Children_F) 1 5670.5 3072.5
## - occupation 1 5682.3 3073.7
## <none> 5669.0 3074.3
## - age 1 5717.6 3077.4
## - yearsmarried 1 5830.8 3089.2
## - religiousness 1 5843.9 3090.6
## - rating 1 6005.9 3107.0
##
## Step: AIC=3072.34
## affairs ~ age + yearsmarried + religiousness + education + occupation +
## rating + factor(Children_F)
##
## Df Deviance AIC
## - education 1 5669.6 3070.4
## - factor(Children_F) 1 5670.7 3070.5
## - occupation 1 5685.7 3072.1
## <none> 5669.3 3072.3
## - age 1 5718.2 3075.5
## - yearsmarried 1 5834.6 3087.6
## - religiousness 1 5844.0 3088.6
## - rating 1 6007.1 3105.1
##
## Step: AIC=3070.37
## affairs ~ age + yearsmarried + religiousness + occupation + rating +
## factor(Children_F)
##
## Df Deviance AIC
## - factor(Children_F) 1 5671.1 3068.5
## <none> 5669.6 3070.4
## - occupation 1 5688.9 3070.4
## - age 1 5719.3 3073.6
## - yearsmarried 1 5835.7 3085.7
## - religiousness 1 5844.0 3086.6
## - rating 1 6016.6 3104.1
##
## Step: AIC=3068.53
## affairs ~ age + yearsmarried + religiousness + occupation + rating
##
## Df Deviance AIC
## <none> 5671.1 3068.5
## - occupation 1 5692.3 3068.8
## - age 1 5720.5 3071.8
## - religiousness 1 5845.6 3084.8
## - yearsmarried 1 5854.5 3085.7
## - rating 1 6016.6 3102.1
##
## Call: glm(formula = affairs ~ age + yearsmarried + religiousness +
## occupation + rating, data = Affairs)
##
## Coefficients:
## (Intercept) age yearsmarried religiousness occupation
## 5.60816 -0.05035 0.16185 -0.47632 0.10601
## rating
## -0.71224
##
## Degrees of Freedom: 600 Total (i.e. Null); 595 Residual
## Null Deviance: 6529
## Residual Deviance: 5671 AIC: 3069

Day 09 KNN Assignments:


1. Prepare a model for glass classification using KNN
K Nearest Neighbors (KNN)
K Nearest Neighbors
Glass Identification Database from UCI contains 10 attributes including id. The response is glass
type which has 7 discrete values.
Attributes
Id: 1 to 214 (removed from CSV file)
RI: refractive index
Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
Mg: Magnesium
Al: Aluminum
Si: Silicon
K: Potassium
Ca: Calcium
Ba: Barium
Fe: Iron
Type of glass: (Class Attribute)
1 - building_windows_float_processed
2 - building_windows_non_float_processed
3 - vehicle_windows_float_processed
4 - vehicle_windows_non_float_processed (none in this database)
5 - containers
6 - tableware
7 - headlamps
glass <- read.csv("../input/glass.csv")
Standardize the Data
Its ideal to standardize featues in the data, especially with KNN algorithm. Lets go ahead and
standardize. Here we are using scale() to standardize the feature columns of glassand assign it to
a new variable. Exclude the target column Type while scaling.
standard.features <- scale(glass[,1:9])
#Join the standardized data with the target column
data <- cbind(standard.features,glass[10])
#Check if there are any missing values to impute.
anyNA(data)
## [1] FALSE
# Looks like the data is free from NA's
head(data)
## RI Na Mg Al Si K
## 1 0.8708258 0.2842867 1.2517037 -0.6908222 -1.12444556 -0.67013422
## 2 -0.2487502 0.5904328 0.6346799 -0.1700615 0.10207972 -0.02615193
## 3 -0.7196308 0.1495824 0.6000157 0.1904651 0.43776033 -0.16414813
## 4 -0.2322859 -0.2422846 0.6970756 -0.3102663 -0.05284979 0.11184428
## 5 -0.3113148 -0.1688095 0.6485456 -0.4104126 0.55395746 0.08117845
## 6 -0.7920739 -0.7566101 0.6416128 0.3506992 0.41193874 0.21917466
## Ca Ba Fe Type
## 1 -0.1454254 -0.3520514 -0.5850791 1
## 2 -0.7918771 -0.3520514 -0.5850791 1
## 3 -0.8270103 -0.3520514 -0.5850791 1
## 4 -0.5178378 -0.3520514 -0.5850791 1
## 5 -0.6232375 -0.3520514 -0.5850791 1
## 6 -0.6232375 -0.3520514 2.0832652 1
Data Visualization
Below plot explains the relation between different features in glass dataset.
corrplot(cor(data))
Test and Train Data Split
We use caTools() to split the datainto train and test datasets with a SplitRatio = 0.70.
set.seed(100)
sample <- sample.split(data$Type,SplitRatio = 0.70)
train <- subset(data,sample==TRUE)
test <- subset(data,sample==FALSE)
KNN Model
We use knn() to predict our target variable Type of the test dataset with k=1.
predicted.type <- knn(train[1:9],test[1:9],train$Type,k=1)
#Error in prediction
error <- mean(predicted.type!=test$Type)
#Confusion Matrix
confusionMatrix(predicted.type,test$Type)
## Confusion Matrix and Statistics
## Reference
## Prediction 1 2 3 5 6 7
## 1 17 3 3 0 0 1
## 2 1 17 0 1 0 1
## 3 3 2 2 0 0 0
## 5 0 1 0 3 0 0
## 6 0 0 0 0 3 1
## 7 0 0 0 0 0 6
## Overall Statistics
## Accuracy : 0.7385
## 95% CI : (0.6146, 0.8397)
## No Information Rate : 0.3538
## P-Value [Acc > NIR] : 3.019e-10
## Kappa : 0.6485
## Mcnemar's Test P-Value : NA
## Statistics by Class:
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.8095 0.7391 0.40000 0.75000 1.00000 0.66667
## Specificity 0.8409 0.9286 0.91667 0.98361 0.98387 1.00000
## Pos Pred Value 0.7083 0.8500 0.28571 0.75000 0.75000 1.00000
## Neg Pred Value 0.9024 0.8667 0.94828 0.98361 1.00000 0.94915
## Prevalence 0.3231 0.3538 0.07692 0.06154 0.04615 0.13846
## Detection Rate 0.2615 0.2615 0.03077 0.04615 0.04615 0.09231
## Detection Prevalence 0.3692 0.3077 0.10769 0.06154 0.06154 0.09231
## Balanced Accuracy 0.8252 0.8339 0.65833 0.86680 0.99194 0.83333
The above results reveal that our model achieved an accuracy of 73.8461538 %. Lets try
different values of k and assess our model.
predicted.type <- NULL
error.rate <- NULL
for (i in 1:10) {
predicted.type <- knn(train[1:9],test[1:9],train$Type,k=i)
error.rate[i] <- mean(predicted.type!=test$Type)
}
knn.error <- as.data.frame(cbind(k=1:10,error.type =error.rate))
Choosing K Value by Visualization
Lets plot error.type vs k using ggplot.
ggplot(knn.error,aes(k,error.type))+
geom_point()+
geom_line() +
scale_x_continuous(breaks=1:10)+
theme_bw() +
xlab("Value of K") +
ylab('Error')

The above plot reveals that error is lowest when k=3 and then jumps back high revealing
that k=3 is the optimum value. Now lets build our model using k=3 and assess it.
Result
predicted.type <- knn(train[1:9],test[1:9],train$Type,k=3)
#Error in prediction
error <- mean(predicted.type!=test$Type)
#Confusion Matrix
confusionMatrix(predicted.type,test$Type)
## Confusion Matrix and Statistics
## Reference
## Prediction 1 2 3 5 6 7
## 1 18 3 3 0 0 0
## 2 2 17 0 2 0 1
## 3 1 1 2 0 0 0
## 5 0 2 0 2 0 0
## 6 0 0 0 0 3 0
## 7 0 0 0 0 0 8
## Overall Statistics
## Accuracy : 0.7692
## 95% CI : (0.6481, 0.8647)
## No Information Rate : 0.3538
## P-Value [Acc > NIR] : 9.716e-12
## Kappa : 0.6853
## Mcnemar's Test P-Value : NA
## Statistics by Class:
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.8571 0.7391 0.40000 0.50000 1.00000 0.8889
## Specificity 0.8636 0.8810 0.96667 0.96721 1.00000 1.0000
## Pos Pred Value 0.7500 0.7727 0.50000 0.50000 1.00000 1.0000
## Neg Pred Value 0.9268 0.8605 0.95082 0.96721 1.00000 0.9825
## Prevalence 0.3231 0.3538 0.07692 0.06154 0.04615 0.1385
## Detection Rate 0.2769 0.2615 0.03077 0.03077 0.04615 0.1231
## Detection Prevalence 0.3692 0.3385 0.06154 0.06154 0.04615 0.1231
## Balanced Accuracy 0.8604 0.8100 0.68333 0.73361 1.00000 0.9444
The Above Model gave us an accuracy of 76.9230769 %.
2. Implement a KNN model to classify the animals into categories:
Load Data
set.seed(1)
library(class)
d = read.table("zoo.DATA", sep=",", header = FALSE)
d = data.frame(d)
Data Conditioning
Phylogenic traits used for classification:
names(d) <- c("animal", "hair", "feathers", "eggs", "milk", "airborne",
"aquatic", "predator", "toothed", "backbone", "breathes", "venomous",
"fins", "legs", "tail", "domestic", "size", "type")
types <- table(d$type)
d_target <- d[, 18]
d_key <- d[, 1]
d$animal <- NULL
names(types) <- c("mammal", "bird", "reptile", "fish", "amphibian", "insect", "crustacean")
types
## mammal bird reptile fish amphibian insect
## 41 20 5 13 4 8
## crustacean
## 10
summary(d)
## hair feathers eggs milk
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :1.0000 Median :0.0000
## Mean :0.4257 Mean :0.198 Mean :0.5842 Mean :0.4059
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## airborne aquatic predator toothed
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.000
## Mean :0.2376 Mean :0.3564 Mean :0.5545 Mean :0.604
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
## backbone breathes venomous fins
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :0.00000 Median :0.0000
## Mean :0.8218 Mean :0.7921 Mean :0.07921 Mean :0.1683
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## legs tail domestic size
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :4.000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :2.842 Mean :0.7426 Mean :0.1287 Mean :0.4356
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :8.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## type
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.832
## 3rd Qu.:4.000
## Max. :7.000
str(d)
## 'data.frame': 101 obs. of 17 variables:
## $ hair : int 1 1 0 1 1 1 1 0 0 1 ...
## $ feathers: int 0 0 0 0 0 0 0 0 0 0 ...
## $ eggs : int 0 0 1 0 0 0 0 1 1 0 ...
## $ milk : int 1 1 0 1 1 1 1 0 0 1 ...
## $ airborne: int 0 0 0 0 0 0 0 0 0 0 ...
## $ aquatic : int 0 0 1 0 0 0 0 1 1 0 ...
## $ predator: int 1 0 1 1 1 0 0 0 1 0 ...
## $ toothed : int 1 1 1 1 1 1 1 1 1 1 ...
## $ backbone: int 1 1 1 1 1 1 1 1 1 1 ...
## $ breathes: int 1 1 0 1 1 1 1 0 0 1 ...
## $ venomous: int 0 0 0 0 0 0 0 0 0 0 ...
## $ fins : int 0 0 1 0 0 0 0 1 1 0 ...
## $ legs : int 4 4 0 4 4 4 4 0 0 4 ...
## $ tail : int 0 1 1 0 1 1 1 1 1 0 ...
## $ domestic: int 0 0 0 0 0 0 1 1 0 1 ...
## $ size : int 1 1 0 1 1 1 1 0 0 0 ...
## $ type : int 1 1 4 1 1 1 1 4 4 1 ...
k = sqrt(17) + 1
m1 <- knn.cv(d, d_target, k, prob = TRUE)
prediction <- m1
cmat <- table(d_target,prediction)
acc <- (sum(diag(cmat)) / length(d_target)) * 100
print(acc)
## [1] 90.09901
Confusion Matrix
data.frame(types)
## Var1 Freq
## 1 mammal 41
## 2 bird 20
## 3 reptile 5
## 4 fish 13
## 5 amphibian 4
## 6 insect 8
## 7 crustacean 10
cmat
## prediction
## d_target 1 2 3 4 5 6 7
## 1 41 0 0 0 0 0 0
## 2 0 20 0 0 0 0 0
## 3 0 1 0 3 1 0 0
## 4 0 0 0 13 0 0 0
## 5 0 0 0 0 4 0 0
## 6 0 0 0 0 0 8 0
## 7 0 0 0 2 1 2 5
Accuracy (%)
acc
## [1] 90.09901

You might also like