Professional Documents
Culture Documents
Assignment
E. Raviteja
Roll no: 10
Day 1_2 Assignment:
Q1) Identify the Data type for the Following:
Q2) Identify the Data types, which were among the following
Nominal, Ordinal, Interval, Ratio.
Data Data Type
Gender Nominal data type
High School Class Ranking Ordinal
Celsius Temperature Interval
Weight Ratio
Hair Color Nominal
Socioeconomic Status Ordinal
Fahrenheit Temperature Interval
Height Ratio
Type of living accommodation Nominal
Level of Agreement Ordinal
IQ(Intelligence Scale) Interval
Sales Figures Ordinal
Blood Group Ordinal data type
Time Of Day Interval
Time on a Clock with Hands Interval
Number of Children Ordinal
Religious Preference Nominal
Barometer Pressure Interval
SAT Scores Interval
Years of Education Ordinal
Q3) Three Coins are tossed, find the probability that two heads and one tail are obtained?
Ans) The probability of getting 2 heads and 1 tail is 3/8
Q4) Two Dice are rolled, find the probability that sum is
a) Equal to 1 = 0/36 = 0
b) Less than or equal to 4 = 6/36 = 1/6
c) Sum is divisible by 2 and 3 = 6/36 = 1/6
Answer:
The sample space S of two dice is shown below.
S = { (1,1),(1,2),(1,3),(1,4),(1,5),(1,6)
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6)
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6)
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6)
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6) }
a) Let E be the event "sum equal to 1". There are no outcomes which correspond to a
sum equal to 1, hence
P (E) = n (E) / n(S) = 0 / 36 = 0
b) Less than or equal to 4
Three possible outcomes give a sum equal to 4: E = {(1, 3), (2, 2), (3, 1)}, hence.
P(E) = n(E) / n(S) = 3 / 36 = 1 / 12
c) Sum is divisible by 2 and 3
No of possible outcomes =24
Probability= 24/36=2/3
Q5) A bag contains 2 red, 3 green and 2 blue balls. Two balls are drawn at random. What is the
probability that none of the balls drawn is blue?
Ans) number of ways of drawing 2 balls out of 7 is 21
Number of ways of drawing 2 balls out of 2red and 3green balls is 10
The probability that none of the balls drawn is blue is 10/21
Answer:
Total no of chances to draw 2 balls at random from 7 coloured balls,
n(s)=7C2=21
let E be an event to draw 2 balls other than blue.
no of chances to draw two balls other than blue are,
n(E)=2C2+3C2+2C1.3C1=1+3+6=10
the probability that none of balls drawn is blue is,
p(E)=n(E)/n(s)=10/21
Q6) Calculate the Expected number of candies for a randomly selected child
Below are the probabilities of count of candies for children (ignoring the nature of the child-
Generalized view)
CHILD Candies count Probability
A 1 0.015
B 4 0.20
C 3 0.65
D 5 0.005
E 6 0.01
F 2 0.120
Child A – probability of having 1 candy = 0.015.
Child B – probability of having 4 candies = 0.20
Answer:
Candies: 1+4+3+5+6+2 =21
Expected number of candies for a randomly selected child= 3.5
Q7) Calculate Mean, Median, Mode, Variance, Standard Deviation, Range & comment about
the values / draw inferences, for the given dataset
- For Points,Score,Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range and also
Comment about the values/ Draw some inferences.
Answer: Mean of Points: 3.5965
Median of the points: 3.695
Mode of the points: 3.92
Standard Deviation of the points: 0.3352
Range: Max – min = 2
Mean for the scores: 3.216
Mode for the scores: 3.44
Median: 3.325
Standard deviation: 0.317
Range: Max – Min =3.911
Q8) Calculate Expected Value for the problem below
a) The weights (X) of patients at a clinic (in pounds), are
108, 110, 123, 134, 135, 145, 167, 187, 199
Assume one of the patients is chosen at random. What is the Expected Value of the
Weight of that patient?
Expected value of one of the patients chosen at random=145.333
Q9) Calculate Skewness, Kurtosis & draw inferences on the following data
Cars speed and distance
Skewness: 2.05755
Kurtosis:
SP and Weight(WT)
Q10) Draw inferences about the following boxplot & histogram
Answer:
Histograms are the most common way to plot a vector of numeric data. To create a histogram
we’ll use the hist() function. The main argument to hist() is a x, a vector of numeric data. If
you want to specify how the histogram bins are created, you can use the breaks argument. To
change the color of the border or background of the bins, use col and border:
Let’s create a histogram of the weights in the Chicken Weight dataset:
Box plot values:
Q11) Suppose we want to estimate the average weight of an adult male in Mexico. We draw a
random sample of 2,000 men from a population of 3,000,000 men and weigh them. We find that
the average person in our sample weighs 200 pounds, and the standard deviation of the sample is
30 pounds. Calculate 94%,98%,96% confidence interval?
98% confidence interval=1.65
94% confidence interval=1.88
96% confidence interval=0.848
Q12) Below are the scores obtained by a student in tests
34,36,36,38,38,39,39,40,40,41,41,41,41,42,42,45,49,56
1) Find mean, median, variance, standard deviation.
2) What can we say about the student marks?
Answer: 34,36,36,38,38,39,39,40,40,41,41,41,41,42,42,45,49,56
Total sum=738
Mean=738/18 =41
Median=40.5
Variance =-7,-5,-5,-3,-3,-2,-2,-1,-1,0,0,0,0,1,1,4,8,15
49,25,25,9,9,4,4,1,1,0,0,0,0,1,1,16,64,225=434
Standard deviation= 434/18=24.11
Square root of 24.11=4.91
2)The student marks are quietly average because all the students are managed to pass by getting
the marks of 40.5 as median. And their standard deviation is 4.91
Q13) What is the nature of skewness when mean, median of data are equal?
Answer: When the values of mean, median and mode are equal, there is no skewness.
Q14) What is the nature of skewness when mean > median ?
Answer: If mean >median than the skewness will be positive skewness
Q15) What is the nature of skewness when median > mean?
Answer: If median>mean than the skewness will be negative skewness
Q16) What does positive kurtosis value indicates for a data ?
Answer: A distribution with a positive kurtosis value indicates that the distribution has heavier
tails and a sharper peak than the normal distribution. For example, data that follow a t
distribution have a positive kurtosis value. The solid line shows the normal distribution and the
dotted line shows a distribution with a positive kurtosis value.
Q18) Answer the below questions using the below boxplot visualization.
Draw an Inference from the distribution of data for Boxplot 1 with respect Boxplot 2.
Q 20) Calculate probability from the given dataset for the below cases
Data _set: Cars.csv
Calculate the probability of MPG of Cars for the below cases.
MPG <- Cars$MPG
a. P(MPG>38)
b. P(MPG<40)
c. P (20<MPG<50)
Q 21) Check whether the data follows normal distribution
a) Check whether the MPG of Cars follows Normal Distribution
Dataset: Cars.csv
b) Check Whether the Adipose Tissue (AT) and Waist Circumference(Waist) from wc-
at data set follows Normal Distribution
Dataset: wc-at.csv
Q 22) Calculate the Z scores of 90% confidence interval,94% confidence interval, 60%
confidence interval.
Q 23) Calculate the t scores of 95% confidence interval, 96% confidence interval, 99%
confidence interval for sample size of 25
Answer: 95% = 2.064, 96% = 2.492, 99% = 2.797
Q 24) A Government company claims that an average light bulb lasts 270 days. A researcher
randomly selects 18 bulbs for testing. The sampled bulbs last an average of 260 days, with a
standard deviation of 90 days. If the CEO's claim were true, what is the probability that 18
randomly selected bulbs would have an average life of no more than 260 days?
Hint:
Rcode pt (tscore, DF)
DF degrees of freedom
Answer: ((260-270)/ ((90/sqrt (18)))) = -10/21.2132034 = -0.047140
Degree of freedom 18-1 = 17
Probability = 0.3218
Day 5 Assignment:
Multi Linear Regression
Summary (Profit_Model_Final)
## Call:
## Lm (formula = Profit ~ `R&D Spend` + `Marketing Spend`, data = startup_50)
## Residuals:
## Min 1Q Median 3Q Max
## -33645 -4632 -414 6484 17097
## Coefficients:
## Estimate Std. Error t value PR (>|t|)
## (Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
## `R&D Spend` 7.966e-01 4.135e-02 19.266 <2e-16 ***
## `Marketing Spend` 2.991e-02 1.552e-02 1.927 0.06.
## Signif. Codes: Zero '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 9161 on 47 degrees of freedom
## Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
## F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
Plot (Profit_Model_Final)
Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.8968. That’s mean this model will predict the output 89.68% time correct
2 - Delivery_time -> Predict delivery time using sorting time
summary(delivery_time)
## Delivery.Time Sorting.Time
## Min. : 8.00 Min. : 2.00
## 1st Qu.:13.50 1st Qu.: 4.00
## Median :17.83 Median : 6.00
## Mean :16.79 Mean : 6.19
## 3rd Qu.:19.75 3rd Qu.: 8.00
## Max. :29.00 Max. :10.00
# Variance and Standard deviation of Delivery.Time column
var(delivery_time$Delivery.Time)
## [1] 25.75462
sd(delivery_time$Delivery.Time)
## [1] 5.074901
# Variance and Standard deviation of Sorting.Time column
var(delivery_time$Sorting.Time)
## [1] 6.461905
sd(delivery_time$Sorting.Time)
## [1] 2.542028
Creating Linear Model for delivery time
deliverTimeModel <- lm(Delivery.Time ~ Sorting.Time, data = delivery_time)
summary(deliverTimeModel)
## Call:
## lm(formula = Delivery.Time ~ Sorting.Time, data = delivery_time)
## Residuals:
## Min 1Q Median 3Q Max
## -5.1729 -2.0298 -0.0298 0.8741 6.6722
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5827 1.7217 3.823 0.00115 **
## Sorting.Time 1.6490 0.2582 6.387 3.98e-06 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.935 on 19 degrees of freedom
## Multiple R-squared: 0.6823, Adjusted R-squared: 0.6655
## F-statistic: 40.8 on 1 and 19 DF, p-value: 3.983e-06
plot(deliverTimeModel)
Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.6823. That’s mean this model will predict the output 68.23% time correct
3 - Emp_data -> Build a prediction model for Churn_out_rate
summary(Emp_data)
## Salary_hike Churn_out_rate
## Min. :1580 Min. :60.00
## 1st Qu.:1618 1st Qu.:65.75
## Median :1675 Median :71.00
## Mean :1689 Mean :72.90
## 3rd Qu.:1724 3rd Qu.:78.75
## Max. :1870 Max. :92.00
# Variance and Standard deviation of Salary_hike column
var(Emp_data$Salary_hike)
## [1] 8481.822
sd(Emp_data$Salary_hike)
## [1] 92.09681
# Variance and Standard deviation of Churn_out_rate column
var(Emp_data$Churn_out_rate)
## [1] 105.2111
sd(Emp_data$Churn_out_rate)
## [1] 10.25725
Creating Linear Model for Churn_out_rate
Churn_out_rate_Model <- lm(Churn_out_rate ~ Salary_hike, data = Emp_data)
summary(Churn_out_rate_Model)
## Call:
## lm(formula = Churn_out_rate ~ Salary_hike, data = Emp_data)
## Residuals:
## Min 1Q Median 3Q Max
## -3.804 -3.059 -1.819 2.430 8.072
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 244.36491 27.35194 8.934 1.96e-05 ***
## Salary_hike -0.10154 0.01618 -6.277 0.000239 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 4.469 on 8 degrees of freedom
## Multiple R-squared: 0.8312, Adjusted R-squared: 0.8101
## F-statistic: 39.4 on 1 and 8 DF, p-value: 0.0002386
plot(Churn_out_rate_Model)
Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.8312 That’s mean this model will predict the output 83.12% time correct
4 - Salary_hike -> Build a prediction model for Salary_hike
summary(Salary_hike)
## YearsExperience Salary
## Min. : 1.100 Min. : 37731
## 1st Qu.: 3.200 1st Qu.: 56721
## Median : 4.700 Median : 65237
## Mean : 5.313 Mean : 76003
## 3rd Qu.: 7.700 3rd Qu.:100545
## Max. :10.500 Max. :122391
# Variance and Standard deviation of Salary_hike column
var(Salary_hike$YearsExperience)
## [1] 8.053609
sd(Salary_hike$YearsExperience)
## [1] 2.837888
# Variance and Standard deviation of Churn_out_rate column
var(Salary_hike$Salary)
## [1] 751550960
sd(Salary_hike$Salary)
## [1] 27414.43
Creating Linear Model for Salary_hike
Salary_hike_Model <- lm(Salary ~ YearsExperience, data = Salary_hike)
summary(Salary_hike_Model)
## Call:
## lm(formula = Salary ~ YearsExperience, data = Salary_hike)
## Residuals:
## Min 1Q Median 3Q Max
## -7958.0 -4088.5 -459.9 3372.6 11448.0
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25792.2 2273.1 11.35 5.51e-12 ***
## YearsExperience 9450.0 378.8 24.95 < 2e-16 ***
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 5788 on 28 degrees of freedom
## Multiple R-squared: 0.957, Adjusted R-squared: 0.9554
## F-statistic: 622.5 on 1 and 28 DF, p-value: < 2.2e-16
plot(Salary_hike_Model)
Hence the P-value is less than 0.05. So X varibale is significance and also Multiple R-Square
value is 0.957 That’s mean this model will predict the output 95.7% time correct
I have a dataset containing family information of married couples, which have around 10
variables & 600+ observations. Independent variables are ~ gender, age, years married, children,
religion etc. I have one response variable which is number of extra marital affairs. Now, I want
to know what all factor influence the chances of extra marital affair.
##
## Call:
## glm(formula = affairs ~ age + yearsmarried + religiousness +
## education + occupation + rating + factor(gender_F) + factor(Children_F),
## data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0503 -1.7226 -0.7947 0.2101 12.7036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.87201 1.13750 5.162 3.34e-07 ***
## age -0.05098 0.02262 -2.254 0.0246 *
## yearsmarried 0.16947 0.04122 4.111 4.50e-05 ***
## religiousness -0.47761 0.11173 -4.275 2.23e-05 ***
## education -0.01375 0.06414 -0.214 0.8303
## occupation 0.10492 0.08888 1.180 0.2383
## rating -0.71188 0.12001 -5.932 5.09e-09 ***
## factor(gender_F)1 0.05409 0.30049 0.180 0.8572
## factor(Children_F)1 -0.14262 0.35020 -0.407 0.6840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 9.575934)
##
## Null deviance: 6529.1 on 600 degrees of freedom
## Residual deviance: 5669.0 on 592 degrees of freedom
## AIC: 3074.3
##
## Number of Fisher Scoring iterations: 2
## Start: AIC=3074.31
## affairs ~ age + yearsmarried + religiousness + education + occupation +
## rating + factor(gender_F) + factor(Children_F)
##
## Df Deviance AIC
## - factor(gender_F) 1 5669.3 3072.3
## - education 1 5669.4 3072.3
## - factor(Children_F) 1 5670.5 3072.5
## - occupation 1 5682.3 3073.7
## <none> 5669.0 3074.3
## - age 1 5717.6 3077.4
## - yearsmarried 1 5830.8 3089.2
## - religiousness 1 5843.9 3090.6
## - rating 1 6005.9 3107.0
##
## Step: AIC=3072.34
## affairs ~ age + yearsmarried + religiousness + education + occupation +
## rating + factor(Children_F)
##
## Df Deviance AIC
## - education 1 5669.6 3070.4
## - factor(Children_F) 1 5670.7 3070.5
## - occupation 1 5685.7 3072.1
## <none> 5669.3 3072.3
## - age 1 5718.2 3075.5
## - yearsmarried 1 5834.6 3087.6
## - religiousness 1 5844.0 3088.6
## - rating 1 6007.1 3105.1
##
## Step: AIC=3070.37
## affairs ~ age + yearsmarried + religiousness + occupation + rating +
## factor(Children_F)
##
## Df Deviance AIC
## - factor(Children_F) 1 5671.1 3068.5
## <none> 5669.6 3070.4
## - occupation 1 5688.9 3070.4
## - age 1 5719.3 3073.6
## - yearsmarried 1 5835.7 3085.7
## - religiousness 1 5844.0 3086.6
## - rating 1 6016.6 3104.1
##
## Step: AIC=3068.53
## affairs ~ age + yearsmarried + religiousness + occupation + rating
##
## Df Deviance AIC
## <none> 5671.1 3068.5
## - occupation 1 5692.3 3068.8
## - age 1 5720.5 3071.8
## - religiousness 1 5845.6 3084.8
## - yearsmarried 1 5854.5 3085.7
## - rating 1 6016.6 3102.1
##
## Call: glm(formula = affairs ~ age + yearsmarried + religiousness +
## occupation + rating, data = Affairs)
##
## Coefficients:
## (Intercept) age yearsmarried religiousness occupation
## 5.60816 -0.05035 0.16185 -0.47632 0.10601
## rating
## -0.71224
##
## Degrees of Freedom: 600 Total (i.e. Null); 595 Residual
## Null Deviance: 6529
## Residual Deviance: 5671 AIC: 3069
The above plot reveals that error is lowest when k=3 and then jumps back high revealing
that k=3 is the optimum value. Now lets build our model using k=3 and assess it.
Result
predicted.type <- knn(train[1:9],test[1:9],train$Type,k=3)
#Error in prediction
error <- mean(predicted.type!=test$Type)
#Confusion Matrix
confusionMatrix(predicted.type,test$Type)
## Confusion Matrix and Statistics
## Reference
## Prediction 1 2 3 5 6 7
## 1 18 3 3 0 0 0
## 2 2 17 0 2 0 1
## 3 1 1 2 0 0 0
## 5 0 2 0 2 0 0
## 6 0 0 0 0 3 0
## 7 0 0 0 0 0 8
## Overall Statistics
## Accuracy : 0.7692
## 95% CI : (0.6481, 0.8647)
## No Information Rate : 0.3538
## P-Value [Acc > NIR] : 9.716e-12
## Kappa : 0.6853
## Mcnemar's Test P-Value : NA
## Statistics by Class:
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.8571 0.7391 0.40000 0.50000 1.00000 0.8889
## Specificity 0.8636 0.8810 0.96667 0.96721 1.00000 1.0000
## Pos Pred Value 0.7500 0.7727 0.50000 0.50000 1.00000 1.0000
## Neg Pred Value 0.9268 0.8605 0.95082 0.96721 1.00000 0.9825
## Prevalence 0.3231 0.3538 0.07692 0.06154 0.04615 0.1385
## Detection Rate 0.2769 0.2615 0.03077 0.03077 0.04615 0.1231
## Detection Prevalence 0.3692 0.3385 0.06154 0.06154 0.04615 0.1231
## Balanced Accuracy 0.8604 0.8100 0.68333 0.73361 1.00000 0.9444
The Above Model gave us an accuracy of 76.9230769 %.
2. Implement a KNN model to classify the animals into categories:
Load Data
set.seed(1)
library(class)
d = read.table("zoo.DATA", sep=",", header = FALSE)
d = data.frame(d)
Data Conditioning
Phylogenic traits used for classification:
names(d) <- c("animal", "hair", "feathers", "eggs", "milk", "airborne",
"aquatic", "predator", "toothed", "backbone", "breathes", "venomous",
"fins", "legs", "tail", "domestic", "size", "type")
types <- table(d$type)
d_target <- d[, 18]
d_key <- d[, 1]
d$animal <- NULL
names(types) <- c("mammal", "bird", "reptile", "fish", "amphibian", "insect", "crustacean")
types
## mammal bird reptile fish amphibian insect
## 41 20 5 13 4 8
## crustacean
## 10
summary(d)
## hair feathers eggs milk
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :1.0000 Median :0.0000
## Mean :0.4257 Mean :0.198 Mean :0.5842 Mean :0.4059
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## airborne aquatic predator toothed
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.000
## Mean :0.2376 Mean :0.3564 Mean :0.5545 Mean :0.604
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.000
## backbone breathes venomous fins
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :0.00000 Median :0.0000
## Mean :0.8218 Mean :0.7921 Mean :0.07921 Mean :0.1683
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## legs tail domestic size
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :4.000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :2.842 Mean :0.7426 Mean :0.1287 Mean :0.4356
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :8.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## type
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.832
## 3rd Qu.:4.000
## Max. :7.000
str(d)
## 'data.frame': 101 obs. of 17 variables:
## $ hair : int 1 1 0 1 1 1 1 0 0 1 ...
## $ feathers: int 0 0 0 0 0 0 0 0 0 0 ...
## $ eggs : int 0 0 1 0 0 0 0 1 1 0 ...
## $ milk : int 1 1 0 1 1 1 1 0 0 1 ...
## $ airborne: int 0 0 0 0 0 0 0 0 0 0 ...
## $ aquatic : int 0 0 1 0 0 0 0 1 1 0 ...
## $ predator: int 1 0 1 1 1 0 0 0 1 0 ...
## $ toothed : int 1 1 1 1 1 1 1 1 1 1 ...
## $ backbone: int 1 1 1 1 1 1 1 1 1 1 ...
## $ breathes: int 1 1 0 1 1 1 1 0 0 1 ...
## $ venomous: int 0 0 0 0 0 0 0 0 0 0 ...
## $ fins : int 0 0 1 0 0 0 0 1 1 0 ...
## $ legs : int 4 4 0 4 4 4 4 0 0 4 ...
## $ tail : int 0 1 1 0 1 1 1 1 1 0 ...
## $ domestic: int 0 0 0 0 0 0 1 1 0 1 ...
## $ size : int 1 1 0 1 1 1 1 0 0 0 ...
## $ type : int 1 1 4 1 1 1 1 4 4 1 ...
k = sqrt(17) + 1
m1 <- knn.cv(d, d_target, k, prob = TRUE)
prediction <- m1
cmat <- table(d_target,prediction)
acc <- (sum(diag(cmat)) / length(d_target)) * 100
print(acc)
## [1] 90.09901
Confusion Matrix
data.frame(types)
## Var1 Freq
## 1 mammal 41
## 2 bird 20
## 3 reptile 5
## 4 fish 13
## 5 amphibian 4
## 6 insect 8
## 7 crustacean 10
cmat
## prediction
## d_target 1 2 3 4 5 6 7
## 1 41 0 0 0 0 0 0
## 2 0 20 0 0 0 0 0
## 3 0 1 0 3 1 0 0
## 4 0 0 0 13 0 0 0
## 5 0 0 0 0 4 0 0
## 6 0 0 0 0 0 8 0
## 7 0 0 0 2 1 2 5
Accuracy (%)
acc
## [1] 90.09901