Professional Documents
Culture Documents
Submitted to:
Prof. Irfan
Submitted by
Ayesha Abbas
{2019-CS-668}
i
Contents
....................................................................................................................................................................... i
1. Datasets Description ........................................................................................................................... 3
1) Raisin Dataset .................................................................................................................................. 3
2) HTRU2 Dataset ............................................................................................................................... 3
3) Parking Birmingham Dataset ........................................................................................................ 3
2. Bivariate Analysis ............................................................................................................................... 4
• Raisin Dataset .................................................................................................................................. 4
• HTRU2 Dataset ............................................................................................................................... 6
• Parking Birmingham Dataset ........................................................................................................ 8
Classification ............................................................................................................................................... 9
Confusion Matrices of “Raisin Dataset” when Number of features are increased one by one: ...... 9
Confusion Matrices of “HTRU2 Dataset” when Number of features are increased one by one: .. 22
Comparing The Performances of Each Classification Model When the Number of Input Features
Are Increased. ........................................................................................................................................... 36
3. Boxplot of the Dataset ....................................................................................................................... 41
4. Scatter Plot Before Applying Clustering ........................................................................................ 41
2
1. Datasets Description
With Even Registration number (2019-CS-668), I’ve chosen following datasets for my
project.
1) Raisin Dataset
This dataset is about the categorization of Raisins into two class named kecimen and
Besni.
Features:
• Area (The number of pixels within the boundaries of the raisin.)
• Perimeter (The distance between the boundaries of the raisin grain and the pixels)
• MajorAxisLength (length of the main axis. The longest line drawn.)
• MinorAxisLength (The length of the small axis. (Shortest Line))
• Eccentricity (A measure of the eccentricity of the ellipse.)
• ConvexArea (The number of pixels of the smallest convex shell of the region formed
by the raisin grain.)
• Extent (The ratio of the region formed by the raisin grain to the total pixels in the
bounding box)
2) HTRU2 Dataset
It contains a dataset of a survey for pulsars. Each candidate is described by 8
continuous variables, and a single class variable. The first four are simple statistics
obtained from the integrated pulse profile (folded profile). The remaining four variables
are similarly obtained from the DM-SNR curve.
Features:
• Mean of the integrated profile.
• Standard deviation of the integrated profile.
• Excess kurtosis of the integrated profile.
• Skewness of the integrated profile.
• Mean of the DM-SNR curve.
• Standard deviation of the DM-SNR curve.
• Excess kurtosis of the DM-SNR curve.
• Skewness of the DM-SNR curve.
• Class
3
Features:
▪ Car parking ID (all the spots where a car could be parked in the city)
▪ Occupancy (How many spots are occupied in the specific time slot)
▪ Capacity (The total capacity of a car park spot)
▪ Last Updated (Date and time when the data is captured)
2. Bivariate Analysis
Bivariate analysis is done on each dataset by using best features selected through a
Feature Selection Method.
• Raisin Dataset
▪ Feature Selection
▪ Result:
Hence the “Major Axis Length” and “Perimeter” features are selected for
bivariate analysis of Raisins.
4
▪ Joint Plot
The length of the major axis is noted nearly equal or greater than
the 200 and the maximum values approaches to 997.292. It is
5
calculated that the mean of major axis is around 430.93 and the
standard deviation of the data is 116.035.
• HTRU2 Dataset
Feature Selection:
Result:
Hence the “Major Axis Length” and “Perimeter” features are selected for
bivariate analysis of Raisins.
6
Joint Plot
7
conveys the fact that most of the pulsars have the excess kurtosis length within the
range of 0 to 2 and Mean within a range of 100 to 125.
Feature Selection
For the Parking Birmingham Dataset, I will use Occupancy Rate and Days
per week as my two required variables for bivariate analysis explicitly.
1. Occupancy Rate (%) = (Occupancy / Capacity) * 100
2. Days per week are obtained by converting the “LastUpdated” column
into it’s corresponding Days.
8
The plot is based on the occupancy rates of the car parking spread over the days
of week. Which depicts the idea of calculating the Occupancy estimation of a car
parking on a certain day which will later be very helpful for the future prediction
in IPS (Intelligent Parking System).
As it can be clearly observed from the plot that Occupancy rate over Tuesday and
Wednesday and Thursday is almost equal which lies in the range of 3500 for all
the parking IDs. And increases by a very small number for the Friday from 3500
to approximately 3700.
Through the visual representation of the data is clearly described that the
occupancy rate of the Car Parking is comparatively high from all the other days
which is above 40,00. Saturday will be considered as the Busiest day in the aspect
of Car Parking. And it will not be easy for the driver to find a suitable parking
place with less effort.
Moving forward to the day Sunday a minor change in the occupancy rate is noted
from the day “Friday”. Which is approximately of 200. When the occupancy rate
on the Monday slot is compared to the “Saturday” one, it can be seen falling from
4000 above to the threshold of 4000, but when it is compared to the other days of
the week, it still has a great altitude.
The occupancy rates of given Car Parking ids can be estimated through their
spread over the days of week and then further be used to classify the busy and
non-busy spots comparatively.
Classification
To apply Classification for both datasets, Raisins and HTRU2, I have randomly divided
the data into 80:20 Ratio. Which means that the 80% data will be used for training
purpose and the remaining 20% data will be used for testing purpose. Classification is
done by using the following two algorithms.
• k-Nearest Neighbor (KNN)
• Naïve Bayes (Gaussian NB)
9
1. Using KNN Classification:
a. “2” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
10
Confusion Matrix:
11
Confusion Matrix:
12
d. “5” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
13
e. “6” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
14
f. “7” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
15
2. Using Naïve Bayes Classification:
a. Using “2” Features:
Confusion Matrix:
16
b. Using “3” Features:
Confusion Matrix Report:
Confusion Matrix:
17
c. Using “4” Features:
Confusion Matrix Report:
Confusion Matrix:
18
d. Using “5” Features:
Confusion Matrix Report:
Confusion Matrix:
19
e. Using “6” Features:
Confusion Matrix:
Confusion Matrix:
21
Confusion Matrices of “HTRU2 Dataset” when Number of features
are increased one by one:
Confusion Matrix:
22
b. “3” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
23
c. “4” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
24
d. “5” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
25
e. “6” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
26
f. “7” Features Selected:
Confusion Matrix Report:
Confusion Matrix:
27
g. All Features Selected:
Confusion Matrix Report:
28
Confusion Matrix:
29
Confusion Matrix:
30
Confusion Matrix:
31
Confusion Matrix:
Confusion Matrix:
32
e. “6” Features Selected:
Confusion Matrix Report:
33
Confusion Matrix:
34
Confusion Matrix:
Confusion Matrix:
35
Comparing The Performances of Each Classification Model
When the Number of Input Features Are Increased.
Results:
From the above table the accuracy measures of the Naïve Bayes Classifier for Raisin dataset can
be compared easily. We noticed that by increasing the features in the model the accuracy of the
classifier is decreased.
Results:
From the above table the accuracy measures of the KNN for HTRU2 dataset can be compared
easily. We noticed that by increasing the features in the model the accuracy of the classifier is
improved.
37
6 3146 106 56 272 95%
7 3127 125 53 275 95%
8 3117 135 50 278 94%
Results:
From the above table the accuracy measures of the Naïve Bayes Classifier for HTRU2 dataset
can be compared easily. We noticed that by increasing the features in the model the accuracy of
the classifier is improved.
3. Clustering
Feature Extraction:
To convert the data into some useful features I have added some more columns
into the dataset. All these operations are done using Microsoft Excel’s multiple
advanced features.
▪ Occupancy Rate = occupancy/capacity
▪ Occupancy Rate (in percentage) = Occupancy Rate * 100
▪ Days: converted the dates from the column “LastUpdated” into their
corresponding days of weeks.
Data Cleaning:
The data provided is not very accurate and precise because sometimes the sensors
are faulty or even, the whole data set may not be updated for a whole day. To
remove these situations from my dataset I have implemented a filtering stage
before applying the clustering algorithm as follows:
▪ The percentage values beyond the range (0-100%) are removed from
the dataset using filtering method in Excel.
▪ Data on a car park that is below 5% for an entire day is also discarded.
▪ Car parks without data are excluded from the study
38
• Summary of Dataset:
39
Viewing the Statistical summary of the Features:
40
3. Boxplot of the Dataset
41
5. Elbow Method
Below is a plot of sum squared distances for “k” in a range from 1 to 11. The value of
WCSS is largest when k = 1 and then it starts to decreasing.
As we can see from the above Graph that “the Elbow” (the point of inflection on the
curve) or the best value of K is 2.
42
7. Scatter plot Of the Data After Applying
k-Means:
43
Note:
Here in the above plot, values form (0 to 6) represents the days of week starting from
“Tuesday” and ending at “Monday”.
Results:
By Applying the k-means clustering algorithm over the Car Parking dataset we have assigned
some labels to relevant group of data according to their similarity scores. By Enhancing and
adding up the more details in the algorithm k-means classifier can be proved as a future predictor
for Car-Parkings.
44