You are on page 1of 44

Introduction To Data Sciences

SEMESTER PROJECT REPORT

Submitted to:
Prof. Irfan

Submitted by

Ayesha Abbas
{2019-CS-668}

Department of Computer Science,


University of Engineering and Technology, Lahore, KSK-Campus.

[10th April 2022]

i
Contents
....................................................................................................................................................................... i
1. Datasets Description ........................................................................................................................... 3
1) Raisin Dataset .................................................................................................................................. 3
2) HTRU2 Dataset ............................................................................................................................... 3
3) Parking Birmingham Dataset ........................................................................................................ 3
2. Bivariate Analysis ............................................................................................................................... 4
• Raisin Dataset .................................................................................................................................. 4
• HTRU2 Dataset ............................................................................................................................... 6
• Parking Birmingham Dataset ........................................................................................................ 8
Classification ............................................................................................................................................... 9
Confusion Matrices of “Raisin Dataset” when Number of features are increased one by one: ...... 9
Confusion Matrices of “HTRU2 Dataset” when Number of features are increased one by one: .. 22
Comparing The Performances of Each Classification Model When the Number of Input Features
Are Increased. ........................................................................................................................................... 36
3. Boxplot of the Dataset ....................................................................................................................... 41
4. Scatter Plot Before Applying Clustering ........................................................................................ 41

2
1. Datasets Description
With Even Registration number (2019-CS-668), I’ve chosen following datasets for my
project.

1) Raisin Dataset
This dataset is about the categorization of Raisins into two class named kecimen and
Besni.

Features:
• Area (The number of pixels within the boundaries of the raisin.)
• Perimeter (The distance between the boundaries of the raisin grain and the pixels)
• MajorAxisLength (length of the main axis. The longest line drawn.)
• MinorAxisLength (The length of the small axis. (Shortest Line))
• Eccentricity (A measure of the eccentricity of the ellipse.)
• ConvexArea (The number of pixels of the smallest convex shell of the region formed
by the raisin grain.)
• Extent (The ratio of the region formed by the raisin grain to the total pixels in the
bounding box)

2) HTRU2 Dataset
It contains a dataset of a survey for pulsars. Each candidate is described by 8
continuous variables, and a single class variable. The first four are simple statistics
obtained from the integrated pulse profile (folded profile). The remaining four variables
are similarly obtained from the DM-SNR curve.

Features:
• Mean of the integrated profile.
• Standard deviation of the integrated profile.
• Excess kurtosis of the integrated profile.
• Skewness of the integrated profile.
• Mean of the DM-SNR curve.
• Standard deviation of the DM-SNR curve.
• Excess kurtosis of the DM-SNR curve.
• Skewness of the DM-SNR curve.
• Class

3) Parking Birmingham Dataset


This dataset contains the study of parking occupancy data published by the Birmingham
city which is further used for Intelligent parking reservation (IPR) and the
prediction of car parking availability.

3
Features:
▪ Car parking ID (all the spots where a car could be parked in the city)
▪ Occupancy (How many spots are occupied in the specific time slot)
▪ Capacity (The total capacity of a car park spot)
▪ Last Updated (Date and time when the data is captured)

2. Bivariate Analysis
Bivariate analysis is done on each dataset by using best features selected through a
Feature Selection Method.

• Raisin Dataset

▪ Feature Selection

▪ Result:
Hence the “Major Axis Length” and “Perimeter” features are selected for
bivariate analysis of Raisins.

4
▪ Joint Plot

▪ Explanation for the plot:

Above mentioned plot refers to the Raisins dataset which was


obtained for building a machine vision system to develop in order
to distinguish between two different variety of raisins (Kecimen
and Besni) grown in Turkey. A total of 900 pieces raisin grains
were obtained, from an equal number of both varieties to hold the
experiment.

The plot which is under the considerations is produced to explore


the two very core features of our dataset which indicates the
“Major Axis Length” and “Perimeter” of the Raisins. Major Axis
Length denotes the biggest line that can be drawn over a raisin and
perimeter refer to the measures of the environment by calculating
the distance between the boundaries of the raisin grain and the
pixels around it.

Chiefly, the stated plot shows a linear relationship between the


above conferred variables. As it can be seen undoubtedly that the
perimeter of any raisin boosts with the rise of the Major Axis
length.

The length of the major axis is noted nearly equal or greater than
the 200 and the maximum values approaches to 997.292. It is

5
calculated that the mean of major axis is around 430.93 and the
standard deviation of the data is 116.035.

By looking at the bins of the joint plot it is inferred that the


maximum spread of the data is 800 to 1400 over X-axis and 340 to
400 over Y-axis which concludes that the higher number of our
sample Raisins resides between these ranges.

There are some outliers can be seen through this visual


representation which conveys the idea that there could be a very
minor amount of data is obtained which contains the exceptional
case of Raisins having greater that 800 of Major Axis Length and
above 2000 length of perimeter.

• HTRU2 Dataset

Feature Selection:

Result:
Hence the “Major Axis Length” and “Perimeter” features are selected for
bivariate analysis of Raisins.

6
Joint Plot

Explanation of the HTRU-2 Joint Plot:


The above cited plot represents a joint-plot containing information
for our HTRU2 Dataset which is a survey for pulsars. Pulsars are a rare type of
Neutron star that produce radio emission detectable here on Earth. They are of
considerable scientific interest as probes of space-time, the interstellar medium,
and states of matter. Machine learning tools are now being used to automatically
label pulsar candidates to facilitate rapid analysis. In particular, classification
systems are widely adopted, which treat the candidate data sets as binary
classification problems.
For the bivariate plot representation, the two very important features of our
dataset have been selected by a “Feature Selection” Method, which mentioned
the features having the higher effect over the classification. The first one is Mean
of the Integrated pulser Profile and the second one is the Excess kurtosis of the
integrated profile. Where Excess kurtosis is a measure of whether the data is
heavy-tailed or light-tailed relative to a normal distribution of the data.
The relationship between the plot is seems to be linear which defines the fact that
if the Excess kurtosis of a pulsar decreases it adds some valuable amount to the
mean of the Pulsar profile. On a precise note, we can easily interpret that the
value of the Excess kurtosis is directly proportional to the inverse of Mean of the
pulsar.
Through the bins of joint plot, it is cleared that the data is overfilled around the X-
Axis in the range of (100 – 125) and (0 – 2) in the series of Y-Axis. Which

7
conveys the fact that most of the pulsars have the excess kurtosis length within the
range of 0 to 2 and Mean within a range of 100 to 125.

• Parking Birmingham Dataset

Feature Selection
For the Parking Birmingham Dataset, I will use Occupancy Rate and Days
per week as my two required variables for bivariate analysis explicitly.
1. Occupancy Rate (%) = (Occupancy / Capacity) * 100
2. Days per week are obtained by converting the “LastUpdated” column
into it’s corresponding Days.

Joint Plot For Dataset.csv:

Explanation for the Dataset.csv plot:


The above plot is based on the dataset of the study related to parking occupancy
data published by the Birmingham city council with the aim of testing several
prediction strategies and analyzing their results. It was used to construct a model
to predict future occupancies in several parking locations to propose different
alternatives to driver.

8
The plot is based on the occupancy rates of the car parking spread over the days
of week. Which depicts the idea of calculating the Occupancy estimation of a car
parking on a certain day which will later be very helpful for the future prediction
in IPS (Intelligent Parking System).
As it can be clearly observed from the plot that Occupancy rate over Tuesday and
Wednesday and Thursday is almost equal which lies in the range of 3500 for all
the parking IDs. And increases by a very small number for the Friday from 3500
to approximately 3700.
Through the visual representation of the data is clearly described that the
occupancy rate of the Car Parking is comparatively high from all the other days
which is above 40,00. Saturday will be considered as the Busiest day in the aspect
of Car Parking. And it will not be easy for the driver to find a suitable parking
place with less effort.
Moving forward to the day Sunday a minor change in the occupancy rate is noted
from the day “Friday”. Which is approximately of 200. When the occupancy rate
on the Monday slot is compared to the “Saturday” one, it can be seen falling from
4000 above to the threshold of 4000, but when it is compared to the other days of
the week, it still has a great altitude.
The occupancy rates of given Car Parking ids can be estimated through their
spread over the days of week and then further be used to classify the busy and
non-busy spots comparatively.

Classification
To apply Classification for both datasets, Raisins and HTRU2, I have randomly divided
the data into 80:20 Ratio. Which means that the 80% data will be used for training
purpose and the remaining 20% data will be used for testing purpose. Classification is
done by using the following two algorithms.
• k-Nearest Neighbor (KNN)
• Naïve Bayes (Gaussian NB)

Confusion Matrices of “Raisin Dataset” when Number of features


are increased one by one:

9
1. Using KNN Classification:
a. “2” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

b. “3” Features Selected:


Confusion Matrix Report:

10
Confusion Matrix:

c. “4” Features Selected:


Confusion Matrix Report:

11
Confusion Matrix:

12
d. “5” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

13
e. “6” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

14
f. “7” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

15
2. Using Naïve Bayes Classification:
a. Using “2” Features:

Confusion Matrix Report:

Confusion Matrix:

16
b. Using “3” Features:
Confusion Matrix Report:

Confusion Matrix:

17
c. Using “4” Features:
Confusion Matrix Report:

Confusion Matrix:

18
d. Using “5” Features:
Confusion Matrix Report:

Confusion Matrix:

19
e. Using “6” Features:

Confusion Matrix Report:

Confusion Matrix:

f. Using All Features:


20
Confusion Matrix Report:

Confusion Matrix:

21
Confusion Matrices of “HTRU2 Dataset” when Number of features
are increased one by one:

1. Using KNN Classification:


a. “2” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

22
b. “3” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

23
c. “4” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

24
d. “5” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

25
e. “6” Features Selected:
Confusion Matrix Report:

Confusion Matrix:

26
f. “7” Features Selected:
Confusion Matrix Report:

Confusion Matrix:
27
g. All Features Selected:
Confusion Matrix Report:

28
Confusion Matrix:

2. Using Naïve Bayes Classification:


a. “2” Features Selected:
Confusion Matrix Report:

29
Confusion Matrix:

b. “3” Features Selected:


Confusion Matrix Report:

30
Confusion Matrix:

c. “4” Features Selected:


Confusion Matrix Report:

31
Confusion Matrix:

d. “5” Features Selected:


Confusion Matrix Report:

Confusion Matrix:

32
e. “6” Features Selected:
Confusion Matrix Report:

33
Confusion Matrix:

f. “7” Features Selected:


Confusion Matrix Report:

34
Confusion Matrix:

g. All Features Selected:


Confusion Matrix Report:

Confusion Matrix:
35
Comparing The Performances of Each Classification Model
When the Number of Input Features Are Increased.

KNN For Raisin Dataset


No of False Accuracy
True Positive True Negative False Positive
features Negative
2 66 24 19 71 76%
3 66 24 19 71 76%
4 66 24 19 71 76%
5 68 22 7 83 83%
6 68 22 7 83 83%
7 68 22 7 83 83%
Results:
From the above table the accuracy measures of the KNN for Raisin dataset can be compared
easily. We noticed that by increasing the features in the model the accuracy of the classifier is
improved.

Naïve Bayes for Raisin Dataset


36
No of True False False Accuracy
True Positive
features Negative Positive Negative
2 65 25 5 85 83%
3 62 28 7 83 80%
4 62 28 7 83 80%
5 61 29 6 84 80%
6 61 29 6 84 80%
7 61 29 5 85 81%

Results:
From the above table the accuracy measures of the Naïve Bayes Classifier for Raisin dataset can
be compared easily. We noticed that by increasing the features in the model the accuracy of the
classifier is decreased.

KNN For HTRU2 Dataset


No of False Accuracy
True Positive True Negative False Positive
features Negative
2 3208 44 88 240 96%
3 3227 25 78 250 97%
4 3229 23 65 263 97%
5 3229 23 64 264 97%
6 3226 26 60 268 97%
7 3224 28 61 267 97%
8 3228 24 60 268 97%

Results:
From the above table the accuracy measures of the KNN for HTRU2 dataset can be compared
easily. We noticed that by increasing the features in the model the accuracy of the classifier is
improved.

Naïve Bayes for HTRU2 Dataset


No of False Accuracy
True Positive True Negative False Positive
features Negative
2 3187 65 91 237 95%
3 3198 54 54 274 96%
4 3195 57 57 271 96%
5 3162 90 56 272 95%

37
6 3146 106 56 272 95%
7 3127 125 53 275 95%
8 3117 135 50 278 94%

Results:
From the above table the accuracy measures of the Naïve Bayes Classifier for HTRU2 dataset
can be compared easily. We noticed that by increasing the features in the model the accuracy of
the classifier is improved.

3. Clustering
Feature Extraction:
To convert the data into some useful features I have added some more columns
into the dataset. All these operations are done using Microsoft Excel’s multiple
advanced features.
▪ Occupancy Rate = occupancy/capacity
▪ Occupancy Rate (in percentage) = Occupancy Rate * 100
▪ Days: converted the dates from the column “LastUpdated” into their
corresponding days of weeks.

Data Cleaning:
The data provided is not very accurate and precise because sometimes the sensors
are faulty or even, the whole data set may not be updated for a whole day. To
remove these situations from my dataset I have implemented a filtering stage
before applying the clustering algorithm as follows:
▪ The percentage values beyond the range (0-100%) are removed from
the dataset using filtering method in Excel.
▪ Data on a car park that is below 5% for an entire day is also discarded.
▪ Car parks without data are excluded from the study

• No. of Rows and Column in the Dataset before cleansing stage:

• No. of Rows and Column in the Dataset After cleansing stage:

38
• Summary of Dataset:

• Checking for the missing values in the Dataset

39
Viewing the Statistical summary of the Features:

• Exploring the Length of Unique Values of the features:

40
3. Boxplot of the Dataset

4. Scatter Plot Before Applying Clustering

41
5. Elbow Method
Below is a plot of sum squared distances for “k” in a range from 1 to 11. The value of
WCSS is largest when k = 1 and then it starts to decreasing.

As we can see from the above Graph that “the Elbow” (the point of inflection on the
curve) or the best value of K is 2.

6. Statistical Description of the k-means variables:

42
7. Scatter plot Of the Data After Applying
k-Means:

8. Joint Plot after Applying k-means:

43
Note:
Here in the above plot, values form (0 to 6) represents the days of week starting from
“Tuesday” and ending at “Monday”.

Results:
By Applying the k-means clustering algorithm over the Car Parking dataset we have assigned
some labels to relevant group of data according to their similarity scores. By Enhancing and
adding up the more details in the algorithm k-means classifier can be proved as a future predictor
for Car-Parkings.

44

You might also like