You are on page 1of 20

Dr.

Eick
COSC 4335 Data Mining Fall 2015
Draft Project1: (Exploratory) Data Analysis
Group Project (Groups of 2 or 3)
Due: Monday, February 23, 11p (electronic Submission)
Last Updated: January 26, 2015; 5p
Download Statlog (Vehicle Silhouettes) Data Set dataset from
http://archive.ics.uci.edu/ml/datasets/ s/Statlog+(Vehicle+Silhouettes) limiting yourself to
analyzing to the following subset of the dataset involving just 5 attributes; use all
examples to create the subset:
COMPACTNESS (average perim)**2/area (1st attribute)
CIRCULARITY (average radius)**2/area (2nd attribute)
SCATTER RATIO (inertia about minor axis)/(inertia about major axis) (7th
attribute)
ELONGATEDNESS area/(shrink width)**2 (8th attribute)
HOLLOWS RATIO (area of hollows)/(area of bounding polygon) (18th attribute)
Class: OPEL, SAAB, BUS, VAN
Apply the following exploratory data analysis techniques using R to your dataset:
1. Compute the mean value and standard deviation of the 5 numerical attributes1. 1
point
Mean

Standard
Deviation
8.234474

93.6784
9
CIRCULARITY
44.8617
6.169866
SCATTER RATIO
168.839
33.24498
2
ELONGATEDNESS
40.9338
7.81156
1
HOLLOWS RATIO
195.632
7.438797
4
2. Compute the covariance matrix for the five numerical attributes you are analyzing;
also compute the correlation for each of the three pairs of attributes. Interpret the
statistical findings! 2 points
COMPACTNESS

Covariance matrix
COMPACTNESS CIRCULARITY SCATTER_RATIO ELONGATEDNESS
HOLLOWS_RATIO
COMPACTNESS
67.80657 35.201637
222.56364
-50.72900
22.391727
CIRCULARITY
35.20164 38.067242
176.47597
-39.94289
1.775135
SCATTER_RATIO 222.56364 176.475966 1105.22856 -252.78343
29.663911
1

This is more a verification of that you have the correct dataset!

ELONGATEDNESS -50.72900 -39.942893


-12.593593
HOLLOWS_RATIO 22.39173 1.775135
55.335707

-252.78343
29.66391

61.02047
-12.59359

The Compactness, Circularity, Scatter_Ratio, and Hollows_Ratio have


positive linear
relationships with all other attributes, except Elonggatedness.
It means that vehicles with high value of one of Compactness, Circularity,
Scatter_Ratio, Hollows_Ratio will usually have values of the other three. In
contrast, vehicles with high value of Elonggatedness will usually have low value
of Compactness, Circularity, Scatter_Ratio, and Hollows_Ratio.

COMPACTNESS
COMPACTNESS
CIRCULARITY
SCATTER RATIO

1
0.6928692
0.8130033

ELONGATEDNESS
-0.788647
HOLLOWS RATIO

0.3655518

SCATTER
HOLLOW
ELONGATEDNESS
RATIO
RATIO
0.81300
0.3655
0.69286923
33
-0.788647
0.86036
1
71
-0.8287548 0.0386
0.1199
0.86036714
1
-0.9733853
0.97338
-0.8287548
53
1 0.2167
0.11994
0.03867702
98
-0.2167251

CIRCULARITY

The positive linear relationship between Compactness and Hollows_Ratio are pretty
weak, whereas the positive relationships between Compactness Circularity and
Scatter_Ratio are strong. The negative relationship between Compactness and
Elongatedness is also strong.
3. Create a scatter plot for the last two numerical attributes of your dataset. Interpret the
scatter plot! 2 points

It seems like the linear relationship is positve as Elongatedness in the range[25,35] ,


and negative as Elongatedness in the range(35,45]. It looks like there is not linear
relationship between the two attributes as Elongatedness > 45.
3

In general, the linear relationship between the two attributes is very weak based on
the scattered plot.
4. Create histograms for each of the 5 numerical attributes. Then create a histogram for
the ELONGATEDNESS attribute for instances of OPEL, instances of SAAB,
instances of BUS, and instances of VAN (4 Histograms); interpret the 5 histograms
you generated for the ELONGATEDNESS attribute. 6 points

It looks like the Elongatedness has two modes. The distribution of the data is not
symmetric. Somehow the graph is right-skewed.

It looks like the Elongatedness for Opel class has only one mode. Also it is clearly to
see that the graph is right-skewed.

Elongatedness for Saab appears to have only one mode. Also the graph is rightskewed.

It seems like Elongatedness for Bus has two modes. If the outer-left bar is not
considered, the graph appears to be left-skewed.

The data for Elongatedness for Van appears to fluctuate pretty much. That suggests
that it is multi-modal. However in general view, it looks like the graph is rightskewed.
5. Create box plots for the COMPACTNESS attribute for the instances of each class and
a fifth box plot for all instances in the dataset. Do the same for the HOLLOWS
RATIO attribute. Interpret and compare the 5 box plots for each attribute! 5 points

It looks like the box plots for opel, saab, van and all for the Compactness have the
50th percentile at the middle of 25th-75th percentiles. Whereas the 50th percentile of the
box plot for bus seems to be closer to the 25th. While the data for van is quite compact
between values 87 and 93, the data for the other box plots seem to have wider spread.

The box plots show that the 50th percentile of opel, saab, and van appear to be at the
middle of the 25th and 75th percentiles. Whereas, the 50th percentile of the box plot for
bus looks to be closer to the 25th percentile, and the 50th percentile of the box plot for
all looks to be closer to the 75th percentile. Moreover the data for opel, saab, van, and
all seem to be compacted between values 192 and 202 while the data for bus seems to
spread wider.
6. Create supervised scatter plots/supervised density plots for 4 pairs2 of the 5 attributes
(for each pair of attributes visualize it using a traditional scatter plot and a density
plot) and the class variable; use different colors for the class variable. Interpret the
scatter plots! 5 points

In general, there are 10 pairs, but you only need to visualize 4 of them!

10

+saab: it apears that the circularity and compactness has a positive linear relationship.
+van: it appears that the circularity and compactness does not have strong linear
relationship because the compactness seems to stay unchanged as the circularity
changes.
+bus: it appears that the circularity and compactness has a negative linear relationship
on the interval [35, 42] on the x-axis, and positive linear relationship on the interval
[42, 60] on the x-axis.
+opel: it apears that the circularity and compactness has a positive linear relationship.

11

+saab: it apears that the scatter ratio and compactness has a pretty strong positive
linear relationship.
+van: it appears that the scatter ratio and compactness does not have strong linear
relationship because the data points look concentrated in a rectangular shape.
+bus: it appears that the scatter ratio and compactness does not have linear
relationship around value of 150 on the x-axis. However, from 150 to 250 on the xaxis, the positive linear relationship seems to exist.
+opel: it apears that the scatter ratio and compactness has a pretty strong positive
linear relationship.

12

+It appears that there exist positive linear relationships between elongatedness and
compactness for classes of bus, opel, and saab. However, it seems like there is not
linear relationship for van class.

13

+saab: it looks like there is a pretty weak linear relationship between Hollows_Ratio
and Compactness.
+van: it seems like there exists no linear relationship between Hollows_Ratio and
Compactness.
+bus: it seems like there exists a negative linear relationship between Hollows_Ratio
and Compactness on the interval [100, 95) on the y-axis, but a positive linear
relationship on the interval [80,95] on the y-axis.

14

7. Create a Star plot for the first 10 instances of class BUS and the first 20 instances of
SAAB (based on the order in the file); interpret the 20 stat plotsstar plots should be
constructed for the 4 numerical attributes! 3 points

It looks like the dominant shapes of 20 instances of saab are those number 3, 19, 25,
28, 39, 45, 91, 93. Also the shapes of instance with numbers 10, 25, 44,50, 52, 57, 77,
78 are also dominant shapes. In general, it seems like these 20 instances could be
divided into two major groups (similar shapes will be grouped together).

15

8. Fit a linear model that predicts the class attribute (treat it as a numerical attribute that
takes values 0, 1, 2 and 3 with OPEL=0, SAAB=1, BUS=2, and VAN=3) using the 5
attributes as the independent variables. Report the R2 of the linear model and the
coefficients of each attribute in the obtained regression function. Do the coefficients
tell you anything about the importance of the attribute in predicting the class variable;
if yes, what? Repeat the experiment using OPEL=0, SAAB=0, BUS=2, and VAN=0,
and answer the same questions! 6 points
OPEL=0, SAAB=1,
BUS=2, and VAN=3

Coefficie
nt

yintercep
t

-0.02352

3.70878

class~compact
ness

-0.01936

2.40081

class~circulari
ty

-0.005386 2.461936
class~scatter_
ratio

0.02157

0.67257

class~elongat
edness

-0.05095 11.40752
class~hollow_r
atio
OPEL=0, SAAB=0,
BUS=2, and VAN=0
class~compact

Interpretation
The coefficient indicates that
there is a weak negative
relationship between class and
compactness. It means it is not
appropriate to use
compactness to predict class.
The coefficient indicates that
there is a weak negative
relationship between class and
circularity. It means it is not
appropriate to use circularity
to predict class.
The coefficient indicates that
there is a very weak negative
relationship between class and
scatter_ratio. It means it is not
appropriate to use
scatter_ratio to predict class.
The coefficient indicates that
there is a weak positive
relationship between class and
elongatedness. It means it is
not appropriate to use
elongatedness to predict class.
The coefficient indicates that
there is a weak negative
relationship between class and
hollow_ratio. It means it is not
appropriate to use hollow_ratio
to predict class.

Coefficie
ynt
intercept Interpretation
-0.01588
2.0029 The negative linear
16

ness
class~circulari
ty
class~scatter_
ratio
class~elongat
edness
class~hollow_r
atio

0.002807

0.389432

0.0005526 0.4220647
-0.006926

0.798889

-0.04016

8.37151

relationship is even weaker.


The linear relationship now is
positive, but very weak.
The linear relationship now is
positive, but very weak.
The linear relationship now is
negative, but is very weak.
The negative linear
relationship is weaker.

9. Create 3 decision tree models with 20 or less nodes for the dataset (leaf nodes count;
do not submit models with more than 20 nodes! Explain how the 3 decision tree
models were obtained. Report the training accuracy and the testing accuracy of this
decision tree; interpret the learnt decision tree. What does it tell you about the
importance of the 5 attributes for the classification problem? 6 points

17

10. Write a conclusion (at most 18 sentences!) that assesses the difficulty of predicting
the class attribute using the selected 5 attributes and assesses which of the 5
attributes is more important/less important for the classification task based on the
findings you obtained answering questions 1-9! Moreover, if you discovered
something else interesting about the dataset, also mention it in your conclusion. 7
points total (and possibly up to 5 extra points)
According to all questions above, we can properly assess the difficulty of
predicting the class attributes using the selected five attributes. Especially, in the question
6 we see that we can choose 2 of 5 attribute to analyze the relationship for class sabb,
van, bus, opel. Each pair of attributes chosen from 5 attributes which help us to see the
different level of linear relationship for class attribute. However, in the question 8 if we
use fit the linear that predicts the class attribute using five attributes as the independent
variable, this is not a properly way. From the result from question 8, there is a weak
relationship between class and one independent attribute of five attributes. Thus, using
the attributes as the independent variables to predict class is inappropriate way. And the
role of individual attribute is not equal, they depend on each other.

11) Similarity Assessment (7 points!)


Design a distance function to assess the similarity of customers of a supermarket; each
customer in a supermarket is characterized by the following attributes3:
a) Ssn
b) Items_Bought (The set of items the bought last month; this is a set)
c) Age (is an ordinal attribute having values: old, medium, young, and teenager)

E.g. (111234232, {Coke, 2%-milk, apple}, old, 33.39) is an example of a customer description.

18

d) Amount_spend (Average amount spent per purchase in dollars and cents; it has a
mean of 50.00 a standard deviation of 25, the minimum is 0.02 and the maximum
is 398)
Assume that Items_Bought and Amount_Spend are of major importance and Age is of a
minor importance when assessing the similarity of the customers. Assess the distance
between the following 3 customers:
a1= (111111111, {A,B,C}, old, 12.20)
a2= (22222222, {B,C,D,E,F, G}, medium, 50.20)
a3=(333333333, {C,D,E,F,H}, young, 28.00).
We assume that the values of attribute age are converted to numbers as old =3,
medium = 2, young =1, teenager =0.
Let a, b be 2 customers,
D Items_Bought(a,b)=1*(1-|(a.Items_Bought n b.Items_Bought)| / |a.Items_Bought U
b.Items_Bought|)
D Age(a,b)=0.2*( |(a.age-b.age)/3|)
D Amount_spend(a,b)=1*|(a.Amount_spend-50)/25-(b.Amount_spend-50)/25|
The distance of a and b :
D(a,b)=( D Items_Bought(a,b) + D Age(a,b)+ D Amount_spend(a,b))/2.2
D(a1,a2)=((1-2/7)+0.2*(1/3)+|(-37.8/25-0.2/25)|)/2.2=(5/7+0.2/3+38/25)/2.2=1.045
D(a1,a3)=((1-1/8)+0.2*(2/3)+|(-37.8/25+22/25)|)/2.2=(7/8+0.4/3+15.8/25)/2.2=0.745
D(a2,a3)=((1-4/7)+0.2*(1/3)+|(0.2/25+22/25)|)/2.2=(3/7+0.2/3+22.2/25)/2.2=0.63
12) Data Analysis (4 POINTS)
a) What is the role and purpose of exploratory data analysis in a data mining project?
b) Interpret the following 2 histograms and analyze their relationships which describe the
male and female age distribution in the US, based on Census Data.

19

The role and purpose of exploratory data analysis in a mining project:

Getting the necessary background information for the task


Providing knowledge to help in tool selection
Assessing difficulty of the task to be solved
Validating data
Forming the hypotheses
Finding the potential issues, error, patterns in the data

Interpreting 2 histograms and their relationship:


Two diagrams above are continuous, and there are no gaps between them. They are
binomial (2 peaks at 5-9 ages and 35-39 ages). The values of two diagrams start going
down significantly beyond 55-59(skewed distribution). In general, both diagrams are
similar until 55-59 ages. After this point, the male curve is significantly steeper than
female curve. From here, we can reach to a conclusion: female live longer than male.

20