27 views

Uploaded by ducnguyen88882012

- Data Mining-Classification and Decision Tree Induction_1
- L12 Dectree14-08-09
- PADM - -Decision Trees
- Pattern Recognition
- Statistics
- 200603_05
- Pearson Correlation Coefficient
- Pearson Product
- 04_classification_basics.pdf
- Cutler Counting Rules
- draftmanual_svs_14feb2009
- resume edem defor 2019
- Islanding Detection of Wind Integrated Systems, A Decision Tree Approach
- 3806-Disease-prediction-by-Using-Machine-Learning-.pdf
- CHALLENGES IN RAISING CREDIT: A STUDY OF MSMES IN PUNJAB REGION
- V6I2-IJERTV6IS020015.pdf
- Decision 1
- IJETR2151
- Javeria Assignment
- Holland’s Method as a Material in Environmental Education

You are on page 1of 20

Eick

COSC 4335 Data Mining Fall 2015

Draft Project1: (Exploratory) Data Analysis

Group Project (Groups of 2 or 3)

Due: Monday, February 23, 11p (electronic Submission)

Last Updated: January 26, 2015; 5p

Download Statlog (Vehicle Silhouettes) Data Set dataset from

http://archive.ics.uci.edu/ml/datasets/ s/Statlog+(Vehicle+Silhouettes) limiting yourself to

analyzing to the following subset of the dataset involving just 5 attributes; use all

examples to create the subset:

COMPACTNESS (average perim)**2/area (1st attribute)

CIRCULARITY (average radius)**2/area (2nd attribute)

SCATTER RATIO (inertia about minor axis)/(inertia about major axis) (7th

attribute)

ELONGATEDNESS area/(shrink width)**2 (8th attribute)

HOLLOWS RATIO (area of hollows)/(area of bounding polygon) (18th attribute)

Class: OPEL, SAAB, BUS, VAN

Apply the following exploratory data analysis techniques using R to your dataset:

1. Compute the mean value and standard deviation of the 5 numerical attributes1. 1

point

Mean

Standard

Deviation

8.234474

93.6784

9

CIRCULARITY

44.8617

6.169866

SCATTER RATIO

168.839

33.24498

2

ELONGATEDNESS

40.9338

7.81156

1

HOLLOWS RATIO

195.632

7.438797

4

2. Compute the covariance matrix for the five numerical attributes you are analyzing;

also compute the correlation for each of the three pairs of attributes. Interpret the

statistical findings! 2 points

COMPACTNESS

Covariance matrix

COMPACTNESS CIRCULARITY SCATTER_RATIO ELONGATEDNESS

HOLLOWS_RATIO

COMPACTNESS

67.80657 35.201637

222.56364

-50.72900

22.391727

CIRCULARITY

35.20164 38.067242

176.47597

-39.94289

1.775135

SCATTER_RATIO 222.56364 176.475966 1105.22856 -252.78343

29.663911

1

-12.593593

HOLLOWS_RATIO 22.39173 1.775135

55.335707

-252.78343

29.66391

61.02047

-12.59359

positive linear

relationships with all other attributes, except Elonggatedness.

It means that vehicles with high value of one of Compactness, Circularity,

Scatter_Ratio, Hollows_Ratio will usually have values of the other three. In

contrast, vehicles with high value of Elonggatedness will usually have low value

of Compactness, Circularity, Scatter_Ratio, and Hollows_Ratio.

COMPACTNESS

COMPACTNESS

CIRCULARITY

SCATTER RATIO

1

0.6928692

0.8130033

ELONGATEDNESS

-0.788647

HOLLOWS RATIO

0.3655518

SCATTER

HOLLOW

ELONGATEDNESS

RATIO

RATIO

0.81300

0.3655

0.69286923

33

-0.788647

0.86036

1

71

-0.8287548 0.0386

0.1199

0.86036714

1

-0.9733853

0.97338

-0.8287548

53

1 0.2167

0.11994

0.03867702

98

-0.2167251

CIRCULARITY

The positive linear relationship between Compactness and Hollows_Ratio are pretty

weak, whereas the positive relationships between Compactness Circularity and

Scatter_Ratio are strong. The negative relationship between Compactness and

Elongatedness is also strong.

3. Create a scatter plot for the last two numerical attributes of your dataset. Interpret the

scatter plot! 2 points

and negative as Elongatedness in the range(35,45]. It looks like there is not linear

relationship between the two attributes as Elongatedness > 45.

3

In general, the linear relationship between the two attributes is very weak based on

the scattered plot.

4. Create histograms for each of the 5 numerical attributes. Then create a histogram for

the ELONGATEDNESS attribute for instances of OPEL, instances of SAAB,

instances of BUS, and instances of VAN (4 Histograms); interpret the 5 histograms

you generated for the ELONGATEDNESS attribute. 6 points

It looks like the Elongatedness has two modes. The distribution of the data is not

symmetric. Somehow the graph is right-skewed.

It looks like the Elongatedness for Opel class has only one mode. Also it is clearly to

see that the graph is right-skewed.

Elongatedness for Saab appears to have only one mode. Also the graph is rightskewed.

It seems like Elongatedness for Bus has two modes. If the outer-left bar is not

considered, the graph appears to be left-skewed.

The data for Elongatedness for Van appears to fluctuate pretty much. That suggests

that it is multi-modal. However in general view, it looks like the graph is rightskewed.

5. Create box plots for the COMPACTNESS attribute for the instances of each class and

a fifth box plot for all instances in the dataset. Do the same for the HOLLOWS

RATIO attribute. Interpret and compare the 5 box plots for each attribute! 5 points

It looks like the box plots for opel, saab, van and all for the Compactness have the

50th percentile at the middle of 25th-75th percentiles. Whereas the 50th percentile of the

box plot for bus seems to be closer to the 25th. While the data for van is quite compact

between values 87 and 93, the data for the other box plots seem to have wider spread.

The box plots show that the 50th percentile of opel, saab, and van appear to be at the

middle of the 25th and 75th percentiles. Whereas, the 50th percentile of the box plot for

bus looks to be closer to the 25th percentile, and the 50th percentile of the box plot for

all looks to be closer to the 75th percentile. Moreover the data for opel, saab, van, and

all seem to be compacted between values 192 and 202 while the data for bus seems to

spread wider.

6. Create supervised scatter plots/supervised density plots for 4 pairs2 of the 5 attributes

(for each pair of attributes visualize it using a traditional scatter plot and a density

plot) and the class variable; use different colors for the class variable. Interpret the

scatter plots! 5 points

In general, there are 10 pairs, but you only need to visualize 4 of them!

10

+saab: it apears that the circularity and compactness has a positive linear relationship.

+van: it appears that the circularity and compactness does not have strong linear

relationship because the compactness seems to stay unchanged as the circularity

changes.

+bus: it appears that the circularity and compactness has a negative linear relationship

on the interval [35, 42] on the x-axis, and positive linear relationship on the interval

[42, 60] on the x-axis.

+opel: it apears that the circularity and compactness has a positive linear relationship.

11

+saab: it apears that the scatter ratio and compactness has a pretty strong positive

linear relationship.

+van: it appears that the scatter ratio and compactness does not have strong linear

relationship because the data points look concentrated in a rectangular shape.

+bus: it appears that the scatter ratio and compactness does not have linear

relationship around value of 150 on the x-axis. However, from 150 to 250 on the xaxis, the positive linear relationship seems to exist.

+opel: it apears that the scatter ratio and compactness has a pretty strong positive

linear relationship.

12

+It appears that there exist positive linear relationships between elongatedness and

compactness for classes of bus, opel, and saab. However, it seems like there is not

linear relationship for van class.

13

+saab: it looks like there is a pretty weak linear relationship between Hollows_Ratio

and Compactness.

+van: it seems like there exists no linear relationship between Hollows_Ratio and

Compactness.

+bus: it seems like there exists a negative linear relationship between Hollows_Ratio

and Compactness on the interval [100, 95) on the y-axis, but a positive linear

relationship on the interval [80,95] on the y-axis.

14

7. Create a Star plot for the first 10 instances of class BUS and the first 20 instances of

SAAB (based on the order in the file); interpret the 20 stat plotsstar plots should be

constructed for the 4 numerical attributes! 3 points

It looks like the dominant shapes of 20 instances of saab are those number 3, 19, 25,

28, 39, 45, 91, 93. Also the shapes of instance with numbers 10, 25, 44,50, 52, 57, 77,

78 are also dominant shapes. In general, it seems like these 20 instances could be

divided into two major groups (similar shapes will be grouped together).

15

8. Fit a linear model that predicts the class attribute (treat it as a numerical attribute that

takes values 0, 1, 2 and 3 with OPEL=0, SAAB=1, BUS=2, and VAN=3) using the 5

attributes as the independent variables. Report the R2 of the linear model and the

coefficients of each attribute in the obtained regression function. Do the coefficients

tell you anything about the importance of the attribute in predicting the class variable;

if yes, what? Repeat the experiment using OPEL=0, SAAB=0, BUS=2, and VAN=0,

and answer the same questions! 6 points

OPEL=0, SAAB=1,

BUS=2, and VAN=3

Coefficie

nt

yintercep

t

-0.02352

3.70878

class~compact

ness

-0.01936

2.40081

class~circulari

ty

-0.005386 2.461936

class~scatter_

ratio

0.02157

0.67257

class~elongat

edness

-0.05095 11.40752

class~hollow_r

atio

OPEL=0, SAAB=0,

BUS=2, and VAN=0

class~compact

Interpretation

The coefficient indicates that

there is a weak negative

relationship between class and

compactness. It means it is not

appropriate to use

compactness to predict class.

The coefficient indicates that

there is a weak negative

relationship between class and

circularity. It means it is not

appropriate to use circularity

to predict class.

The coefficient indicates that

there is a very weak negative

relationship between class and

scatter_ratio. It means it is not

appropriate to use

scatter_ratio to predict class.

The coefficient indicates that

there is a weak positive

relationship between class and

elongatedness. It means it is

not appropriate to use

elongatedness to predict class.

The coefficient indicates that

there is a weak negative

relationship between class and

hollow_ratio. It means it is not

appropriate to use hollow_ratio

to predict class.

Coefficie

ynt

intercept Interpretation

-0.01588

2.0029 The negative linear

16

ness

class~circulari

ty

class~scatter_

ratio

class~elongat

edness

class~hollow_r

atio

0.002807

0.389432

0.0005526 0.4220647

-0.006926

0.798889

-0.04016

8.37151

The linear relationship now is

positive, but very weak.

The linear relationship now is

positive, but very weak.

The linear relationship now is

negative, but is very weak.

The negative linear

relationship is weaker.

9. Create 3 decision tree models with 20 or less nodes for the dataset (leaf nodes count;

do not submit models with more than 20 nodes! Explain how the 3 decision tree

models were obtained. Report the training accuracy and the testing accuracy of this

decision tree; interpret the learnt decision tree. What does it tell you about the

importance of the 5 attributes for the classification problem? 6 points

17

10. Write a conclusion (at most 18 sentences!) that assesses the difficulty of predicting

the class attribute using the selected 5 attributes and assesses which of the 5

attributes is more important/less important for the classification task based on the

findings you obtained answering questions 1-9! Moreover, if you discovered

something else interesting about the dataset, also mention it in your conclusion. 7

points total (and possibly up to 5 extra points)

According to all questions above, we can properly assess the difficulty of

predicting the class attributes using the selected five attributes. Especially, in the question

6 we see that we can choose 2 of 5 attribute to analyze the relationship for class sabb,

van, bus, opel. Each pair of attributes chosen from 5 attributes which help us to see the

different level of linear relationship for class attribute. However, in the question 8 if we

use fit the linear that predicts the class attribute using five attributes as the independent

variable, this is not a properly way. From the result from question 8, there is a weak

relationship between class and one independent attribute of five attributes. Thus, using

the attributes as the independent variables to predict class is inappropriate way. And the

role of individual attribute is not equal, they depend on each other.

Design a distance function to assess the similarity of customers of a supermarket; each

customer in a supermarket is characterized by the following attributes3:

a) Ssn

b) Items_Bought (The set of items the bought last month; this is a set)

c) Age (is an ordinal attribute having values: old, medium, young, and teenager)

E.g. (111234232, {Coke, 2%-milk, apple}, old, 33.39) is an example of a customer description.

18

d) Amount_spend (Average amount spent per purchase in dollars and cents; it has a

mean of 50.00 a standard deviation of 25, the minimum is 0.02 and the maximum

is 398)

Assume that Items_Bought and Amount_Spend are of major importance and Age is of a

minor importance when assessing the similarity of the customers. Assess the distance

between the following 3 customers:

a1= (111111111, {A,B,C}, old, 12.20)

a2= (22222222, {B,C,D,E,F, G}, medium, 50.20)

a3=(333333333, {C,D,E,F,H}, young, 28.00).

We assume that the values of attribute age are converted to numbers as old =3,

medium = 2, young =1, teenager =0.

Let a, b be 2 customers,

D Items_Bought(a,b)=1*(1-|(a.Items_Bought n b.Items_Bought)| / |a.Items_Bought U

b.Items_Bought|)

D Age(a,b)=0.2*( |(a.age-b.age)/3|)

D Amount_spend(a,b)=1*|(a.Amount_spend-50)/25-(b.Amount_spend-50)/25|

The distance of a and b :

D(a,b)=( D Items_Bought(a,b) + D Age(a,b)+ D Amount_spend(a,b))/2.2

D(a1,a2)=((1-2/7)+0.2*(1/3)+|(-37.8/25-0.2/25)|)/2.2=(5/7+0.2/3+38/25)/2.2=1.045

D(a1,a3)=((1-1/8)+0.2*(2/3)+|(-37.8/25+22/25)|)/2.2=(7/8+0.4/3+15.8/25)/2.2=0.745

D(a2,a3)=((1-4/7)+0.2*(1/3)+|(0.2/25+22/25)|)/2.2=(3/7+0.2/3+22.2/25)/2.2=0.63

12) Data Analysis (4 POINTS)

a) What is the role and purpose of exploratory data analysis in a data mining project?

b) Interpret the following 2 histograms and analyze their relationships which describe the

male and female age distribution in the US, based on Census Data.

19

Providing knowledge to help in tool selection

Assessing difficulty of the task to be solved

Validating data

Forming the hypotheses

Finding the potential issues, error, patterns in the data

Two diagrams above are continuous, and there are no gaps between them. They are

binomial (2 peaks at 5-9 ages and 35-39 ages). The values of two diagrams start going

down significantly beyond 55-59(skewed distribution). In general, both diagrams are

similar until 55-59 ages. After this point, the male curve is significantly steeper than

female curve. From here, we can reach to a conclusion: female live longer than male.

20

- Data Mining-Classification and Decision Tree Induction_1Uploaded byRaj Endran
- L12 Dectree14-08-09Uploaded byAllison Collier
- PADM - -Decision TreesUploaded byneha
- Pattern RecognitionUploaded byJaya Shukla
- StatisticsUploaded byLhiza
- 200603_05Uploaded byMemedhot Winu
- Pearson Correlation CoefficientUploaded byCheyenne Cereno
- Pearson ProductUploaded byMClarissaE
- 04_classification_basics.pdfUploaded byWang Chen Yu
- Cutler Counting RulesUploaded byRobert
- draftmanual_svs_14feb2009Uploaded byKeisha Clark
- resume edem defor 2019Uploaded byapi-439361614
- Islanding Detection of Wind Integrated Systems, A Decision Tree ApproachUploaded byValentine Siyoi
- 3806-Disease-prediction-by-Using-Machine-Learning-.pdfUploaded byJhon t
- CHALLENGES IN RAISING CREDIT: A STUDY OF MSMES IN PUNJAB REGIONUploaded byAnonymous CwJeBCAXp
- V6I2-IJERTV6IS020015.pdfUploaded byAzis AL Faridho Hrp
- Decision 1Uploaded byRohan Sequiera
- IJETR2151Uploaded byanil kasot
- Javeria AssignmentUploaded byMuhammad Umair
- Holland’s Method as a Material in Environmental EducationUploaded byAnonymous kqqWjuCG9
- 44543-178091-1-PBUploaded byRaees salman
- Lean Six Sigma Meets Data Science_ Integrating Two Approaches Based on Three Case StudiesUploaded bysasi
- A_General_Theory_of_Tourism_Consumption.pdfUploaded byLuneBelle
- Lbp Based Analyssis for Breast CancerUploaded bynarasimhan kumaravelu
- Random Forests IntroUploaded bySpats65
- v102n01p061Uploaded byMoises Guilherme Abreu Barbosa
- Term Paper BSUploaded bymehedy09
- Stats Lec5Uploaded byAsad Ali
- CorrelationUploaded byRahul Janjuha
- hsc-commerce-2014-march-maths2 K.pdfUploaded byShradha Rohan Bayas

- JCL PatternUploaded byVintheonlyone
- PersonalGrooming.pptxUploaded byRakesh Mamdyal
- HMK15_TechData_2007Uploaded byAnonymous EhdILsAwms
- B&R CodeUploaded byImranRashid
- Land Suitability Characterization for Crop and Fruit Production of Some River Nile Terraces, Khartoum North, SudanUploaded byIJSRP ORG
- Crohn's diseaseUploaded byNader Smadi
- Jun2 ResumeUploaded byJun-Jun Padilla
- Elastimold Connectors Loadbreak and Deadbreak Elbow and Bolted Tee Connectors HV MV 600 SeriesUploaded byArmin Fernández Gerardo
- The Wife of BathUploaded byralsori
- AeroDyn.pdfUploaded bytaybech
- How to Write a for and Against CompositionUploaded byMiriam La de Felipe
- A2 Unit 3 Dada & SurrealismUploaded bymarcatkinson
- ims glencore xstrata finalUploaded byapi-258523868
- Mechanical Designer 3D Design in Dallas Ft Worth TX Resume Shawn VahdatUploaded byShawnVahdat
- Design of a Multi-band Loop Antenna for WirelessUploaded byBalaKrishna
- Cryo TherapyUploaded byAdi Irawan
- Fuel Related StandardUploaded byRay Romey
- CCNSPPresentationUploaded byRaj Kishore Patra
- Proposal Sample.pdfUploaded by916153
- THE_CIBA_PATENTED_TECHNOLOGYUploaded byJohn Byde
- Antenna - SolutionsUploaded byArun Johar
- 60 Pharma(1)Uploaded byanon_67632140
- history fair sourcesUploaded byapi-109002720
- Strategic Quality ManagementUploaded byHarsh Nogia
- Quaker OatsUploaded byPrateek Rajpal
- Bull_161Uploaded byzeuspower4710
- Overview of UML DiagramsUploaded bySukesh Chulliyote Nambiar
- In a Sanskrit Poem the Goddess OfUploaded bydeepanwita_ray
- NASBAUploaded byLalan Holala
- What is Risk Management.docxUploaded bywahyu