© All Rights Reserved

27 views

© All Rights Reserved

- Project Report Small Business for BBS
- Statistics Course in Psychology
- School Principals Leadership Style.pdf
- Study of Customer Switching Behavior Bank Services
- Effects of Change Management Strategies on Performance in National Examinations in Public Secondary Schools in Kisii County, Kenya
- 2017 11 Economics Sample Paper 01 Qp
- AI Learning
- 06 Learning Systems
- Decision Tree Classifiers to determine the patient’s Post-operative Recovery Decision
- SLMNA1-11 EcoB 07 Correlation Goutam
- s1 16_mergedhrm#Mm Zc441#Qm Zc441 l9
- Decision Trees for Uncertain Data
- Importance of Information classification in the Data Manning
- 17
- Consumers’ Attitude Towards Green Products - An Exploratory Study in Erode District
- 0192I - PPT1 - R2
- Percentile
- Chapter 3 060516.Edited
- New QM- Lesson Plan
- CHAPTER 7:

You are on page 1of 20

Eick

COSC 4335 Data Mining Fall 2015

Draft Project1: (Exploratory) Data Analysis

Group Project (Groups of 2 or 3)

Due: Monday, February 23, 11p (electronic Submission)

Last Updated: January 26, 2015; 5p

Download Statlog (Vehicle Silhouettes) Data Set dataset from

http://archive.ics.uci.edu/ml/datasets/ s/Statlog+(Vehicle+Silhouettes) limiting yourself to

analyzing to the following subset of the dataset involving just 5 attributes; use all

examples to create the subset:

COMPACTNESS (average perim)**2/area (1st attribute)

CIRCULARITY (average radius)**2/area (2nd attribute)

SCATTER RATIO (inertia about minor axis)/(inertia about major axis) (7th

attribute)

ELONGATEDNESS area/(shrink width)**2 (8th attribute)

HOLLOWS RATIO (area of hollows)/(area of bounding polygon) (18th attribute)

Class: OPEL, SAAB, BUS, VAN

Apply the following exploratory data analysis techniques using R to your dataset:

1. Compute the mean value and standard deviation of the 5 numerical attributes1. 1

point

Mean

Standard

Deviation

8.234474

93.6784

9

CIRCULARITY

44.8617

6.169866

SCATTER RATIO

168.839

33.24498

2

ELONGATEDNESS

40.9338

7.81156

1

HOLLOWS RATIO

195.632

7.438797

4

2. Compute the covariance matrix for the five numerical attributes you are analyzing;

also compute the correlation for each of the three pairs of attributes. Interpret the

statistical findings! 2 points

COMPACTNESS

Covariance matrix

COMPACTNESS CIRCULARITY SCATTER_RATIO ELONGATEDNESS

HOLLOWS_RATIO

COMPACTNESS

67.80657 35.201637

222.56364

-50.72900

22.391727

CIRCULARITY

35.20164 38.067242

176.47597

-39.94289

1.775135

SCATTER_RATIO 222.56364 176.475966 1105.22856 -252.78343

29.663911

1

-12.593593

HOLLOWS_RATIO 22.39173 1.775135

55.335707

-252.78343

29.66391

61.02047

-12.59359

positive linear

relationships with all other attributes, except Elonggatedness.

It means that vehicles with high value of one of Compactness, Circularity,

Scatter_Ratio, Hollows_Ratio will usually have values of the other three. In

contrast, vehicles with high value of Elonggatedness will usually have low value

of Compactness, Circularity, Scatter_Ratio, and Hollows_Ratio.

COMPACTNESS

COMPACTNESS

CIRCULARITY

SCATTER RATIO

1

0.6928692

0.8130033

ELONGATEDNESS

-0.788647

HOLLOWS RATIO

0.3655518

SCATTER

HOLLOW

ELONGATEDNESS

RATIO

RATIO

0.81300

0.3655

0.69286923

33

-0.788647

0.86036

1

71

-0.8287548 0.0386

0.1199

0.86036714

1

-0.9733853

0.97338

-0.8287548

53

1 0.2167

0.11994

0.03867702

98

-0.2167251

CIRCULARITY

The positive linear relationship between Compactness and Hollows_Ratio are pretty

weak, whereas the positive relationships between Compactness Circularity and

Scatter_Ratio are strong. The negative relationship between Compactness and

Elongatedness is also strong.

3. Create a scatter plot for the last two numerical attributes of your dataset. Interpret the

scatter plot! 2 points

and negative as Elongatedness in the range(35,45]. It looks like there is not linear

relationship between the two attributes as Elongatedness > 45.

3

In general, the linear relationship between the two attributes is very weak based on

the scattered plot.

4. Create histograms for each of the 5 numerical attributes. Then create a histogram for

the ELONGATEDNESS attribute for instances of OPEL, instances of SAAB,

instances of BUS, and instances of VAN (4 Histograms); interpret the 5 histograms

you generated for the ELONGATEDNESS attribute. 6 points

It looks like the Elongatedness has two modes. The distribution of the data is not

symmetric. Somehow the graph is right-skewed.

It looks like the Elongatedness for Opel class has only one mode. Also it is clearly to

see that the graph is right-skewed.

Elongatedness for Saab appears to have only one mode. Also the graph is rightskewed.

It seems like Elongatedness for Bus has two modes. If the outer-left bar is not

considered, the graph appears to be left-skewed.

The data for Elongatedness for Van appears to fluctuate pretty much. That suggests

that it is multi-modal. However in general view, it looks like the graph is rightskewed.

5. Create box plots for the COMPACTNESS attribute for the instances of each class and

a fifth box plot for all instances in the dataset. Do the same for the HOLLOWS

RATIO attribute. Interpret and compare the 5 box plots for each attribute! 5 points

It looks like the box plots for opel, saab, van and all for the Compactness have the

50th percentile at the middle of 25th-75th percentiles. Whereas the 50th percentile of the

box plot for bus seems to be closer to the 25th. While the data for van is quite compact

between values 87 and 93, the data for the other box plots seem to have wider spread.

The box plots show that the 50th percentile of opel, saab, and van appear to be at the

middle of the 25th and 75th percentiles. Whereas, the 50th percentile of the box plot for

bus looks to be closer to the 25th percentile, and the 50th percentile of the box plot for

all looks to be closer to the 75th percentile. Moreover the data for opel, saab, van, and

all seem to be compacted between values 192 and 202 while the data for bus seems to

spread wider.

6. Create supervised scatter plots/supervised density plots for 4 pairs2 of the 5 attributes

(for each pair of attributes visualize it using a traditional scatter plot and a density

plot) and the class variable; use different colors for the class variable. Interpret the

scatter plots! 5 points

In general, there are 10 pairs, but you only need to visualize 4 of them!

10

+saab: it apears that the circularity and compactness has a positive linear relationship.

+van: it appears that the circularity and compactness does not have strong linear

relationship because the compactness seems to stay unchanged as the circularity

changes.

+bus: it appears that the circularity and compactness has a negative linear relationship

on the interval [35, 42] on the x-axis, and positive linear relationship on the interval

[42, 60] on the x-axis.

+opel: it apears that the circularity and compactness has a positive linear relationship.

11

+saab: it apears that the scatter ratio and compactness has a pretty strong positive

linear relationship.

+van: it appears that the scatter ratio and compactness does not have strong linear

relationship because the data points look concentrated in a rectangular shape.

+bus: it appears that the scatter ratio and compactness does not have linear

relationship around value of 150 on the x-axis. However, from 150 to 250 on the xaxis, the positive linear relationship seems to exist.

+opel: it apears that the scatter ratio and compactness has a pretty strong positive

linear relationship.

12

+It appears that there exist positive linear relationships between elongatedness and

compactness for classes of bus, opel, and saab. However, it seems like there is not

linear relationship for van class.

13

+saab: it looks like there is a pretty weak linear relationship between Hollows_Ratio

and Compactness.

+van: it seems like there exists no linear relationship between Hollows_Ratio and

Compactness.

+bus: it seems like there exists a negative linear relationship between Hollows_Ratio

and Compactness on the interval [100, 95) on the y-axis, but a positive linear

relationship on the interval [80,95] on the y-axis.

14

7. Create a Star plot for the first 10 instances of class BUS and the first 20 instances of

SAAB (based on the order in the file); interpret the 20 stat plotsstar plots should be

constructed for the 4 numerical attributes! 3 points

It looks like the dominant shapes of 20 instances of saab are those number 3, 19, 25,

28, 39, 45, 91, 93. Also the shapes of instance with numbers 10, 25, 44,50, 52, 57, 77,

78 are also dominant shapes. In general, it seems like these 20 instances could be

divided into two major groups (similar shapes will be grouped together).

15

8. Fit a linear model that predicts the class attribute (treat it as a numerical attribute that

takes values 0, 1, 2 and 3 with OPEL=0, SAAB=1, BUS=2, and VAN=3) using the 5

attributes as the independent variables. Report the R2 of the linear model and the

coefficients of each attribute in the obtained regression function. Do the coefficients

tell you anything about the importance of the attribute in predicting the class variable;

if yes, what? Repeat the experiment using OPEL=0, SAAB=0, BUS=2, and VAN=0,

and answer the same questions! 6 points

OPEL=0, SAAB=1,

BUS=2, and VAN=3

Coefficie

nt

yintercep

t

-0.02352

3.70878

class~compact

ness

-0.01936

2.40081

class~circulari

ty

-0.005386 2.461936

class~scatter_

ratio

0.02157

0.67257

class~elongat

edness

-0.05095 11.40752

class~hollow_r

atio

OPEL=0, SAAB=0,

BUS=2, and VAN=0

class~compact

Interpretation

The coefficient indicates that

there is a weak negative

relationship between class and

compactness. It means it is not

appropriate to use

compactness to predict class.

The coefficient indicates that

there is a weak negative

relationship between class and

circularity. It means it is not

appropriate to use circularity

to predict class.

The coefficient indicates that

there is a very weak negative

relationship between class and

scatter_ratio. It means it is not

appropriate to use

scatter_ratio to predict class.

The coefficient indicates that

there is a weak positive

relationship between class and

elongatedness. It means it is

not appropriate to use

elongatedness to predict class.

The coefficient indicates that

there is a weak negative

relationship between class and

hollow_ratio. It means it is not

appropriate to use hollow_ratio

to predict class.

Coefficie

ynt

intercept Interpretation

-0.01588

2.0029 The negative linear

16

ness

class~circulari

ty

class~scatter_

ratio

class~elongat

edness

class~hollow_r

atio

0.002807

0.389432

0.0005526 0.4220647

-0.006926

0.798889

-0.04016

8.37151

The linear relationship now is

positive, but very weak.

The linear relationship now is

positive, but very weak.

The linear relationship now is

negative, but is very weak.

The negative linear

relationship is weaker.

9. Create 3 decision tree models with 20 or less nodes for the dataset (leaf nodes count;

do not submit models with more than 20 nodes! Explain how the 3 decision tree

models were obtained. Report the training accuracy and the testing accuracy of this

decision tree; interpret the learnt decision tree. What does it tell you about the

importance of the 5 attributes for the classification problem? 6 points

17

10. Write a conclusion (at most 18 sentences!) that assesses the difficulty of predicting

the class attribute using the selected 5 attributes and assesses which of the 5

attributes is more important/less important for the classification task based on the

findings you obtained answering questions 1-9! Moreover, if you discovered

something else interesting about the dataset, also mention it in your conclusion. 7

points total (and possibly up to 5 extra points)

According to all questions above, we can properly assess the difficulty of

predicting the class attributes using the selected five attributes. Especially, in the question

6 we see that we can choose 2 of 5 attribute to analyze the relationship for class sabb,

van, bus, opel. Each pair of attributes chosen from 5 attributes which help us to see the

different level of linear relationship for class attribute. However, in the question 8 if we

use fit the linear that predicts the class attribute using five attributes as the independent

variable, this is not a properly way. From the result from question 8, there is a weak

relationship between class and one independent attribute of five attributes. Thus, using

the attributes as the independent variables to predict class is inappropriate way. And the

role of individual attribute is not equal, they depend on each other.

Design a distance function to assess the similarity of customers of a supermarket; each

customer in a supermarket is characterized by the following attributes3:

a) Ssn

b) Items_Bought (The set of items the bought last month; this is a set)

c) Age (is an ordinal attribute having values: old, medium, young, and teenager)

E.g. (111234232, {Coke, 2%-milk, apple}, old, 33.39) is an example of a customer description.

18

d) Amount_spend (Average amount spent per purchase in dollars and cents; it has a

mean of 50.00 a standard deviation of 25, the minimum is 0.02 and the maximum

is 398)

Assume that Items_Bought and Amount_Spend are of major importance and Age is of a

minor importance when assessing the similarity of the customers. Assess the distance

between the following 3 customers:

a1= (111111111, {A,B,C}, old, 12.20)

a2= (22222222, {B,C,D,E,F, G}, medium, 50.20)

a3=(333333333, {C,D,E,F,H}, young, 28.00).

We assume that the values of attribute age are converted to numbers as old =3,

medium = 2, young =1, teenager =0.

Let a, b be 2 customers,

D Items_Bought(a,b)=1*(1-|(a.Items_Bought n b.Items_Bought)| / |a.Items_Bought U

b.Items_Bought|)

D Age(a,b)=0.2*( |(a.age-b.age)/3|)

D Amount_spend(a,b)=1*|(a.Amount_spend-50)/25-(b.Amount_spend-50)/25|

The distance of a and b :

D(a,b)=( D Items_Bought(a,b) + D Age(a,b)+ D Amount_spend(a,b))/2.2

D(a1,a2)=((1-2/7)+0.2*(1/3)+|(-37.8/25-0.2/25)|)/2.2=(5/7+0.2/3+38/25)/2.2=1.045

D(a1,a3)=((1-1/8)+0.2*(2/3)+|(-37.8/25+22/25)|)/2.2=(7/8+0.4/3+15.8/25)/2.2=0.745

D(a2,a3)=((1-4/7)+0.2*(1/3)+|(0.2/25+22/25)|)/2.2=(3/7+0.2/3+22.2/25)/2.2=0.63

12) Data Analysis (4 POINTS)

a) What is the role and purpose of exploratory data analysis in a data mining project?

b) Interpret the following 2 histograms and analyze their relationships which describe the

male and female age distribution in the US, based on Census Data.

19

Providing knowledge to help in tool selection

Assessing difficulty of the task to be solved

Validating data

Forming the hypotheses

Finding the potential issues, error, patterns in the data

Two diagrams above are continuous, and there are no gaps between them. They are

binomial (2 peaks at 5-9 ages and 35-39 ages). The values of two diagrams start going

down significantly beyond 55-59(skewed distribution). In general, both diagrams are

similar until 55-59 ages. After this point, the male curve is significantly steeper than

female curve. From here, we can reach to a conclusion: female live longer than male.

20

- Project Report Small Business for BBSUploaded byBhairab Pd. Pandey
- Statistics Course in PsychologyUploaded bySpongeBobLongPants
- School Principals Leadership Style.pdfUploaded byRalph Fael Lucas
- Study of Customer Switching Behavior Bank ServicesUploaded bygolden2010
- Effects of Change Management Strategies on Performance in National Examinations in Public Secondary Schools in Kisii County, KenyaUploaded byInternational Organization of Scientific Research (IOSR)
- 2017 11 Economics Sample Paper 01 QpUploaded byramukolaki
- AI LearningUploaded byroshankoju
- 06 Learning SystemsUploaded byanishnair94
- Decision Tree Classifiers to determine the patient’s Post-operative Recovery DecisionUploaded byAI Coordinator - CSC Journals
- SLMNA1-11 EcoB 07 Correlation GoutamUploaded byGoutam Das
- s1 16_mergedhrm#Mm Zc441#Qm Zc441 l9Uploaded byPankaj Vishwakarma
- Decision Trees for Uncertain DataUploaded byieeexploreprojects
- Importance of Information classification in the Data ManningUploaded byAnonymous vQrJlEN
- 17Uploaded byVinith M
- Consumers’ Attitude Towards Green Products - An Exploratory Study in Erode DistrictUploaded byarcherselevators
- 0192I - PPT1 - R2Uploaded byshintariana09
- PercentileUploaded bypraveenshridhar
- Chapter 3 060516.EditedUploaded byM-Hazmir Hamzah
- New QM- Lesson PlanUploaded bypradeep
- CHAPTER 7:Uploaded byChandra Bhushan Sah
- Choice, Social Structure, And Political Information- The Information Coercion of MinoritiesUploaded byChris Elkins
- Jurnal Menulis PidatoUploaded byGugum Gumbira
- Villanueva Lesson 1Uploaded byJM Villanueva
- Kusapoo CuteUploaded byJhoy Vien Benavidez
- Classification and DiscriUploaded byYaronBaba
- BMI for Age Girls (Percentiles Table)Uploaded byRijendra JD
- weebly ppUploaded byapi-282220691
- vceasy-further-maths-notes-template-v1.pdfUploaded byVince Lau
- Discriminant Analysis Romania.pdfUploaded byKőmivesTimea
- Generator Tech Flux probeUploaded byDhruv

- s05nmdl2sm.pdfUploaded bymn_amin
- Electrical Switchgear ProtectionUploaded byfathonix
- Solid Rivet InformationUploaded byKukyong Lee
- Ignition CoilUploaded bywaterlife70
- 250283Uploaded byMauricio_Vera_5259
- Paper-5 Estimation of Reliability Parameters for Tele-communication SystemUploaded byRachel Wheeler
- Model Paper 1Uploaded byNeha Sharma
- ENC 1101 Discourse Community PaperUploaded byjraja163
- Next Generation Java Testing. TestNG and Advanced ConceptsUploaded byJe De
- cast-256Uploaded byFull Name
- Sum of LognormalsUploaded byritolab
- HSV1-TBUploaded byEngrWasiAhmad
- Tutorial on Factored Language ModelsUploaded byvbsowmya
- Team Building Workshop Template by Tom Romito, FacilitatorUploaded byTom Romito
- GBPPR 'Zine - Issue #101Uploaded byGBPPR
- YPPH FBO Ground Handling Agent Perth International Airport, AustraliaUploaded byUniversal Weather and Aviation, Inc.
- FMW Media Works Corp. "Exploring the Block" Broadcast and Filming Schedule for NovemberUploaded byPR.com
- imo-class 9 sample paperUploaded bybharat
- NAD C300Uploaded byjuvior
- Artificial IntelligenceUploaded byCarmen Alexandra Partnoi
- TDA8843-N2Uploaded byAdolfo Lacerda
- FUNERALIAUploaded bywilmerj87
- SECTION 1 CHAPTER 1 BELT DRIVES.pdfUploaded byengineeraina
- BTS HW SystemUploaded byMuty Koma
- The Early Chicago Tall Office Building: Artistically and Functionally ConsideredUploaded byDorian Vujnović
- Liquid Injection Molding Technology: Pushing the Limits, if anyUploaded bykaasplankje
- LastUploaded byambrooks1
- SDIO SDmode ControllerUploaded byDinh Hoang Tung
- CCC ExamUploaded bymitesh7587
- The Potential Role of Technology in Careers Education in the UKUploaded byGraham Attwell