- The Way We Work in Class
- les5e_minitab_05
- Researchfff
- answer mid year test 2011.docx
- 2018 Cabral, Sampaio & Roazzi. Distributive Judgments in Cooperative Production Contexts
- 2010+12MAT3CD+Sem2+St+Hilda's+CA.pdf
- Suggested Questions I
- OUTPUT.doc
- T-test
- QUANTITATIVE AND STATISTICAL METHOD ASSIGNMENT.docx
- Ijrcm 1 Vol 3 Issue 3 Art 21
- Effect of Socio-economic variables on of Computer literacy among university students
- Objectives
- Am Test 1 Form 5 Mutiara
- INTRODUCTION TO STATISTICS.docx
- The Influence of Political Power in Predicting the Outcome of Revolutions
- AB201112_e
- Presenting, Analysing and Charting Data
- STX1110 Handbook Master HE 06-07-1
- Relationship Between Selected Corporate Attributes
- Correlation Zero Order
- Lesson1 Shs
- project 4
- Statistics Calendar
- estatística_tempo
- Data Science in the Cloud
- Qm Resit Assignmnt
- Statistics and Predictions-4
- The Relationship Between Legal Knowledge and Attitude Toward Justice of the Participants in Community Knowledge Sharing Project
- Daniel Doido
- Vadrevu Info Extrac Web
- Forsyth
- c03 Syntax
- Kapoor 03 Experimental
- Camba Zog Lu
- Lecture 8
- MMI 05 Slides
- Di Mascio Bozen 22
- c03 Syntax
- c03 Attr Grm
- MM2004_Hoi
- Ch5 Names Type
- Cluster Search Result
- Payment Slip
- Types of PL - Sajilal
- TTMEI0809
- Ch5 Sequential
- Ch7 Clustering
- Chapter 2 SQ Factors
- masterC-semOfferedNEW
- c03 Attr Grm
- Ch5 Sequential
- Chapter 9 SQA Planning
- Ch0 Content
- Ch0 Content
- Ch1 Introduction
- Senat539_Masters Programme _without Thesis_ Senate 15.07
- 105 Weeks En
- TTNOV0708-2(1)

**What is a Data Set?
**

Attributes (describe objects)

Variable, field, characteristic, feature or

observation

Objects (have attributes)

Record, point, case, sample, entity or item

Data Set

Collection of objects

A1 A2 … An C

Type of an Attributes

Depends on the following properties:

Distinctness: =,

Order: <, >, =

Addition: +, -

Multiplication: *, /

Types of Attributes

Attribute Type

Description

Examples

Operations

Nominal

Each value represents a

label. (Typical

comparisons between

two values are limited to

“equal” or “not equal”)

Flower color, gender,

sip code

Mode, entropy,

contingency correlation,

2

test

Ordinal

The values can be

ordered. (Typical

comparisons between

two values are “equal”

or “greater” or “less”

Hardness of minerals,

(good, better, best),

grades, street numbers,

rank, age

Median, percentiles,

rank correlation, run

tests, sign tests

Interval

The differences between

values are meaningful

i.e., a unit of

measurement exist. (+, -

)

Calendar dates,

temperature in Celsius

or Fahrenheit

Mean, standard

deviation, Pearson’s

correlation, t anf F test

Ration

Differences and ratios

are meaningful. (*, /)

Monetary quantities,

count, age, mass, length,

electrical current

Geometric mean,

harmonic mean, percent

variation

Attribute level

Discrete and Continuous Attributes

Discrete attributes

A discrete attributes has

only a finite or countably

infinite set of values. Eg.

Zip codes, counts or the

set of words in a

document

Discrete attributes are

often represented as

integer variables

Binary attributes are a special

case of discrete attributes and

assume only two values

Eg. Yes/no, true/false,

male/female

Binary attributes are often

represented as Boolean

variables, or as integer

variables that take on the

values 0 or 1

Continuous attributes

A continuous attribute

has real number values.

Eg. Temperature, height

or weight (Practically real

values can only be

measured and

represented to finite

number of digits)

Continuous attributes are

typically represented as

floating point variables

Structured Data Sets

Common types

Record

Graph

Ordered

Three important characteristics

Dimensionality

Sparsity

resolution

Record Data

Most of the existing

data mining work is

focused around data

sets that consist of a

collection of records

(data objects), each

of which consists of

fixed set of data

fields (attributes)

Name

Gender

Height

Output

Kristina

F

1.6 m

Medium

Jim

M

2 m

Medium

Maggie

F

1.9 m

Tall

Martha

F

1.88 m

Tall

Stephanie

F

1.7 m

Medium

Bob

M

1.85 m

Medium

Kathy

F

1.6 m

Medium

Dave

M

1.7 m

Medium

Worth

M

2.2 m

Tall

Steven

M

2.1 m

Tall

Debbie

F

1.8 m

Medium

Todd

M

1.95 m

Medium

Kim

F

1.9 m

Tall

Amy

F

1.8 m

Medium

Lynette

F

1.75 m

Medium

Data Matrix

If all objects in data set have the same set of numeric

attributes, then each object represents a point

(vector) in multi-dimensional space.

Each attribute of the object corresponds to a

dimension

Projection of X

load

Projection of Y

load

Distance

Load

Thickness

10.23

5.27

15.22

2.7

1.2

12.65

6.25

16.22

2.2

1.1

Document Data

Each document becomes a ‘term’ vector, where each

term is a component (attribute) of the vector, and

where the value of each component of the vector is

the number of times the corresponding term occurs in

the document

s

e

a

s

o

n

t

i

m

e

o

u

t

l

o

s

t

w

i

n

g

a

m

e

s

c

o

r

e

b

a

l

l

p

l

a

y

c

o

a

c

h

t

e

a

m

document 1

document 2

document 3

3 0 5 0 2 6 0 2 0 2

0 7 0 2 1 0 0 3 0 0

0 1 0 0 1 2 2 0 3 0

Transaction Data

Transaction data is a

special type of record

data, where each

record (transaction)

involves a set of items.

For example, consider a

grocery store. The set

of product purchased by

a customer during one

shopping trip constitute

a transaction, while the

individual products that

were purchased are

items

Items

Butter

cheese

b

d

Bread

Milk

coke

a

c

e

Database

1

4 a b c e

b c e

Transaction ID (tid)

2

3

Items bought

a b d e

a b d e

a b c d e 5

6 b c d

Graph Data

Data with relationships among objects

Likes OO modeling where node represents

object and the link representing relationship.

Eg. Linked web pages

Data with object that are graphs

Object contains another objects known as

subobjects, e.g., the structure of chemical

compounds, where the nodes are atoms and the

links between nodes are chemical bonds.

Ordered Data:

The attributes have relationships that involve order in

time or space.

Fig. 2.4 variations of ordered data

Sequential data

Sequence data

Time series data

Spatial data

Data Quality

The following are some well known

issues

Noise and outliers

Missing values

Duplicate data

Inconsistent values

Noise and Outliers

Typographical errors lead to incorrect values

Nominal attribute is misspelled

Extra possible value for that attribute

Different names for the same thing (eg: Pepsi

/ Pepsi Cola

Noise – modification of original value

Random error

Non-random error

Noise can be

Temporal

Spatial

Cont’d

Signal processing can reduce (generally

not eliminate) noise

Outliers – small number of points with

characteristics different from rest of the

data

Missing Values: Eliminate Data

Objects with Missing Values

Out-of-range entries

Negative number in a field that is normally

only positive

Zero in a field that can never normally be

zero

The reason:

Malfunctional measurement equipment

Changes in experimental design during data collection

Collection of several similar but not identical datasets

Respondents in a survey may refuse to answer certain

questions as age or income

Cont’d

A simple and effective strategy is to

eliminate those records which have

missing values. A related strategy is to

eliminate attribute that have missing

values

Drawback: you may end up removing a

large number of objects

Missing values:

Estimating them

Price of the IBM stock changes in a

reasonably smooth fashion. The missing

values can be estimated by interpolation

For a data set that has many similar data

points, a nearest neighbor approach can be

used to estimate the missing values. If the

attribute is continuous, then the average

attribute value of the nearest neighbors can

be used. While if the attribute is categorical,

then the most commonly occurring attribute

value can be taken

Missing values: Using the

missing value as another value

Many data mining approaches can be

modified to operate by ignoring missing

values.

E.g. Clustering – similarity between pairs of

data objects need to be calculated. If one or

both objects of a pair have missing values for

some attributes, then the similarity can be

calculated by using only other attributes.

Inconsistent/ Duplicate Data

Most machine learning tools will produce

different results if data files are duplicated –

repetition influence on the result.

Errors made at data entry – spelling errors

Can corrected manually

Data integration

Different names of an attribute in different

databases

Also having duplication

- The Way We Work in ClassUploaded byKairatTM
- les5e_minitab_05Uploaded byClyde Torres
- ResearchfffUploaded byijayathunga
- answer mid year test 2011.docxUploaded byjamri abunawas
- 2018 Cabral, Sampaio & Roazzi. Distributive Judgments in Cooperative Production ContextsUploaded byroazzi
- 2010+12MAT3CD+Sem2+St+Hilda's+CA.pdfUploaded byFred
- Suggested Questions IUploaded byAnirban Bardhan
- OUTPUT.docUploaded byAsnan Azis Fatoni
- T-testUploaded byHimanshi Birla
- QUANTITATIVE AND STATISTICAL METHOD ASSIGNMENT.docxUploaded byZakaria Din
- Ijrcm 1 Vol 3 Issue 3 Art 21Uploaded byAravind Sankaran
- Effect of Socio-economic variables on of Computer literacy among university studentsUploaded byYogendar Kumar Verma
- ObjectivesUploaded byVincentAlejandro
- Am Test 1 Form 5 MutiaraUploaded bySharvinder Singh
- INTRODUCTION TO STATISTICS.docxUploaded byMangala Semage
- The Influence of Political Power in Predicting the Outcome of RevolutionsUploaded bymariaglenna
- AB201112_eUploaded byNataniel Lopes Barros
- Presenting, Analysing and Charting DataUploaded byposhgirl0069
- STX1110 Handbook Master HE 06-07-1Uploaded byapi-3818523
- Relationship Between Selected Corporate AttributesUploaded byMonirul Alam Hossain
- Correlation Zero OrderUploaded bylogan159
- Lesson1 ShsUploaded byCristy Balubayan Nazareno
- project 4Uploaded byapi-241259046
- Statistics CalendarUploaded byfp2221
- estatística_tempoUploaded byThiago Lechner
- Data Science in the CloudUploaded byStephen Lynch
- Qm Resit AssignmntUploaded byruna
- Statistics and Predictions-4Uploaded byArpit
- The Relationship Between Legal Knowledge and Attitude Toward Justice of the Participants in Community Knowledge Sharing ProjectUploaded byGlobal Research and Development Services
- Daniel DoidoUploaded byGilney Cardoso Pereira

- Vadrevu Info Extrac WebUploaded byBama Raja Segaran
- ForsythUploaded byBama Raja Segaran
- c03 SyntaxUploaded byBama Raja Segaran
- Kapoor 03 ExperimentalUploaded byBama Raja Segaran
- Camba Zog LuUploaded byBama Raja Segaran
- Lecture 8Uploaded byBama Raja Segaran
- MMI 05 SlidesUploaded byBama Raja Segaran
- Di Mascio Bozen 22Uploaded byBama Raja Segaran
- c03 SyntaxUploaded byBama Raja Segaran
- c03 Attr GrmUploaded byBama Raja Segaran
- MM2004_HoiUploaded byBama Raja Segaran
- Ch5 Names TypeUploaded byBama Raja Segaran
- Cluster Search ResultUploaded byBama Raja Segaran
- Payment SlipUploaded byBama Raja Segaran
- Types of PL - SajilalUploaded byBama Raja Segaran
- TTMEI0809Uploaded byBama Raja Segaran
- Ch5 SequentialUploaded byBama Raja Segaran
- Ch7 ClusteringUploaded byBama Raja Segaran
- Chapter 2 SQ FactorsUploaded byBama Raja Segaran
- masterC-semOfferedNEWUploaded byBama Raja Segaran
- c03 Attr GrmUploaded byBama Raja Segaran
- Ch5 SequentialUploaded byBama Raja Segaran
- Chapter 9 SQA PlanningUploaded byBama Raja Segaran
- Ch0 ContentUploaded byBama Raja Segaran
- Ch0 ContentUploaded byBama Raja Segaran
- Ch1 IntroductionUploaded byBama Raja Segaran
- Senat539_Masters Programme _without Thesis_ Senate 15.07Uploaded byBama Raja Segaran
- 105 Weeks EnUploaded byBama Raja Segaran
- TTNOV0708-2(1)Uploaded byBama Raja Segaran