0 Up votes0 Down votes

2 views21 pagesCh2-DataIssues

Aug 19, 2014

© © All Rights Reserved

PPT, PDF, TXT or read online from Scribd

Ch2-DataIssues

© All Rights Reserved

2 views

Ch2-DataIssues

© All Rights Reserved

- Past exam q and answers compiled notes
- Artikel Udara Pak Sam Grostlog 2018
- Lesson Plan : Statistics
- 33851
- Probability Distribution
- Ava i Perf Simple
- BUAd 801
- Chapter 4_statistic Mean Mode n Median_add
- 07A4BS01-PROBABILITYANDSTATISTICS
- GridDataReport-Ld.awal Rahman Tumada_410014048
- CCP303
- Handout_Measures of Central Tendencyt_Grouped
- Clase 11 Data Exploration
- 9ABS304 Probability and Statisitcs
- set4 qc
- S1_2005_Jan OCR MEI
- 12 AAAJ v5no1 1992 21-28 Public Sector Effectiveness Auditing
- 126 Finals Reviewer
- Dong Mime2011
- IMEKO TC3 2005 001u Machine Tensile

You are on page 1of 21

Attributes (describe objects)

Variable, field, characteristic, feature or

observation

Objects (have attributes)

Record, point, case, sample, entity or item

Data Set

Collection of objects

A1 A2 An C

Type of an Attributes

Depends on the following properties:

Distinctness: =,

Order: <, >, =

Addition: +, -

Multiplication: *, /

Types of Attributes

Attribute Type

Description

Examples

Operations

Nominal

Each value represents a

label. (Typical

comparisons between

two values are limited to

equal or not equal)

Flower color, gender,

sip code

Mode, entropy,

contingency correlation,

2

test

Ordinal

The values can be

ordered. (Typical

comparisons between

two values are equal

or greater or less

Hardness of minerals,

(good, better, best),

grades, street numbers,

rank, age

Median, percentiles,

rank correlation, run

tests, sign tests

Interval

The differences between

values are meaningful

i.e., a unit of

measurement exist. (+, -

)

Calendar dates,

temperature in Celsius

or Fahrenheit

Mean, standard

deviation, Pearsons

correlation, t anf F test

Ration

Differences and ratios

are meaningful. (*, /)

Monetary quantities,

count, age, mass, length,

electrical current

Geometric mean,

harmonic mean, percent

variation

Attribute level

Discrete and Continuous Attributes

Discrete attributes

A discrete attributes has

only a finite or countably

infinite set of values. Eg.

Zip codes, counts or the

set of words in a

document

Discrete attributes are

often represented as

integer variables

Binary attributes are a special

case of discrete attributes and

assume only two values

Eg. Yes/no, true/false,

male/female

Binary attributes are often

represented as Boolean

variables, or as integer

variables that take on the

values 0 or 1

Continuous attributes

A continuous attribute

has real number values.

Eg. Temperature, height

or weight (Practically real

values can only be

measured and

represented to finite

number of digits)

Continuous attributes are

typically represented as

floating point variables

Structured Data Sets

Common types

Record

Graph

Ordered

Three important characteristics

Dimensionality

Sparsity

resolution

Record Data

Most of the existing

data mining work is

focused around data

sets that consist of a

collection of records

(data objects), each

of which consists of

fixed set of data

fields (attributes)

Name

Gender

Height

Output

Kristina

F

1.6 m

Medium

Jim

M

2 m

Medium

Maggie

F

1.9 m

Tall

Martha

F

1.88 m

Tall

Stephanie

F

1.7 m

Medium

Bob

M

1.85 m

Medium

Kathy

F

1.6 m

Medium

Dave

M

1.7 m

Medium

Worth

M

2.2 m

Tall

Steven

M

2.1 m

Tall

Debbie

F

1.8 m

Medium

Todd

M

1.95 m

Medium

Kim

F

1.9 m

Tall

Amy

F

1.8 m

Medium

Lynette

F

1.75 m

Medium

Data Matrix

If all objects in data set have the same set of numeric

attributes, then each object represents a point

(vector) in multi-dimensional space.

Each attribute of the object corresponds to a

dimension

Projection of X

load

Projection of Y

load

Distance

Load

Thickness

10.23

5.27

15.22

2.7

1.2

12.65

6.25

16.22

2.2

1.1

Document Data

Each document becomes a term vector, where each

term is a component (attribute) of the vector, and

where the value of each component of the vector is

the number of times the corresponding term occurs in

the document

s

e

a

s

o

n

t

i

m

e

o

u

t

l

o

s

t

w

i

n

g

a

m

e

s

c

o

r

e

b

a

l

l

p

l

a

y

c

o

a

c

h

t

e

a

m

document 1

document 2

document 3

3 0 5 0 2 6 0 2 0 2

0 7 0 2 1 0 0 3 0 0

0 1 0 0 1 2 2 0 3 0

Transaction Data

Transaction data is a

special type of record

data, where each

record (transaction)

involves a set of items.

For example, consider a

grocery store. The set

of product purchased by

a customer during one

shopping trip constitute

a transaction, while the

individual products that

were purchased are

items

Items

Butter

cheese

b

d

Bread

Milk

coke

a

c

e

Database

1

4 a b c e

b c e

Transaction ID (tid)

2

3

Items bought

a b d e

a b d e

a b c d e 5

6 b c d

Graph Data

Data with relationships among objects

Likes OO modeling where node represents

object and the link representing relationship.

Eg. Linked web pages

Data with object that are graphs

Object contains another objects known as

subobjects, e.g., the structure of chemical

compounds, where the nodes are atoms and the

links between nodes are chemical bonds.

Ordered Data:

The attributes have relationships that involve order in

time or space.

Fig. 2.4 variations of ordered data

Sequential data

Sequence data

Time series data

Spatial data

Data Quality

The following are some well known

issues

Noise and outliers

Missing values

Duplicate data

Inconsistent values

Noise and Outliers

Typographical errors lead to incorrect values

Nominal attribute is misspelled

Extra possible value for that attribute

Different names for the same thing (eg: Pepsi

/ Pepsi Cola

Noise modification of original value

Random error

Non-random error

Noise can be

Temporal

Spatial

Contd

Signal processing can reduce (generally

not eliminate) noise

Outliers small number of points with

characteristics different from rest of the

data

Missing Values: Eliminate Data

Objects with Missing Values

Out-of-range entries

Negative number in a field that is normally

only positive

Zero in a field that can never normally be

zero

The reason:

Malfunctional measurement equipment

Changes in experimental design during data collection

Collection of several similar but not identical datasets

Respondents in a survey may refuse to answer certain

questions as age or income

Contd

A simple and effective strategy is to

eliminate those records which have

missing values. A related strategy is to

eliminate attribute that have missing

values

Drawback: you may end up removing a

large number of objects

Missing values:

Estimating them

Price of the IBM stock changes in a

reasonably smooth fashion. The missing

values can be estimated by interpolation

For a data set that has many similar data

points, a nearest neighbor approach can be

used to estimate the missing values. If the

attribute is continuous, then the average

attribute value of the nearest neighbors can

be used. While if the attribute is categorical,

then the most commonly occurring attribute

value can be taken

Missing values: Using the

missing value as another value

Many data mining approaches can be

modified to operate by ignoring missing

values.

E.g. Clustering similarity between pairs of

data objects need to be calculated. If one or

both objects of a pair have missing values for

some attributes, then the similarity can be

calculated by using only other attributes.

Inconsistent/ Duplicate Data

Most machine learning tools will produce

different results if data files are duplicated

repetition influence on the result.

Errors made at data entry spelling errors

Can corrected manually

Data integration

Different names of an attribute in different

databases

Also having duplication

- Past exam q and answers compiled notesUploaded byvicki_hood_2
- Artikel Udara Pak Sam Grostlog 2018Uploaded bysujatmiko
- Lesson Plan : StatisticsUploaded byNoritah Madina
- 33851Uploaded byAamir Ch
- Probability DistributionUploaded byHayati Aini Ahmad
- Ava i Perf SimpleUploaded byNabil Mesbahi
- BUAd 801Uploaded byAdetunji Taiwo
- Chapter 4_statistic Mean Mode n Median_addUploaded byyattie17607
- 07A4BS01-PROBABILITYANDSTATISTICSUploaded byWasim Basha
- GridDataReport-Ld.awal Rahman Tumada_410014048Uploaded byLaodeAwalRahmanTumada
- CCP303Uploaded byapi-3849444
- Handout_Measures of Central Tendencyt_GroupedUploaded bySyElfredG
- Clase 11 Data ExplorationUploaded byfabianR12345
- 9ABS304 Probability and StatisitcsUploaded bysivabharathamurthy
- set4 qcUploaded byIskandar Ibrahim
- S1_2005_Jan OCR MEIUploaded byoddishy
- 12 AAAJ v5no1 1992 21-28 Public Sector Effectiveness AuditingUploaded bytomo2787
- 126 Finals ReviewerUploaded byWilliam James
- Dong Mime2011Uploaded byChensong Dong
- IMEKO TC3 2005 001u Machine TensileUploaded byWim Andre
- Lecture 5Uploaded byTJ
- chap9dUploaded byIshfaq Hussain
- Riboflavin 2Uploaded byÄniiz Cäno
- SACRED HEART SIBU 2013 M2(A)Uploaded bySTPMmaths
- 5 ForecastingUploaded byBrian Akumu
- ATQ_1.docxUploaded byDanielle Marie Gevaña
- ATQ_1.docxUploaded byDanielle Marie Gevaña
- Covariance and CorrelationUploaded byAbhay Kanodia
- Earnings ManagementUploaded byMathilda Chan
- descriptivestatistics-170330121728.pdfUploaded byDr-Abdul Khaliq

- Camba Zog LuUploaded byBama Raja Segaran
- ForsythUploaded byBama Raja Segaran
- Di Mascio Bozen 22Uploaded byBama Raja Segaran
- MM2004_HoiUploaded byBama Raja Segaran
- Types of PL - SajilalUploaded byBama Raja Segaran
- Payment SlipUploaded byBama Raja Segaran
- MMI 05 SlidesUploaded byBama Raja Segaran
- Lecture 8Uploaded byBama Raja Segaran
- Cluster Search ResultUploaded byBama Raja Segaran
- Vadrevu Info Extrac WebUploaded byBama Raja Segaran
- Kapoor 03 ExperimentalUploaded byBama Raja Segaran
- Lecture 3Uploaded byਰਜਨੀ ਖਜੂਰਿਆ
- c03 SyntaxUploaded byBama Raja Segaran
- Ch5 Names TypeUploaded byBama Raja Segaran
- Chapter 9 SQA PlanningUploaded byBama Raja Segaran
- Chapter 2 SQ FactorsUploaded byBama Raja Segaran
- Ch5 SequentialUploaded byBama Raja Segaran
- Ch0 ContentUploaded byBama Raja Segaran
- 105 Weeks EnUploaded byBama Raja Segaran
- Ch7 ClusteringUploaded byBama Raja Segaran
- Ch1 IntroductionUploaded byBama Raja Segaran
- masterC-semOfferedNEWUploaded byBama Raja Segaran
- Senat539_Masters Programme _without Thesis_ Senate 15.07Uploaded byBama Raja Segaran
- TTNOV0708-2(1)Uploaded byBama Raja Segaran
- TTMEI0809Uploaded byBama Raja Segaran

- BrochureUploaded bygappu002
- Proposal SurveyUploaded byHazim Zulkiifli
- Chapter 8Uploaded byAlisha Khan
- k-6 technology skills checklistUploaded byapi-52797528
- 2.5 Scatter Plots and Line of Best FitUploaded bykcarvey
- Cole Chandler's UX Resume March 2017Uploaded byAnonymous 6YWyh31
- Session 05 - SEP 29 - Competing on Analytics (Ok)Uploaded bymhamwi74
- Introduction to VB NETUploaded byVishal Rane
- Wiring in Floor PlanUploaded byDeepak Singh
- Android Activity Lifecycle and IntentsUploaded byParas Pareek
- Installation GuideUploaded byjopehi
- FortiOS-6.2.0-Cookbook.pdfUploaded byneo thomas
- MCS-044Uploaded byAbhishek Veerkar
- Bangalore Software CompaniesUploaded byvinod3511
- Accenture Multi Resolution Asset Data ExchangeUploaded byZM1967
- Digital Signals FAQUploaded byAndy Cobb
- Kumbharana Ck Thesis CsUploaded byYadu Krishnan
- IMPORTANT DATES IN Information Technology.docxUploaded byKrish_666
- Totonac-Tepehua - Creating a Comparative Dictionary of T-TUploaded byraubn111
- DHL Introduces New SmartSensor TechnologyUploaded bySahila Mahajan
- Industrial Revolution UKSW - Arimbi YogasaraUploaded byJessy Ismoyo
- KADS1M Debit GuidelinesUploaded byfogegt
- bsnlUploaded byvkyaju1
- AIS Romney 2006 Slides 06 Control and AIS Part 1Uploaded bysharingnotes123
- Hypervisor ComparisonUploaded byEhtesham Opel
- Action Plan for Managing Difficult & Weak Students at UniversityUploaded byridzuansagir
- resume jainharshitUploaded byapi-286457563
- 4.PDFUploaded byKingsuk Saha
- Panasonic Kx-tg6711!12!21pd Sm FinalUploaded byscribd
- 9962543636_September_2019_eBill-Vodafone.pdfUploaded byMohan Joseph

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.