30 views

Uploaded by Muhammad Fadzreen

genji

- ITS610 - Social Ethical And Professional Issues, Short Note
- MB0050 Research Methodology Set 1
- Summary of Formula - Statistics
- Basic quantitative aptitutude and reasoning 20 april
- 2-5KSA09-Practice1
- Program Evaluation
- 03 Descriptive Statistics
- Probability Lecture Slides v2018.10.03
- Measures of Central Tendency1
- ch02-SM-1
- Statistic
- statistics
- ch6
- Berman100-103
- Chapter 1 Solutions to Practice Problems
- Central Tendency and Dispersion
- Cumulative Frequency Pdf1
- Interpretation FAMILY INCOME INTERPERSONAL SKILLS OF ELEMENTARY EDUCATION STUDENTS OF PSU-URDANETA CAMPUS
- spss final12
- Lab Activity 1

You are on page 1of 53

Topic 2

Understanding your Data

(shuzlina@fskm.uitm.edu.my)

Faculty of Computer and Mathematical Sciences, UiTM

Source: Adapted Jiawei Han and Micheline Kamber (2012); Tan et al (2012)

Objectives

Data

Types of Data Sets

Record Ordered

Relational records Video data: sequence

Data matrix, e.g., numerical

of images

matrix, crosstabs

Document data: text Temporal data: time-

documents: term-frequency series

vector

Sequential Data:

Transaction data

transaction

Graph and network

sequences

World Wide Web

Social or information Genetic sequence

networks data

Molecular Structures Spatial, image and

multimedia:

3

Record Data

of which consists of a fixed set of attributes

Tid Refund Marital Taxable

Status Income Cheat

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

10

Data Matrix

If data objects have the same fixed set of numeric

attributes, then the data objects can be thought of as points

in a multi-dimensional space, where each dimension

represents a distinct attribute

where there are m rows, one for each object, and n

columns, one for each attribute

P r o j e c t i o n P r o j e c t i o n D i s t a n c e L o a d T h i c k n e s s

o f x L o a d o f y l o a d

1 0 . 2 3 5 . 2 7 1 5 . 2 2 2 . 7 1 . 2

1 2 . 6 5 6 . 2 5 1 6 . 2 2 2 . 2 1 . 1

Document Data

each term is a component (attribute) of the vector

the value of each component is the number of times the

corresponding term occurs in the document.

Transaction Data

A special type of record data, where

each record (transaction) involves a set of items.

For example, consider a grocery store. The set of

products purchased by a customer during one shopping

trip constitute a transaction, while the individual

products that were purchased are the items.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Graph Data

Examples: Generic graph and HTML Links

<a href="papers/papers.html#bbbb">

Data Mining </a>

<li>

<a href="papers/papers.html#aaaa"> 2

Graph Partitioning </a>

<li>

<a href="papers/papers.html#aaaa">

5 1

Parallel Solution of Sparse Linear System of Equations </a>

<li>

2

<a href="papers/papers.html#ffff">

N-Body Computation and Dense Linear System Solvers 5

Other Types of Data

Ordered Data

Sequences of transactions Genomic sequence data

Items/Events

GGTTCCGCCTTCAGCCCCGCGCC

CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC

CCCTCTGCTCGGCCTAGACCTGA

GCTCATTAGGCGGCAGCGGACAG

GCCAAGTAGAACACGCGAAGCGC

TGGGCTGCCTGCTGCGACCAGGG

An element of the sequence

Data Objects

A data object represents an entity.

Examples:

sales database: customers, store items, sales

medical database: patients, treatments

university database: students, professors, courses

Also called samples , examples, instances, data

points, objects, tuples.

Data objects are described by attributes.

Database rows -> data objects; columns

->attributes.

10

Attributes

variables): a data field, representing a

characteristic or feature of a data object.

E.g., customer _ID, name, address

Types:

Nominal

Binary

Numeric: quantitative

Interval-scaled

Ratio-scaled

11

Attribute Types

things

Hair_color = {black, brown, blond, red, auburn, grey,

white}

We can assign a code of 0 for black, 1 for brown

marital status, occupation, ID numbers, zip codes

about them and are not quantitative

It makes no sense to find the mean (average) value or

median (middle) value for such an attribute, given a set of

objects.

Except the attributes most commonly occurring value.

12

Attribute Types

Nominal attribute with only 2 states (0 and 1)

Symmetric binary: both outcomes equally

important

e.g., gender

important.

e.g., medical test (positive vs. negative)

Convention: assign 1 to most important

outcome (e.g., HIV positive)

13

Attribute Types

Ordinal

Values have a meaningful order (ranking) but

magnitude between successive values is not

known.

Size = {small, medium, large}, grades, army

rankings

Grade (e.g., A+, A, A-, B+, B, B-, C+, C, C-,

D+, D, E, F)

are qualitative.

Describe a feature of an object, without giving an

actual size or quantity.

14

Numeric Attribute Types

(represented in integer or real-valued)

Interval

Measured on a scale of equal-sized units

Values have order

E.g., temperature in Cor F, calendar

dates

No true zero-point, neither 0C nor 0F

indicates no temperature.

We can compute their mean value, in addition to

tendency.

15

Numeric Attribute Types

Ratio

Inherent zero-point

We can speak of values as being an order of

magnitude larger than the unit of

measurement (10 K is twice as high as 5

K).

e.g., temperature in Kelvin, length,

counts,

monetary quantities (e.g., you are 100 times

richer with $100 than with $1).

16

Discrete vs. Continuous Attributes

Discrete Attribute

Has only a finite or countable infinite set of

values

E.g., zip codes, profession, or the set of

Sometimes, represented as integer variables

discrete attributes

Note that discrete attributes may have numeric

values, such as 0 and 1 for binary attributes, or,

the values 0 to 110 for the attribute Age.

17

Discrete vs. Continuous Attributes

Continuous Attribute

Has real numbers as attribute values

digits

Are real numbers, whereas numeric values

Continuous attributes are typically

18

Properties of Attribute Values

the following properties it possesses:

Distinctness: =

Order: < >

Addition: + -

Multiplication: */

Ordinal attribute: distinctness & order

Interval attribute: distinctness, order & addition

Ratio attribute: all 4 properties

Attribute Description Examples Operations

Type

Nominal The values of a nominal attribute are zip codes, employee mode, entropy,

just different names, i.e., nominal ID numbers, eye color, contingency

attributes provide only enough sex: {male, female} correlation, 2 test

information to distinguish one

object from another. (=, )

provide enough information to order {good, better, best}, rank correlation,

objects. (<, >) grades, street numbers run tests, sign tests

differences between values are temperature in Celsius deviation, Pearson's

meaningful, i.e., a unit of or Fahrenheit correlation, t and F

measurement exists. tests

(+, - )

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,

and ratios are meaningful. (*, /) monetary quantities, harmonic mean,

counts, age, mass, percent variation

length, electrical

current

20

Examples

Source:

http://www.perceptualedge.com/articles/dmreview/qua

nt_vs_cat_data.pdf

BASIC STATISTICAL DESCRIPTIONS

OF DATA

Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population):

Note: n is sample size and N is population size

1 n

x xi

x

n i 1 N

Weighted arithmetic mean: n

x i 1

n

Median: A holistic measure w

i 1

i

values otherwise

Estimated by interpolation (for grouped data):

n / 2 ( f )l

Mode median L1 ( )c

f median

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

For unimodal frequency that are moderately skewed; the formula

23

Symmetric vs. Skewed Data

symmetric, positively and

negatively skewed data

symmetric

positively skewed

negatively skewed

24

Measuring the Dispersion of Data

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 Q1

Five number summary: min, Q1, M, Q3, max

Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot

outlier individually 1 n

1 n

2 ( xi ) 2 x 2

2

i

Outlier: usually, a value higher/lower than 1.5 x IQR N i 1 N i 1

Variance: (algebraic, scalable computation)

1 n 1 n 2 1 n

s

2

n 1 i 1

( xi x )

2

[ xi ( xi ) 2 ]

n 1 i 1 n i 1

25

Properties of Normal Distribution Curve

From to +: contains about 68% of the

From 2 to +2: contains about 95% of it

From 3 to +3: contains about 99.7% of it

26

Boxplot Analysis

Minimum, Q1, M, Q3, Maximum

Boxplot

Data is represented with a box

The ends of the box are at the first and

third quartiles, i.e., the height of the box

is IQR

The median is marked by a line within

the box

Whiskers: two lines outside the box

extend to Minimum and Maximum

27

Visualization of Data Dispersion:

Boxplot Analysis

28

Histogram Analysis

tabulated frequencies, shown as

bars

It shows what proportion of cases

fall into each of several categories

Differs from a bar chart in that it is

the area of the bar that denotes

the value, not the height as in bar

charts, a crucial distinction when

the categories are not of uniform

width

The categories are usually

specified as non-overlapping

intervals of some variable. The

categories (bars) must be adjacent

29

Histograms Often Tell More than Boxplots

in the left may have the

same boxplot

representation

The same values for:

min, Q1, median, Q3,

max

But they have rather

different data distributions

30

Quantile Plot

Displays all of the data (allowing the user to assess

both the overall behavior and unusual occurrences)

Plots quantile information

For a data xi data sorted in increasing order, fi indicates

that approximately 100 fi% of the data are below or

equal to the value xi

31

Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate distribution

against the corresponding quantiles of another

View: Is there is a shift in going from one distribution to

another?

Example shows unit price of items sold at Branch 1 vs.

Branch 2 for each quantile. Unit prices of items sold at

Branch 1 tend to be lower than those at Branch 2.

32

Scatter plot

Provides a first look at bivariate data to see

clusters of points, outliers, etc

Each pair of values is treated as a pair of

coordinates and plotted as points in the plane

33

Positively and Negatively Correlated Data

34

Positively and Negatively Correlated Data

If the pattern of plotted points slopes from lower left If the pattern of plotted points slopes from upper left to

to upper right, this means that the values of X lower right, then the values of X increase as the values

increase as the values of Y increase, which of Y decrease, suggesting a negative correlation .

suggests a positive correlation.

The right half is negative correlated

35

Uncorrelated Data

36

Exercise 1

value?

What are the lower and

upper values?

What are the outlier

values?

Interpret the box-and-

whisker plot.

SEVERAL TYPES OF DATA

VISUALIZATION

Data Visualization

Gain insight into an information space by mapping data onto

graphical primitives

Provide qualitative overview of large data sets

Search for patterns, trends, structure, irregularities, relationships

among data

Help find interesting regions and suitable parameters for further

quantitative analysis

Provide a visual proof of computer representations derived

Geometric techniques

Icon-based techniques

Hierarchical techniques

39

Geometric Techniques

projections of the data

Methods

Direct data visualization

Scatterplot matrices

Landscapes

Projection pursuit technique

Finding meaningful projections of

multidimensional data

Prosection views

Hyperslice

Parallel coordinates

40

Scatterplot Matrices

(k2/2-k) scatterplots]

41

Landscapes

news articles

Used by permission of B. Wright, Visible Decisions Inc.

visualized as

a landscape

The data needs to be transformed into a (possibly artificial)

2D spatial representation which preserves the characteristics

of the data 42

Icon-based Techniques

icons

Typical visualization methods:

Chernoff Faces

Stick Figures

General techniques

Shape Coding: Use shape to represent certain

information encoding

Color Icons: Using color icons to encode more

information

TileBars: The use of small icons representing the

43

44

Chernoff Faces

A way to display variables on a two-dimensional surface,

e.g., let x be eyebrow slant, y be eye size, z be nose length,

etc.

The figure shows faces produced using 10 characteristics--

head eccentricity, eye size, eye spacing, eye eccentricity,

pupil size, eyebrow slant, nose size, mouth shape, mouth

size, and mouth opening): Each assigned one of 10 possible

values, generated using Mathematica (S. Dickson)

The Cartoon Guide to Statistics. New York:

Harper Perennial, p. 212, 1993

Weisstein, Eric W. "Chernoff Face." From

MathWorld--A Wolfram Web Resource.

mathworld.wolfram.com/ChernoffFace.html

Hierarchical Techniques

hierarchical partitioning into subspaces.

Methods

Dimensional Stacking

Worlds-within-Worlds

Tree-Map

Cone Trees

InfoCube

45

46

Dimensional Stacking

attr ib u te4

attr ib u te2

a ttr ib u te3

a ttri b u te 1

Partitioning of the n-dimensional attribute space in 2-D

subspaces, which are stacked into each other

Partitioning of the attribute value ranges into classes.

The important attributes should be used on the outer

levels.

Adequate for data with ordinal attributes of low cardinality

But, difficult to display more than nine dimensions

Important to map dimensions appropriately

Dimensional Stacking

Visualization of oil mining data with longitude and latitude mapped to the outer x-, y-

axes and ore grade and depth mapped to the inner x-, y-axes

47

Tree-Map

Screen-filling method which uses a hierarchical

partitioning of the screen into regions depending on the

attribute values

The x- and y-dimension of the screen are partitioned

alternately according to the attribute values (classes)

48

Tree-Map of a File System

(Schneiderman)

49

Three-D Cone Trees

works well for up to a thousand nodes

or so

First build a 2D circle tree that

arranges its nodes in concentric circles

centered on the root node

Cannot avoid overlaps when projected

to 2D

G. Robertson, J. Mackinlay, S. Card.

Cone Trees: Animated 3D

Visualizations of Hierarchical

Information, ACM SIGCHI'91

Graph from Nadeau Software

Consulting website: Visualize a social

network data set that models the way

an infection spreads from one person to

the next

50

InfoCube

A 3-D visualization technique where hierarchical

information is displayed as nested semi-

transparent cubes

The outermost cubes correspond to the top level

data, while the subnodes or the lower level data

are represented as smmaller cubes inside the

outermost cubes, and so on

51

Source of Public Datasets

http://archive.ics.uci.edu/ml/

Datasets for Data Mining The University of Edinburgh

http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets0405.ht

ml

Google Public Data

http://www.google.com/publicdata/directory

Kent Ridge Bio-medical Dataset

http://datam.i2r.a-star.edu.sg/datasets/krbd/

Frequent Itemset Mining Dataset Repository

http://fimi.ua.ac.be/data/

Bioinformatics Datasets

http://www.kent.ac.uk/library/subjects/biosciences/bioinformatics.

html?tab=genomes

Summary

Data attribute types: nominal, binary, ordinal,

interval-scaled, ratio-scaled

Many types of data sets, e.g., numerical, text,

graph, Web, image.

Gain insight into the data by:

Basic statistical data description: central tendency,

dispersion, graphical displays

Data visualization: map data onto graphical

primitives

Measure data similarity

Above steps are the beginning of data

preprocessing.

53

- ITS610 - Social Ethical And Professional Issues, Short NoteUploaded byMohd Khairi
- MB0050 Research Methodology Set 1Uploaded bySourabh Kukar
- Summary of Formula - StatisticsUploaded byEzekiel D. Rodriguez
- Basic quantitative aptitutude and reasoning 20 aprilUploaded byapi-19869803
- 2-5KSA09-Practice1Uploaded byTongton Friend
- Program EvaluationUploaded byMeg McGhin
- 03 Descriptive StatisticsUploaded byS R Saini
- Probability Lecture Slides v2018.10.03Uploaded bytaha
- Measures of Central Tendency1Uploaded byMeghna Kumar
- ch02-SM-1Uploaded bybilkeralle
- StatisticUploaded byYou Sothea
- statisticsUploaded byrishi
- ch6Uploaded byAhmad Y. Ghuson
- Berman100-103Uploaded bysemaj
- Chapter 1 Solutions to Practice ProblemsUploaded byVictor Chan
- Central Tendency and DispersionUploaded byginish12
- Cumulative Frequency Pdf1Uploaded byyounas
- Interpretation FAMILY INCOME INTERPERSONAL SKILLS OF ELEMENTARY EDUCATION STUDENTS OF PSU-URDANETA CAMPUSUploaded byVhi Da Lyn
- spss final12Uploaded byMilton Stevens
- Lab Activity 1Uploaded bytrajes77
- mmUploaded byM.Edwansyah Rissal
- SgsbUploaded byVishesh Dwivedi
- Fix Bhs Inggris Otw Lancar BismillahUploaded byAyu Pratiwi
- stem433 dfannin final lesson planUploaded byapi-375469242
- Cheatsheet Data VisualizationUploaded bySiddhartha Gupta
- BASIC STATISTICS.pdfUploaded byViraj Kumar
- Final Stats My(2)Uploaded byDilraj Kohli
- count_to_five_manual_1_page.pdfUploaded byAlan Serué
- Inference About VariablesUploaded by1ab4c
- MCTUDUploaded byOrion Lapitan

- ch11Uploaded byMuhammad Fadzreen
- code children learningUploaded byMuhammad Fadzreen
- DUploaded byMuhammad Fadzreen
- ch07Uploaded byMuhammad Fadzreen
- ITS666_Lecture Note 7Uploaded byMuhammad Fadzreen
- ch04Uploaded byMuhammad Fadzreen
- ch03Uploaded byMuhammad Fadzreen
- ch02Uploaded byMuhammad Fadzreen
- ch01Uploaded byMuhammad Fadzreen
- Data MiningUploaded byMuhammad Fadzreen
- Lect2 Intelligent AgentUploaded byMuhammad Fadzreen
- TspUploaded byMuhammad Fadzreen
- 2.2-2.3 Traveling Salesman ProblemUploaded byMuhammad Fadzreen
- Kbs Cancer Breast (Stage2 n 3)Uploaded byMuhammad Fadzreen
- ELC550 SampleTest AnswerKey (080914)Uploaded byMuhammad Fadzreen
- GeneticAlghoritm(ENG) sUploaded byMuhammad Fadzreen
- Topic 8 UncertaintiesUploaded byMuhammad Fadzreen
- Week 4Uploaded byMuhammad Fadzreen
- Week 3Uploaded byMuhammad Fadzreen
- Grammar - LatestUploaded byMuhammad Fadzreen
- Class 15(Relation Prop)Uploaded byMuhammad Fadzreen
- class 16Uploaded byMuhammad Fadzreen
- kamus dewanUploaded bywasabi43

- properties-of-pure-substance-compatibility-mode1.pdfUploaded byMariela030
- Sabermetrics Basics - Jim AlbertUploaded byFernando Crema
- ACFrOgAIm8o5yS0eELFZ8w1_8upfvCSby7et1-kbXWxxLvryyAeXP4p-fkxO6Fpclq3ovQpmpc9WCqyxIFEOI5KN-t9qxeiabPuvgMP0TMnUQ3fA727heR8-3EeEm_o=Uploaded byNaveen Kumar
- Lecture Note: Analysis of Financial Time SeriesUploaded bytestuser132546
- 2006 Peterson Et Al. Trophic Impacts of Beach FillUploaded byCORALations
- Tablecurve2d 5 01 BrochureUploaded byanjal22
- Thesis EquationsUploaded byNuzhat Safdar
- Regression 101 rUploaded byHéctor Flores
- Module 18 Probability DistributionsUploaded bytheuniquecollection
- timeseries (1)Uploaded byPratik Das
- bootstrap.pdfUploaded byDaniel Araujo
- kkchap3Uploaded byKhan Kinoo
- Regression AnalysisUploaded byShivani Pandey
- Discriminant Function AnalysisUploaded byCART11
- Model Selection in R Featuring the LassoUploaded bytufan85
- Historic American Indian Place Names,Geographical Area and Density in New England:Uploaded byFrank Waabu O'Brien (Dr. Francis J. O'Brien Jr.)
- B-Basis T650 3K-135-8H Fabric-7740Uploaded bydmarchant1
- Topinambour_Spray Drying of InulinUploaded byIonela Hotea
- STAT 125-HK. Business Statistics Midterm Exam.docxUploaded byblosome
- ProblemSet3 (1)Uploaded byavinashk1
- Saint Gba334 Unit 1 Quiz (25 Questions) All CorrectUploaded byteacher.theacestud
- Probabilistic methods in geotechnical engineering.pdfUploaded byBagus
- SIFAT KOLIGATIF LARUTANUploaded byDiah Sukmawati
- dynamics.pdfUploaded byWalaa Mahrous
- BattleTech Alpha StrikeUploaded bySwordmaster74
- BBl_DWPUploaded byWaqas Javed
- Bayesian NetworksUploaded byJoey SingYee
- Collin, Chem Lab Report 3Uploaded byJuan Estrada
- 144 Spearman's Rank Correlation Coefficient.pdfUploaded byHassan Bay
- EXAM HELP 2Uploaded byJohn Smith