You are on page 1of 50

Concept and Applications

of Data Mining
Week 1

Topics

Introduction
Syllabus
DataMiningConcepts
TeamOrganization

Introduction Session

Yournameandmajor
Thedefinitionofdatamining
Th d fi iti
fd t
i i
Yourexpectationfromthiscourse
Your expectation from this course

Course Syllabus

Syllabus
S ll b

Data Mining Applications

Classes of Data-Mining Applications in 2003

Source: www.kdnu
uggets.com
m

DataMiningApplications

Percentage

Banking

13

Bioinformatics/biotech

10

Directmarketing/fundraising

10

F d d t ti
Frauddetection

Scientificdata

Insurance

Telecommunication
l

Medical/pharmaceuticals

Retail

eCommerce/Web

Other

Investment/stocks

Manufacturing

Security

Supplychainanalysis

Travel

Entertainment

Newsweek,May22,2006

Market Basket Analysis

Figure 9.14
9
A Ch
hemical database
d
e.

C mistrry Infform
Chem
matic
cs
cs

What is Data Mining?

Source:CoverpageofAdvancedinKnowledgeDiscoveryandDataMining,
editedbyU.Fayyad,
Shapiro,P.SmythandR.Uthurusamy,MITPress
edited by U Fayyad G.Piatesky
G PiateskyShapiro
P Smyth and R Uthurusamy MIT Press

How Much Information in 2003

http://www.sims.berkeley.edu/research/proje
cts/how much info 2003/
cts/howmuchinfo2003/

What is Data Mining?


Misnomer??
GoldMiningvs.Sand(Rock)Mining
KnowledgeDiscoveryfromData(KDD)
Knowledgeextraction
K
l d
t ti
Data/patternanalysis
Dataarchaeology
Data dredging
Datadredging

Data Mining is an Interdisciplinary


and Multidisciplinary Field
DATABASE
TECHNOLOGY

STATISTICS
& MATH

MACHINE
LEARNING

DATA
MINING

INFORMATION
RETRIEVAL

INFORMATION
THEORY

OTHER
DISCIPLINES

Figure 1.1 The ev


volution o
of databa
ase system
m techno
ology

Da
ata Minin
M ng iss a
P
Proce
ess of
o knowle
edge
e
discove
ery
Figure 1.4 Data mining as a step in the process of knowledge discovery

Architecture of a Data Mining


System
Graphical User Interface
Pattern/Model Evaluation
Data Mining Engine

KnowledgeBase

Database or
Data Warehouse Server
data cleaning, integration, and selection

Database

Data
Warehouse

World-Wide
o d de Other Info
Repositories
Web

Figure 1.5 Architecture of a typical data mining system

Data
a a Mining
ga
and
d SStakeholders
a e o de s
Increasing potential
to support
business decisions

Making
M
ki
Decisions

End User

Data Presentation
Visualization Techniques

Business
Analyst

Data Mining
K
Knowledge
l d Discovery
Di
Data Exploration
y
Querying
y g and Reporting
p
g
Statistical Analysis,

Data
Analyst

Data Warehouses / Data Marts


OLAP
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP

DBA

Data Types - Perspective on Structure

Structured
Semistructured
S i t t d
Unstructured

20

Structured Data (1)

Dataisorganizedinsemanticentities
g
Similarentitiesaregroupedtogether
( l ti
(relationsorclasses)
l
)
Entities
Entitiesinthesamegrouphavethesame
in the same group have the same
descriptions(attributes,features)
21

Structured Data (2)


Descriptionsforallentitiesinagroup
(schema)
Attributes
Havesamedefinedformats
d f df
Havepredefinedlengths
Followsameorders

22

Semi-structured
Semi
structured Data (1)

Semistructureddataareorganizedin
g
semanticentities
Similarentitiesaregroupedtogether
Si il
titi
d t th
Entities
Entitiesinsamegroupmaynothavesame
in same group may not have same
attributes
23

Semi-structured
Semi
structured Data (2)
Attributes
Orderofattributesnotnecessarilyimportant
Notallattributesmayberequired
Sizeofsameattributesinagroupmaydiffer
Typeofsameattributesinagroupmaydiffer

24

XML
<bank1>
<customer>
<customer_name>Hayes</customer_name>
H
/
<customer_street>Main</customer_street>
<customer_city>Harrison</customer_city>
<account>
<account_number>A102</account_number>
<branch_name>Perryridge </branch_name>
<balance>400</balance>
</account>
<account>

</account>
</customer>
.
.
</bank 1>
</bank1>
25

Unstructured Data (1)

Massesofcomputerizeddata
whichdonothaveadatastructure
whichiseasilyreadablebyamachine

26

Unstructured Data (2)

MerrillLynchestimatesthatmorethan85percentof
allbusinessinformationexistsasunstructureddata
commonlyappearinginemails,memos,notesfrom
callcentersandsupportoperations,news,user
ll
d
i
groups,chats,reports,letters,surveys,whitepapers,
marketing material research presentations and Web
marketingmaterial,research,presentationsandWeb
pages. DMReviewMagazine,February2003Issue

Data Types Perspective on


Representation

Numericandcategorical
Numeric and categorical
Quantitativeandqualitative
Nominalandordinal
Staticanddynamic(temporal)

28

Numeric and Categorical Data (1)

Numericdata
Numeric data
Realnumberdata,integernumberdata
Properties
Orderrelations(2<5)
Distancerelation(d(2.3,4.2)
Distance relation (d(2.3, 4.2) =1.9)
1.9)
Equalityrelation(2=2)

29

Numeric and Categorical Data (2)

Categorical(symbolic)values
Categorical (symbolic) values
Equalityrelation
Blue=BlueorRea<>Blue
Blue = Blue or Rea <> Blue

Categoricalvaluescanbeconvertedtoanumeric
values
Gender(male,female) (0,1)

30

Quantitative and Qualitative Data


Quantitativedata
Numericvaluesarequantitativevalues
Height,weight,salary

Qualitativedata
Nominal
N i l
Ordinal
31

Nominal Data
Utilitycustomertype(residential,commercial,
industrial,governmental)
Usedifferentsymbols,characters,and
numbers
ThesevaluescanbecodedalphabeticallyasA,
B,andC,ornumericallyas1,2,and3
d
i ll
d
Orderless
Order less
32

Ordinal Data
Therankofthestudentinaclass
O
Ordinalvariablesisacategoricalvariablefor
di l
i bl i
i l i bl f
whichanorderrelationisdefinedbutnota
di t
distancerelation
l ti
The
Theorderedscaleneednotbenecessarily
ordered scale need not be necessarily
linear;differencebetween4th and5th students
are different to that of 14th and15
aredifferenttothatof14
and 15th students
33

Static and Dynamic Data

Staticdata
Attributevaluesdonotchangewithtime

Dynamicdata
Attributevalueschangewithtime
Att ib t
l
h
ith ti

34

Data Repositories
Transactionaldatabase
Relationaldatabase
Relational database
Datawarehouse
Advanceddatabase
Datastream
The World Wide Web
TheWorldWideWeb
35

Transactional Database
TID

List of item_IDs

T100

I1, I2, I5

T200

I2 I4
I2,

T300

I2, I3

T400

I1, I2, I4

T500

I1, I3

T600

I2, I3

T700

I1 I3
I1,

T800

I1, I2, I3, I5

T900

I1, I2, I3

Table 5.1 Transactional data for an AllElectronics branch


36

Fig
gure 1.6. Fragme
ents of Re
elations
Fro
om a Rellational Databas
D
se for AllE
Electroniics

37

Data Warehouse (Mart)

Figure 1.7 Typical framework of a data warehouse for AllElectronics


38

Table 3.1 Comparison between OLTP and OLAP systems

39

Star Schema of a Data


Warehouse for Sales

Figure 3.4 Star schema of a data warehouse for sales

40

Table 3.3
3 3 A 3-D
3 D view of sales data for AllElectronics,
AllElectronics according to the
dimensions time, item, and location. The measure displayed is dollar_sold (in
thousands).

Data Cube for Sales

Figure 3.1 A 3-D data cube representation of the data in Table 3.3,
according to the dimensions time, item, and location. The measure
displayed is dollar_sold (in thousands).

42

Fig
gure 3.10. Example
es of Typic
cal OLAP
op
perations on
o multid
dimension
nal data cube,
c
co
ommonly used for data
d
warrehousing
g

43

Advanced Databases

Objectrelationaldatabases

Temporaldatabases

Sequencedatabases

Timeseriesdatabases

Spatialdatabases

Saptiotemporal
Saptio
temporaldatabases
databases

Textdatabases

H t
Heterogeneousdatabases
d t b

Data Streams

Th
Thefeaturesofdatastream:hugeorpossibly
f
fd
h
ibl
infinitevolume,dynamicallychanging,flowing
i
inandoutinafixedorder,allowingonlyone
d t i fi d d
ll i
l
orasmallnumberofscans,anddemanding
f t ( ft
fast(oftenrealtime)responsetime
l ti )
ti

The World Wide Web (1)

TheWWWservesahuge,distributed,global
g ,
,g
informationservicecenterfornews,
,
,
advertisements,consumerinformation,
financialmanagement,education,
g
government,ecommerce,andmanyother
,
,
y
informationservices

The WWW (2)

ThechallengesforKD
g
Size
Complexity
p
y
Dynamic
Diversity
Relevance

Lab Activities
IntroductiontoR
Organizeyourteam
Eachteamconsistofthree(four)students
Emailyourteaminformation(namesandemailaddresses)to
theinstructorbytheendoftodayslabsession

Readthechapter2ofthelecturetextbookanddoteam
homeworkassignment#1
Readthechapters1,2and3ofthelabtextbook
Brainstorm on the topic of you group project
Brainstormonthetopicofyougroupproject

(Team) Homework Assignment #1


DoExample2.1,2.6,2.7,andExercise2.18.
Note that you need to use R for 2 18 (b)
NotethatyouneedtouseRfor2.18(b).
Preparefortheresultsofthehomework
p
assignment
Duedate
beginningofthelectureonFridayFebruary4th.

Next Week Topics

Datatypesanddatarepositories(Section1.3)
Datapreprocessing(Ch.2)