You are on page 1of 429

es =.

Course Objectiv ing.


e and essentialit y
of Data Warehousing and Min
I To identify the scop e applications,
levant mo! dels and algorithms for respectiv
> Toanalyzed ata, choose re
a mining.
3. To study spatial and web dat ing.
towards ad vances in data min
4 To develop research interest
l be able to :
completion of course learner wil
Course Outcomes : On success ful
a Mining Principles.
1 Understand Data Warehouse fundamentals, Dat
and apply OLAP operations.
2 Design data warehouse with dimensional modelling
world problems.
3, Identify appropriate data mining algorithms to solve real
4. Compare and evaluate different data mining techniques like classification, prediction
clustering and association rule mining.
Describe complex data types with respect to spatial and web mining.
un

Benefit the user experiences towards research and innovation.


an

Prerequisite : Basic database concepts, Concepts of algorithm design and analysis

Module No. Toni


: pics _ Hrs.
1.0 .Introducti
ie oe ; Da Warehouse and Dimensional modeling : Introduction
6.0
arses. VGierwaesi Need for Strategic Information, Features of Data|

D oe . architecture ’ metadata, = j
waare ene Information Package Diawrem, st meters ais
schema, STAR
a keys, Snowflake Schema , Fact C : Schem ,
tables, Updat : . onstellation =i
pdate to the dimension tables, Aggregate fact tabl see a
les.(Refer chapter 1)
ETL Process and O LAP : Maj :
, process Data: .
. yor steps in ETL extraction :| 8
Techniques, Data transformation : Basic tasks Major sien
Dai L i : y i
> ormation types
I V
Dimensional | Analysi
Analysis, Hypercubes, ‘
¢1 on.

Slice, Dic € and Rotation,: 8, OLAP


: MOLAP : Dri ean
operations ROLE Roll up,
OLAP models
; - (Refer chapter 2) —
( Book Code ode :: ME40A) :
( Book

Scanned by CamScanner
Ce Set TR

Module No. Topics Hrs.

3.0 Introduction to Data Mining, Data Exploration and Preprocessing : Data 19


Mining Task Primitives, Architecture, Techniques, KDD process, Issues in
Data Mining, Applications of Data Mining, Data Exploration : Types of
Attributes, Statistical Description of Data, Data Visualization, Data
Preprocessing : Cleaning, Integration, Reduction : Attribute subset selection,
Histograms, Clustering and Sampling, Data Transformation and Data
Discretization : Normalization, Binning, Concept hierarchy generation,
Concept Description : Attribute oriented Induction for Data Characterization.
(Refer chapter 3)
4.0 Classification, Prediction and Clustering : Basic Concepts, Decision Tree 12
using Information Gain, Induction : Attribute Selection Measures, Tree
pruning, Bayesian Classification : Naive Bayes, Classifier Rule - Based
Classification : Using IF-THEN Rules for classification, Prediction : Simple
- | linear regression, Multiple linear regression Model Evaluation and Selection :
Accuracy and Error measures, Holdout, Random Sampling, Cross Validation,
Bootstrap, Clustering : Distance Measures, Partitioning Methods (k-Means,
k-Medoids), Hierarchical Methods (Agglomerative, Divisive).
(Refer chapter 4)
5.0 Mining Frequent Patterns and Association Rules : Market Basket Analysis,
Frequent Item sets, Closed Item sets and Association Rule, Frequent Pattern
Mining, Efficient and Scalable Frequent Item set Mining Methods : Apriori
Algorithm, Association Rule Generation, Improving the Efficiency of Apriori,
FP growth, Mining frequent Itemsets using Vertical Data Format, Introduction
to Mining Multilevel Association Rules and Multidimensional Association
Rules. (Refer chapter 5)
6.0 Spatial and Web Mining : Spatial Data, Spatial Vs. Classical Data Mining,
Spatial Data Structures, Mining Spatial Association and Co-location Patterns,
Spatial Clustering Techniques : CLARANS Extension, Web Mining: Web
Content Mining, Web Structure Mining, Web Usage mining, Applications of] :
Web Mining. (Refer chapter 6)
Total 52

Scanned by CamScanner
Sem. 6-Comp.) 1 Tabla of
BFt pata
Data Ware & Mining (oe Lortanig
Module 1
ae.
a lonal Modellin 1-1
Chapter 1: Introducti to Data Warehous and Dimens'! 9 nl
on e

Syllabus : to Strategic Information, Need for Strategic Information, Featuras of Data


oo es versus Data Marts, Top-down versus Bottom-up
Warehouse. approach. Data
itecture, metadata, E-R modelling versus Dimensional Modelling, Information
wa rehouse
aa aran schema, STAR schema keys, Snowflake Scherna, Fact Constellation
Schema, sail Fact tables, Update to the dimension tables, Aggregate fact tables.
Sin
“ Syllabus Topic : Introduction to Strategic Information...
1 Introduction to Strategic Information svatoetonssoscansecosunntenssenscttinactesencstesensossteneennseetotesetsssreuns
Te
~ Syllabus Topic : Need for Strategic IPAOFTMAOM srorsenesesorerneotenscuenscesersovnseteenteasenermsnesnsctecses
ADD
12 Need for Strategic Information ................cccccccsessecessosesssceecereccarsensnneeees
1.2.1 Desired Characteristics of Strategic Information...........-.
.... st teeaseoventttessenenscesee, 1-2
1.2.2 Why are Operational Systems not Suitable for Providing
Strategic Information? seovsaeeeasavuauaseaansuninansssnacassssecersaniessare
resninsustecsanetserneetsteerenses
ses 1-2
123 Operational v/s Decisional Support System (May 2016)
......s.cssssesssseseeereesessetsesss. 1-3
” Syllabus Topic : Features of Data Warehouse eorc-e-cseeeeoose....-...
1.3 Intriduction of Data Warehouse (May 2010, Dec.
2010, May 2011, May 2012,
1.3.1 Definition Data Warehouse Savesig GG Gaes
1.3.2 Benefits of Data Warehousing.....sse..ssse-sesss....
1.3.3 . Features of a Data Warehouse (Dec.
2010, Dec. 201 4)
’ Syllabus Topic : Data Warehouses Vers
us Data Marts~......0.0.. ee taeannenerteses
1.4 assessestenecetestsseeenseceee 16s
Data Warehouses versus Data Mart
s (May 2013, May 201 Bp oessssivcves
~ Syllabus Topic : Top-Down s tote tettsaeesecassseeesseuneeaesese TB
versus Bottom-Up Approach vosse
csssssseeeeeee.. er NEOPETS
1.5 Top-Down versus Bottom-Up seanesmens 1-7
APPIOACH sessssssssstetsssttsssssessess
ssesss UN Tibimesseamecveasaaita
1.5.1 The Top Down Approach : acesle7
The Dependent Data Mart
(May Structure
2010, Dec. 2010, May 2011, May
2012)... occcccssssssssseeec... wonaaa
1.5.2 The Bottom-Up Approach ccesevesscse bot
: The Data Warehouse
(May 2010, Dec, 2010, Bus Structure
May 2011, May 2012) Titres
enseeeacanesascesasc essensessenssasassssnssesseeed
“B
1.5.4
Federated Approach........c.0.....
abuse PH vee nansanvniseeielgtiye
1.5.5 A Practical APPIOECN N04 MAOUSTTARERETST
on cane were
nicsiersnmntnsinnernnnes,, lel 1
Y Syllabus Topic : Data War srMNeLCNE ES
ehouse Architecture........... 1-11
1.6 Data Warehouse Arch rere eta U este see
itecture Sileeenve
asetteeettansns
,
1.6.1 The Information
Flow Mechanism
OO ae NOt ens uetent
aneesseg

su
Scanned by CamScanner
mn 2 tJ

162 OD career: Architecture (May 2010, Dee, 2010, May 2011, Dac. 2011,
TAG
» May 2013, May 2014, Dec, 2014, May 2016) n..cssccccsessnsseseene
1.6.3
siaianetialrcllThreeGees
Tier/ Multi-ti
exons 118
‘tier Data Wareho Use Architecture (DAC. 2093)....ccecsseressvneee
vy § I PARA AtO sccsssesstessnnscusnensasesnusdnunsonsaicieneqnvadsonvvranecersiveskslévoaisniiibeesunntsussburceoieuvers 1-19
4-49
1.7 | Metadata (Dec. 2010, Dec. 2012, May 2013, May 2014, Dec. 2014) 0... sis
, Dec. 2013)........cssseees
1734 Definition (May 2010, Dec. 2011, May 2012
(7B, Thasctie UR of Stk Sieuensnusnnacnnovatinemine 1-20
2012 , Dec. 2013) enna 29
eho use Meta data (May 2010 , May
Data Wartion of Metadata or Types of Metadata in Data Warehouse
1.7.3
Classifica
1.7.4
VE
(May 2010, Dec. 2011, May 2012, Dec. QO13) coccesessccsssssssssvssorvecesesccersossssrenseneees
svsssssssesarnssceserersiete T -22
¥ Syllabus Topic: E-R Modelling versus Dimensional MOGEIIimg «..sscsssescessereesnv
ssanearutscsesevssavsecsressersssnesee T7OL
1.8 E-R Modelling versus Dimensional Modelling...
1.8.1 What is Dimensional Modeling ? asia2011, May 2012) ssausassnesvoisaceurners sestieeoneea 1-22
1.8.2 Difference between Data Warehouse Modeling and Operational
avatnemen veetes 1-22
Database Modeling ...
sever 23
4.8.3 Comparison Database and Data sinretinase Database...
ER model... we 128
1.8.4 Comparison between Dimensional Model and
1-24
Y_ Syllabus Topic : Information Package Diagram... 1-24
Information Packag e Diagrarn. ......-ss ssssrrsse eee
19. sates 12S
¥ Syllab us Topic : Star Schema ...
ssxceypececusaaee [OO
(Dec. 2010, Ma 1y 2012, Dec. 22014) a
1.10 The Star Schema sesaensereeneee OT
: STAR Schema Keys...
Y Syllabus Topic : -27
wal
al
4.11 STAR schema Keys... er
Top ic : Sno wfl ake Sch ema ...
Y Syllabus 2, Dec. 2012,
The Sno wfl ake Sch ema (May anti, Dec. 2014,‘al 201
1.12
4)...
May 2013, May 2014, Dec. 201 seecenaatesseeane 129
Sta r Sch em a and 4 Sn aw ieke Schema...
1.12.1 Differentiate between nenanne TES
a Dim ens ion al Mod el ce cnmnsartnmunmanennnsonanven
1.12.2 Steps of Designing
stellation met
Y Syllabus Topic + Fact Con Sta r (Ma y 201 2, May 2014) .ssssninneseinnens
nere TST
ilies of
1.13 llation Schema or Fam
Fact Conste a
t t
ss Face
¥ Syllabus Topic: Factle adil
1.14 Factless Fact Tables ... ci
nsion Tables...
1.14.1 Fact Tables ee Dime scenes
NS
le (De c. 201 2, Ma y -a0i9).-
1.14.2 Factless Fact Tab nsecerecan@nensst?
the Di me ns io n Ta bl es ccanauaneeanonsssnegersecqnsso
: Update to
¥ Syllabus Topic : (M ay O16) aesessssesesesssnssns
trnnecenennrren
ns io n Ta bl es
1.15 Update to the Dime
mensions... en 5
Slowly Changing Di nnennnnannne ne neste
4.15.1 sccomanajananeesnannneeaqen
bles ...
4.15.2 Large Dimension Ta

Scanned by CamScanner
_
eT
1.15.3 Rapidly Changing or Large Slowly Ch
anging Dimensions
e ee N
1.15.4 JUNK DIMONSIONS wesscssees abe 1-33
setessssesesesasece i
¥v Syllabus Topic ee 14g
ae ne Bee nniannin
innininininiin
nng 1-49
Mee ae AY 2A) nnnnninaniininnnti
1.17” Examples on Star Schema
and Snowflake SCAM nininnin
inutti
innnn 1-49
innsinnanininin,.. 1-49
re Rey Gusationa end ANEHEP a etin
e
nnnnininiiinnnniinni nnncte 1-55
Chapter EE eA
ANAM MNANaNNtto
manan,,
1-69

Module 2

Chapter 2: ETL Process and OLAP —__


2-1 to 2.35
Syllabus ; Major steps in ETL
Process, Data extraction : Tec
hniques, Data transformation
tasks, Major transformation type : Basie
s, Data Loading : Applying Data
Dimensional Analysis, Hypercubes, , OLTP Vs OLAP, OLAP definiti
OLAP operations: Drill down, Roll on,
UP, Slice, Dice and Rotation,
OLAP models : MOLAP, ROLAP.
2.1 An Introduction to ETL Process (May
2012) ..vecssvesssccsssssssscsssssesece
2.1.41 What is ETL TOO? von. scssecssssvscesesese «(Di
2.1.2 Desired Features............... sneesssancsseonssunsteceserutsanentsnsssassesesenssessisenosanesess von Def
¥ Syllabus Topic
: Major Steps in ETL PROGR as saxssraosscaiiaisiiisigeenneennscone soonDet
20 Major Steps in ETL Process (May 2012) sand
oo... eccssssssestsentsssesseccseses se
Y Syllabus Topic : Data Extraction TECHNIQUES. a)
einnnnnnsniinnntttnnnnnnnae..
| 2.3 Data Extraction (May 2016) vecescssseeesccccsssseresen 2.2
SLES an e guavnececs
2.4 Identificati wesesDeD
on of Data Sources (May 201 SD) cua
nenspecescc Ssesensengeensanauesecsresnesoseeutsnceseans
2.5 Data in Operational Systems (May
2010, Dec. 2010, May 2011, Dec.
Dec. 2012, 2011, May 2012,
May 2013, Dec. 2013, May 2014,
Dec. 201 4, May- 2016) suttGdSEeeETRas
2.5.1 aeenmncanere
Immediate Data Extraction sds senese
nenssaneves sas
2.5.2
Deferred Data ERIE nnn sents
Y Syllabus Topic : Data Transform
ation - Basic Tasks, Major Transform
2.6
ation TYPOS sescssressssssearene
2Bn
Data Transformation : Tas ks Involved in Data Transformati
on
(May 2010, Dee. 2010, May 2011
, Dec. 201 1, May 2012, Dec. 2012
Dec. 2013, May 2014, Dec. 2014, May , May 2013,
a
2.6.1 2-8
The Set of Basic Tasks st" gtatenveneseateons
sncssinsiasioysasscucsuies secensorsetesucarsirantessissi
2.7 Data Integration and Consolidation ueesnsssecs 2-10
(May 2010, Dec. 2010, May 2011,
: May 2012, Dec, 2012, May 2013, Dec. 201 1,
Dec. 2013, May 2014, Dec. 2014,
May 2016) .........ccs0002-10
¥ Syllabus Topi c : Data Loading ~ Applying Dat wns.ss
isnssinnninnnnnniuseeeeeeec c.., 2-11
2.8 Data Loading : Techniques of Data Load
ing (May 2010, Dec, 201 0, May 2011
May 2012, Dec. 2012, May 2013, Dec. , Dec. 2011,
2013, May 2014, Dec. 2014, May 2016
2.8.1 ) ..........s0.2-14
Loading the Dimension
cr

Scanned by CamScanner
Table of Contents

Reasons for”
2.9.2 ; Data c| Diry" Data......
2.10 ®ansing......
Sample ETL Tools
2.11 Need for Oni corer
Wad penta snesbiveisiesss,
Ne Analytical -
~ Syllabus Topic : oLT Processing...
2.12 OLTP Vis OLAP (
a 13 OLAP and Mult » M Vay 20 H,
01 1, Maran,
idimensional
Lo SyllHt Analysis
abus
ipet
Toin
pie.
c : Hypercubes
sean deesetiaieas_.
,
/ Syllabus Topte ogame,
2.15 Operations - Drill down, Roll

2.16.4 DOLAP... ne
2.17 Examples of OLAP..ww.
University Questions and ANSWEIS ooo
® Chapter ENS ecnninnninniniununnivnnniniecc

2-5
np Chapter3: Introduction to Data Mining, Data Exploration
and Preprocessing 3-1 to 3-64
é i
1-8 Syllabus : Data Mining Task Primitives, Architecture, Techniques, KDD proces
s, Issues in Data
Mining, Applications of Data Mining, Data Exploration : Types of Attribute
s, Statistical Description of
Data, Data Visualization, Data Preprocessing : Cleaning, Integration, Reduction :. Attribute subset
3 selection, Histograms, Clustering and Sampling, Data Transformation and Data Discretization :
| Normalization, Binning, Concept hierarchy generation, Concept Description : Attribute oriented
Induction for Data Characterization.
3.1 Whatis: Data Mining ?(D@6.2011) siicsiissosnnnnannuaniennnnmman anime!
¥ Syllabus Topic : Data Mining Task Primitives ........cccsssessssssssssecsssseessssnesssssseccassssuecesessssasnuneieseeen
SS
3.2 Data Mining Task Primitives .ssserseonsrssesestnsinenssenetntnesnnsnntstntnmenniarnnsinnmennnnnn DR

Scanned by CamScanner
i, “
— Table of
5
(MU-Sem. 6-Comp.)
housing & Mining
Ww Data waro —
wy Data Warehousi
— ————_—
Ar
fe of Typiit
a ch tare
calecDatu Mining ayeern
¥ oe ure’
Are 016
33
(May s010, Dee. 2010, Dec. 2011, Dec. 2013, May 2016). . 3.13 Data intes:
he 3.13.4
nano
¥ syllabu is Topic:
Pp Techniques ..
ieee, "
Data Mining Technique oie, "20tt) saennnananense ene st ae gee
} me 2133
Statistics...
an sneneeuneeen eee sesenmeneesinesee
|
3.4.1
3.4.2 Machine HERE senor 3.13.4
reiensiuei,
ensenrncessningninnnenravsnnsserineestosnrersncenenssson ¥ Syllabus To
| 3.43
0
Information Retrieval (UPR) sonrsossocensesc

} 3.44 Database Systems and Data Warehouses


....
214
Clustering a
DataP:
3.4.5 Decision Support System
ose on cose 3.44.1
s Topic : KDD Process ..
Dec. 201 0, May 2011, 3.14.2
7 » Recnaded Discovery in DatabaseaiKDD) Ass 2010,
May 2012, Dec. 2012, Dec, 2013, May 2016)......--ssc
csscsssenenenereseressscestareaseseens sees B BA42
¥ Syllabus Topic : Issues im Data Mining.......:ssssessserensencersecuserecossareantanenna
nenseceeneccereseeranetstticssiagy 344 3.14!
nnsrsneneaseeeserseeneresnseeera
3.6 Major Issues in Data Mining .....sssesccssssssesssscsseseseesennenananencnreessrsrasiarasea sede d | 3.14.

eeansus aseares esevets tersana neenses es 3.14


¥ Syllabus Topic : Applications of Data Mining sesssvessavessensoateaeesneeanensesasensac de] |
¥ Syllabus
3.7 Applications of Data Mining (Dec. DOT): cevenscciseesyronoesrsnornendbnnrann sei teases tpalbiaeah past pereeanemend DAL
“Syl labu s Topic : Data Exploration - Types of Attributes .......:s.sssessessesseesenenssnseseeeeeesstarseansecnanensce 3.15 Dat
de] D
38 Types of Attributes.. oe sabia vies
3.1
3.
“Syllabus Topic: Statistical sitibamsiialon of Data..
er

3.9 Statistical Description of Data ...esesssses


er

3.9.1 Central Tendency........

wo
3.9.2

wo
Dispersion Of Data..c.cccsccssessessscsssessseneess susie sahil eee
3.9.3 Graphic Displays of Basic Statistical Descriptions of ¥ Sylla
Data .........escscesseers +320 3.16
Y Syllabus Topic: Data Visualization ....ccscssss (
sssessesesceses
3.10 Data Visualization (Dec. 2012).....csesssssee ¥ Syl
ssesteseesees
¥ Syllabus Topie : Data Preprocessing for I
...........
iin

3.11 Data Preprocessing... 3.17


ve
© 3.11.1Yop Form of Data Pre-
Sita oi dO2 3.18
ii. “PFOC@S
= SINGvo... Teeeageesonesnecereanvsenctanaiseseassensonsersnssseonseessanensse
QQ
8.12 Data Clea
(Mayni
2016)ng
... nnn
3.12.1 Reasons for “Dirty” Data........ Senrersrenncnrnnencteaneneatsssesnnsntere sO
312.2 Steps in Data Cleansing....... rovtareenteenetnecsetsstessessessssensesseaseenence,
OA
3.12.3 Missing Values...
riistetecearnavarecenssascocsseseso
usesesssasessssshesseeverssueee
G4
3.12.4 Noisy DAE, pT encereneeeeeneetntenennasennsesetsstnstteetotneesececs $95
3.19
3.12.5 Inconsistent Data... nnnenennininmnnn ST

Scanned by CamScanner
Syllabus Tople :
Integratio Table of Co
3.13 Data Integrati n tren neeese
nag ntents
On (May 20
3.13.4 16)...
Entity Identification
Problem 3-42
13.2 Redundancy
3.13.3 4nd Correlation
Tuple Duplication... Analysis ... a
3.13.4 Data Value
Conflict Detection . aa
“ Syllabus Topic
Data Reduction and Resolution ai
Clustering ang -
Sampling soestasavaies
3.14 Data Reduction
(May 201 )
3.14.4
Need for Data
Reduction...
3.14.2 Data Reduction Te
chnique
3.14.2(4) Data Cube Aggr
egatio
n
3.14.2(B) Dime
nsionality Re
ducti on
3.14.2(C) Data
Compression
3.14.2(D) Nume
rosity Reductio
Y n
Syllabus Topic : Da
ta Transfor mation and
Data Discretization,
Data Transformati Normalization, Binnin
on and Data Discre g
tization
3.15.1 Data TrANSFOrMARION
eee cccsessee,
3.15.2 Data Discretization Tttteave
eeetssentesuenstncssscoceses ases
3.15.3 Data Transformation by
Normalization
3.15.4 Discretization by Binning.
.. .ccscccssssssss ssess.,
Discretization by Histogram
3.15.5
Analysis
¥ Syllabus Topic : Concept Hier
archy Generation
3.16 Concept Hierarchy Generation

Ceone e
eee ere eeee: eeSOSQGSNOUNENSERSGAACa0AsenassoaseseyausHBEIEGanu
me esyesera
a. en senessuap
sees euaseoeea
“ nne, beanaeees!

nee
3.17
3.1 8 Data Generalization and Summarization-based Characteriza
tion

ae
3.18.1 Data Generalization... cccssssscsssssssssssssssarsessesseseses.s
3.18.2 | How Attribute-Oriented Induction is P@rfOrMED? ...vsseeserseesnensen svasteanscnes
a eteeeeaeeen 3.18.2(A) Data Gene raliZation.......sesessesessesseesssessasssssesssssescereeessanssinteesessenesesese —
3.18.2(B) Attribute Generalization Control .......csccssccsssesesesceresnees esses natn
3.18.2(C) Example of Attribute Oriented INGUction ......ccsssccssseseesessnsesessesssnseeen 7
3.19 University Questions and ANSWEIS.....:sesssessseneesneernesesssssnsssessnsesensssessserssees
Chapter Ends.....cccccccsnens Jenceunvsnvanses
Haus sivstnnddaunennnen(iiveaiued bores

Scanned by CamScanner
i
{
{
i

te
)

t
'
v

j
a ) 7 Tabla of
:
j

€F pata ware pousing & Mining (MU-Sem. 6 coe) Corton,
Module 4 WF ata vioreteuvin & Minin
' j SSS Minin
F ; “ Syllabus Topic + Pradict:
i ae 43 Prediction
ter 4: Classification, Prediction and Clustering
ii Chal -
: Tree using te 1% 434 Structure
j ‘Syllabus : Basic Concepts, Decision i Information Gain, Induction :
: Attribute
Houte Seco >
Induc' Selection 422 Lintar F
ea Tree pru ning, 5 Bayesian Classification : Naive Bayes, Classifier Rule - Based Classificay
THEN 433 Muttigie
Rules for classification, Prediction : Simple linear regression, Multiple linea,
= a, Model Evaluation and Selection 434 Other F
: Accuracy and Error measures, Holdout,
regressit Random ¥ Syllabus Tople ; v, on
Sampling,
i Cross Validation, i Bootstrap, Clustering : Distance Measures, Partitioning Methods (ke 4 A
Model Evatuation
iucae k-Medoids), Hierarchical Methods (Agglomerative, Divisive). ¥ Syllabus Topte ; Ac:
Syllabus Topic : Basic COMCepts ..........seeseesssesesensenaeeeeensenensseesnnscaneteneesnanenenetenecseens 44.14 eet
44 Basic Concept : Classification.........00++. ¥ Syllabus Topic : 4
4.1.1 Classification Problem.. 4.42 Ho!
412 Classification Example .. ¥ Syllabus Topic «1

4.1.3 Classification is a TWo Step PrOC@SS vssssesssssesseseneeneeneseneetstensescenses 42 443 Ri

4.1.4 Difference between Classification and Prediction ......:cssssseeessssessnesssessess 44 ¥ Syllabus Topic :

42 ‘Classification Methods sosvonaoaceencensesssnnsnienscoceecengssnnavansensseenvaaceassananaseessuaseersseateisanssetecssescnshssenae 4.44 ¢


¥ Syllabus Topic : Decision Tree using Information Gain stesentnanesnsteassenseectessessnenetenensnnsnas ¥ Syllabus Topic
snesssscrsaneneescAeS
4.2.1 Decision Tree Induction......sssessssssssessssssssssessescssiscsssessnesnstsneessnssassunsetasssasssaneseccs 445

4.2.1(A) ‘Appropriate Problems for Decision Tree Learning ..cssssscissecsesssssessesssecsereecevecs ¥ Syllabus Top
45 What is C
4.2.1(B) Decision Tree Fepresemtation...cssssessesscsssssssesoyssseessnssnsecssseearsnneesessaersersasessenssecennadeh
454
¥ Syllabus Topic : Induction - Attribute Selection Measure ....c...... Deere ere rer rer eee
45.2
4.2.1(C) Attribute Selection M@aSUre «..e.ssecsceseese Steet nema reseea teeter emeee,
45.3
4.2.1(D) Algorithm for Inducing a Decision Tree.
46 Types
v Syllabus Topic : Tree PLUNig .....ssseensseeee
4.6.4
4.2.1(E) Tree PIUMiNG..0..sccescessessessescesesessseesseare
4.6.2
4.2.1(F) Examples of IDS............... deavik eee rer eee reer
4.6.
¥ Syllabus Topic : Bayesian Classification - Naive
Bayes... 48.
4.2.2 ° Bayesian Classification : Naive Bayes Classi
fier. ¥ ” Syllabu
4.2.2(A) Bayes TROQHOM sssssseueonnensssesnessassscesanvaimisenis
sossicivaesesit inseatannescerastinenectean anced 47 Dis
4,2.2(B) Basics of Bayesian Classification................
.... mukicd Y Syllab
4.2.2(C) Naive Bayes Classifier: Example .....
..s. SHSO0 Ce nee bea ne eenenLeneeaseeeeesagebennen, svinsverebve 48 P
nn "D2
4.2.3 Rule based ClASSHCHON esspsnininsnnenanintisnsansiensisitinuesesse
e,.-6l r
4.2.4 Other Classification Methods cease
NRERORE
SNeucnOa nen eserenemiseesesu
eens nne Hr vssseesseceseeeresneere4=66

Scanned by CamScanner
Accuracy an
wv Syllabus dE ror
Topic : Hold
out
4.4.2

eu, 4,
4.4.3 Random
ea, 4y ae Subsampling ww
Y Syllabus Topic : Cros
s-Validation
ee Re 4.4.4 Cross-Validation (CV)
_ Y Syllabus Topic : BOOTS AD
. eecccctetnee..
4.4.5 Bootstrapping... c
ss’. uae.
¥ Syllabus Topic : TS I
4.5 What is Clustering ? __
1 tHeneorennonsnesseestsbnennenm
aegeaaesceitntinidtivarenrengo
4.5.1 sabesanssiinsinc
What is Clustering? (Dec. 201
0, May 2012, Dec, 2012, Dec.
47 4.5.2 . Categories of Clustering Methods (May 201
2013)...
0, May 2012) voessrcrssseccrsere
4.5.3 Difference between Classification and Clustering
.....csscsccsessese
i 4.6 TYPES Of Dat....eccccccssecssssesssssee ses tT esebsvneartes AO00T COTE eee eeecS
Bebade emer renee: FEN REGS SR neon daar enn PO
3 4.6.1 © Interval-Scaled Variables oocccccssssssssssssssssssessees Bee NTNU cavnarnarsaneuvanidl4-80
4.6.2 Bimary Variable .....scssssecssscssssesecsssssvecsssssssessesssssssssscscsssssecessessessusanansesuennenannneat
eB
4.6.3 Nominal, Ordinal, and Ratio Variables 1-483
4.6.4 Variable of Mixed Tepe wereaanentecreenmenensia
¥ — Syllabus Topic : Distance A me SE
4,7 Distance Measures (Dec. 2011)... a

¥ Syllabus Topic : Partitioning Methods (k-means, k-medOids) ....seeeeeseesen


meta —.
itioning Methods (Dec. 201 Oras
(May 2010, Dec 2 i on
“ “ms ans Clustering : (Centroid Based Technique)
we
Dec. 2014, May 2016).. sheer re se eeenrneeaees

on 2013, Dec. 2013, May 2014,


speceneeeeeaee
,

_ °

a
Object-based Technique) ....... i
4.8.2 doidi s (Repres' entative
K-Me
vassecssessssecssessesneresnseess
a

Sampling Based Method


oo eeseeeeeesenencesaneceeee

4.8.3

Scanned by CamScanner
W Data Warehousing & Mining (MU-Sem. 6-Comp.) 9

¥ Syllabus Tople : Hierarchical Methods - Agglome


rativa, Divisive eunsaaisneg
4.9 Hierarchical Clustering (May 2014, Dec. 2014)..... @ Data Wiatehourang
............... 2 Miriinny
i 4.9.1 %%
} Agglomerative Hierarchical Clustering (May
4.9.2 Divtehve Filerarc hicel Chistes sreervse2010 ),
epentnussciesiitisiina,.._ ; ks ene Vee wcrc
<node , oma
49.3 son
BIRCH (Balanced Iterative
Reducing and Clustering usin
"4.9.4 g Hierarchies)... weno gs
Advantages and Disadvantages of Hierarch 5.7 FP Growth (Mary 2016
ical Clustering
4.10 University Questions and ANSWOIS.
oeescssssscertsrecssens 57.4 Definition «
® Chapter Ends 5.7.2 FP-Treg A
} 573 FP-Trea ¢
ee
/ 5.74 Example
| Module 5 575 Mining F
m 57.6 Benefits
j
t i ¥
ChapterS: Mining Frequent Patter Syllabus Tople : Minin
; ns and Association Rules
a

int
5-1 to 5-57 58
Syllabus nari
Frecteort
: Market Basket Anal ysis, Frequent
Item sets, Closed Item sets and
So ¥ Syllabus Topic : intro
Frequent Patten Mining, Efficient Association Rule, 59 Introduction to Mi
and Scalable Frequent Item set Minin
Association Rule Generation, g Methods Aprio
: ri Algorithm,
Improving the Efficiency of Aprio (May 2012, May
ri, FP growth, Mining frequent ¥ Syllabus Topic : Mu
» Introduction to Mining Multilevel ltemsets
Association Rules. Association Rules and Multidimen 5.10
sionay Mining Multidim:
“Syllabus Topic : Market Basket Analysis....... 5.1 University Ques
..... si etnesteeseoessussesnennsnaestenssassssusansaceatsstens,
5.1 Market Basket Analysis (May * Chapter Ends...
2012, Dec. 201 3)...
5.1.1 What is Market Basket Analysis? ......
...erss
5.1.2 How is it Used ? sind sgnenedanayragansiniainia
cesta sed
5.1.3 Applications of Market Basket Analysis
........ swans tetas ieaeses
cenessensesenensaaen
¥ Syllabus Topic : Frequent Item Sets, Chapter6: Spatial
Closed Item Sets and Association
5.2 RUIG v...cccecscsessece svononea
Frequent Item Sets, Closed Item ee
Sets and Association Rule
6.2.1
Syllabus ; Spatial [
Frequent Itemsets dvaiiaaiveaaviaesi
aalieeieanledee eer ety Association and Cc
5.2.2 Closed Itemsets..........csesscs, ater Veneneaseen aeeedenes
Mining : Web Conte
6.2.3 Association Rules (Dec. 2011 »M
lay 2012, Dec. 2013)......... ¥ Syllabus Topic
¥
Syllabus Topic : Frequent Pattern CCC eer
Mining ...... 6.1 Spatial Da
5.3. Frequent Pattern MiNiNG.... esse. re
sie ssrttesenssesavsasvonssnessteuassesesssessusaransrsursasssee 58 6.1.4
¥ Syllabus Topic : Efficient and sensesees “
Scalable Frequent Itemset Mining
Methods.......,csccssscsveesssssssessssscsss5¢7
5.4 Efficient and Scalable Frequent Items 613
et Mining Method... .e scs ses. nea au ceca siettaiews il
¥ Syllabus Topic : Apriori Algorithm 6.1.4
Teeventvonannennuansnecssnceenenczusesoasnansecesnaseoonetas
anantentées saivndaiedSiiaioianntel 5-7
5.4.1 Apriori Algorithm (DeC. 2011)... acc 615
esses. saseeesansgesonseesuncacscoeseessssenegeas el
5.4.2 Advantages and Disadvantages Y Syllabus Te
of Apriori IQ OPEN ssansnsssceutsiieiecinsitssiareees.,....
5.4.3 Solved Examples on Aprioni Algo BAS. 62 Spatial
rithm ......cccccccssescseses., tetsecsereas
¥ Syllabus Topic : Association sassesvecraneusepeanansenind
Rule Ci) 1-1-1 (0)
5.5 sssseeusesesseteeseansasssssussesnseasenenees SGM
Association Rule Generation enemnnas
COT etre tt)

Scanned by CamScanner
- Data Warehousing 9 & Mining a (MU-Sem. 6-Comp
mee, mp.)
.) 10 Table of Conten ts
. a
‘.
¥ Syllabus Topic : Improving the Efficiency of Apriori
5.6 Improving the Efficiency of Aprioti....c.cccessesesss
¥ Syllabus Topic : FP GION srssssceseseserstssssseeee a
aa %

57.1.
5.7 FP Growth (May 2016) .scsscscssssesssssseerneeeeecce.

wan
Delniionel Five... aeonsan oe
oe ip
*
Wan, 5.7.2 FP-Tree AIGOrithM .essssssssseescensevseesenetuseniessttvtssnetseee
"., my, eeeeeec 6993 : |
‘ 5.7.3 FP-Tre@ Size ssssssssessssssvecsnsessusssseersnseosssassnasesseessunssensiecsisseceeeesuss
iesunentensssatecsccéeosess 5-34
5.7.4 ENTE: GEER TG oasicsesicsvicinsivsseasncssisvanetasenvinissacecngysy R35
5.7.5 Mining Frequent Patterns from FP Tree.......ssssssssssssesssetstssusssssesesessussssecsssesessserssere 5-49
5.7.6 Benefits of the FP-Tree Structure... eee cccccssecsseesonssesesenssssssestsucsusnesevarsnesee! 5-45
y¥ Syllabus Topic : Mining Frequent Itemsets using Vertical Data FOrmats.........ecsessescsesesesseeeee 45
5.8 Mining Frequent Itemsets using Vertical Data Format .............c:sscessseeesseessseesseescsssossuseeeeees 5-45
¥ Syllabus Topic : Introduction to Mining Multilevel Association RuteS........rcsersseserssnesnesnteennese
5.9 Introduction to Mining Multilevel Association Rules
(May 2012, May 2013, Dec. 2013, Dec. 2014, May 201 GY -emrernemrennonsitoaseian
nase eea ata

Y¥ Syllabus Topic : Multidimensional ASSOCIATION RULES ......cccsescescesscesnssesesssssessssorensseensessnuesesaseres


5.10 Mining Multidimensional (MD) Association Rules (May 2013, Dec. 2014, May 201 B) . 220-0549
5.11 University Questions and ANSWETS.......-.:ssssseceesessestsssseessssresnneesnneensneensassessaustnannannarcnnsssces aoe
@ — Chapter Ends.........sscsccessecussesseesssenessnssssanenanensessanenteasensnessrensnsesents

6-1 to 6-31
Chapter 6: Spatial and Web Mining
g Spatial
Data Mining, Spati! al Data Structures, Minin
Syllabus : Spatial Data, Spatial Vs. Classical
- CLARANS Extension, Web
rns, Spatial Clustering Techniques
Association and Co-location Patte lications of Web Mining.
Stru cture Mining, Web Usage mining, App
Mining : Web Content Mining, Web sees
cena? seeesecenuannsssnas
somesmscs cesses naee
nnseq
anne nemes uemeaeasenanaqn
¥ Syllabus Topic : Spatial Data
enna cnnar
eeanee gunca

morsveennnnn enennerverneane
6.1 Spatial Data....sssesesessnsenneemen
6.1.1 Spatial Pattern
?
6.1.2 What is NOT Spatial Data Mining
Data Mining ?...-sssssrssssssere
6.1.3. WhyLeam about Spatial
rrn
rS ...ssssssesseeessenssneessenarensens
6.1.4 Spatial Data Minirig : ACtO
Data Mining
6.1.5 Characteristics of Spatial ee
Spa tia l Vs. Cla ssi cal Dat a Mining ....--ssssssrersesreernerren
Y Syllabus Topic : eeeser sansa bhbiS
ring.a.esss-sssrsrsr
Data Mi
6.2 Spatial Vs. Classical

a cl

Scanned by CamScanner
} €F pata Warehousing & Mining (MU-Sem. 6ComP a
Structures.
¥ Syllabus Topic: Spatial Data
SPUC HUIN OS ooo. cc ee
eee s es
eeeeee
63 Spatial Data
6.3.1 R-tree ....-
6.3.2 Re -tree ..

6.3.3 More Spatiat tigi. dae conek Wicd unas near seain dlila ri neatEtEs
ati on and Co- loc ati on Pat ter ns
..
¥ Syllabus Topic: Mining Spatial Associ
64 Mining Spatial Association and Co-location Patterns...
ssessssseneeeeee
6.4.1 Mining Spatial ASSOCIALION ........ssecssesseescsnseeestsseeans
eaee snsees
6.4.2 Mining Co-location Patterns .......ssssscssesssesssrensesseneecn
¥ Syllabus Topic : Spatial Clustering Techniques - CLARANS Extension
65 Spatial Clustering Techniques : CLARANS aeeheeanetasseassenseeasenessnes

Y¥ Syllabus Topic: Web Mining


6.6 Web Mining...
Syllabus :
6.6.1
How Webbanning isis Different ‘eee Classical DM Vas
? wensitnses
6.6.2 Benefits of Web DataGAs ke ogleckscks Seccetiastent ase oar Wtroduction to Str
Warehouse, Dat:
¥ Syllabus Topic : Web Content Mining
approach. Data »
6.7 Web Content Mining TOONS OREO SU NGSENOSNOEESE EDGE EGOS RO SRanOnE Suen EDaEERSEESFeeaeaas04000n0eRSEEOENSSESOODDEEnaS
eS Modelling, Inform
6.7.1 Introduction to Web Content Mining..............ccssesscsssessssseccseeseasvene Snowflake Scher
6.7.2 Text MINING.......0ccecsscececeeeee dimension tables
“Syllabus Topic : Web Usage mining
6.8 Web Usage Mining.... EEE
Ss
6.8.1 What is Websb Usage Mining ?P svypartinaceastvaions
6.8.2 Purpose of. Web Usage Mining... 1.1. Introduct
6.8.3 Web Usage Mining Activities.............
6.84 Web Server Log ~Furlerse
ht iveatt
stiucrnennda
inehanigtas,
nsi<5 ase
— For the past f
6.8.4(A)
Structure of Web Log sc sosepemmoaponace Rereente iee and large sizes
e eae —
6.8.4(B) Web Server Log- An Example CUCSRGUERTERCG CeGuCSEseG¢nSI - Data warehot
00004000 PEEEISUSEHSRESNESUS SUSGS0000SuRGennauene
v Syllabus Topic: Web Structure Mini
ng...
and industrie:
Ste,

6.9 Web Structure Mining... - Although it’:


6.9.1 fitvedetion to Web Seruenya’ Mining. sd this.
éntaraaneany sncasene
6.9.2 6-22
Techniques of Web Structure MUNIN... cssssecsrsss - With the he
esserssntessesaseesasssuceutssassssecesecseuceceened6-24
6.9.2(A) Page Rank Technique Used by | financial
6.9.2(B) CLEVER Technique................. +» 6-24
companies
6.10 : von 6-25
Web Crawlers... st decision m
¥ Syllabus Topic: Mhidoos' of 6-29
ck Web Mining’. — The funds
sssattatassessasssesareeneaeOOl
6.11 Applications of Web Mining................ . informati
itt ee
® Chapter Ends............... - If we ne
sens
different
Appendix A: Solved University Question Paper Of May 2019 o...ssscsssssessescsssssesess..AeT to Ar16

Scanned by CamScanner
Module 1

Introduction to Data
Warehouse and
Dimensional Modelling
A

Syllabus :
Introduction to Strategic Information, Need for Strategic Information, Features of Data
Warehouse, Data warehouses versus Data Marts, Top-down versus Bottom-up
approach. Data warehouse architecture, metadata, E-R modelling versus Dimensional
Modelling, Information Package Diagram, STAR schema, STAR schema keys,
Snowflake Schema, Fact Constellation Schema, Factless Fact tables, Update to the
dimension tables, Aggregate fact tables.

Syllabus Topic : Introduction to Strategic Information

1.1. Introduction to Strategic Information

- For the past few years, data warehousing solution is becoming a huge trend in medium
and large sized companies.
- Data warehouse is no longer just a theorized study in universities, because many business
and industries have started to embrace this mainstream phenomenon.
_- Although it’s not applied in every company yet, we can see the stable demand growth for
this.
- With the help-of Data warehousing technology, every industry right from retail industry to
financial institutions, manufacturing enterprises, government department, airline
companies people are changing the way they perform business analysis and strategic
decision making.
- The fundamental reason for the inability to provide strategic information is that strategic
information was been extracted from the existing operational systems.
- If we need the strategic information, the information must be collected from altogether
different types of systems.

Scanned by CamScanner
=F pata Warehousing & Mining (MU-Sem. 6-Comp.) _1-2 Introduction to
Data Warehoy,
%
- Only specially designed decision support systems or
informational Systems can pros
strategic information. I de

— The result of thisis :i ihe ereation


ati of an ew comp
on uti ng envi
vi ronment for the
providing the strategic information for every Purpose of
enterprise,

. Syllabus Topic : Need for Strategic Inform


ation —
1.2 Need for Strategic Information
———__
- Strategic information is required for an ente Ipri
se to decide the business Strategies and
establis
h the goals for business. Enterprise nee
ds to monitor the results b Y Set
objectives. ting som.
— For examples :
Top 3 products sold in 2015
000 0

Improve the quality of top products


Increase sales by 10 % in the:south region
Retain the present customers
Increase the productivity of particular
items due to demand
0

So to take decisions about these type


of objectives, business people need
information.
1.2.1 Desired Characteristics of Strate
gic Information
i - INTEGRATED : There must be one integrated » ente
rprise wide view of information
system,
- DATA INTEGRITY : Information must be accurate and must be conventional to
business rules.
ACCESSIBLE : Easily accessible
for analysis.
~ CREDIBLE : Every business factor mus
t have one.and only one value.
TIMELY : Information system must be
obtainable within the predetermined time
frame.
1.2.2 Why are Operational S ystems not
Suitable for Providing Strategic
Information?

The fundamental reason for the inability to provide


strategic information is that strategic
information was been extracted from the existing
operational systems.
— These operational systems such as University Record syste
m, inventory management,
claims processing, outpatient billinor g, and so on are
<
.
not designed in a way to providide
strategic information.

Scanned by CamScanner
a
.
¥ Data w _1-3 Introduction to Data Warehouse
—_ aos io ser 8-Comp.)
nit
- If we need the strategictin |information, the information must be collected from altogether
en e
igned decision support systems or
nformati ypes of systems. Only specially des
ategic information.
informational systems can provide str
4.2.3 Operational v/s Decisional Support System
. > (MU - May 2016)
Operational systems are tuned for known transactions and workloads, while workload is
not known a priori in a data warehouse.
Special data organization, access methods and implementation methods are needed to
support data warehouse queries (typically multidimensional queries).
during the
Example, average amount spent on phone calls between QAM-5PM in Pune
month of December.
Operational System - | Data Warehouse (DSS)
Application oriented Subject oriented
Used to run business Used to analyze business

Detailed data Summarized and refined

Current up to date Snapshot data

Isolated data Integrated data


Ad-hoc access
Repetitive access
Knowledge user (manager)
Clerical user
Performance relaxed
Performance sensitive
time ( millions)
s) Large volumes accessed at a
Few records accessed at a time (ten
Mostly read (batch update)
Read/update access
Redundancy present
No data redundancy
abytes
Database size 100GB - few ter
Database size 100 MB-100 GB
es of Data Warehouse
Syllabus Topic : Featur

rehouse
1.3 Introduction of Data Wa
May 2011, Ma y 2012, Dec. 201
3)
=> (MU - May 2010, Dec. 2010,
use
1.3.1 Definition Data Wareho
: "A
us e wa s def i ned by Bil l Inmon in 1990, in the following way
ho
The term Data Ware latile collection of data in
sub jec t- ori ent ed, int egrated, tim e-variant and non-vo
warehouse is a process". He defined the terms in the sen
tence as
's dec isi on ma ki ng
support of management
follows :

Scanned by CamScanner
Pata that gives information abo
ny ut a particular subject instead of a
v *
bie ts
AMON, OPERATIONS, } Dany,
Integrated

Data thal gathered into the data War


coherent who le, ehouse trom a Variety of Sources and
Merged | Nt y
Time-variant

All data in the data Warehous


e is identified with 4 pa
rticular time Period,
Non-volatile
Data is stable in a data
Warehouse, More data
enables management to Si is added but data
is nev er removed,
n a consistent picture of Thiy
the business,
Ralph Kimball provided
a much simpler defi
Warehouse is a copy of tran nition of a d {a Ware
saction data specific ally house ie,
“data
structured for query
This is a fu and analysis",
nctional view ot a da
ta Warehouse, Kimbal
Warehouse is built like l did not addre SS
Inmondid, rather he fo how the dat
cused on the tune tion
1.3.2 Benefits of Da ality of a data Wareho
ta Warehousing use,

= Potential high returns on investment


intelligence ; Implementati and delivers enhanc
on of data Warehouse requ ed business
Rs, But it helps the organi ires a huge investment in
zation to lakhs of
take strategic decisions bas
and organization ed on past historical data
can improve the re sults of various pro
Seamtentation, inventory cesses like marketing
manageme ntand sales,
Competitive advantage 3 As
previously unknown and Una
data warehouse, decision vailable data is available in
makers can atecess that
competitive advantage, dat X tO take decisions to gain the
~ Saves Time : As the dat
a trom Multiple sources
USCES CAN geCESs da is ay tilable in integrated form, business
ta from one place. Ther
e is no need to retriove the data
sources, from multiple
~ Better enterprise intell
igence : It improves the
customer service and pro
~ High qu ality data : Data in dat ductivity.
a Warehouse is ¢ leaned and
So data quality is high, transferred into desired format,

Scanned by CamScanner
Sig
% ; — ‘areh OUusin
i g & Mini
i ng (MU-
Sem. 6
i Features Of
a Data Wareho
use

Characteristic ata Warehouse


s; Features of > (MU - . Dec. 2010,
ty 4
aD Dec. 2014)
A common Way of in
troducing g da data wa
warehouse : rehousining g isi to ref
er to the characterist
ics of a data
1. Subject Oriented
2. Integrated
| 3. Nonvolatile
Time Variant
Bs

1. Subject Oriented

lis
. ve : , loans, ete.
ef€r ques
ave tions like
l “Which customer has taken
Imum loan amount for Jast year?” This ability to
define a data warehouse by
subject matter, loan in this case, make
s the data warehouse subject oriented,
Operational applications i . Data warehouse subjects

Fig. 1.3.1: Data Warehouse is subject Oriented

2. Integrated
- A data warehouse is constructed by integrating multiple, heterogeneous data sources
like, relational databases, flat files, on-line transaction records. The data collected is
cleaned and then data integration techniques are applied, which ensures consistency
in naming conventions, encoding structures, attribute measures etc. among different
E- data sources. ‘
Example:
Data warehouse subjects

- Subject
Student

Data Warehouse
Fig. 1.3.2: Integrated

Scanned by CamScanner
£F pata Warehousing & Mining (MU-Sem. 6-Comp.)__1-6 Introduction to Data Warehovs, -

3. Non-volatile ¥F Data Warehousing|


———

sv
- 20 Nonvolatile means that, once data entered into the warchouse, it cannot be removed
: or
changed because the purpose of a warehouse is to analyze the data.
It consists of multip
4. Time Variant
A data warehouse maintains historical data, For e.g. A customer record has details of his It is highly flexible
job, a data warehouse would maintain all his previous jobs (historical information) When,
Implementation tal
compared to a transactional system which only maintains current job due to which its not
| Generally size is f
possible to retrieve older records.

Syllabus Topic : Data Warehouses Versus Data Marts Sy

1.4 Data Warehouses versus Data Marts ~ 15 Top-Dov


——

> (MU - May 2013, May 2016) Data Warchot


1.5.1 The Top
Data Mart defined

— A data mart is oriented to a specific purpose or major data subject that may be distributed
to support business needs, It is a subset of the data resource. — The data flo
operational ¢
— A data mart is a repository of a business organization's data implemented to answer very
specific questions for a specific group of data consumers such as organizational divisions
of marketing, sales, operations, collections and others.
— A data mart is typically established as one dimensional model or star schema which is
composed ofa fact table and multi-dimensional table.
- A data mart is a small-warehouse which is designed for the department level.
— Itis often a way to gain entry and provide an opportunity to learn.
Major problem : If they differ from department to department, they can be difficult to
integrate enterprise-wide.
Table 1.4.1 : Differences between Data Warehouse and Data Mart

Data Warehouse aS Data Mart


A data warehouse is application A data mart is a dependent on specific DSS
independent. application. :
It is centralized, and enterprise wide. . It is decentralized by user area.
It is well planned. It is possibly not planned. - This d
The data is historical, detailed and level o
The data consists of some history, detailed
summarized. and summarized. o T

Scanned by CamScanner
q “tg, Se Data Warehousing & Mining (MU-Som. 6-Comp.) 1.7 Introduction to Data Warehouse
b | Data Warehouse
Data Mart
has It consists of multiple subjects,
It consists of a single subject of concern
detaiy to
orn, :
Sto tion Sop the user,

| Itis highly flexible. Itis restrictive,


é
/
Wh;bioy ig ; takes months toyear,
Implementation
;
Implementation is done usually in months.
,
| Generally size is from 100 GB to ITB. Generally size is less than 10OGB.
“a

Z
— _ Syllabus Topic : Top-Down versus Bottom-U
p Approach
1.5. Top-Down versus Bottom-Up Approach
13, » May 2 Data Warehousing Design Strategies or Approaches for Building a Data
Warehouse.
1.5.1. The Top Down Approach : The Dependent Data Mart
Structure
be Csteibuy ‘ > (MU - May 2010, Dec. 2010, May 2011, May 2012)
- The data flow in the top down OLAP environment begins with data extracti
on from the
answer 9 operational data sources.
1al divisions Other
operational DBs
1a which is

Extract |
Transform
Load
F | Refresh
ifficult to

Data
Warehouse”

we OEEFig. 1.5.1: Top Down Approach


ensuring a
- This data iis loaded into the staging area and validated and consolidated for
(ODS).
level of accuracy and then transferred to the Operational Data Store
ses.
o The ODS stage o is sometimes = skipped if it is a replication of the operational databa

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-8 Introduction to Data Warehoy
——— 7 Se
- Data is also loaded into the vie warehouse in a parallel process to avoid extract
Ng i
from the ODS.
o Detailed data is regularly extracted from the ODS and temporarily hosted in the
staging area for aggregation, summarization and then extracted and loaded into the
Data warehouse.
o The need to have an ODS is determined by the needs of the business.
o If there is a need for detailed data in the Data warehouse then, the existence of an
_ ODS is considered justified.
© Else organizations may do away with the ODS altogether.
Once the Data warehouse aggregation and summarization processes are complete, the
data mart fresh cycles will extract the data from the Data warehouse into the staging
area and perform a new set of transformations on them.
0 This will help organize the data in particular structures required by data marts.
- Then the data marts can be loaded with the data and the OLAP environment becomes
available to the users.
Advantages of top down Approaches
— It is not just a union of disparate data marts but it is inherently architected.
- The data about the content is centrally stored and the rules and control are also centralized.
- The results are obtained quickly if it is implemented with iterations.
Disadvantages of top down Approaches .

— Times consuming process with an iterative method.


— The failure risk is very high.
- Asitis integrated a high level of cross functional skills are required. |
1.5.2 The Bottom-Up Approach : The Data Warehouse Bus Structure

> (MU - May 2010, Dec. 2010, May 2011, May 2012)
- This architecture makes the data warehouse more of a virtual reality than a physical
reality. All data marts could be located in one server or could be located on different
servers across the enterprise while the data warehouse would be a virtual
entity being
nothing more than a sum total of all the data marts.
- In this context even the cubes constructed by using OLAP tools could be considered as
data marts. In both cases the shared dimensions can be used for the conformed
dimensions.
- The bottom-up approach reverses the positions of the Data warehouse and the Data marts.

Scanned by CamScanner
h Ogy, whe i ¥F Data Warehousing
SS &S
Mining (MU-Sem. ot
&-Comp. 1-9 Introduction to Data Warehouse
_____ introduction to Data Warehouse
&q .
ade gq. in - Data marts are directly loaded with the data
from the operational systems through the
{ty staging area.
The ODS may or may not exist depending on the
business requirements. However, this
approach increases the complexity of process
coordination.
- The data ‘flow in the bottom up approach starts
with extraction of data from operational
% databases into the staging area where it is processed
and consolidated and then loaded into
the ODS.
mp] ete - The data in the ODS is appended to or replaced by the
fresh data being loaded.
the Stag: — After the ODS is refreshed the current data is once again
extracted into the staging area
By and processed to fit into the Data mart structure,
ts, - The data from the Data mart then is extracted to the staging area aggregated,
summarized
bec and so on and loaded into the Data Warehouse and made available to
the end user for
OMes analysis.

alized

" Extredt
Transform
_ Load

012)
aah Data aie

ical _Warehouse
rent males

ing Fig. 1.5.2 : Bottom Up Approach


Advantages of bottom up Approach

. ~ This model strikes a good balance between centralized and localized flexibility.
‘ — Data marts can be delivered more quickly and shared data structures along the bus
eliminate the repeated effort expended when building multiple data marts in a
non-architected structure.

Scanned by CamScanner
The standard procedure where data mart
s are refreshed from the ODS a Nd
not from,
operational databases ensures data consolidation and hence it is ge
nerally eCOMMe, :
approach, it i
4]
Manageable pieces are faster and are casily implemented,
Risk of failure is low.
Allows one to create important data mart first,
Disadvantages of bottom up Approach
— Allows redundancy of data in every
data mart.
Preserves inconsistent and incompati
ble data,
Grows unmanageable interfaces.

1.5.3 Hybrid Approach

The Hybrid approach aims to har


ness the speed and user orientati
approach to the integration on of the Bottom up
of the top-down approach,
The Hybrid approach begins wit
h an Entity Rel ationship diagra
gradual extension of the data marts
m of the data marts and
to extend the enterprise model
fashion. in a Consistent, linear

These data marts are developed usi


ng the star schema or dimensional
models,
The Extract, Transform and Loa
d (ETL) tool is deployed to ext
ract data from the source
into a non persistent staging area
and then in to dimensional data mar
atomic and summary data. ts that contain both
The data from the various dat
a marts are then transferred
to the dat a warehouse.and query
tools are repr
ogrammed to request summ ary
data from the marts and atomic data from
Data Warehouse. the
Advantages of Hybrid Appr
oach
Provides rapid development wit
hin an enterprise architecture
framework.
Avoids creation of renegade
“jIndependent” data marts,
Instantiates enterprise mod
el and arch itecture only whe
deliver real value,
n needed and once data marts

Synchronizes meta data and


database models between ent
erprise and local definitions.
Back filled DW eliminates
redundant extracts.

Scanned by CamScanner
°

YF Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-11 Introduction to Data Warehouse

Disadvantages of Hybrid Approach


- Requires organizations to enforce standard use of entities and rules.
- Back filling a DW is disruptive, requiring corporate commitment, funding, and application
rewrites.
Few query tools can dynamically query atomic and summary data in different databases.

1.5.4 Federated Approach

This is a hub-and-spoke architecture often described as the “architecture of architectures”.


It recommends an integration of heterogeneous data warehouses, data marts and packaged
applications that already exist in the enterprise.
- The goal is to integrate existing analytic structures wherever possible and to define the
“highest value” metrics, dimensions and measures and share and reuse them within
existing analytic structures.
- This may result in the creation of a common staging area to eliminate redundant data feeds
or building of a data warehouse that sources data from multiple data marts, data
warehouses or analytic applications.
- Hackney-a vocal proponent of this architecture claims that it is not an elegant architecture
but it is an architecture that is in keeping with the political and implementation reality of
the enterprise.
Advantages of Federated Approach
- Provides a rationale for “band aid” approaches that solve real business problems.
- Alleviates the guilt and stress data warehousing managers might experience by not
adhering to formalized architectures.
- Provides pragmatic way to share data and resources.
Disadvantages of Federated Approach
- The approach is not fully articulated.
- With no predefined end-state or architecture in mind, itmay give way to unfettered chaos.
- It might encourage rather than dominant in independent development and perpetuate the
disintegration of standards and controls.
1.5.5 A Practical Approach
The Steps in the Practical Approach are :
1. The first step is to do Planning and defining requirements at the overall corporate level.
2. An architecture is created for a complete warehouse.
3. The data content is conformed and standardized.

Scanned by CamScanner
ww Data Warehousing & Mining (MU-Som. 6-Comp,) 1-12 Introduction to Data Warehou,,
— Se
aS Per the *
4. Consider the series of supermarts one at a time and implement the data warchouse, SF pata Warehousing ¢
bai Unij 5. In this practical approach, first the organizations needs are determined. The key to thiy
——

“ademic ye; In Fig. 1.6.1, the ;


approach is that planning is done first at the enterprise level. The Fequirements a,
external source da
Choice Ba gathered at the overall level.
The next building
6.The architecture is established for the complete warehouse. Then the data content
for cach data system and tl
supermart is determined. Supermarts are carefully architected data marts. Superman
‘i In the middle, day
implemented one at a time.
7, Before implementation checks the data A meta data repo
types, field length etc. from the Various
supermarts, which helps to avoid spreading of different data across On the rightmo
several data Marts,
8. Finally a data warehouse is created which is a responsible to de
union of all data marts. Each data m:
belongs to a business process in the enterprise, and Source Datac
the collection of all the data Marts
form an enterprise data warehouse —
. The source
—_—_—_SSS=__—_—
7
- Typically,1
Syllabus Topic : Data Warehouse Architecture
= —_———,

—————————————

= ——
- This Operz
1.6 Data Warehouse Architecture processing
- The inforn
1.6.1 The Information Flow Mechanism make the
- The Building Blocks of a Data Operationa
Warehouse Or information flow analysis te
Fig. 1.6.1. Mechanism as shown in
2. Data Staginc
Source Data staging area
data systems Data and metadata
storage area
End-user — As soon
Presentation tools
* transform
“Processing = — There ar
clean |hOc query data, they
reconcile tools:
derive matched to” — As the \
match presentation.
combine format warehou
remove dups ; ; data stru
standardize - Report writers > |.
transform nee ay ee - The fun
conform End-user
dimensions applicicati oo Rel
ations
. . o Co
export to data [Load ‘Modeling/. .
Marts mining tools o Es
— i
Visualization o A
~ tools”
8. Data Stor

- Thec
envirc
- For i
techn

Scanned by CamScanner
__ntroduction ‘0 Data Warehous
In Fig. 1.6.1, the source e
data component is shown
external source data syst on the left, shows various
em. internal and
— The next building blockis the data staging area where
data system and then transf the data is extracted from sou
ormation and loading of data rce
- Inthe middle, data sto is done.
rage compone! nt is plac
ed which manages the data war
- <Ametadata Tepository ehouse data.
is used to keep a track of the
data.
- On the rightmost side there is information
delivery component. This
responsible to deliver the inform component is
ation in different ways to the
end users.
1, Source Data Component.
- The source data component Prov
ides the necessary data for the data
~ Typically, the source data for warehouse.
the warehouse comes from the
operational applications.
- This Operational data and Processing
is completely separated from data
Processing. warehouse
~- The information Tepository is surround
ed by a number of key components desi
make the entire envi gned to
ronment functional, manageable and
accessible by both the
operational systems that source data
into the warehouse and by end-user
analysis tools. query and
,
2. Data Staging Component
- As soon as the data arrives into the data
staging area, it is cleaned up, and using
’ transformation function it is converted into an
integrated structure and format.
- There are different. types of transformation funct
ions that may be applied over the
data, they are conversion, summarization, filterin
g and condensation of data.
- As the warehouse maintains historical data, the
amount of data is very huge. The
warehouse must be able to hold,and maintain such a
large volume as well as different
data structures for the same.
- The functionality includes : /
o Removing unwanted data from operational databases.
i
f © Converting to common data names and definitions.
é
;
o Establishin ig defaults for missing data.
© Accommodating source data definition changes.
3. Data Storage Component
- The central data warehouse database is the cornerstone of the data warehousing
environment.
- For implementation, mostly a Relational database management system (RDBMS)
ee

technology is used.
ae
ew

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-14 Introduction to Da
ta Ware
- However, this kind of implementati feu,
on is often constrained by the fact
per the; that tradit;, SF Data Wareh
RDBMS products are optimized for tran
' Uni sactional database Processing,
"4
——

mic ye ~ Some of the Data warehouse 1.6.2 Datav


features like very large database
Nice Ba Size, ad hoc di
processing and the need for flexible user view
creation including aggr Sy
table joins and drill-downs have beco “Bates,
me drivers for different technologi.,Maly,
approaches to the data warehouse database. = The data i:
as from ot!
4. Information Delivery Component
= The data |

The main purpose of data warehouse where the


is to provide strategic information
users for decisi tO busines, the data w
on-making. The Se users interact with the data warehouse Using - The data
different tools,
sorting ar
C :
The information delivery component query or |
is used for providing
informatio data Warehous - As soon
n to one or more destination :
3 . s according to some user Specifie Schedu presenta
algorithm
.
d ling
- A preser
- |
~ Delivery of inform staging a
ation may be based on time of da or
| y on the Completi applicati
external event, on of ay
5. Metadata Compo — The thre
nen t (i) So
~ Meta data is data about da (iii) Pre
ta that describes the data wa
maintaining manag rehouse. It is used for buildi
, ing and using the data ware ng, — The dat:
into:
h o M
use. eta data can be classifi entire p
‘ ed
transfor
i Y warehouse desig and MS
warehouse development ners and administra
and management tasks. tors for — Atypic

6 Management and Co
ntrol Component

— Man

Scanned by CamScanner
*

wy F i ‘
_Data Warehousing & Mining (MU-Sem, 6-Comp.) 1-15 Introduction to Data Warehouse
1.6.2 Data Warehouse Architecture
>(MU - May 2010, Dec. 2010, May
2011, Dec. 2011, May 2012, May
2013,
- The data in a data wareho May 2014, Dec. 2014, May 2016)
use comes fro m
Operational s ystems of the
as from other external
sources. These are c oll Organization as well
ectively referred to as source
— The data extracted ‘from systems.
source systems is stored
where the data is cleaned, in an area call ed data staging area,
fransformed,
combined, and duplicated
’ the data warehouse, to prepare the data in
: a
The data staging area is gen
erally a collection of machin
sorting and sequential Proces es where simple activities
sing takes place. The data stag like
query or presentation services, ing area does not prov ide any
:
OU | = As soon as a system Provides query or presentation services, it is cate
presentation server. gorized as a
.
‘A presentation server is the target-
machine on which the data is load
ed from the data
applications.
The three different kinds of systems that are
required for a data warehouse are:
(i) Source Systems
(ii) Data Staging Area
(iii) Presentation servers
9 — ' The data travels from.source systems to prese
ntation servers via the data staging area. The
entire proce ss is popularly known as ETL (extract, transform,
and load) or ETT (extract,
transform, .and transfer). Oracle’s ETL tool is
called Oracle Warehouse Builder (OWB)
and MS SQL Server’s ETL tool is called Data Tran
sformation Services (DTS).
A typical architecture of a data warehouse is shown in Fig.1.
6.2 :

<> Highly go>


Meta. |Summarized
— oS da Data
Query Manager
YOO)

9
c 0
ao
ares * e x

2 8 —$—p - End user


3 Lightly
53 = Summarized
access tools
a
o
o
Data
'
45
Detailed data

| Warehouse manager *

Archive / Backup

Fig. 1.6.2 : Data Warehouse Architecture

Scanned by CamScanner
a]
Ww Data 1 Warehousing & Mining Introduction to Data Wareh,
ttn,

Each component and the tasks performed by them arc explained below ;
1. Operational Source
The sources of data for the data warchouse are supplied from :
- The data from the mainframe systems in the traditional network and hierarchic,
format.
- Datacan also come from the relational DBMS like Oracle, Informix.
* = Inaddition to these internal data, operational data also includes external data Obtain,
from commercial databases and databases associated with supplier and customers.
2. Load Manager
- The load manager performs all the operations associated with extraction and Joading
data into the data warehouse.
- These operations include simple transformations of the data-to prepare the data fy,
entry into the warehouse.
- The size and complexity of this component will vary between data warehouses ang
may be constructed using a combination of vendor data loading tools and custom bujj
programs.
3. Warehouse Manager |
— The warehouse manager performs all the operations associated with the management
of data in the warehouse.
- This component is built using vendor data management tools and custom built
programs.
— The operations performed by warehouse manager include.
— Analysis of data to ensure consistency. .
- Transformation and merging the source data from temporary storage into data
warehouse tables.
- Create indexes and views on the base table.
- Denormalization.
— Generation of aggregation.
- Backing up and archiving of data.
- In certain situations, the warehouse manager also generates query profiles 10
determine which indexes and aggregations are appropriate.

Scanned by CamScanner
*

Data Wi i iat Introduction to Data Warehouse


¥ arehousing & Mining (MU-Sem. 6-Comp.) _ 1-17
4. Query Manager
— The query manager performs all operations associated with management of user
queries.
— This samaponut is usually constructed using vendor end-user access tools, data
warehousing monitoring tools, database facilities and custom built programs.
- The complexity of a query manager is determined by facilities provided by the end-
user access tools and database.
5. Detailed Data
- This area of the warehouse stores all the detailed data in the database schema.
- In the majority of cases detailed data is not stored online but aggregated to the next
level of details.
— The detailed data is added regularly to the warehouse to supplement the aggregated

6. Lightly and Highly Summarized Data


(aggregated) data
- This stores all the predefined lightly and highly summarized
.
generated by the warehouse manager.
to change on an ongoing
This area of the warehouse is transient as it will be subject
basis in order to respond to the changing query profiles.
speed up the query performance.
The main goal of the summarized information is to
d data. is updated
As the new data is loaded into the warehouse, the summarize
continuously.
7. Archive and Back up Data
ed for the purpose of archiving and back
The detailed and summarized data are stor
. Up s.
h as magnetic tapes or optical disk
_ ‘The data is transferred to storage archives suc
8. Meta Data
use also stor es all the Meta dat a (data about data) definitions used by
The data wareho
.
all processes in the warehouse.
g :
— Itis used for variety of purpose includin rces to a
rac tio n and loa din g pro ces s - Meta data is us ed to map data sou
° The ext
on within the warehouse.
common view of informati the
data is used to ‘automate
' ‘The warehouse management process - Meta
.
production of summary tables

Scanned by CamScanner
' : 1-18 Introduction to Data jw. ai hong ;
)
&: Data Warehousing & Mining (MU-Sem. 6-Comp.
direct 4 query ty the
2
data isis us used to SF Data Warehousing & |
Dake - oO As part of Query Management process - Mcta
Univ hecauee the «.
—eee

nic year most appropriate data source. - Gateways like


di ini each pro! cess, ‘ e Purpose ‘
3ce Bas — The structure of Meta data willil differ embedding fo

|
different. underlying DE
End-User Access Tools Middle Tier (OL)
— The main purpose of a data warehouse is to provide information to the busine, OLAP Engine i
managers for strategic decision-making. Processing) or M
3. Top Tier (Fron
— These users interact with the warehouse using end user access tools.
— Some of the examples of end user access tools can be : = This tier is:
data mining
eo Reporting and Query Tools
- From the A
o Application Development Tools
J. Enterprise W:
© Executive Information Systems Tools-
The informati
o Online Analytical Processing Tools
enterprise war
o Data Mining Tools Il. Data Mart
1.6.3 Three Tier/ Multi-tier Data Warehouse Architecture — aAsubset

> (MU - Dec. 2013) - Itcanbe

' Ti. Virtual wai


Monitor 1 OLAP
& ! server — Aseto
Integrator} | - Onlys
!

Analysis
fe ¥ query ——SSSS
Serve reports
data mining 17 Meta

4.7.4 Detir

Data marts ; Data w


purpose met
Data Sources Data Storage OLAP Engine Front-End Tools (a) Inform:
Fig. 1.6.3 : Multi-tier Data warehouse
Architecture (b) Inform
1. Bottom Tier (Data Sources and the rel
Data Storage)
~ Itis a warehouse database serv recon
er, that is generally a RDBM
S (c) Infor
Using Application Program int
erfaces mode
operational and external sources. (called as gateways), data is extracted from
the W

Scanned by CamScanner
BF Data Warehousing & Mining (MU-Sem.
6-Comp.) 1-19 Introduction to Data Warehouse
nae like, ODBC(Open Database: conn
ection), OLE-DB (Open linking
embedding for database), JDBC and
(Java Database Connection) is
underlying DBMS, supp orted by
, 2. Middle Tier (OLAP Engine)
OLAP Engine is either implemen
ted using ROLAP (Relational
Processing) or MOLAP(Multidimension online Analytical
al OLAP).
3. Top Tier (Front End Tools)
This tier is a client which contains query
and reporting tools, Analysis tools, and /or
data mining tools.
From the Architecture ‘Point of view there are three data warehouse
Models :
I. Enterprise Warehouse
The information of the entire organization is collecte
d related to various pubiocts in
enterprise warehouse.
II. Data Mart

A subset of Warehouse that is useful to a specific group of users.


ec, 2013)
It can be categorized as Independent vs. dependent data mart.
II. Virtual warehouse
A set of views over operational databases.
Only some of the possible summary views may be materialized.

Syllabus Topic : Metadata ~

1.7 Metadata
> (MU - Dec. 2010, Dec. 2012, May 2013, May 2014, Dec. 2014)
1.7.1 Definition
= (MU - May 2010, Dec. 2011, May 2012, Dec. 2013)
Data warehouse metadata are piec&s of information stored in one or more special-
purpose metadata repositories that include. :
(a) Information on the contents of the data warehouse, their location and their structure,
b) Information on the processes that take place in the data warehouse back-stage, concerning
the refreshment of the warehouse with clean, up-to-date, semaganeehy. and a structurally
reconciled data,
(c) Information on the implicit semantics of data (with respect to a common enterprise
model), along with any other kind of data that aids the end-user exploit the information of
yn
the warehouse,

Scanned by CamScanner
~
~_¥& ata Warehousing & Mining (MU-Sem. 6-Comp.) _1-20 antroduction to Data Wath.
(d) Information on the infrastructure and physical characteristics of components and ‘
sources of the data warehouse and, -
(e) information including security, authentication, and usage statistics that aigs te
_ administrator, tunes the operation of the data warehouse as appropriate,
1.7.2 Describe Metadata of a Book Store

Metadata can also be considered as an equivalent of Amazon book store, If we consid


each data element as a book, the meta-data will contain.
— Name of the book,
- Summary of the book,
- Assessments about the book,
- The date of publication,
- High level description of what it contains,
~ Who are the publishers,
- How-you can find the book,
"
- Author of the book, |
— Whether the book is available OR not.
:
- This information helps you to:
© Search for the book
© Access the book
© Understand about the book
before you access OR buy
it.
1.7.3 Data Warehouse Meta
data

> (MU - May 2010, May 2012, Dec.


Data Warehousing has Specif 2013)
tables typically incl ic metadata requirements.
udes : Metadata that describes
~ Physical Name
- Logical Name
~ Type : Fact: Dimension,
Bridge
~ Role: Legacy, OLTP
, Stage,
- DBMS: DB2,
Informix, MS
~
SQL Server, Oracle, Sybase
. Location
~ Definition
— Notes

Scanned by CamScanner
ww Data Warehousing & Mining (MU-Sem, 6Comp.) 1-21 Introduction to Data Warehouse
<== 2a —

Metadata describes columns within tables

Physical Name Logical Name


Order in Table Data type

Length Decimal Positions

Nullable / Required | Default Value

Edit Rules Definition

Notes

- Example of Meta data for Customer sales data warehouse


— Entity Name : Customer
Alias Names : Account, Client
Definition : A person or an organization that purchases goods or services from the
company.
Remarks : Customer entity includes regular, current, and past customers.
Source Systems : Finished Goods Orders, Maintenance Contracts, Online Sales.
— Create Date : June 15, 2007
- Last Update Date
: June 21, 2009
- Update Cycle : Weekly
Last Full Refresh Date : December 29, 2008
— Full Refresh Cycle : Every six months
Data Quality Reviewed : July 25, 2009
— Last Deduplication : June 05, 2009
Jec. 2013) Planned Archival : Every six months
— Responsible User : Pallavi H.
describé
1.7.4 Classification of Metadata or Types of Metadata In Data Warehouse
> (MU - May 2010, Dec. 2011, May 2012, Dec. 2013)
Operational Meta data
In an Enterprise, data for the data warehouse comes from various operational systems.
Different source systems contain different data structures having different field lengths
and data types.
So the information of operational data source is given by Operational Meta Data.

Scanned by CamScanner
=F pata Warehousing & Mining (MU-Sem. se
omp.) 1-22 Introciucton to Data Ware, 4 |
‘Per the p m
SF Data Warehousing & Minit
uf Univ ormation Meta data
Extraction and Transf ————

of data from heter ogeneo Us. 80) ro


lemic yea = Thisis
ins
conte ins
conta the i nformalali ion about th /e extre
t ction
1 4.8.3 Comparison Dat:
lolce Bas 1. Used for Online Trans:
on in
i data slag
sta, ing area i
abol t data transformati i
informatio n abou as Data Warehousing.'
vk
It also :contains

End User Meta data 2, The tables and joins a


reduce redundant data
This particular category provides the end user the flexibility of looking for informatio, iy
3. Entity - Relational m«
their own way.
4,.. Optimized for write c
l Modelling. = =™
Syllabus Topic : E-R Modelling versus Dimensiona 5. Performance is low f
Data Warehouse
1.8 _E-R Modelling versus Dimensional Modellin: g ——__
1. Used for Online A
1.8.1. What is Dimensional Modeling ? ; Users for business ¢

> (MU - Dec. 2011, May 2017 2. The Tables-and joi


response time for a
— Itisa logical design technique used for data warehouses.
3. Data- Modeling te
- Dimensional model is the underlying data model used by many of the commercial OLAp 4. Optimized for rear
products available today in the market.
5. High performance
- Dimensional model uses the relational model with some important restrictions. 6. Is usually a Datat
~ It is one of the most feasible technique for delivering data to the end users in a dau
1.8.4 Compariso
warehouse. ,
Every dimensional model is composed of at least one table with a multipart key called
the
fact table and a set of smaller tables called dimension tables. -
1.8.2 Difference between Data Warehouse Modeling and Operationa
l Database
Modeling x

fe Operational Database Modeling


Data Warehouse Modeling
Current Values are the Data Conten
t. Data is Archived, Derived, Summarized.
Data structure is Optimized for Simplify t
tndrsaie Hee Data st Tucture isi Optimi
imi zed for complex can rotate
queries, views of t
Access frequency is High.
=
Access frequency is Medium to low. It is asym
a a Access type is Read, Update,
Delete. Data access type is only Read.
| Usage is Predictable, Rep
etitive. Usage is Ad hoc, Random, Heuristi Permit re
{Revo c.
lime is in Sub - seconds. It is exte
Response time is in Several sec
onds to
minutes. new d
| Large Number of users, decisior
Relatively small number
of users.
asl

Scanned by CamScanner
9 Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-28 _Introduction to Data Warehouse
es

ee
EEE —— ————————————

|.8.3 Comparison Database and Data Warehouse Database

ks Used for Online Transactional Processing (OLTP) but can be used for other purposes such
as Data Warehousing. This records the data from the user for history.
The tables and joins are complex since they are normalized (for RDMS). This is done to
reduce redundant data and to save storage space.
b Entity - Relational modeling techniques are used for RDMS database design.
L.. Optimized for write operation,
Performance is low for analysis queries.
Jata Warehouse

Vs Used for Online Analytical Processing (OLAP). This reads the historical data for the
Users for business decisions.
The Tables and joins are simple since they are de-normalized. This is done to reduce the
response time for analytical queries.
Data - Modeling techniques are used for the Data Warehouse design.
Optimized for read operations.
High performance for analytical queries.
}. Is usually
a Database.

|.8.4 Comparison between Dimensional Model and ER model


Sr. Dimensional Model 56) 2X. 0 EER Medel,
No. bak Sy Bae anl sou eae CBOER e
1. | Support’ ad-hoc’ querying for business | Support for OLTP and ODS(Operational
analyst and complex analyzes (data | Data store)
warehouse and multidimensional database)
2. | Entities are linked through a series of joins. | Entities are linked through a series ‘of
joins.
3. | Simplify the view of the data model. You | The data model has only one dimension.
can rotate the data cube to see different
views of the data.
4. | It is asymmetric. "4 It is symmetric. All tables look the
same.

5. | Permit redundancy. _ | Remove the redundancy in data.


6. | Itis extensible to accommodate unexpected | If the model is modified, the
new data elements and new design | applications are modified.
decisions. The application is not changed.

Scanned by CamScanner
o

Ww Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-24 Introduction to Data Warety
An, %
tly as per th Sr. Dimensional Model ERModdl wz Data Warehousing & Mining (Mi
mbai Un No,
SEeeaaeameaeaeaaeaaooooooooooooeeuem

Facts
academicy
7. | It is robust. The dimensional model design | It is variable in structure ang Yen (a) Occupied Rooms
er Choice £
can be done independent of expected query | vulnerable to changes in the use, (d) No of occupants
patterns. .| querying habits.
8. | The model is easy and understandable. The mode! for enterprise is very hard fy
people to visualize and keep in thes
heads. 1.10 The Star Schema
|
9. | The model really models a business. It is a | The model does not really model ,
body of standard approaches for handling | business. It models the micyy Date
common modeling situations in the | relationships among data elements. dimen

business world.

Syllabus Topic : Information Package Diagram


=
Stor
dim
1.9 Information Package Diagram

Information package diagram is the approach to determine the requirement. of data


— Star Schema is the m«
warchouse.
— Dimensions are store
It gives the metrics which specifies the business units and business dimensions.
Every Dimension tak
The information package diagram defines the relationship between the subject or
— A\l the unique ide
dimension matter and key performance measures (facts).
composite key in th
The information package diagram shows the details that users want so its effective for
- The fact table also
communication between the user and technical staff.
product_id giving |
Table 1.9.1 : Information Package for Hotel Occupancy
— Foreign keys for
Hotel Room Type Time- | Room Status product id and sto
Hotel Id Room id Time id Status id” — Inadimensional |

Branch Name room type - Thesize of the fe


Year Status Description
— The Facts in the
Branch Code room size Quarter
(i) Fully-addi
Region number of beds | Month dimension
Address type of bed Date (ii) Semi-add
cily/stat/zip max occupants | day of week dimensio!

construction year | Suite Example


day of month, °
current t
renovation year holiday flag 1 current|

Scanned by CamScanner
¥ pata Warehousing & Mining (MU-Sem, 6-Comp.)
De yita\\ Fate
1-25 Introduction to Data Warehouse
cf
&
iy)
é
(a) Occupied Rooms (b) Vacant Rooms
7 yi
(c) Unavailable Rooms
aN (d) No of occupants (ce) Revenue
‘Dp in a Syllabus Tople : Star Schema
= 1.10 The Star Schema
Ie “dey
ni! > (M
:
-UDec. 2010, May 2012, Dec. 2014)
ents f % ie "
dimension
ae

~
> “ Salestac; t f |__[Pro auct.
dimension} .
TE ee

5, . dimension| - :

of day Fig. 1.10.1: Sales Star Schema

— Star Schemaiis the most popular schema design for a data wareho
use.
’ Dimensions are stored in a Dimension table and every entry has its own
unique identifier.
ct — Every Dimension table is related to one or more fact tables.
- All the unique identifiers (primary Bal from the dimension tables make up for a
the composite key in the fact table.
- The fact table also contains facts. For example-a combination of store_id, date_key and
product_id giving the amount of a certain product sold on a given day ata given Store.
- Foreign keys for the dimension tables are contained in-a fact: table. For eg.(date key,
product id and store id) are all three foreign keys.
- Inadimensional modeling fact tables are normalised, whereas dimension tables are not.
- The size of the fact tables is large as compared to the dimension tables.
- The Facts in the star schema can be classified into three types :
(i) Fully-additive : Additive facts are facts that can be summed up through all of the
dimensions in the fact table.
| (ii) Semi-additive : Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others. -
| Example : Bank Balances : You can take a bank account as Semi- Additive since a
. current balance.for the account can't be summed as time period; but if you want see
current balance of a bank you can sum all accounts current balance.

Scanned by CamScanner
Sem, 6-Comp.) 1-26
¥ Data Warehousing & Mining (MU- Introduction to
Data Wa,
eh ty
»N
( iii) Non-additive
on- -: Non-additive
, facts are facts that cannot be sum med UD for an . 2

VV dimensions present in the fact table. 27 Data Warehousing &


y of _—_—_—_—_——___!
a ——_—_—_—_—_————_
ai E.g. : Ratios, Averages and Variance.
SS
Advantages of Star Schema
1.11 STAR sche
- A star schema describes aspects of a business. It is made y P Of
multiple dime
and one fact table. For e.g. if you have a book selling business, s Om
nsig
e of the qj Tay, (i) Primary Keys :
tables would be customer, book, catalog and year. The fact
table woulg Mengi,, which identifies
information about the books that are ordered from each catalo Cony, Example :
by each Cust
particular year. omer
uring, In Professor_Dir
~ Reduced Joins, Faster Query Operation. Course_S
Cc
- Itis fully denormalized schema,
Se
- Simplest DW schema. Cou

- Easy to understand,
Easy to Navigate between the tables
due to less number of joins.
Most suitable for Query processing,
Example ;

limo_koy
day
day_of_tho_weak
mo
quanth ms an
rter me, Sales fact table
year
mh
item_key:
Item_name.
time_key w| brand
ua (ii) Surrogate ]
| type
Itam_key [* Supplier_key
- These k
branch_key - Theyd
Iccation_kay - All dat
branch_key oo
branch name , _ | location - Onem
branch_type units_sold me
| location_key - A four
dollers_sold Street .
city
(iii) Foreign k
|

State_or j
| Se cou ntry ee Every Dit
| Measures | ee
key of eac

Fig. 1.10.1; Sales


Star Schema

Scanned by CamScanner
a
yy Data Warehousing & Mining
(MU-Sem. 6-Comp.) 1-27
Introduction to Data Warehouse
Syllabus Topic : STAR Sche
ma Keys
Me, —s—s«11 = STAR schema Keys
Sig,
Re dig, q y (i)CsPrimary Keys : The ‘4
Primary key of the dim
uly She | which identifies each row ension table is one of
in a dimension table unique the attribute value
Me, ak Example : ly,

In Professor_Dimdimension,
Prof dis primary key of dimens
Course_Section_Dim ion Professor_Dim.
i
Course_ids-
i
Section_no. Period_Dim
i Course_name _, ~ Semester_id -
Course_grades Year
I .
“Course_id =
i Room_capacity™
{ _Semester_i
| | Student_id:
| Prof_id
| "grade
Student_Dim
! Professor_Dim Student_id
| )Student_name ©
‘ | Major

Department_id ee
Department_name:

Fig. 1.11.1
(ii) Surrogate Keys

- These keys are system generated


sequence numbers.
- They do not have any built in
meanings,
en

- All data warehouse keys must be mea


ningless surrogate keys,
|: ~— One must not use the original produc
tion keys.
| - A four byte integer makes a good surrog
ate key.
“Gi Foreign Keys

Every Dimension table has one-to-many rela


tionship with the fact table. So the primary
key of each dimension table is a foreign key in the fact table
.

;
}
Scanned by CamScanner
p.)i 1-2 8 Introduction to Da
w
5
j g (MU-Sem. 6-Com
Data Warehousing & Minin ta Warep,
SSN
*

Syllabus Topic : Snowflake Schema =F Data Warehousir


a

1.12 The Snowflake Schema


time_key =.
d =
> (MU- May 2011, Dec. 2011, May 2012, Dec. 2012, May 2013, May 2014, Dec,20 day_of_the_wee
- A snowflake schema is used to remove the low cardinality i.e attributes havin month
B ie quarter
distinct values, textual attributes from a dimension table and placing themj ina year
Secon
dimension table.
For e.g. in Sales Schema, the product category in the product dimension table cay
removed and placed in a secondary dimension table by normalizing the produ
ct dimensi,,
table. This process is carried out on large dimension tables.. branch |

It is a normalization process carried out to manage the size


of the dimension tables. Ba branch_key «
this may affect its performance:as joins needs to be performed. Branch_nami
branch_type
In a star schema, if all the dimension tables are normalised
then this schema js called x
snowflake schema, and if only few of the dimensions in a star sch ema
are normalised the
it is called as star flake schema.
Star Flake Schema

It is a hybrid structure (i.e. star schema + snowflake


schema).
Every fact points to one tuple in each of the dimensio 1.12.1 Differe
ns and has additional attributes.
Does not capture hierarchies directly.
Straightforward means of capturing a multiple
dimension data model using relations.
;
Ti :
Sales_Person_Dim
a
Sales_Person_key : ke =
Month :
= 2
| Quarter
Depariment.jd Pact Table
Sales_Person_key
eer
Time_key
weeLocation_Dim Location_key
Location_key ;
Cilics
ol Product_key Product_Dim
Sales_amount Product_key
Zones NO_of_prod_sold ” Prod_name
Region
Prod_type
Prod_supplier — 1.12.2 Ste
Fig. 1.12.1 Contd. Step1 : Sel
The fir
an understa

Scanned by CamScanner
Item [suppiier|
Sales fact table itom_key meee
> item_namea Supplier_key
time_key. _a| brand _ suppliar_type
Q wet
dy Ie | type wee
item_key [° supplier_key --T”
Ct di
>

branch_key
On ta rae location_key
bly ; ‘ branch_key a"
- location
: branch_name
a ij units_sold ‘~}-————
branch_type se
f ) location_key ©
bis : Street
mal ise, —— eres Es
‘olty_key. | [elty
ia
‘ avg_sales” nh,
=
ead
4 _State_or_province
country £

Fig. 1.12.1 : Sales Snowflake Schema


Utes, _ 1.12.1 Differentiate between Star Schema and.SSri
giatake Schema
[ Sr. No. | Scher | Snowflake Schema _ i
Ons, = schema contains the dimension A : Showitla
ee schema contains in-depth
tables mapped around one or more joins because
the tables are iia in to many
fact tables. pieces.
} 2 | It is a de-normalized model. It is the normalised form of Star schema.
No need to use complicated joins
j . | Have to use complicated joins, since it has

E
more tables.
Queries results fast. There will be some delay in processing the
Query.
5.. | Star Schemas are usually not in In Snowflake schema, dimension tables are
BCNF form. All the primary keys in 3NF, so there are more dimension tables
of the dimension tables are in the which are linked by primary — foreign key
fact table. relation.

1.12.2 Steps of Designing a Dimensional Model

Step1 : Select the Business Process


The first step in the design is to decide what business process(es) to model by comibining
an understanding of the business requirements with an understanding of the available data.

Scanned by CamScanner
=

3 Data Warehousing & Mining (MU-Sem. 6-Comp,) _ 1-30 iisiadibeva to Da


= Baten | @ vata Warehousing & Mining (MU
Da
Step 2 : Declare the Grain \ Syllabus
— Once the business process has been identified, the data warchouse team faces q seri iy,
«
Fact Constellation :
decision about the granularity. What level of detail must be made available jy
te \ 1.13 5
dimensional model?
— The grain of a fact table represents the level of information detail in a fact table, Declarig
the grain means specifying exactly what an indiviqual fact table record represents. \ Fact Constellation
sl
implies, itis
by a business proces, = As its name
— Itis recommended that the most atomic information captured ‘ : schema is more com|
= ‘ ‘
Atomic4 data is the most detailed information collected. The more detailed and atomic the This multiple fact tz
data better. Sa dimension ta
fact measurements are, the more we know and we can analyze the - is a
— Inthe star schema discussed above, the most detailed data would be transaction line item A schema of this type”
detail in the sale receipt. ‘ a sophistication.

(date, time, product code, product_name, price/unit, number_of_units, amount) For each star schema ©
hema.
= (18-SEP-2002, 11.02, pl, dettol: soap, 15, 2, 30) “That solution is very fl
But in the above dimensional model we provide sales data rolled up by product(all records The main disadvantag
corresponding to the same product are cmsatiries) in a store on a day. A typical fact table because many variants
record would look like this:
In a fact constellati

18-SEP-2002, Productl, Storel, 150, 600 -


dimension s, which
This may be useful j
. th and other facts with
This record tells us that on 18" Sept. 150 units of Product! was sold for Rs, 600 from | -
Storel. — Use of that model s
details down to the
Step 3: Choose the Dimensions which is calculated
dimensions
- Once the grain of the fact table has been choeen, the date, product, and store — In that case using
are readily identified. through a fact con
where these
It is often possible to add more dimensions to the basic grain of the fact table, Family of stars
additional dimensions naturally take on only one value under each combination of the
primary dimensions.
If the additional dimension violates the grain by causing additional ‘fact rows to be
generated, then the grain must be revised to accommodate this dimension.
Step 4 : Identify the Facts
The first step in identifying fact tables is where we examine the business, and identify the
transaction that may be of interest.
In our example the Electronic Point of Sale (EPOS) fansactons give us two facts,
quantity sold and sale amount.

Scanned by CamScanner
Ww Data Warehousing & Mining (MU-Sem. 6-Comp,) _ 1-31 Introduction to Data Warehouse
nes Syllabus Topic : Fact Constellation Schema
Se,

4.13 Fact Constellation Schema or Families of Star

Fact Constellation
| > (MU - May 2012, May 2014)
_ Asits name implies, it is shaped like a constellation of stars (i.e., star schem as).
— This schema is more complex than star or snowflake varicties, which is due to the fact that
jt contains multiple fact tables.
- Thisallows dimension tables to be shared amongst the fact tables.
- A schema of this type should only be used for applications that need a high level of
~ sophistication. ;
- For each star schema or snowflake schema it is possible to construct a fact constellation
schema.
- That solution is very flexible, however it may be hard to manage and support.
- The main disadvantage of the fact constellation schema is a more complicated design
because many variants of aggregation must be considered.
- In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which-are for given facts relevant.
- This may be useful in cases when some facts are associated with a given dimension level
and other facts with a deeper dimension level.
— Use of that model should be reasonable when for example, there is a sales fact table (with
details down to the exact date and invoice header id) and a fact table with sales forecast
which is calculated based on month, client id and product id.
- In that case using two different fact tables on a different level of grouping is realized
through a fact constellation model.
Family of stars | Dimension: “Dimension:
table eo table.

Ne =
:=
ra
table. |
“Dimension pension,
‘table.
Sa | : eae

we peed
Dimension: NS “Dimension.
. table — stable

Fig. 1.13.1 : Family of stars

Scanned by CamScanner
p.) 1-32
eouig & Mining (MU-Sem. 6-Com
w Data wara—— —
——————————

Example : wy Data Warehousing & Mining (MU-Ser

— Likewise in a sales fact table we 1


time_key Dimension tables are companion
a ay item -

‘item_key = Each dimension table is defined


gil” Sales fact table
item_name
integrity with any given fact ta’
quarter :
year time_key 7 brand textual information.
of ee type
| Suppliot_type - To understand the concepts of
item_key
following scenario :
branch_key — Imagine standing in the marke
.
location_key —_
: down the quantity sold and the
—_ “
branch_key
- location| +” L- - Note that a measurement need
a units_sold ee
eee : Mian
7 focation_key “2 product, and store).
ranch_type .
‘ a; : street
dollers_sold: city : — The information gathered can |
:
province_or: state
avg_sales country +

Shipper_key’ ;
shipper_name
~ location_key ”
Fig. 1.13.2 : Sales Fact Consolidation
‘shipper: type

Syllabus Topic : Factless Fact Tables

1.14 Factless Fact Tables

1.14.1 Fact Tables and _- The facts are Sales_Unit,


Dimension Tables
additive), which depend o
- Adimensional model consists
of Fact tables and dimension dimensions are stored in dis
tables.
Each dimensional model has
a primary table Which is
the business measurements a fact table that is meant to contain | 1142 Factless Fact Tab
,
Numeric and additive are
the most useful facts.
A factct table has many to j - Factless table means only
man y relationships; it contains a
that join to a dimension tabié, set of two or more foreign keys | Used only to put relation |
~
> A fact in the fact table depend 4 -
Are useful to describe |
s on man y facts. For eg. In sales schem
depends on Product, Location a, sales_amount fact | something has/has not hay
and time . These factors are called
as dimensions.
Dimensions are factors on whi Often used to represent rr
ch a given fa ct depends. Th
thought of as a function of three variab e sales_amount fact can ~
les. also be The only thing they cont
event which is identifie
Sales_amount = (product,
location, time) tables.

Scanned by CamScanner
ike sal es_unit and cost.
pieemrcc:y in a sales fact table we may include other facts like
Likewise
- Dimension tables are companion tables to a fact table i nastar schema. j at Fags
for fr /eferen cain:
‘ Sart
- Each dimension table is defined by its primary key th at serves as the basis
tables con sista 3
integrity with any given fact table to which it is join ed. Most dimension Ave
textual information. sider the yet?
— To understand the concepts of facts, dimension, and star schema, let us con feea iond
following scenario : we
and writing
sold am
- Imagine standing in the marketplace and watching the products being
down the quantity sold and the sales amount each day for each product in each Stl
- Note that a measurement needs to be taken at every intersection of all dimensions (day,
product, and‘store), , :
Table 1.14.1 -
— The information gathered can be stored in the fact table as shown in
Table 1.14.1 : Fact Table

Date Key
Product Key
Store Key
Sales_Unit
Sales_Amount
. Cost
ee ,

_— The facts are Sales_Unit, Sales_Amount, and Cost (note that all are numeric and
additive), which depend on dimensions Date, Product, and Store. The details of the-
dimensions are stored in dimension tables.

leant to conte
1.14.2 Factless Fact Table
‘ > (MU - Dec. 2012, May 2013)
- Factless table means only the key available in the Fact there is no measures available.
Used only to put relation between the elements of various dimensions.
Are useful to describe events and coverage, i.e. the tables contain information that
something has/has not happened.
Often used to represent many-to-many relationships.
The only thing they contain is a concatenated key, they do still however represent a focal
sion
event which is identified by the combination of conditions referenced in the dimen
tables.

Scanned by CamScanner
of ae

‘ ity ‘ i ) 1-34 Introducti


¥F pata Warehousing & Mining (MU-Sem. 6-COMP a0 Data Ware, 4h,
Fig. 1.14.1.
An Example of Factless fact table can be scen in the

dimension Date key [Account dimension


—— Employee kay
ni aykay
Customer
Employee dimension eA

Customer dimension Loan key ___J-——~ Loan dimension


Fig, 1.14.1: A Factless Fact Table

_ There are two main types of factless fact tables :


4. Event tracking tables
— Use a factless fact table to track events of interest to the organization, For eXanpl,
attendance at a cultural event can be tracked by creating a fact table Containing th
following foreign keys (i. links to dimension tables) : event identifi,
speaker/entertainer identifier, participant identifier, event type; date. This table cx
then be queried to find out information, such as which cultural events or EVENt type
are the most popular. ,
- Following example shows factless fact table which records every time a Studey
attends a course or Which class has the maximum attendance? Or What is the average
number of attendance of a given course? » §
- All the queries are based on the COUNT() with the GROUP BY queries, So weca
first count and then apply other aggregate functions such as AVERAGE, MAX, MIN.

Attendance fact
\

Fig. 1.14.2 : Example of Event Tracking Tabl


es
2. _ Coverage Tables
The other type of factless fact table is called Coverage table by Ralph.
It is used to suppot
negative analysis report. For example a Store that did not
sell a product for a given period
To produce such report, you need to have
a fact tabl to capture all the possible
combinations. You can then figure out
what is Missing. © .
Common examples of factless fact
table ;
‘0 Ex-Visitors to the office,
© List of people for the web click.
© Tracking student attendance or registration events. -

Scanned by CamScanner
Se PIC : Update to the Dimension tapies
1.15 Update to the Di
mension Tables —

> (MU - May 2016)


Every
ate day more and more sales take place, so more and more rows are added to th e fact

Updation due to ieee


in fact table happens very
rarely.
Dimension tables are more stable
as compared tothe fact tables.
Dimension table changes
0f
hange in attributes themselves but not because
due to the ¢
inc rease in number of rows.
1.15.1 Slowly Changing Dimens
ions
Dimensions are generally constant over time but
if not constant then changes slowly.
The customer ID of the record remain same but the marital status or locati
on of customer
may change over time.
In OLTP system, whenever such change in attribute value mtagpen, the old values
replace
the new value by overwriting old ones.
But in data warehouse, overwriting of attributes is not the solution as historical data for
analysis is required always.
_So making such changes in attributes have 3 differesit Spee:
(i) Type 1 Changes,
(ii) Type 2 Changes,
(iii) Type 3 Changes

(i) Type 1 Changes


| ‘
Principles of Type 1 Changes
— Type 1 changes are related to the correction of errors in source systems.
- Example, In customer dimension if name changes from Anand Prasad - >Anandraj
|
Prasad
ritten.
— Sometimes changes has no importance, so old values can be overw
-— Changes need not be preserved in DW.
the Data Warehouse:
—~ Howto apply Type 1 Changes to
the old value with alue by by. overwriting the attribute value in the
new value
Replace
dimension table row.
old value.
No need to preserve the

Scanned by CamScanner
evised Syllab
ty Us (Rev - 2016) of
3-2019
‘edit and Grad
ing System)

TT B A = =

| SF ata warshousing & Mining (MU-Sem, -COmP) TSB ¥ Data Warehousing & Mining (ML
i table
di ension
dim s inin row.
Noneed to do other change
————— oe ]

— table. - This type of change part


affected for dimension
- The key values are not ustomer
Cust - Ifany change for same ;
i lement.
es are simple and easy to imp
— Type I chang [ CustomerKey _) How to apply Type 2 Char
———
Customer Namo - Add anew tuple in dir
_——$—$___|
Preduct Order fact Customer Coda - Include date field on w

Product Product Key


Material Status - Original row should n:
\ = Address — The new row is insert
Product Name Time Key $<
Example:
Product Code Gustomer Key ae
Brand Salesperson Key _ | 2p Customer Code:
ci23
Martal Status
Time Order Amount | ge Sal esperson (21/2/2008):
Time Ki Cost Amount z ene Mamed
ed - Salesperson Key Address (13/10/2014)
Das Margin Amount
Salesperson Name
Month SalesUnit
Teritory Name
uaner Region Name
Year
Fig. 1.15.1 ; Star schema for Order Tracking

— From Fig. 1.15.1 star Schema, If Customer Name of Customer Dimension or Prodiict
Name of Product dimension changes, then it comes under Type 1 change as there is
no need to preserve the old values. It can be overwritten with new names without
changing any Key Value.
(ili) Type 3 Changes
(ii) Type 2 Changes
- Check the attribute Marital Status of Customer Dimension. — Type 3 changes
— This attribute value can be changed from Single to Married over time. _— Complex queri
~ If the Marital status of Anand Prasad Changes from Single to Married, then old value o Hard to im

can not be overwritten as there may be a need to track orders by marital Status. o ‘Time-con
- If Anand Prasad Changes his state from Maharashtra to Gujarat, then also old value of o Hardton
state has to be preserved . : - InType2 ch
This type of changes are called Type 2 Changes where historical Values are preserved that cut-off
instead of overwriting .
“Married” al
General principles for type 2 change _ groups.
- Type 2 changes are related to the true changes in source systems. — Incase, if tl

- History of data must be preserved in the data warehouse, be handled


attribute fo!

Scanned by CamScanner
, ww Data Warehousing & Mini
g ning
(MU-Som.6-Comp.)
m 1-37 Introduction to Data Warehouse
r- Thist - of change partiiti tions the history in the data warch
W-5 — =

ouse,
'y change for same attribute, then
it must be preserved.
How to apply Type 2 Changes to
the Data Warehouse ?
- Add anew tuple in dimension table with
new value of the attribute.
- Include date field on which the change occurred in dimens
ion table strictly 25 >
- Origina
= l row shouldns not be changed and the key: value of that row is; not affected Mumba
. wef, acad
- The new row is inserted with a new surrogate key.
Cc
Example: / (spe

Customer Code:
C123
Martial Statu ‘ i
(21 78/2008): ‘ ‘ Key Restructuring
Married eid C123->39154112
Addiress (13/10/ee2014) i
Baie: ce 2
5278934

i Before “*-| |: After 21/9/2008 se | > After 13/10/2014


CustomerKey: | 33154112 ~ | |. 51141284 | |) 5a7egaa2
Customer Name: | |: Anandraj Prasad - Anandraj Prasad Anandraj Prasad
Ir Prodty
_ Custom Gode :*», _|
MartialerStatus: |.
if ge
ingle — |
“| | fa.
C128
Married
oad f C123
Married
S there} Address: |]. ; Mune Me ; : ee oe ce ¢ pain
Withog
Fig. 1.15.2
(iii) Type 3 Changes
— Type 3 changes are not common at all
_- Complex queries on type 2 changes may be :
-yaloe o Hardto implement
o ‘Time-consuming
ped o Hard to maintain
track the orders before or after
— In Type 2 changes, date is a cut- off point. If someone and
ntl
that cut-off date , then customer either fall in “single” group prior to that date
ted in both
any period of time, it cannot be coun
“Married” after that date. But for
groups. . .
In case, if the requirement is to count the orders on or after cut-off date then it cannot
value of
is a need to track both old and new
be nanled as a type 2 change. There
attribute for a certain period. This type of change is called Type 3 shane:

Scanned by CamScanner
fF
1

General Principles of Type 3 Change


-
go pa!
. ig feasiibleble iiff the:
Type 3 change usually consists of “soft” or tentative changes in the source system,
7 .
- The change in the attribute value needs to be preserved, so a history needs manageable.
maintained to keep a track of the old and new values. 1 be
- They can be used for performance comparison across the transition.
dimension changes Fl
— With type 3 change its possible to track forward and backward,
_ ‘ ion table * is:
Jf dimension
How to apply Type 3 Changes to the Data Warehouse
or more smaller dime
- No new dimension row is needed, ;
- Theexisting queries will seamlessly switch to the current value.
- Any queries that need to use the old value must be revised accordingly.
~ The technique works best for one soft change at a time.
~ If there is a succession of changes, more sophisticated techniques must be advised,

a
Fast Changing _—
1.15.2 Large Dimension Tables ‘attributes

- Large dimension tables are very deep and wide.


- Deep means it has a very large number of rows and wide means
it may have many
attributes or columns. : , — Move the rapidly
- To handle large dimensions, one can take out some mini dimensions from dimension table
a large
dimension as per the interest. These mini dimensions-can be represented in the _ — Example, Consid
form of
star schema. Income, Rating,
- For example, the above mentioned order analysis star schema is one of the — From Customer_
mini dimension
of manufacturing company in which the marketing department of as shown in Fig.
the company is ©
interested. ;
~ Customer and Product are generally large dimensions,
- Large dimensions are generally slow and inefficient
due to its size.
- They tend to have multiple hierarchies to perform variou
s OLAP operations like drill
down or roll up.
1.15.3 Rapidly Changing or Large Slowly
Changing Dimensions
- In type 2 changes, new row is created
with the new value of changed attribute. ; Fast Changing
preserves the history or old values of attributes This
, : Now slipt into a
:
~ If again there is a change in same dimens'
attribute, then again a new dimension
created with new value. table row is

Scanned by CamScanner
is oo is fee asible if the
: dimension
: ; changes infrequent! li ‘
q y like once or twice a
omrek Product dimension which has rows in thousands changes melt a =a
manageable.
But in case of customer dimensions, where number of rows are millions and
= infrequently. changes
then type 2 changes are feasible and not much difficult. If customer )
dimension changes rapidly then Type 2 changes are problematic and difficult.
If dimension table is rapidly changing and large then
break that dimension table into one
or more smaller dimension tables. x"
, cw
Customer_DIM Ratail_ Sales
CustomerID ;
Customer_Name
luct_I
. Customer_Address Paine
Customer_City sat ‘ | Employee_ID
ne Date_ID
eames Rating. Sales_Rev
as: j sis, “f- A
attributes Credit History Score. : oe
:
unt Status 5 :

Fig. 1.15.3
Ego
Move the rapidly changing attribute in. another dimensi
“ 4
on table and leave the original
at
ges
dimension table with slowly changing attributes.
Example, Consider Customer dimension, then following attributes changes rapidly Age,
Income, Rating, Credit history score, Customer account status, Weight.
From Customer_Dim, take out the fast changing attribute and create new mini dimension
as shown in Fig. 1.15.4.

Customer_D!
-Customer_ID -
Customer_Name ~
Customer_Address*:
Customer_City
DateID,
‘More... Sales_Rev
Avg_Sales
er eyes cs 5 . E % oe be is as : Z ¥ oe |

-| Cust_Mini_Dim eit tapicaece aha es


Fast Changing attributes_——?| Aga
now sliptintoanewmini . | Income.
dimension ‘Rating ‘fag
Credit History Score: -
“Account Status :
Weight,
Fig. 1.15.4

Scanned by CamScanner
4
Ww Data Warehousing & Mining (MU-Sem. 6-Comp.) _
1-40 Introduction to Data Werth,
housing§
1.15.4 Junk Dimensions F vata oe
Wate
= Fi Tom
source i.
legacy systems,
tems, some
some extual
te? data Or flags
= cannot- be the SI gn Itc;
fi ant fi I | soln. =
major
a] dimensions . But at the same

which can be used to consider such fields.


time, , It it cann oO { be di “scarded, = The Te e are Some
Ption, | (a) Star Schema
Tirn
| FTime_koy
~ Don’t consider such texts and flags. Discard
all while creating major dimensions, Buthy
doing this you may be losing some useful information. a
- Keep sul texts and flags as it is in fact [Month __
table. But due to this size of fact table
increase, . —_
without any advantage.
~ Create a separate dimension table for each
flag and text. With this, number of dime
table increases greatly. nsio, f [nate
~ Create a single “Junk” dimension and keep | Item_na
all meaningful texts and flags into it.
~ Such junk dimensions are useful to fire Ean
the queries based on flag or text valu
es.
Syllabus Topic : Aggregate
Fact Tables
-
1.16 Aggregate Fact Tables (b) Snowflake Sct
Time
‘Time_key
.
~ A fact table contains measurements > (MU - May 2014) :j Day
[ Day_ot_the_w
or so called as facts.
~ Fore.g Ina Sales data warehouse,
i 1 Month
a sales fact table contains - Quarter
one of the fact. units sold for a transaction
as Year
- Addition of some level
of aggregation to the
fact table, For €.g.
the Sales fact table by we could have aggregated
month. The number of units sold
Particular month, to a particular customer in a
Item
— Addition of summary data to the
toe

fact table. .
rr
ritam_key
ee

Thi is process would help


; liem_name_
P in retriievj
eving results for . &§ \ | Brand
required, queri i —
queries where aggregation would be | Type

Scanned by CamScanner
—_
soln. :
ma
(a) Star Sche
Time Branch
Time_key Branch_key
Day Brarich_name
week
Day_of_the_ Branch_typs
Month Sales
Quarter Time_key
year % Item_key
Blanch _key Location
Item : palin Key |_| Location _key
ltem_key : Unitsr_ss_oslodld| Street |
Stem names] -Dolla City
street
Brand Province_or_
Type

Fig, P. 1.17.1 : Sales Star Schema ‘

(b) Snowflake Schema


‘ Time Branch
Time_key © =. -Branch_key —
Day ae : Branch_name
Day_of_tho_week Sales :Branch_type
Month. wath Time_key. 5
Quarter item_key

Year ‘Branch_key
‘Location_key |
‘Units_sold |
‘Dollars_sold * Location
Item :Location_key
ltem_key Street ws
City_key ae
ltem_name
| Province_or_street®
Brand :

Be _ City
Supplier_type Supply Clty _key
\ Supplier_key City ;
Supplier_type Province_or_street-

Fig. P. 1.17.1(a) : Sales Snowflake Schema

Scanned by CamScanner
s i ining (MU-Sem. 6-Comp.) 1-42 Introduction t -
eee eee ie

R
eo) , 10 Data ware, vy Data Warehousing & Mining (MU
(c) Fact Constellation
Branch
Time (c) Can you con
key answer and c
Bra _name
Day

Ni
S Soln. +
Month (a) Star Schema
Course_Section_Dim
year.
Bra
key =
nch_key= |

Dollars_sold =~
les ==

Fact table

|
Shipping
|

|
key =
From_location

|
key ss
To_location © name:

. Shipper_|

Fig. P. 1.17.1(b) : Fact constellation for


sales
Ex. 1.17.2: The Mumbai university wants you to help desig (b) Total Courses Con
n a star schema to record grades
for course completed by students. There are
four dimensional tabies namely | Each Course has ave
course_section, professor, student, period
with attributes as follows :
Course_section Attributes: Course_|
Id, Section_number, Course_name, Units,
University stores dat
Room_id, Roomcapacity. During a Total Student in Uni
gi ven semester the college offers an average
of 500 course sections
Time Dimension =‘
Professor Attributes: Prof_id, Prof_Name, Title, Department_id, | ‘Now, Number of t
department_name.
5 semesters)
Student Attributes: Student_id, Student_name
, Major. Each Course section has (c) Snowflake Sche
an average of 60 students ,
Period Attributes: Semeste —
rid, Year. The database will contain Data for 30 Yes, the abo
months periods. The only fact
that is to be recorded in the fact table following as:
Grade is course
- Courses are
Answer the following Questions
normalized
(a) Design the star schema for this
problem, ~ Professor bi
(b) Estimate the number of rows in the
fact table, using the assumptions schema, so
Stated above and also estimate
the total size of the fact table (in
assuming that eac bytes) — Similarly
h field has an average of 5 bytes,
shown in t

Scanned by CamScanner
Data Warehousing & Mining (MU-Sem. 6-Comp.) Da | Warehouse
Introduction to Data
1-43 __

(ce) Can you convert this star schema to a snowflake schema ? Justify your
answer and design a snowllake schema if it is possible.
Soln. :
(a) Star Schema
, Course_Section_Dim
Course_id Period_Dim
Section_no Semestor_id
Course_name Year
Units Course_grades
Room_id Course_id
Room_capacity , Semester_id

Student_id
Prot_id
grade Student_Dim
Student_id
Professor_Dim Student_name
Prof_id
‘Major
Prof_name |
Title= °
Department_id.
Deparment_namg

Fig. P. 1.17.2: University Star Schema


(b) Total Courses Conducted by university = 500

Each Course has average students = 60


University stores data for 30 months
Total Student in University for all courses in 30 months = 500*60 = 30000
Time Dimension= 30 months= 5 Semesters (Assume 1 semester= 6 months)
Now, Number of rows of fact table= 30000*5= 150000 (one student has 5
grades for
3 semesters)

(c) Snowflake Schema

Yes, the above star schema can be converted to a snowflake schema, considering the
following assumptions.
- Courses are conducted in different rooms, so course dimension: can be further
normalized to rooms dimension
as shown in the Fig. P. 1.17.2(a).
— Professor belongs to a department, and department dimension is not added in the star
schema, so professor dimension can be further normalized to department dimension.
— Similarly students can have different major mnpeotts so it can also be normalized -
shown in the Fig. P. 1.17.2(a).-

Scanned by CamScanner
EW Revis
] aa d Syllabus rp...
-
YF Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-44 Introduction to
Data Wa,
Tobe, a
Reom_Dim Course_Dim ty
Te : Period _Dim
‘ WF pata Warehousing & Mining (MU-Sem,
6
Time period < 5
ear years
Stora + 300 stores
Teporting
Course, Product : 40,000 Produc
ts |
rs. Promotion: a sold ite
m me
Semester.
ti
udent_id Soin. :
G (a) Star schema:

Department_Dim Professor_Dim Student_Dim Major_Dim


Prof_ r
t Prof_name
t

Fig. P. 1.17.2(a) : University Snowflake Schema

Ex. 1.17.3:. Package Typs


Draw star schema for “Hotel Occupancy” considering dimensions. like Time,
Hotel etc. MU - May 2012, Dec. 2013, 10 Marks Unit of Measure
Soln.: Draw the Star Schema Units per case
Shelf level
Rooms_Dim™' :.. Shell width
Time_Dim Room id
Year [Room Type > Shell depth
Quarter ‘Room Name. -
Month “os Types of Bed ©
ne oR Suite e
jayof weak* Max Occupants:
Day of Month NO Fact_Table B

Holiday Flag Hotel Occupancy my Month


Branch Code Month Number
“Room Id
Status id
To Date Holiday Flag
From Date
No of Occupants ¥i
Hotels_Dim
Hotel Id ‘Room_Status_Dim
Branch Code Status Id (b) Time period = 5 years:
Branch Name
* [Region
Status Desc
There are 300 stores,
Address Each stores daily sale
Fig. P. 1.17.3 : Hotel Occupancy Star Schema Promotion = 1
|

For a Supermarket Chain consider the following dimensions, namely Product, Maximum-number 0
Ex, 1.17.4:
store, time , promotion. The schema contains a central fact tables sales facts Ex.1.17.5: Drawa Si
with three measures unit_sales, dollars_sales and dollar_cost.
Design star schema and calculate the maximum number of base fact table
records for the values given below:

Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem, 6-Comp.) 1-45 Introduction to Data Warehouse

Time period : 5 years


Store : 300 stores reporting daily sales
Product : 40,000 products in each store(about 4000 sell in each store daily)
Promotion : a sold item may be in only one promotion in a store on a given day
MU - May 2013, May 2016, 10 Marks, Dec. 2014, 5 Marks
Soln. :
(a) Star schema :
Product Key
SKU Number -. Product Store Store Key
40,000 product 300 stores Stora Name
Product Description
(only 4,000 sell in Store ID
. Brand Name each store daily) Aaidress
Product Sub-Category
Product Category . City
Department. State
Package size Zip
Package Type District

Weight Sales facts table Manager


Unit of Measure Product Key - Floor Plan: *
Units per case Time Key . Services Type -
Shelf level Store Key
Shelf width Promotion Kay’
Shelf depth Unit Sales / _ ‘Promotion Key.
Doller sales: » Promotion Name
Time Key . Doller Cost. > Promotion Typs
Date 2 billion fact table item _ Display Type
Day of week ' ene In only one peubon Type
: Media Type
Week Number promotion, 5
per stone, _ >. Promotion Cost
Month
Month Number 5 years or per day |___ Start Date_|
1,825 days PROMOTION | . . End Date
Quarter
Year
TIME ‘Responsible Manager
Holiday Flag

Fig. P. 1.17.4 : Sales Promotion Star Schema

(b) Time period = 5 years x 365 days = 1825


There are 300 stores,
SS

Each stores daily sale = 4000


Promotion = |
Maximum-number of fact table records: 1825 x 300 x 4000 x 1= 2 billion

Ex. 1.17.5: Drawa Star Schema for Student academic fact database.
et

Scanned by CamScanner
W Revised Sviliar.

WF pata Warehousing& MnO ,


School_Dim °
Soln. : Studant_Dim School_key , WW pata Warehousing & |
——

Student_Key School
Patient_Dim
Student Address
Address City
Patient_key
Name
City : State Gender
State Zip
Deoctor_Dir
saad Academlc_Fact_Tabl siype. Doctor_Ker
Age ay Student_key:. Status_Dim Narre
Sex _ 2 Project_key. eigtds-hey Aifiation
Status_key. Fa ;
School_key

Project_Dim
-Project_! Keys =)
- Descriptio
Type
: Category.

Fig. P. 1.17.5 : Student Academic Star Schema

Ex. 1.17.6: List the dimensions and facts for the Clinical Information System and Desz
Star and Snow Flake Schema. -
Soln. :
Dimensions
1. Patient
Doctor
Specialization_Dir
3. Procedure
Diagnose Specialization_key
5S. Date of Service 6. Location
7. Provider
Facts

1. Adjustment
2. Charge
3. Age

Scanned by CamScanner
-
wy Data Warehousin ig & Mining (MU-Sem. 6-Comp.) 1-47 Introduction to Data Warehouse

Patient_Dim
Date_of_Service_Dim
Patient_key
Dateser_key

se
Name
Date
Gender ‘act Table for Procedure Month
& Billing History Quarter
Doctor_Dim
Pationt_key
Doctor_Key Location_Dim
Dateser_key
Name P| Location_key Location_key
Affliation Doctorkey Name

Procedure_Dim
Provider_key — Owner
Procedure_key
ae a dagnos key) Provider_Dim
Subgroup / Sete enue
Name .
hey
Group ; —— =
- ; ——____] er
Diagnos_Dim
Diagnos_key *
: Descript

Subgroup
Group

Fig. P. 1.17.6 : Clinical Information Star Schema


Patient_Dim " Date_of_Service_Dim
Patient_key_ [Dateser_key.
‘Name is | Date ne
Gender Fact Table for Procedure Month _
& Billing History - — ‘Quarter
Patlent_ key .

Destor Kay eeeFor Lica


Doctor_Dim

City_Dim. °
Name - ——" Location _key }____——— beeiation Key.
Affiation Doctor_key pone a ae City_key
Specialization_key Provider_key.- Owner Ee Fl City_name
State
Procedure_key: sly key Ss
Specialization_Dim / Procedure_Dim ee Besgnoa ave emia Dh
alization_key Procedure_key Adjustment Provider key.
|Major Descript Charge : Name aul
_ {Department Subgroup Age; ype"
: Group
Diagnos_Dim
Diagnos_key
Descript
Subgroup
‘Group
Snow Flake Schema
Fig. P. 1.17.6(a) : Clinical Information

Scanned by CamScanner
ng & Mining M U-Sem. 6-Con p.) 1-48 Introduction to Dats
Se Jata Warehous

> Ne 7
1 . 17 Wee Draw a Star Schema fo L br
i ar y 9
* ,
Ex.
ive Soln. : Time ww Data Warehous
ear Time_id
Date
asi Ex.1.17.9: Ab:
Day thei
Book_Purchase_Fact Year
like
Time_ld [Month| wh
User_id Holiday
Ea
S_id
cu
Librarian_id a
Book id
po
i P| Book Sa,
Currant_purchasa — We
[Book_id J
W
Book_name '
‘Author.
(i
-Publication
‘Edition ~ e
Total_copies .
—t oT

‘Availability
|No_of_issued= copies
fives: cae
City Supplier_name = |
Working_time
OS

Ex. 1. 17.8 :

Soin. :
Sales_Person_Dim (ii) Star Sc
Time_Dim
Time_key: :

Quarter
Departmantid Fact_Table Year
‘Sales Pe rson_key
Time_key
Location_Dim

[ Location key) Product_Dim


‘Zones [N0_of_prod_solg.
‘Region

Scanned by CamScanner
j like House Bui

warehouse should record an entry f


With respect to the abo
ve business
(i) Design an informati on pac
kage diagram. Clearly explain
diagram. all aspects of the
(i) Draw a star sche
ma for the data warehouse
tables, dimension tabl clearly identifying the fact
es, their attributes and measures, .
oe
vf

Soln. :
ca ees MU - May 2014, 10 Marks |
(D °
-

| Time_key Customer_key Branch_key Location_key


| Day Account_number Branch_Area — Region
| Day_of_week Account_type
, Salas, ity j Branch_home State
AS Citferay | Month | Loan_type
TACK sal f | Quarter
City
| .
ucts soy Street
| Year : |
ranulaniy | Holiday_flag
es. | .
(ii) Star Schema for a Bank
Tima “|
Time_key
“Customer. |
|Gustomer_key
Day
“Account_numbe
Day_of_week —
Account_type
Month ine key | Loan_type
a Customer_key *y
eel. ‘Branch_key
[Holiday Flag | Location_key
.
}Loan_amount - NX Location é
‘Rate_of_interest | |Location_key |
.|
Payment_amount Region
Branch
State
‘Branch_key —
City
Branch_area
‘Street
-Branch_home
Fig. P. 1.17.9 : Star schema for Bank

Scanned by CamScanner
fi

i 8S per
a the N a

Nbaj Unj eS Vise 5 1

mic bus (Rey ;


?
&"ne
Choice 5 eversi ty
2048-201 5 20 16) of
Crocs.
}
*

Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 1-50 Introduction to Data Wareh,

Ex. 1.17.10: Draw star schema for video Rental ‘ para Warehousing & Minit
Soin. : Rental Item
————_s_

Customer Ex. 1.17.12: Design Star <


ltam_Kay Customer_Koy : -
Item No Rental Facts Customer Namo Soln. :
Rating Item_Key Customer Code
New Release Flag Time_Key Address
Genre Customer_Key State
Classification Stora_Key bea EID ie is
Sub-Classification _ Promotion_Key _Rental Plan Type
Director Rental Fea
Lead Male : Late Charge

Time . Store Promotion


Time_Key_ Store_Key 2. | Promotion Key...
Date Store Name Promotion Cade ~
Day of Month Store Cade » Promotion Type - .
; ZipCode -
Week End Fla : : Discount Rate
: on oes:
7 ‘i
: Manager . Ex. 1.17.13: Considi
Month — -
BOOK!
, Quarter BOOK

est STOC
Ex, 1.17.11: Star schema for retail. chain.. oej 1

Soln. : (a)

(b)
Time key ‘Customer kay
Product key Name me. | ok Ms : :
Customer key © “Age reat Soln.:
5 Store key ‘Income (a) Information
Mode key Gender
Actual sales status
Forecast sales
Price :
Discount
Product key
Name . i
Brand Facts :To
Category
Mode key
Colour
Payment mode
Price
Interest rate .

Scanned by CamScanner
Ww Data Warehousing &
Ex. 1.17.1 _ Des * ng MU-Som. SComp,) 1-54
Ex
& 3 1.
1.17.1 aa
a . 25)

ign Star schema for autosa


7 T+ Set name PRE
Introduction to Data Warehouse
les analysisof company, ——

eT a easels 3
Soln. : UEP ONT
Time
Branch

|__| week:
Month y

|
!
I .

| Item Location
item_key "7 Location_
item_name— Street
Brand
nce_or_street

Ex. 1.17.13 : Consider the following database for a chain of


bookstores.
_ BOOKS (Booknum, Primary Author, Topic, Total_Stock,
_ BOOKST
Price)
ORE (Storenum, City, State, Zip, inventory_Value)
STOCK (Storenum, Booknum, QTY)
:
With respect to the above business scenario, answer the follow
ing questions.
Clearly state any reasonable assumptions you make.
_ (a) Design an information package diagram. - :
(b) Design a star schema for the data warehouse clearly identifying the fact.
tables(s), Dimension tabla(s), their attributes and measures.
a2 of.
Soin. :
(a) Information Package Diagram

| “BOOKS _| BOOKSTORE
Booknum Storenum —
Primary_author | City

Topic State
3 Zip

, Qty
Facts :Total_Stock , Price, Inventory_value

Scanned by CamScanner
ver the New Re
| Univers: ViSed Syllabus (po.
Mic yes,
ice B ‘
BF pata Warehousing. & Mining
i (MU-Se
H m. 6-Comp
- .).) 1-52 Introduction to Data ¢
Reh
(b) Star Schema
wy Data Wiarehousing &
Dim_Books Mining (M
fe
(b) Star Schema ang sno
Mmary_author wflal
PRODUCT otra
ble
num
Storenum
y 7,

Ex. 1.17.14: One of India’s larg


e retail departmental chains, with
annual revenues tOUching
$ 2.5 billion mark and having over Unit of Maasura,
3600 employees working at di
was keenly interested in a bus Verse location, Units per casa
iness intelligence solution that - Sheil lever
Can bring clea,
insights on operations and per Shelf width
formance of departmentalstores &cross the |. Sheif deoth
chain. The company needed to rata)
support a data warehouse that
exceeds daily
Sales data from Point of Sales
(POS) across all locations, wit
h 80 million rows TIME_DIM
and 71 columns.
Time Key
(a) _ List the dimensions and Date
facts for above application.
Day of week
(b) Design star schema and snow Week Numb:
flake schema for the above appl
ication, Month
(c) Design a BI application which will
provide Retail Chain compan Month Numb
features and performance y with Quarter
that meet¢ heir objectives
using any data mining Yoar
technique.
Soin. : ~ Holiday Fl:

(a) Dimensions

Product, § tore, Time, Loc


ation .
Facts:

Unit Sales, Dollar Sales,


Dollar Cost

Scanned by CamScanner
house
Introduction to Data Ware TS
e
WF Data Warehousing & Mining (MU-Sem. 6-Comp,) 1-53

(b) Star Schema and snowflake Schema


PRODUCT_DIM STORE_DIM
Product Key
Stora Key
Supplier_type
Store Name
Product Description
Store ID
Brand Name
Address
Product Sub-Category
Product Category City
Stale
Department
Zip
Package size
District
Package Type
Weight Sales facts table Manager
Product Key Floor Plan
Unit of Measure
— ~ Time Key. - Services Type
Units percase
Shelf level- Store Key ©.
Shelf width ~.” Location Key
Shelf depth” -_— : Unit Sales a LOCATION_DIM
_Doller sales _ Location_key
-DollerCost™*} Street
, City
; . Province_or_street
Ltpone
Time Key >”
‘Date
Day of week...
Week Number
Month
Month Number -
Quarter
Yoar_-
Holiday Flag.”

Fig. P. 1.17.14 : Star Schema

“a
Scanned by CamScanner
S$ Per the Ne
w P .
a! Unive
P
=
lemic ye 1-54, Introduction to —
¥ Data Warehousing & Mining (MU-Sem, 6-Comp.) Data
10ice B; Wary
PRODUCT_DI
M
STORE_DIM
Supplier “ ww Data Warehousing & Min
Supplier_type Supplier_key Stora Key
| :
Product Descriptio| [_Supplier_type Store Name ; Linked to the righ’
n
Brand Name ~ Store ID
Product Sub-Catego drive faster, bette
-———Addr
areses
s s
customer behavic
Department
State’ business.
Data warehousin
Sales facts table business advanta
| Product Key.
Time Key
Store Key
: e
Location. Key
Shelf depth _ - Unit Sales © Pe LOCATION_DIM
Doller Sales Oper
L_Doller Cost Location_key sys
2
TIME_DIM Fig. P.
_Time Key + Advantages of BI Sol
Date
Day of week The Retail Busine:
Week Number
Province_or_stre
at
more timely decisix
_ Month
Month Number Promoting unmat:
Quarter channel selling.
Year
_ Holiday Flag The Solution help
product and custo
Fig. P, 1.17.14(a) selling, but to wh
(c) ; Snowflake Sche
Need for BI ma Given as an inp
Solution fo r Re
tail Busine ss
Whether shoppi provide real-time
ng for daily to respond to fast
i
a1.
t 18 Univ
Un iver
ersi
sity
ty

May 2010 _
'Q.1- Define Data
block diagra
;
and at all points Q.2
maximum impact
.
.

on your bottom
Up and d Define Met
line, warehouse
To cope with :

these challenges
known as data . , many retailers ar (Ans. : Ref
Warehouse. e buuildi
djng unj
Q.3 How are t
Discuss th
(Ans. : Re

Scanned by CamScanner
5 ee

Ty w Data Warehousing &


Mining (MU-Sam, 6-C
omp,)
| 1-55 introduction to Data Warehouse
a Linked to the rig
ht tools, and tied
drive faster, bet to the rig ht business Processes, data
ter decision ma warehouses can
king, deeper
customer behaviou and mo re revealing understand
r, and Precise-cv ings of
en Prescient- control over
every aspect of the
- Data warehousing
can
Serve as th
business advantages fro © foundation for building strategic and
m the poin t-of- sustainable
Sale all the way up to the executive
office.

oh .

Operational Central — Multidimensional


systems database Business
analytics reports
Fig. P. 1.17.14(b) :
Structure of BI soluti
on for Retail Business
Advantages of BI Solution for
Retail Conipany
- The Retail Business Intelligence
Solution has the scope and flexibility
more timely decision makin § to enable better,
at all levels of an enterprise.
= Promoting unmatched customer-centricity, and more effective and profitable cross-
channel selling.
= The Solution helps you follow through
on customer-cen tric’ strategies by integrating
product and customer information to yield both
deeper, actionable insights into not just what
selling, but to whom it is selling and why. is

- Given as an input about customer, operational


and competitive data, the Solution can
provide real-time monitoring, localized analyses.a
nd automated exceptions management
to respond to fast-changing trends and business conditions.
red :
1.18 University Questions and Answers

by May 2010
as

°Q1- Define Data Warehouse, Explain the architecture of data warehouse with suitable
block diagram. (Ans. : Refer sections 1.3 and 1.6.2) (10 Marks)
@.2 Define Metadata. What
are the different types of metadata stored in a data
warehouse ? Illustrate with a simple customer sales data warehouse.
(Ans. : Refer sections 1.7.1, 1.7.3 and 1.7.4) (10 Marks)
Q@.3 How are top-down and bottom-up approaches for building data wareho
use differ ?
Discuss the merits and limitat ion of each approa ch.
(Ans. : Refer sections 1.5.1 and 1.5.2) (10 Marks)

Scanned by CamScanner
Dec. 2010
————

Q@.4 Explain the characteristics of the data present in the data ware house, —— D
ata Warehousing & Mining|
SSE

, (Ans... Refer section 1.3.3)


Q.16 Design Star Schema {,
Q.5 Define Data Warehouse. Explain the architecture of data War My
block diagram. (Ans. : Refer sections 1.3 and 1.6.2) house with a (Ans. : Refer Ex, 1.17.
May 20 12
Q@.6 Distinguish between Top-Down and Bottom-Up Approach, (1p May
Q.17 Define a data warehe
©.

(Ans. : Refer sections 1.5.1 and 1.5.2) .


and hence explain its
@e

Q.7 White short notes on Metadata. (Ans. ; Refer section 1.7) « Mar,
Q@.16 Differentiate betwee
CBEEEEE:’CS'C-_

Q.8 Adimension table is wide; the fact table is deep. Explain, What is St AR sen Ma, warehouse. Explain
its advantages. (Ans. : Refer section 1.10)
tt

4 ms (Ans. : Refer sectior


Ans. : “
Q@.19 Whatis meant by r
Dimension Table meta data stored ir
(Ans. : Refer sectix
Dimension table has got all the detail informatio
n of their Tespective tabl
€-g., customer dimension table will contain all e , Q.20 State and explain
the related info ab Out customer each of them. (An
table contains the main data, which s whe
contains the s urrogate ke TEAS fi.
with other
ys of ev er y di mensig Q.21 Define what is m
measures. ns ale.
tequirements for
A dimension table contains a high information pack:
er granular information so hav
e le schema. (Ans. :|

grain of a subject area. Q.22 Explain dimensi


number of rows in the Fact tab Lower grain cause
le. Dec. 2012
May 2011
Q.23 Whatis the role
LT

Q.9 Define a data warehouse. (Ans. : Refer se


oat Explain the architecture
block diagram. (Ans. : of data ware house with
Refer sections : 1.3 an suitable Q.24 Consider the f
d 1.6.2)
rite short notes on To (10 Marks)
p-down and Bottom-u BOOKS (Baok
(Ans. ‘ Refer sectio p ap proaches in data ware
ns 1.5.4 and 1.5.2) housing. BOOKSTORE
:
Q.11 Write short note on Snow (5 Marks STOCK (Stor
flake schema. (Ans. :
Dec. 2011 Refer section 1.12) With respect
(5 Marks)
|
state any rea
@.12 Explain Architecture of |
(a) Desig
data warehouse, (Ans
Q@.13 . : Refer section 1 6.
Explain in detail me
tadata 2) (10 Marks) | (b) Desig
(Ans. : Reter sections and its various type
s tables
1,7.4 and 1.7.4) ue
.
Q.14 . (Ans.
Explp ain: Snowflake Schema with (10 Marks)
i example. (Ans. : Refer sect
» 15 ion 1. 12) (10 Marks) | Q.25 Define the
What is dimensiona
l Modeling | (a) Fact
(A? ns. : Refer sectio ? Explain in deta
n 1.8.1) il
,
(10 Mar’ | (b) Sno
ks) |¢

Scanned by CamScanner
¥ Data W
arehousin
& Minin
MU-Sem,
Q.16 Design Sta ' SComp) 4.57
(Ans. : Referr ScEx.hema forauioea IntBro
duction to Data Warshouse
4 17.19 SS analysis of the
May 2012 Co mpany,

Q.17 (10 Marks)


De
andfinehenca edata Wérehouse
explain
| renc~ e Expla
in Wh
at is the Med for
Q.18 Different .
iate between
Ure. (ANS, = Rete; Sectio, developing a data warehouse
"8 1.3
.and 1.6.2) (49Marks)
Warehouse, Explain the ad own and ‘
vantages n ‘ bottom-up
(Ans. : Refer sections 1.5.1 and Approach
1,5 2 disadvantages of ©aches of forthe building a data
_ m,
9.19 What is meant by meta dat
p meta data stored a ? Exparelain with
in a deta . with an &xample.
(10 10 Marks)
i (Ans. : Refer se Explain the differ
ctions 1.7.1, ent types of
j ang Laing
| Q.20 7 7.4
oe: )
State and explain the (10 Ma7rks)
each of them, (Ans, Various sche
Refersections 1.1 in de
0, ‘
+10, 1 12a aes
;
@.21 fi : “0 ri
De ne what is Meant by arka)
s)
eel information Package diag
se Sccupancy" having dime ra m, For rec ord ing the information
. Tmat ionpackage nsio
ns like time, hotel etc, give
schema. (Ans, :.Refe q agram for the same ’ also draw the the
r Ex. 1.17.3) - e star schema and sno W
flake
Q@.22 Explain dime nsion Modeling in detail (10 Marks)
, (Ans. : Refer section 1.8
Dec, 2012 .1) (10 Marks)
.
.
Q.23 Whatis the role of Me
ta data in a data wa
. (Ans. : Refer sectio rehouse? Illustrate wit
n 1.7) h examples,

Q.24 Consider the following data (10 Marks)
base for a chain of bookstore
BOOKS (Booknum, Primary_au s,
thor, Topic, Total_Stock, pri
BOOKSTORE (Storenum, ce):
City, state, Zip, Inventory_va
STOCK (Storenum, Book lue)
num, Qty)
With respect to the above
business Scenario, answer the
State any reasonable assu following questions, Clearl
mptions you make. y
(a) Design an information packag dia -
e gram.
(6) Design a star schema for the
data warehouse Clearly identi
tables (s), Dimension table (s), fying the Fact
their attributes and measures.
(Ans. : Refer Ex. 1.17.13)
Q.25 Define the following terms by (10 Marks)
giving examples:
(@) Fact less Fact tables (Ans. : FaatOn BCH
TAR) ‘lian.
(b) Snowflake Schema (Ans. : Ref
er section 1.12)

Scanned by CamScanner
May 2013

Q.26 What are differences


betw een Data Ware
(Ans. : Refer section house and Data Ma
1.4) rt
Q@.27 Explain the role of
(Ans. : Refe
Meta data in a data
r section 1,7) warehouse, May
Q.28 Write detailed note
s (10 Matty
(Ans. : Refer section on Data Warehouse Architecture,
1.6.2)
Q. 29 Fora Supermarke et
t Chain cons (10 Ma
time, promotion, ider the follow
The schema co ing dimensions,
Measures unit ntains a central namely Pro uct,
sales dollars_sale fact table, Sales Sto,
application. s ang dollar_c facts with thre
ost Design star Sc
hema fo, thi
Calculate the
maximum number of ba
following Values se fact table
given below : records for Wa
Time period: 5 rehouse With
years the’
.
Store: 300 stores
reporting daily
Product: sales
40,000 Products .
(Ans. : Refer in each Store
Ex. 1.77.3) (a bout 4000 sell
Q. 30 in €ach store
Define the fo daily)
llowing terms
by giving exam
|
(€) Factless ples (10 Marky |
fact tables (Ans
(b) , : Refer Se
Snowflake Sc ction 7. 14,2)
hema (Ans. : Refer
Dec, 2073 Section 1, 72) (5 Marks)
Q. 31 (5 Marks)
What is.a Da
ta ware ho
with a block di use? Explain
agram, (Ans the three tie
. : Refer Sect r architecture
Q. 32 ions 1.3 ang of a Data Wa
What is mean 7.6.3) re house
t by meta
meta data data? Explain (10 Marks)
Stored in with ©xample.
a data ware Ex pl ai n
(Ans. : Refer house, Illust the different
S@ctions 174, rate with types of
1.7.3 and 1,7 €X am pl es
Q. 33 ene ann .

by nrormation (10 Marks)
.
_ uirem Packa ae Diagra
Information tel Occ Pancy" m, Fo
Package diagram for; having dimensio r fecording the informatio
Schema. (Ans
. " Refer the sam » ns like n
time, hotel ete,,
Ey. 1.17.3) - Also dra the sta
May 20 14. h give the
e etna oe lake
Q. 34 Explai Me ks)
n Clearly ar
the role of
(Ans, : Refer Meta data
SECtion 1.7) ; a data
ata in
Q.35 Writ‘ e : ; Narehouse
détaileg Notes on Data
Ans, * Refer
S@ction 1 -6.2 Warehouse Architec (10 Marks )
) ture

(10 Mark: s)
Scanned by CamScanner
wv Data Warehousing & Mining (MU-Sem, 6-Comp.) 1-59 Introduction to Data Warehouse

Q. 36 A bank wants to develop a data warehouse for effective decision-making about their
. loan schemes. The bank provides loans to customers for various purposes like house
building loan, car loan, educational Idan, personal loan, etc. The whole country is
categorized into a number of regions, namely, north, south, east and west. Each
region consists of a set of states. Loan is disbursed to customers at interest rates that
change from time to time. Also, at any given point of time, the different types of loans
have different rates. The data warehouse should record an entry for each
disbursement of loan to customer.
(a) Design an information package diagram for the application.
(b) Design a star schema for the data warehouse clearly identifying the fact tables
(s), dimensional table (s), their attributes and measures.
(Ans. : Refer Ex. 1.17.9) a (10 Marks)
Q.37 Define the following terms by giving examples :
(a) Fact constellation (Ans. : Refer section 1.13)
(b) Snowflake schema (Ans. : Refer section 1.12)
(c) Aggregate fact tables (Ans. : Refer section 1.16) _ (15 Marks)
Dec. 2014

Q. 38 What are the different characteristics of a Data Warehouse 7?


(Ans. : Refer section 1.3.3) (5 Marks)
Q.39 Explain the role of meta data in'a data warehouse.
(Ans. : Refer section 1.7) (10 Marks)
Q.40 Write detailed notes on Data Warehouse Architecture. ,
(Ans. : Refer section 1.6.2) (10 Marks)
Q.41 For a supermarket chain consider the following dimensions, namely product, store,
time, promotion. The schema contains a central fact table, sales facts with three
measures unit_sales, dollars_sales and doller_cost. Design star schema for this
; application. (Ans. : Refer Ex. 1.17.4) (5 Marks)
Q.42 Défine the following terms: (10 Marks) ~
(a) Dimension tables (Ans. : Reter section 1.10)
(b) Snowflake schema (Ans. : Refer section 1.12)
May 2016
eel
.
.

Q. 43 Illustrate the architecture


of a typical DW system. Differentiate DW and data mart.
(Ans.: Refer section 1.6.2 and 1.4) (10 Marks)
Q.44 Write short note on: Operational Vs. decision Support poten.
(Ans. : Refer section 1.2.3) (5 Marks)
i

Scanned by CamScanner
:¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-60 Introduction to Data Wareh
SSSSSSSSSSsS939”$@aoe ao
Q.45 For a super market chain, consider the following dimensions namel
SS
Y¥ Product, Sten
time and promotion. The schema contains a central fact table for sales,
i, Design star schema for the above application.
i. Calculate the maximum number of base fact table records for wareho
Use With "
* the following values given below :
Time period-5 years 3 :
Store - 300 stores reporting daily sales
Product- 40,000 products in each store (about 4000 sell in each store
daily)
(Ans.-: Refer Ex. 1.17.4) ; (10 Marks)
Q. 46 Write short note on : Updates to dimension tables.
(Ans. : Refer section 1.15)
(5 Marks) | -Syabus :
QQOo | Major steps in ETL proce
Chapter Ends tasks, Major transforma
OLAP definition, Dimens
up, Slice, Dice and Rotal

2.1. An Introduction

2.1.1. Whatis ETL Toc

- An ETL tool is a tool


‘itis compatible with ;
ETL functions reshay
to be stored in the da
21.2 Desired Feat
u

Automated data me
Warehouse, data ma
Extensible,
robust,

Standardization of
~
Reneat: is.

Scanned by CamScanner
| Moauiez >

ETL Process and OLAP

Syllabus :

Major steps in ETL process, Data extraction : Techniques, Data transformation, : Basic
tasks, Major transformation types, Data Loading : Applying Data, OLTP Vs OLAP,
OLAP definition, Dimensional Analysis, Hypercubes, OLAP operations : Drill down, Roll
up, Slice, Dice and Rotation, OLAP models : MOLAP, ROLAP.

2.1 An Introduction to ETL Process

=> (MU - May 2012)

2.1.1 What is ETL Tool?

- AnETL tool is a tool that reads data from one or more sources, transforms the data so that
it is compatible with a destination and loads the data to the destination. ~
- ETL functions reshape the relevant data from the source systems into useful information
to be stored in the data warehouse.
2.1.2 Desired Features

~ Automated data movement across data stores and the analytical area in support of a data ©
warehouse, data mart or ODS effort.

Extensible, robust, scalable infrastructure.


!

Standardization of ETL processes across the enterprise.


Reusability of custom and pre-build functions.

Scanned by CamScanner
2-2
Better utilization of existing §
hardware resources. vata Warehousing
— Faster change-control and ese two factor
management. ve use. This ¢
Integrated meta-data managemen
t. jpouse program =
Complete development environment, “work as you think” design metaphor, |
pata extract
List of
source jdentific:
‘ieee - §yllabus Topic : Major Steps in ETL Process. ——ee_|
} -
each of the source

2.2 Major Steps in ETL Process SN


be carried out 0
> (Mu. ~ May 2,
) Following are the major steps in the ETL process, these steps gives a goodi ua into, _ Time window
ETL process. _ Job Sequencir
job has finishe
— Find out all the target data needed in the data warehouse.
|| _ Exception Ha
— Find out all the data sources (internal and external).
- Prepare data mapping for target data elements from sources. it
i 2.4
r
- Establish comprehensive data extraction rules. i

- Determine data emneiivernattion and cleansing rules.


— Plan for aggregate tables.
- Organise data staging area and test tools.
- Write procedures for all data loads. _
- ETL for dimension tables.
- ETL for fact tables.

Syllabus Topic : Data Extraction Techniques

2.3 Data Extraction ——<"

> (MU- May 2048


- Data extraction and conversions differ in a data warehouse as compared to’ operation!
‘systems. Two major factors differentiate them they are ;
I. Data is extracted from many disparate sources for a data warehouse. .
| aeact ful
2. Data needs to be extracted on an ongoing incremental loads and one time initial f
load.

Scanned by CamScanner
Wi Data Warehousing & Mining (MU-:-Som. 6-Comp,) 2-9 0 ETL Process

~ These two factors increase the complexity of the data extraction process in a data
warehouse. This data extraction can be carried out using a third party tool along with in
house program or scripts.

List of Data extraction issues

— Source identification : The Source Application and the source structures are identified.
— Method of extraction : The extraction process can be cither manual or tool based for
each of the source systems.
- Extraction frequency : Data extraction frequency needs to be defined whether it should
be carried out on a daily basis, weekly or quarterly. %

— ‘Time window: Time window needs to be decided for each data source.

- Job Sequencing : find out whether the beginning of one job has to wait until the previous
job has finished completely.
Exception Handling : The input records that cannot be eral have to be handled upon

2.4 Identification of Data Sources

. => (MU - May 2016)


- Listeach data item of metrics or facts needed for analysis in fact tables.
— List each dimension attribute from alldimensions.
; For each target data item, find the source system and source data item. |
- If there are multiple sources for one data element, choose the preferred source.
— Identify multiple source fields for a single target field and form consolidation rules.
Identify single source field for multiple target fields and establish splitting rules.
- Ascertain default values.

- Inspect source data for missing values.

2.5 Data in Operational Systems ©

> (MU - May 2010, Dec. 2010, May 2011, Dec. 2011, May 2012, Dec. 2012,
May 2013, Dec. 2013, May 2014, Dec. 2014, May 2016)
Source systems usually store data in two ways, current value and Periodic status. The type
of data extraction techniques to be used has to be decided based on the type of category the
data falls.

Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem.6-Comp:) cat ETL ProcessSy
Current Value - . \
i i
time WF pata Warehousing&
The value of an attribute represents the value at this moment of . t ———————————sasasaoo=809”

Attribute : Status
— Values are transient or transitory.
61112000
ent on the business transaction,
— The change of value in this category is depend
9/15/2000
— Change in value or how long the value will stay cannot be predicted.
e and outstanding 1/22/2001
— Some of the e.g. are Customer name, address, account balanc
on an individual order. . AMO, 31112001
; 25.1 \mmediate
Periodic Status’ _-
the attribute needs uid — Immediate Da
In Periodic status, every time a change occurs the value of
preserved. _ jae Se

Storing current value

Attribute : Customer’s state of residence


6/1/2000 " Value : OH
9/15/2000. Changed to CA
1/22/2001 Changed to NY

3/1/2001 Changed to NJ

Storing Periodic status

6/1/2000 9/15/2000 1/22/2001 3/1/2001

1. Cap

Fig. 2.5.1 : Example of Current Value and Peri


odic Status

Scanned by CamScanner
OPTION 3:
Capture in source
applications - -

Fig. 2.5.2 : Immediate data extraction

1. Capture through transaction logs

In this option, the transactional log maintained by DBMS is used for data extraction.
log.
All the committed transactions are selected from the transactional
the log file is refreshed.
All the transactions have to be extracted before
is not feas ible if the data sour ce is other than the database applications
This option Aim
|
like for e.g. indexed or flat files.
a met hod for crea ting copi es ina distributed environment.
Data replication is

Scanned by CamScanner
Source Database eg

Fig, 2.5.3 : Data extraction


using replication technology
Capture through Database tri
ggers
This option is also applicable if
the source system is a database
application,
Database triggers are stored
procedures which are invo
event occurs. ked when certain predefines
.
Trigger programs may be cre
ated for all events for which
data needs to be captured,
The output of the trigger pr
ogram is written in a sep
extract data for the dat arate file which will be used
a warehous
e.
to
Building of trigger programs add
s an extra burden on the develo
pment effort.
3. Capture in source applicat
ions
This option is also known as Appl
ication assisted data capture.
Relevant application Progra
ms needs to be modified tha
t write to source files and

This option can be used for any


type of source system, for eg.,
indexed file or flat file. A database application,
:

Scanned by CamScanner
capture the changes.
2.5.2 Deferred Data Extraction
= This option does not cap
ture the changes in rea
l time.
= Th [ere are two options = a ) capture based on date (ii) tuume
stamp and capture
by comparing

1. Capture based on date and


time stamp
— Time stamp is considered as a basi
s for selecting the records for data extr
action.
- Attime stamp is marked to the source reco
rd whenever it is created or updated,
- This technique works well if the number of revi
sed records are small.
os This technique works for any type of source file, provided
the source file contains a
date and time stamp. . , .
ef 2. Capture by comparing files

If all of the above techniques are not feasible for the source file then this option is
d considered. .
eo ~ — This technique is also called snapshot differential technique as it compares two
snapshots of the source data.

This technique captures the changes by comparing the two copies of data.

Although this techniqne is simple, comparing full rows in a large file can be very
inefficient.
legacy source systems do not have
However this technique is feasible when the
transaction logs, time stamp on source records.
awe | eek

Scanned by CamScanner
(Rev - 2016) of
Revised Syllabus
-ictly as per the New
umbai University vay-som. 0-Corg), 28 ETL Process erst,
».f, academic a 2016: wW Data Warehousing & mining. . i Data Ware! & Mini
; per Choice Base =
Example + Employee!
= other.
> “Thus one set of Dalat
Once all the data elen

NG
‘The conversion may
© Characters mus
o Mixed Text m

o Numerical da

o Data Format

} o Measureme
|
| o Coded datz
| __ AM these tran
available to pt

DATA STAGING AREA such comprel


Fig. 2.5.4: Deferred data extraction: options Data transform

oe : ——— _ oothing
Syllabus Topic : Data Transformation - Basic Tasks, Major Transformation Types ~ Sm
~ Aggregati
2.6 Data Transformation : Tasks Involved in Data Transformation - Gevexal
- mali:
> (MU - May 2010, Dec. 2010, May 2011, Dec. 2011, May 2012, Dec. 2012, | Ne
May 2013, Dec. 2013, May 2014, Dec. 2014, May 201 6) — Scaling:
- The operational databases developed can be based on any set'of prioriti "aig .
miority
changing with the requirements. cee aaah Keeps kame

, 4 - Therefore those who develop data warehouse based on thi ese datab 2
nai i aS€s are typically faced ~ Scali
with inconsistency among their data sources.
— Transformation Process deals with
i rectifying
ifs any inconsistency
. Gt aii . when
- One of the most common transformation ee .
is common for the given data element to ‘ is “Attribute Naming Inconsi :
databases Teferred to by different hits istency’. It ca
names in different alt

Scanned by CamScanner
Example : Employee Name may be EMP_NAME in one database and ENAME in the
other.

Thus one set of Data Names are picked and used consistently in the data warehouse.
Once all the data elements have right names, they must be converted to common formats.
The conversion may encompass the following :
o Characters must be converted ASCII to EBCDIC or vice versa.
o Mixed Text may be converted to all uppercase for consistency.

o Numerical data must be converted in to a common format.


o Data Format has to be standardized.

o Measurement may have to convert. (Rs/ $)

o Coded data (Male/ Female, M/F) must be converted into a common format.

All these transformation .activities are automated and many commercial products are
available to perform the tasks. DataMAPPER from Applied Database Technologies is one
such comprehensive tool.

Data transformation can have the following activities

Smoothing : It removes noise from the data.


Aggregation : It uses summarization, data cube construction.

Generalization : It uses concept hierarchy to replace the data by higher level concepts.
Normalization : Attributes are scaled to fall within a small, specified range.
Scaling attribute values to fall within a specified range.
Example : To transform V in [min, max] to V’ in [0,1], apply
= (V-Min) / (Max-Min) .

Scaling by using mean and standard deviation (useful when min and max are unknown or
when there are outliers) :

= (V-Mean) / Std. Dev.


ones and
Attribute/feature construction :.New attributes are constructed from the given
added for data mining process.

Scanned by CamScanner
WF Data Warehousing & Min MU-Sem-
~ wn, 4
S¥ilabus (Re 2.6.1 The Set of Basic Tasks @ vata |

J 1. Selection of data _— It can be consid


Grading Select the desired data from the source system; it can be cither whole Tecords % applied.

y several records. Pir, = Standardisation


‘ the same data is
2. Splittingjoining
‘ ————————————

— data records.
on selected
Itis the data manipulation Bee

- It can be either a Split or a join operation of the selected parts


durin, Data Lo
al E transformation. . BG 244 rrr
al 3. Conversion .
This step includes conversions of single fields to standardize the data . Loading pr
i
from different source systems and understandable to the users. extraction fom, the source datal
4. Summarization . The whole
: following way
mui ke lysis or querying of data in data w (1) Initial Le
See, t » Me data transformation
function inc, (2) Tnereme
— SFor
eeexample, f a credit
_ ° ee card company to an
rary alyze sales patterns,
in the data warehouse every it e
Da
single transaction on tarRe
i tre
ieee may waae
nt toSeeatie th Scher ve
e'daily transactions After
ee for each creditoa
SS hed
Instead o Storing
the most granular
data by individual

5. Enrichment
|
Tt 1sis th the Teatranging an : =
— ; d si simplify:
|
mplifying of in i vi |
inddi
dual fields to make
them more useful o
in the Lo
i a wn
2.7 Data I nteg Egrati i ratioon and Consol
idation
> (MU-M ay
2010, Dec, 2010, Ma _*7
y 2011, Dec. 2011,
M
M | |
~ pgr
Inte eatrioitof ea
n is
data i combining al 20fis13,
rDea
c, 2013, May 2014 Dee si
? - 2014?, Maya20
e 16) | "
made ready to be 2.8
loaded j
into a data Wareho
°Peratuse
ional data into Cohere
nt data ||
7

Scanned by CamScanner
Ste, ww Data Warehousing a, Mining (muU-
a, , a ong (MU Som.6-Comp.) 2.44
n — It can be considereg mee brocaes onc OLAP
\ _ i
applied. 45 0 pre- /
Process type before major transformation techniques
are —.

3 we |
Syllabus Topic
: Data Loading - App
lying Data
|
¢ 2.8 D ata Loading : Techniques of Data
Loadin

ge
trae ; > (MU - May 20 ;
y 2010, Mo rth Dec. 2011, May 2012, Dec. 2012,
to,

UP
, Dec. 2013, May 2014, Dec. 2014, May 2016)

'
atl oa
Loading process is the physical
movement of the data from the computer systems stort
4 the source database(s) to that which will
store the data warehouse deésheane "
The whole process of moving data into the data warehouse repository o%
is referred to in the ybe

eee
ata following ways :
|
ion a (1) Initial Load : For the very first time loading all the data warehouse tables. a y
Meh (2) Incremental Load : Periodically applying ongoing changes as per the requirement.

(3) Full Refresh : Deleting the contents of a table and reloading it with fresh data.
|
=7 Data Refresh versus Update
‘ itty After the initial load, the data warehouse needs to be maintained and updated and this can
:
indiy; Vv be done by the following two methods :
es.
= — Update-application of incremental changes in the data sourc
— Refresh-complete reload at specified intervals.
; ;
8
Data Loading
:
warehouse.
ae — Data are physically moved to the data
:
“load window”.
- The loading takes place within a is
time upda tes of the data warehouse} ‘as the warehouse
Th e trend d 1is to near real
<“
— -
nal applications.
increasingly used for operatio
it hee
Dim ens ion Tables
¢q 2.8.1 Loadin g the
tables0022 functions, int
includesingtwo basis
the dime nsio n ongo .
} _ Procedure for main maintain ing ving the changes
taining
app
the tables and thereafter

Scanned by CamScanner
. 2-12 ETLP,
W Data Warehousing & Mining (MU-Sem._ SCOP) moose Se,
pA = Ph
- System gen crated keys are used in a data warehouse.
their own keys.
The records in the source system have 2.9.1 Reasons for “Di

Therefore before an initial load or an ongoing load the production keys must be “OMe, — Dummy Values,
to system generated keys in the data warehouse. : x dey
| -
_ Another issue is related to the application of Type icati 1, Type 2 and Type
YP 3 dimen 3 ¢:
Fields,
changes to the daia warehouse. Fig. 2.8.1 shows how to handle it. w= we a =
‘Transform and prepare for” — Contradicting Data,
- loading in Staging area Inappropriate Use ¢

— Violation of Busin
Determine if rows exist — Reused Primary K
__..this surrogate key « Non-Unique Iden
‘t Matches “ eee
Part of Match changed values with: eee NO 2 De
Bs ~ existing dimension values,” ACTION 29.2 Data Clean
Matches
exactly? NO : —- Source systems
| : = Specialized da
TYPE 1 TYPE2 J£ TYPE 3 | ; \ address correct
Overwrite ; 3 a e F Create anew - : a By ie h don Wri changed value b ss I eadin data
old value J [dimension record. |] |= to“old" attribute field | First tonic
Fig. 2.8.1 : Loading changes to dimension tables | Steps in Data CI
2.8.2 Loading the Fact tables : History and Incremental Loads
| 5 +,
~ The key in the fact table is the concatenation of keys from
the dimension tables | .: ie
~ . Therefore for this reason dimension records are : esem€
loaded first.
-— FE Fore
- Aconcatenated key is created from the keys of
the corresponding dimension tables.
examly
2.9 9 Data Quality
.
: Issues in Data Cleansing
. / s
, | & Correetir
Data cleaning, also called data cleansing
3 : or scrubb in, di : and removiss | |
errors and inconsistencies from data in order ti 8, deals with detecting , indis
in
problems are present in single data collections oak
.
misspellings
s during
. data entry, tilting tnfeaniane such as Cate
files and databases, e. g. due 1 | =
Dor other invalid data. ]

Scanned by CamScanner
W pata Warehous& ing
Mining (MU-Som. 6-Comp) _ 2-19 ETL Procngs and OLAP

2.9.1 Reasons for “Dirty” Data


- Dummy Values,
- Absence of Data,
- Multipurpose Fields,
- Cryptic Data,
- Contradicting Data,
— Inappropriate Use of Address Lines,
_ Violation of Business Rules,

— Reused Primary Keys,


- Non-Unique Identifiers,
_ Data Integration Problems.

9.9.2 Data Cleansing


cleansed.
Source systems ‘contain “dirty data” that must be
used. Important for vudiiering name and
Specialized data cleansing software is often
s.
address correction and house holding function
clea nsin g ven dor s incl ude Vali ty (Int egrity), Harte-Hanks (Trillium), and
Leading data
First logic.
Steps in Data Cleansing

1, Parsing
loc ate d and ide nti fie d in the source files and.then
ents are
— The individual data elem
the target files.
these elements are isolated in r
as the fir st, mid dle , ‘and last name, anothe
sed
— For eg. Name can be par nu mb er and street name, city and
state.
par sed as str eet
example, Address can be
or
Correcting
2.
ary dat a sou rce s are used for correction of
and second
— Sophisticated data algorithm
.
individual data components zip code.
of a
rep lac e a van ity address and ation
— Fore.g.

Scanned by CamScanner
-Sem. 6-Comp.,

3. Standardizing
Standardizing applies conversion routines
to tran Sform data into its
consistent) format using both standard and custom
business Tules,
- Examples include addi tere ly
ng a pre name, replac
street name,
ing a nickname, and usin

4, Matching
— Searching and matchi
ng records within and
Standardized data ba across the Parsed, co
sed on predefin r,
ed business Tules to el
— Examples include identi iminate duplication 2
fying similar names and %
addresses
5. Consolidating

The matched record


s are analysed and iden
and merging them into tified for relationsh
ONE Tepres ips and thenCOnsoliag,
entation, = |
6. pale
Data Cleansing must :
deal with many type
s of possible errors
— These inchide Missin
g data and incorrect da
ta at one source,
— Inconsistent data
and Conflicting data
when two or more so
7. Data Staging urces are involved.

~ :
Often used as an interi
m Step between data extr if

~
action and later steps
Accumulates data from asynchrono
Sessions, or other pr us sources using native
ocesses, interfaces, flat files, FTP
~ Ata predefined cut off time
, data in the staging file is
warehouse. transformed and loaded to the
:

Scanned by CamScanner
Data Warehousing & Mining (MU-Som. 6-Comp.) _ 2-15 ETL Process and
OLAP
0 Sample ETL Tools

The ETL Process is shown in the Fig. 2.10.1.


Teradata Warehouse Builder from Teradata.

Data Stage from Accentual Software.


SAS System from SAS Institute.
Power Mart/Power Center from Informatica.
Sagent Solution from Sagent Software.

Hummingbird Genio Suite from Hummingbird Communications.

Creating source and |,


Target repositories
Mapping source and |.

~|| Creation and execution


“|| of workflows to load
data from source to
target
Ete”

Fig. 2.10.1 : ETL Process

. ‘Scanned by CamScanner
W eh
¥ Data War
ng _
On li ne'e An al yt ical processi
2.11 +r
Need for
ional view of data.
‘diimension:
supports the multid
OLAP or the On Line Analytical
ous views of informatio,
dy, an d pr of ic ie nt access to the vari
OLAP provides fast, stea '
d
. .
queries can be processe
The complex
1 1¢s On multidimeng;
by processing complex
ti ‘oO analyze information
I t's easy

views of data.
is generally used to analyse the information where huge amount,
Data warehouse
historical data is stored.
sales, matig
ted to more than one dimension like
= Information in data warehouse is rela
trends, buying patterns, supplier, etc.
Definition
ne Analytical Processing |
Definition given by OLAP council (www. slapestivell org) On-Li
y.
(OLAP) is a category of software technology that enables analysts, managers and executives t
of possibk
gain insight into data through fast, consistent, interactive access in a wide variety
views of information that has been transformed from raw data to reflect the real dimensionality | Model Differen
of the enterprise as understood by the user. Eee
i
Syllabus Topic : OLTP V/s OLAP. Single purpost
System.
Full set of En
2.12 OLTP V/s OLAP

> (MU - Dec. 2010, May 2011, May 2012, Dec. 2013) |
Application Differences
res

oa,
*)

oneal
>

Transaction oriented
Subject oriented
High Create/Read/Update/
activ Delete (CRUD) | Hi se
ity ) gh Read activity
Many users
Few users
Continuous updates — Many sources |
Batch updates — Single s
| ime informati on
Real-t | = :
. Historical information
Tactical decision-making — ca
p=
Strategicicplanning
a. ™
Scanned by CamScanner
-
(MU
ata warehousing & Mining -Sem., 6-Comp.)
ae
OLAP
— ed, customized delivery “Uncontrolled”, generalized delivery
Con
a
RDBMS RDBMS and/or MDBMS
.
ee eh
Informational database See
operational database ——
yoni ae Differences
Se OLIP OLAP |

Hi - "ransaction wok using few records Low transaction volumes using many
records at a time.
at a ne, _
——

B glancing needs of online v/s scheduled Design for on-demand online processing.

aa volatile data. Non-volatile data.


| High —

Data redundancy — BAD. Data redundancy — GOOD.


ae ®
.
Multiple levels of granularity.
_——

Few levels of granularity.


Complex database designs used by IT ‘Simpler database designs with business-
personnel. friendly constructs.

Model Differences
ENS] oer

Single purpose model — supports Operational Multiple models — support Informational


System. Systems.
Full set of Enterprise data. Subset of Enterprise data.
Eliminate redundancy. Plan for redundancy.
Natural or surrogate keys. Surrogate keys.
Validate Model seni business Function | Validate Model against —_—reporting
Analysis. requirements.
Technical metadata depends on business Technical metadata depends on data
requirements. mapping results.
This moment in time is important. Many moments in time are essential
a
elements.-

213 OLAP and Multidimensional Analysis

~ Multidimensional analysis is done in data warehouse environment.


~ Multidimensional analysis is identical with online analytical processing (OLAP).

Scanned by CamScanner
r the Ne

W dois Warehousing& Mining (MU-Som. 6-Comp-)_ 2-18


Jn! SSS,
ETL Process a, ;

icy ‘ ; i W pata Warehou


To perform multidimensional analysis, OLAP is required which performs ‘i
eE
calculations and displays the desired information.
-~ OLAP gives multiple views of data to perform multidimensional analysis,

» Syllabus Topic : Hypercubes =n,

2.14 Hypercu be

Let us assume that there are two business dimensions Product and Time. Business Use|
wish to analyse metrics as fixed cost, variable cost, indirect sales, direct sales, and
margin. Then for analysis purpose, data can be displayed on spreadsheet
which
metrics as columns and time as row for one product per page
as shown in Table 2.14).
Table 2.14.1 : Display of Spreadsheet for Product Trousers where
Time as row and Metrics as columns
|__| fixed Cost | Variable Cost [Indirect
Sales | DinGcts es | Profit. |
[Jan
F ob. | 340 110 230
270 90 200
320 100 . these dis
260 100
[Mar] 310 100 one more
210 270 70
|Apri] 340 110, ees
210 320 80
[May | 330 10 =|) 339—— 300
June 260 90
90
150 _ 300
July | 310 100 100
180 300 70
Aug. | 380 130 210 360
Sep. | 300 100 60
[oct | 180 290 70
310 100 170 -
[Nov.] 330 310, 70
110 210 310
Dec 350 120 so |
200 360 90
aid one for mene by in Re 2.14
This

Multidimensional Domain Structure (MDS) Th1s* typgees of representbse dines


ation is called #

Scanned by CamScanner
W Data Warehousing &s
Minin MU-So
om. 6.6, ) 2 245 :
.
se
van
° I,
ETL Process and OLA

tn
Mar 1 ™ + Fhed cost
Apr Coats Veriatia
M T i

. Jackets Indirect
=
Jul 4 Pe It Sales

Aug , | Dresses Dirac


Direct
Sep \
Oct hirts |

Nov T
Dec jo idan
Profit

bree dimensions is ca
lleg a8 hypercube
which “presents mult
one more dimensio idimensional data, So
n Store with Time an
d Product. MDS for 4 add yi
dimensions is given in

—— one TIMEJ ——_—PRODUcT METR


r ICS
New Yor 4.5" tea a i
Feb |

Mar
Senile’ Coats Variable
Apr | |
May Ineinact | |
Dallas ah, Jack le |
Jun | | 3 |

D Jul —
T Dresses
" Sales
4 | Aug - |
Sep {
Cleveland T i I Shirts |

Bosto Nov - | Profit |


|TDec I Stacks I Margin ||

Fig. 2.14.2 : MDS for 4 dimensions a | |

Scanned by CamScanner
Rewleed Suliqhe= — ETL Procosg and ¢, wr
as per the New . . 6-Comp. 2-20
ww Data Warehousing & Minin (MU-Som: | ee Mb
bai Unive

=
|
ww Data Warehousin:
is also
sademic y jon of these 4 dimensions on the epee 7 h a
possible |

Choice | - Representation © yns within the same display group. In the Big. 2.14. A i
2.14.3, produc |
multiple logiical dim
i ens sic
page TCpresents <

x
Cost) are com hincd to display as columns. One
metrics (Sales and
: New York.
sales for store

| B-
Multidimensional

PRODUCT MET RICS

[Jem fm fm ae
STORE TIME

v( San Jose Feb Coats f


Client machines:

Cost — Miultidimens
| Dallas Mar Jackets Cubes or H.
q
two-dimens:
for a partic
New YorkStore ue 2 dimensio
Hats
: Sales| Hats: Cost | Coats : Sales | Costs : Cost | Jackets
: Sales| Jackets
: Cost 3 dimensio!

Jan.| 450 350 550 450 500 400 |


Feb.| 380 280 460 360 400 320 Ang Pez
[Mar. 400 310 480 410 450 400
P1

P2
Fig. 2.14.3 : MDS and page display for4 dimensions
P3
So hypercube is used for representation of more than 3 dimen
sions. Similarly one can add |
more dimensions and represent it using MDS.
ree Silat rn e
Syllabus Topic : OLAP Operation e
s - Drill down, Roll up, Slice, Dice A data cu
and Rotation

2.15 OLAP Operations in Multid


imensional Data Model a
—— |
>
(MU - Dec. 2010, May 201 —- Anmult
OLAP Operations OR
2, Dec. 2012, May 2013,
Dec. 2014, May 2016) 1 oD
OLAP Techniques
— OLAP techni ; , 2, ¥F
OLAP mi ee (0 retrieve the information from data warehouse into
iyetery: ‘nsional datab i
ases. So informati.on The fe
can be retrieved using front end
Exam
Consists o

" Fespect to

:
Scanned by CamScanner
Corporato
databases
(Data
warohouses.
ate.)

i he
Mult! dimensional
dalabaea
Client ma chines

Vig. 2.15.1 : OLAP Implementation

Multidimensional models is used to inhabit data in multi-dimensional matrices like Data


Cubes or Hypercubes. A standard spreadsheet, signifying a conventional database, is 4
two-dimensional matrix. One example would be a spreadshect of regional sales by product
for a particular time period. Products sold with respect to region can be shown in
2 dimensional matrix but as one more dimension like time is added then it produces
3 dimensional matrix as shown in Fig. 2.15.2.
—_—_——>
Regions
Reg1 Reg2 Reg3 ...

P1
Products
nN

Atwo-dimensional database matrix

2D database
Fig. 2.15.2 : Pictorial view of data cube and

es :
A multidimensional model has two types of tabl
es of dimensions
1. Dimension tables : contains attribut
2. Fact tables : contains facts or measures
follo wing techn iqu es are used for OLA P implementation :
The
any
ns id er a co mp any of Electronic Products. Data cube of comp
Example : Let us co
(a gg re s ated with respect to city), Time (is aggregated with
cation
consists of 3 dimensions Lo
gg re ga te d wi th re sp ec t to item types).
em (a
pect to quarters) and it
Scanned by CamScanner
‘N ew Revised Syllabus (Rev - 2016) of
versity
ar 2018-2019
ised Credit and Grading System)

f= m=z = —

Sem. 6-Comp.)
es
us in g & Mining (MU-
FF Data Wareho
or Roll Up
1. Consolidation to dimensions,
ne . ra ll y ha ve hi erarchies with respect
ge
ional databases ect to one or mo
r,
— Multi- dimens sh ip wi th re sp
da! ta relation
ing up OF adding ty data.
les to get total Ci
at io n is ro ll
Consolid up all pr od uc t sa
ple, adding
dimensions. For exam rformed on the
the re su lt of rol l up operation pe
2.15. 3 shows cation. This hierarchy
was
For example, Fig- ep t hi er ar ch y for lo
bing up the conc
central cube by clim te <country.
tot al or de r st re et <c ity <provinc e_or_sta
defined as the y by location
the dat a by ci ty to the countr
n aggregates
- e roll up o peration show
Th
hierarchy. Location
(countries)
Canada

a
Time (Quarters)
of fy
TITS

Item type

Fig. 2.15.3 : Roll -up or drill up

Scanned by CamScanner
q ww Data Warehousing & Mining (Mu-s
. ; eM. 6-Com
Slons: 2. Drill-down :
P) 2.
23 ETL Process and OLAP
Or Mo - Drill Down is
defined as changing the view ofd
— For example, the Fig. 2.15.4 Wises i ata to a greater level of detail

byON wae
the ryrer eaneee PY Stepping down. gtConcept.
<quarter<year_ OF drillhierarchy
down operations performed on
for time defined as

Cation

His, 2.15.4: Drill Down

Scanned by CamScanner
¥Y aS pe
E r the New Revis
ed Syllabus (Rev
tbai Un iversity - 2016) of
icademic year 2018-2019 °
’ Choice Based Credit and Gradin @®
g System)
pc
= == —
id
. ; -Sem.6-Comp.) 2-24 ETL Process
Ww Data Warehousing & Mini (MU-Sem. and9 {

Lap 7 ata Warehousing & Mi


3. Slicing and dicing
ai : ¢
Slicing and dicing refers to the ability to look at a databas from various view
Dice operation ¢
— iewpn,;
oints, = Daceniclis
vve nopttsidi
: : lect ‘ion with
. respect dim ens ion of the give
- Slice operation carry out se to one Tl Cube — le,
For example,
: : L
dimension as

—— ion
_- For example, Fig. 2.15.5 shows the slice opera tion where the sales data are se}, Cocaine ts
¥ eee meeor Selected entertai
from the left cube for the dimension time using the criterion time = nment”
Location Chic “Q1”. : Rotate
ago Location .
(cities) Newyork 5. Pivot/ Ro
Toronto (cities) _ Pivot techniqu
Vancouver, . : anot
/ give her p
Q1 Jens ff 825}] 14 |] 400 => a - Forexample!
o2 ]
. New york
Time (Quarters) i nde an
Q3 i
Toronto (cities)
Chicago
a4 | Vancouver | New york
;
Computer Toronto
Security
Vancouve
Home Phone ' Home Phone !
Entertainment
Entertainment
Ttem type
Item type
Fig. 2.15.5 : Slice operation En

6. Other OLAP |
— Drill acre
=> Vancouve more thar
d
- — Dril. l thr
Time (Quarters)
bottom 1¢
—_—_———__

———__—

2.16 i
OLAP I ee

Entertainment
Enteterta
rtaii nment j 1
Item type Approaches tc
= ieic, 2156.6 = Di *
>-6 Dice operation pe ste 2pa Fag
Multid imension
Tefets to techno

Scanned by CamScanner
-

% ta Warehousing & Mini


SE Sa asin & i
Ming (MU-Som, 8-Comp,) 2-25
- Dice operation carry 9
ut Selection ETL Process and OL
' the 8iven CUby given cube and Produc
€S a sub Cu
with AP
be , °SPect to two OF more dimensio
Ms - For example, the di ns of the
ce
o Crati
dimension as Location .
i. are se]€cteg fe see a On the Ieft cube bas
( location=
0 “To2rrentg
a 9”” ;,. « couver”) and
or Van ed on three
a8 sho
(timwne in a5Fig,: 22 15.6 where the
entertainment” or com
puter"), iteriaia isi
criter
or “Q2”) and (item=
5. Plvot/ Rotate “home

- Pivot technique is useq f


give another presentati °F Visualization of data. Th; .
On of the data. +. This operation rotates the data axis. to
- Forexample Fig.
a2-D slice are rota2.15 , 7 shows i
t ed. the pivot Operatio* n where the .
Location :
item and location axis in
liem typa

Item type
we Fig. 2.15.7 :Pivot operat
ion
6 Other OLAP operatio
ns
~ Drill across : This techni que
is used when there is need to
more than one fact table. execute a query involving

Drill through : This technique uses


relational SQL facilities to drill through
bottom level of the data cube, the
mes z |
in Syllabus Topic : OLAP Models - MOLAP, ROLAP
|

216 OLAP Models : MOLAP, ROLAP, HOLAP DOLAP °


> (MU - May 2016)
Approaches to OLAP Servers
Ih the OLAP world, there are mainly two different ae oe
Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid
servers:
OLAP (HOLAP)
Refers to technologies that combine MOLAP
and ROLAP.

———— a

Scanned by CamScanner
2.16.1 MOLAP # pata Warehousing & Minit

analysis. In MOLAP, data js Stop, 2:16 2 ROLAP


This is the more traditional way of OLAP
multidimensional cube. The storage is not in the relational database, but in Proprietary ¢ N This methodology relic
~ Advantages of MOLAP aan of tradition
, . f slicing and dicing is equ
. o seks
1. Excellent performance : MOLAP cubes are built for fast data retrieval, and ie
for slicing and dicing operations. hin, advantages of

2. Can perform complex calculations : All calculations have been pre-gen 1. Can handle large =
cube is created. : “rated When 4, the limitation on dat:
i itself places no jimitz
Disadvantages of :MOLAP -_
‘ | 2 Can leverage func
1. Limited in the amount of data it can handle : Because all calcula database already cor.
2 tions are |
when the cube is built, it is not possible to include a large
amount of data inte nt| a_i
itself. This is not to say that the data in the cube cannot be derive
d from a large amount | Disadvantages of RO
data.Indeed, this is possible. But in this case,
only summary-level information will |
included in the cube itself.
voc(or aa
multiplete SOL
2. Requires addititiional investment
: Cube technology are often propr
dala gse,
alread der ietary and do no
in the organization. Therefore, to adopt MOLAP technology, chances of} ~ underlyin
additional 2. Limited by SQ
investments in human and capital resources
are needed.
roobrialion generating SQL s
all needs (for exa
teyer | ~~ client technologies are |

Scanned by CamScanner
Thi 18 ional
methodology relics on manipulating the data stored in the relationa database to B1Vive
aby ee eee traditional OLAP's slicing and dicing functionality. In cach action
Sm
hy ON siieing and dicing is equivalent to adding a “WHERE” clause in the SQL statem
Advantages of ROLAP
1 Wh

of data : The data size limitation of ROLAP technology 1s


«
iy 1+ Can handle large amounts
the limitation on data size of the underlying relational database. In other words, ROLAP
itself places no limitatio
n on amount of data.
HY,
2. Can leverage functionalities inherent in the relational database : Often, relational
*

: "4 MY database already comes with a host of functionalities. ROLAP technologies, since they sit
on top of the relational database, can therefore leverage
these functionalities.
“ly Disadvantages of ROLAP
Ul}
pt. Performance can be slow : Because
each ROLAP report is essentially a
(or multiple SQL queries) in ‘the relational database, SQL query
hy: ~ underlying data size is large. the query time can be long if the
a .
| 2 Limited by SQL functionalities
:
: Because ROLAP
technology mainly relies on
generating SQL statements to query the relational database, and SQL statemen
ts do not fit
all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP
technologies are therefore traditionally limited by what SQL can do.

_ Presentation | Multidimensional | Desktop


Layer. . view client

Scanned by CamScanner
Tae: i vata Warehousi
housing & Mining
gated this risk by building into the tool OUl-Of-the'h,

aks to allow users to define their own functions, eactT able : Fact_table (DI
* ieeaaeet

2.16.3 HOLAP
ang ROLp, a
the advantages of MOLAP
_ HOLAP technologies attempt to combine
san
, summary-type information, HOLAP leverages cube ieemnoloey for faster Performan,
_ When detail information is needed, HOLAP can “drill through” from the cube its \ (oe
underlying relational data.
stored, [oe
_ For example, a HOLAP server may allow large volumes of detail data to be
relational database, while aggregations are kept in a separate MOLAP store. »
server.
Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP
Operations
2.16.4 DOLAP
fac
on si
tability, saicfieent= Slaxiicse th
sing and variation of ROLAP. It offerst por 4. Sl
It is Desktop Online Analytical on mel
software to be presen
P, it needs only DOLAP
users of OLAP. For DOLA d
sional datasets are formed and transferre.to destin|
Through this software, multidimen
machine.
}
\
2.17 Examples of OLAP F
Ex.2.17.1: Consider a data warehouse for a hospital where there are three dimension Q
Doctor (b) Patient (c) Time
And two measures i) count ii) charge where charge is the fee that the doc '
charges a patient fora visit. | 2. Dice: Itisas
Using the above example describe the following OLAP operations
| Bike Hine one
1) Slice 2) Dice 3) Rollup 4) Drill down 5) Pivot
. *

a
:
-

DURDEN ees
Soln. : :
There are four tables, out of 3 dimension tables and
1 fact table.
Dimension tables

1. Doctor (DID, name, phone, location, pin, specialisation)


2. Patient (PID.name, Pp phone , Slate, te, city,loc
city,locati
ation, pin)
i
3. Roll up :
3. Time (TID.day, month, quarter , year)
hierarchy
Or count |

Scanned by CamScanner
by, *
‘S$
aie
.

ng ETL Process and OLAP


Fact Table = Fact_table
DID,PID
Ry Fount, charge)
Pete U‘th om Cor] 3 =

te Cy Ces] S00 260 1 >


be ‘| <
be ‘¢ 100 180 ten a >

Lp Story. 200 o es
Stop, &
) | 100 [Pca alt ow
= ah

|
| Operations
Ortar.
"ab i || L ;
Slice i on fact table
: Slice with DID =

0 de,
| : 300 50 280
, | wig 250 180 200

ion ¢ _ 200 0 | 300©


ab, |
| 2, Dice : It is a sub cube of main cube. Thus it cuts the cube with more than one predicate
like dice on cube with DID = 2, and DID = 01 and PID
= 01 and PID = 03 and
| TID = 02, 03

a . 300 180

| 100° | 125 oe
- 200 | . 300

3, Roll up : It gives summary based on concept hierarchies. Assuming there exists concept
hierarchy in patient table as state->city->location. Then roll up will summarise the charges
for a particular state etc.
or count in terms of city or further roll up will give charges

Scanned by CamScanner
v - 2016) of
v7 pata Warehousing & Mining (N
——==

Sys Par 200 250 350 goin. + oa

> , \—_ SKU Number


ve ge £ product Doseription
50 50 es 19 NDo Brand Name
25 ap 7 Do‘ isla
ao e, and Product Sub-Category
2 50 nce docs Product Calegory
[bar2] 100 Po under doctor boy arent

= 80 = iP ep Package Size
<
Package Type
200 200 2 a = < fs Weight
350 .| 200 p |: Unit of Measure
180 rs - = oe Units Per Case

80 180 | 80| | _ Shelt level


Shelf width
[sha cent
:
ee ee |
4. Drill down : It is opposite to roll up that means if currently cube is summarised4:
respect to city then drill down will also show summarisation with respect to location, |

[Toe] = = a |
35 300 200 /” 50 :

too | 200 | 10 | 4~ 1H
130: 0 630

125 | 300 -| 280 ||. Fig. P.2.17

a There are four tabl


. Pivot: It rotates th . ;
temodilercate € cube, sub cube or rolled -up or drilled -down cube, thus changing) For OLAP opera

Ex. 2.17.2 * All Electroni _—<"f Ex. 2A7.3 : ‘The Co


ania 's Company have sales department consider three dimers There ¢
(i) Ce
(i) Time * . ae
The Schema Con taie pantie pect |; 8
dlere-costanain S a central fact table sales witsiah two measures .The o
” (ii) units-sold , iy: t
Using the above €xample describe the follow iZ a \
wing OLAP operations : i.
(i) s Dice
ae
iii) - Rol -
(i)5: Slice oho at
a
(i) |
. )Rol-up _{iv) drill Down. Thy ile SE SOE
MU - Dec. 2013,

Scanned by CamScanner
ae . is wa
s
a

Qu Data Warehousing & Mining (MU-Se.


Boss Wares mM. 5-Comp.) 2.31 a

soln. = Product
Product Key
SKU Number
Do11, Doip Product Description Store
ore 98 the ang Brand Name Store Key
nder dot econ Product Sub-Category Store Namo
DO Product Category Store ID
Department Address
Package Size City
Packages Type State
Weight Zip
Unit of Measure \ See : Dette
Units Per Case Time Key —

Snot width
Shelf evel StoreSales
"Unit Key ~ Services Ty ype
y . pelt det Dollar Cost 3
> 1S SUMMariseg
ect to location, Time
Time Key
Date
Day of Week
Week Number.
__. Month
__Month Number
- Quarter
Year
- | Holiday Flag
Fig. P. 2.17.2 : Star Schema for Electronics Company sales department

There are four tables, out of these 3 dimension tables and 1 fact table.
For OLAP operations refer Example 2.17.1.

pote The College wants to record the grades for the courses completed.by, stu
There are four dimensions: 3
{i) Course (ii) Professor ' % yee
(iii) Student (iv) Period j
The only fact that is to be recorded in the table is aeesigaia:
() . Design star schema.
: (ii) Write DMQL for the above star schema
©
OLAP cosas is
| iii) U the above example describe thefohowina
oaks naan dl ies

Scanned by CamScanner
id Syllabus (Rev - 2016) of

19
tan’

ww Data Warehi ousing & Mining (MU:


Data Warehousing & h

Soln. : Course_Dim
Course_id a.3 Distinguish betw
Section_no a.4 Explain differen
Course_name
peices | — may 2011
Room_id
Course_id a.5 Explain the ma
Room_capacity
Semester_id . (Ans. : Refer s
Student_id
Q.6 Compare bety
Proof_id
paca | Course_grade’ Student_Dim
Q.7 The college ¥
—-Student_id are four dime
Prof_id
~ Student_name -
Prof_name
Major () Cours
Tite
(ij) ~—— Profe:
Department_id .

|
Department_name (a) Thec

’ Fig. P. 2.17.3 : Star Schema for college

DMQL for the above star schema

— Define cube. Course_Grade_Fact (Course_Dim,Peri wan


Proffesor_Dim,Student_Dim): course-grade
- Define dimension Course_Dim as (Course_id, Section_id, Course_name, Ui
Room_id, Room_capacity) Dec. 2011
———————

— Define dimension Period_Dim as (Semister_id, Year) Q@.8 Explain!


- Define dimension Professor_Dim as (Prof_id, Prof_name, Title, Department i (Ans. : F

May 2012
Department_name )
- Define dimension Student_Dim as (Student_id, Student_name Major)
For OLAP operations refer Example 2.17.1, Q.9 Whatis
2.18 University Questions and Answers (Ans. :

Q.10 Comp:
May 2010
Q@.11 Write
Q.1 Explain ETL of data warehousing in detail
(Ans. : Refer sections 2.5, 2.6, 2.7 ang 2 Dec. 2012
8)
Dec. 2010
Ee (10 Marks!
Q.12 Exp
Q.2 Explain ETL ofdata
Warehousin (Ans
(Ans Refer sections 250 ae
Pee0, 2. 2.8)
A anne)

Scanned by CamScanner
(10 Marks)

(10 Marks)
a@.5 Explain the major steps jn th
ET
(Ans. : Refer sections
2,5. 26 se 28) manent
ea eet
@.6 Compare between OLTP
and OLAP. (Ans, : Refer
section 2. 12)
(5 Marks)
Q.7 The college wants
aeioun ae. t
record the Srades for the courses completed
by students. There
(i) | Course (iii) Student
(i) | Professor (iv) Period,
(a) The only fact that is to be recordedin the table
(i) Design star schema.
is course-grade :
(5 Marks)
(i) | Write DMQL for the above star
schema (5 Marks)
(b) Using the above example describe
the following OLAP operations
:
Slice, Dice, Roll-up, Drill-down, Pivot.
(Ans. : Refer Ex.2.17. 3)
(10 Marks)
Dec. 2011

Q.8 Explain ETL of data warehousing in detail.


(Ans. : Refer sections 2.5, 2.6, 2.7 and 2.8)
(10 Marks)
May 2012
ee

Q.9 Whatis meant by ETL? Explain the ETL process in detail,


(Ans. : Refer sections 2.1, 2.2,2.5, 2.6, 2.7 and 2.8) _ (10 Marks)
.10 Compare OLTP and OLAP systems. (Ans. : Refer section 2.12) (5 Marks)
11 Write short notes on OLAP operations. (Ans. : Refer section 2.15) (10 Marks)
C, 2012 |

12 Explain the ETL cycle for a data warehouse


in detail. (10 Marks)
2.7 and 2.8)
(Ans. : Refer sections 2.5, 2.6,

Scanned by CamScanner
he New Revised Syll labus (Re Fo
liversity ed
fear 2018-2019
Jased Credit and Grading Sys
tem)

oods sold, and the ta


1e of various 9erations
" e Sa ng p op
| re ho us e st or in g an rill down
| Q.13 Consider a data wa describ | ip
as
s example
the sale, Using thi ollu (10Mr,
e
(2) Dice
3)
( |
} (1) Slice
15)
(Ans. : Refer section 2. , |{
May 2013 - . form - Load) cy e (10 Mar
the st ep s of the ET L (Extract _ i
|
| @. 14 De sc ri be three ree didi mensions:
(Ans.: Refer sect ions2.5, 2.6, 2.7 and 2.8) ital, where there
are a
| a hosp d nt
i data warehouse for
(3) Ti me ; an d two measures: (1) O axato want
(2) Pati en t ing imoper
“* ey badd describe th e Co ie Ex
cr ea te a OL AP cube and
For this example |
(5) Pivot (ANS:
(1) Slice (2) Dice (3) Rollup (4) Drill Down
:
Dec. 2013 detail ; (ities
Warehouse in
' le i ina D Da ta
tract Transform LoLO ) cyc
ad
a.16 Explain ETL ( Ex an d 2. 8)
: - Re fe r se ct ions 2.5, 2.6, 2.7 nsicn
consider three dime
(A ns .
partment sales,
Ele ctr oni c co m pany have sales de
Q.17 All
namely. /
Product. (iii) Store
(i) ‘Time (ii)
two measures
sc he ma con tai ns cen tral fact table sales w' ith
The
(ii) Units-Sold
‘() _ Dollars-cost and
ions.
be the following OLAP operat
Using the above example, descri
(i) Dice (ii) Slice
DrillkDown (Ans. : Refer Ex. 2.17.2)
(10 Marks)
(iii) | Rol-Up (iv)
(10 Marks}
Q. 18 Compare between OLAP and OLTP. (Ans.: Refer section 2.12)
:
May 2014
Q@.19 Describe clearly the different steps of the ETL (Extract - Transform - Load) cycle ty
data warehousing. (Ans. : Refer sections 2.5, 2.6, 2.7 and 2.8) (10 Mart)

Q.20 Consider a data warehouse for a hos pital, where th jmension’ | |


(1) Doctor (2) Patient (3) Time; and two measures: (1) ee ) Fear
example create a OLAP cube and describe the followingEx.OLAP
2.17. ci et ; 1)$
Down (5) Pivot. (Ans. : Refer
(2) Dice (3) Rollup (4) Drill

:
}
I
Scanned by CamScanner
W vata Warehousing & Min
ing (MU-Sem. 6-Comp,) _
2-35
Dec . 20 14 e and OLAP
a


Q.21 Explain. the ETL (Extract
, Transfo tm Load y
(Ans. : Refer sections
2.5, 2.6, 2.7 and ie
(10 Marks)
Q.22 Describe the following OLAP operations using an example
:
-(1)_— Slice (2) Dice 3
(4) — Drill down
(5) Pivot (Ans. : Refer
anes section 2.15) (10 Marks)
May 2016

Q.23 Describe the steps of ETL process.


(Ans. : Refer sections 2.5, 2.6, 2.7 and 2.8)
(10 Marks)
24 We would es to view Sales data of a company with
respects three dimensions
-namely location, item and time. represent the sales data in the form of a 3-d
data
cube for the above and perform roll up, drill down. slice
and dice OLAP operations on
the above data cube and Illustrate. (Ans. : Refer section 2.15)
(10 Marks)
| Q.25 Discuss various OLAP models. (Ans. : Refer section 2.16)
(10 Marks)

hh on00

Chapter Ends
ere eres

i
i 1

Scanned by CamScanner
os Warehousing & Mi

pefinition
Data mining is proce

Introduction to Data Mining, Data mining is the


for useful informa
Data Exploration and networks, and acdva
Preprocessing and relationships, '
Data Mining is ar
=

o Valid,
o Novel,
of
Syllabus : o Potentially
Introduction to Data Mining, Data Exploration and Preprocessing : Data Mining 7, \j —_———
Primitives, Architecture, Techniques, KDD process, Issues in Data Mining, Applications at =>
Data Mining, Data Exploration : Types of Attributes, Statistical Description of Data, Den||
Visualization, Data Preprocessing : Cleaning, Integration, Reduction : Attribute Subse |
selection, Histograms, Clustering and Sampling, Data Transformation and Daal 32
32_Data Mir
Discretization : Normalization, Binning, Concept hierarchy generation, Concer |
Description: Attribute oriented Induction for Data Characterization. Data mining|
data mining quer
'
|
3.1 Whatis Data Mining ?

> (MU- Dec. 2011)


- Data Mining is a new technology, which helps organizations to process data through |
| algorithms to uncover meaningful patterns and correlations from large databases that
otherwise may not be possible with standard analysis and reporting.
- Data mining tools can help to understand the business better and also improve
future
performance through predictive analytics and make them proactive 1. Task rel
and allow knowledge |
driven decisions. ET Si
Issues related to information extraction from large
databases, data mining field brings - Usi
together methods from several domains lik
Recognition, Databases and Visualiz
¢ Machine. Learning, Statistics, Pattem |
ation, — Be
- Data mining field finds its
application in mari ket analysis - M
€.g. Customer relationship Man and management like, fot
agement, cross sel ling, market segmen
used in risk analysis and tation. It can also be
management for f Orecasting,
underwriting, quality control, customer retention, improved
competitive analysis and cre
dit scorin g.

Scanned by CamScanner
W pan Warehous
ing & Mining (MU-Se
Definition m. 6-Comp
=

Stored in a data wareho


: use
and relationships, Whiced sta erwticise
ch othtis
ficial
alshatog}y: a intel];
Fe H
e analysis) to reveal trends, patee e
terns
- Data Mining is non-trivial process of identifying Bs
° Valid,
ing : D © Novel
i Minj Min,

I. Task Relevant Data


(Mu -Dec a 2. Kinds of knowledge to be mined
ess +201
dh i 3. Background knowledge
od
S data
ds a 4. Interestingness measure

5. Presentation and visualization of


discovered patterns
improve futur
ow knowledst : 1. Task relevant data.

Specify the data on which the data mining function to be


performed.
r field brings ~ Using relational query, a set of task relevant data can be collected.
tics, Pattem ~ Before data mining analysis, data can be
cleaned or transformed.
~ Minable view is created i.e. the set of task relevant data for data mining.
il Like,
an also®

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _3-3 Introduction tg Dein
Sy! Data Warehousing
FfSS S—— & Minin
f 2. The kind of knowledge to be mined
\ é Operation-derived hie
— Specify the knowledge to be mined. decoding of informatic
objects, data clustering.
- Kinds of knowledge include concept description, association, clasgie
prediction and clustering. Cation Example : URL or em:
xyz@cs.mu.in
- User can also provide pattern templates. Also called metapatterns or ry,
metaqueries. taruley ‘ a) Rule-based hierarchi
defined as a set of rul
3. Background knowledge , rule definition.

— It is the information about the domain to be mined. Example :


| . - Concept hierarchies is the form of background knowledge which helps to discover See eeaitii,
a knowledge at multiple levels of abstraction. meaner _P .
| low_profit_margin (Z) <
i 15 distinct values . fit gin |
| medium_profit_mar
65 distinct values high_profit_margin (Z)

4. Interestingness |
3567 distinct values
. — Itis used to cc
674,339 distinct values ~ Based on the
Fig. 3.2.1 :. Concept hierarchy for the dimension location -
. “—
— Patterns not:
Four major types of concept hierarchies
— jecti
Objectivem
a) Schema hierarchies : It is the total or partial order among
attributes in the database o Simpli
schema,
human
Example : Location hierarchy as street < city < province/state < country
o Exam
b) Set-grouping hierarchies : It organizes values into
j sets or groups of constants.
i & ‘Cee
Example : For attribute salary, a set-grouping hierarchy can Confi
be specified in terms
ranges as in the following :
Ce
(low, avg, high}
gh} € all (salary)
(sal o. Utili
{1000....5000} € low

{5001....10000} € avg
oO
{10001....15000} € high
° a
c

Scanned by CamScanner
Ww Data Warehousing
Mining (Mu.
ON {MU-Sem, 6-Comp
¢) Operation-deriyeg hie; ,) 4
Introducti ion to D Data Minin
A Tarchies - A 2
decoding of information-cneoded . Z based ON operation specified whic
objects, data clustering, h may include
omnes. information extraction from
Example : URL
complex data
or €mail address

dio, Example :
Vey 4 i
Followi ing —s rules are
used ty
medium_profitandhigh profit Categorize items
_margin, as low_profit,
low_profit_margin (Z)
<= price (Z, Al)*cost
(Z, A2)*((AI-A2)<80)
medium_profit_marg
in (Z) <= price (z, Al
)*cost (Z, A2)4(A1-A2)
high_profit_margin (Z) <= price (Z, 280)*((A1-A2)<350)
A1)Acost (Z, A2)*(A1-A2)>350)
4. Interestingness measures

It is used to confine the number of


uninteresting patterns returned by the
process,
Based on the structure of Patterns and
statistics underlying them.
Each measure is associated a threshold
which can be controlled by the user.
Patterns not meeting the threshold are not prese
nted to the user.
Objective measures of pattern interestingness :
o Simplicity : A patterns interestingness is based on its overall simplicity
for
human comprehension.
° Example : Rule length is a simplicity measure.

° Certainty (confidence) : Assesses the validity or trustworthiness of a pattern.


Confidence is a certainty measure
(Number of tuples containing both A and B)
Confidence (A > B) = ( (Number of tuples containing A)

Utility (support) : It is the usefulness of a pattern support


(Number of tuples containing both A and B)
(A=>B) = (total number of tuples)

Novelty : Patterns contributing new information to the given pattern set are
exception).
called novel patterns (Example : Data

Scanned by CamScanner
syllabus (Rev - 2016) of

nd Grad?

ty
of di scovered patterns Tf Data Warehousing & Mir
visualization
5. Presentation and
discovered Patterns in
sho uld be able to display the 2. Databases or data w
Data mining sys tem s barshart hy .*
, cro sst abs i (cr oss -tabulations), pie or
tables tes It fetches the data as
forms, such as niles, ons,
vis ual rep res ent ati
trees, cubes, or other 3. Knowledge base
for paying
forms = presentation to be used
t

User must be able to specify the This is used to guide


discovered patterns. |
4. Data mining engin
SS |
Syllabus Topic :Architecture the dz
It performs
cluster analysis etc.

3.3 Architecture of a Typical Data Mining System 7 5. Pattern evaluatio

> (MU - May 2010, Dec. 2010, Dec. 2011, Dec. 2013, May 2, Tt is integrated w
patterns.
Architecture of a typical data mining system may have the following major componentsx
6. Graphical user i
shown in Fig. 3.3.1 :
This module is u
users to browse |
=
—————oO''

———————

3.4 Data Min

Many techniq
Data cleansing 4
Some of such
Data Integration
ff
3.4.1 Statistic

Statistics is
representatic

Fig. 33.1: Architecture of typ _


Statistical a
ical data mining system
data to drav
1, D: atabase, data warehouse,
or other information
These are information rep
repository 3.4.2. Machi
D .
performed on the data.
Ositories,
‘ata cleaning and data integration techniques ™2Y
be _
Machine |
ability to
‘computer

Scanned by CamScanner
ar “ban, ; Ww Data Warehousing & Minin 5
ee LE Minin g (Mu SOM, S-Comp.) aA Introduction to Data Mining

d “ nnn Oe date Warehouse seryor


It fetches the data
as Per the user's
My requirement which
! 3. Knowledge base is heed for data mining
task,
4 s
This is used to Suide the Sea
oS} rch, and gives interesting and
{ hidden patterns from date.
4. Data min ing engine
It performs the dat
a mining task such as characteriza
j cluster analysis etc, tion, association, classificati
on,
73 5. Pattern evaluation module
V 2p, yi ‘
Co 2 It is integrated with the mining module and it helps in
"Pore,
a! patterns. searching only the interesting
6. Graphical user interface
i This module is used to communica
te between user and the data min
i users to browse database or ing system and allow
data warehouse schemas.

Syllabus Topic : Techniques

3.4 Data Mining Technique

> (MU - Dec. 2011)


; - Many techniques that strongly influence the
development of data mining methods.
- Some of such technologies are given below
:
3.4.1 Statistics
‘~ Statistics is a discipline of science, which uses mathe
matical analysis to quantify
Tepresentations, model and summarize empirical
data or real world observations.
f ~ Statistical analysis involves collection of methods
and applying them on large amounts of
' data to draw conclusions and Teport the trend.
I 3.4.2 Machine Learning

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _3-7 Introduction to Data iy, |
———————— — —$—$$—S 4

- For example, Facebook's News Feed changes according to the user's personalj ie,
with other users.
1. Supervised learning : One standard formulation of the supervised learning taskig
classification problem. In classification, the training dataset is having labe|
examples. Based on training dataset, classification model is constructed which wl
to give label to unseen data. Neural networks and decision trees techniques are hj
dependent on the previous information of classification where the classes are Pe.
determined by classifications techniques.
2. Unsupervised learning : One standard formulation of the unsupervised learning task
is the Clustering problem where the examples of training dataset are not labelleg
Based on similarity measures, clusters are formed.
3. Semi-supervised learning : This learning technique combines both labelled ay
unlabeled examples to generate an appropriate function or classifier. This method
generally avoids the large number of labelled examples.
4. Active learning : The user play an important role in active learning. The algorithm
itself decides which thing you should label. Active learning is a powerful approach in
analysing data effectively.

3.4.3 Information Retrieval (IR)

Information Retrieval deals with uncertainty and vagueness in information


systems:
— Uncertain representations of the semantics of objects (text, images,...).
— Vague specifications of information needs (iterative querying).
— Example: Find documents relevant to an information need from a large document set.

3.4.4 Database Systems and Data Warehouses

Database

1. Databases are used to record the data and also be used for data warehousing. Online
Transactional Processing (OLTP) uses databases for day to day transaction purpose.
2. To remove the redundant data and save the storage space, data is normalized and stored in
the form of tables.
3. Entity-Relational modeling techniques are used for Relational Database Management
System design.

Scanned by CamScanner
lof a 2. Data is de-normalized 66
analytical queries, t ables are not i complex and reduces the response ‘
time for

3. Data-modeling techni owe ¢


| 4. . Read o ti vanes Schema are used for the Data Warehouse design.
IS
leg
e .
peration on data warehou Se are o eis . fi ently for
analysis purpose so performance is high for qbetien a {
is } 3.4.5 Decision Support System .

T]
TOach ; - Decision Support system is si ..
in : : a Category ofof informati
information systems, ;
which helps =
ina dees ion
making for business and osbatiengice

~ a = jon Bens software based systems which helps decision makers to extract useful
: . a m raw data, documents and take appropriate decisions, modelling business
= by identifying problems and solving them.
On
- Information that is gathered and presented by a DSS are:
o A list of Information assets like legacy, relational data sources, cubes, data
warehouse, data marts.
o Comparative statements of sales.
o Projected revenue based on assumptions of new product sales.

‘ Syllabus Topic : KDD Process

3.5 Knowledge Discovery in Database (KDD)


Dec. 2012, Dec. 2013, May 2016)
> (MU - May 2010, Dec. 2010, May 2011, May 2012,
methods
le dg e in dat a an d ap pl ication of data mining
- The process of discovering know (K DD ).
|
ery in Da ta ba se s
refers to the term Knowledge Discov hic ificial Intelligence
h ,includeeeArtce
— i ay
- It includes a wide riety of app: lication doan mains, W
i va j Visualizat
g Statistics and Data
Pattern Recognition, Machine

Scanned by CamScanner
- The main goal includes extracting knowledge from large databases, the Boal is wa
by using various data mining algorithms to identify useful patterns acco, Sang hia,
predefined measures and thresholds. hn, |

Outline steps of the KDD process interpretanond

Data mining ~

metenenenneenseenenncnnde

(a) Pattern st

cancer
(b) A decisiot

(c) Consider
mining m
Fig. 3.5.1: KDD Process

The overall process of finding and interpreting patterns 7. Data mining


from data involves the Tepeated |
application of the following steps :
Using a repr
1. Developing an understanding of : regression cli

(a) The application domain 8 Interpretin:

(b) The relevant prior knowledge 9. _ Consolidat!

(c) The goals of the end-user.


The terms k
2. Creating a target data set :

Selecting a data set, or focusing on a subset of variables, or data samples,


on which
discovery is to be performed.

3. Data cleaning and pre-processing :

(a) Noise or outliers are removed.


(b) Essential information is collected for modelling or accounting
for noise.
(c) Missing data fields are handled by using appropriate strateg
ies.
(d) Time sequence information and changes are maintained.

Scanned by CamScanner
: a: WF Data Warehousin & Mini (MU-Som
j . 6-C ei
3- 2
4, Data reduction and projection :
(a) Based on the goa
l of the lask, useful
feat ‘Ures
are found to
(b) The number of variables may be eff, Fepresent th
e data
dimensionality reduction or trans
“formation. Cclively
Invariant ted reprec rgs ne,i Methorls ik
also be found out.
variant representations for the i
| 5, Choosing the data mining task
:
{ree - U5)
Selecting the appropriate Data
minin ig tasks liki
ificati
on the goal of the KDD process. Seen Stustering, regression based ——
. 6. Choosing the data mining alg
orithm(s) :
¥ a \
(a) Pattern search is done using the appropri
ate Data Mining method(s)
9 P
(b) A decision is taken on which mod
els and parameters may be appropriate
\ Sec
(c) Considering the overall criteria of the KDD process a match for the particular data
mining method is done.

7, Data mining:

Using a representational form or other representations like classification, rules or trees,


regression clustering for searching patterns of interest.

8. Interpreting mined patterns

9. Consolidating discovered knowledge

The terms knowledge discovery and data mining are distinct. ———
aa a a ca“ naa aE: PA
iy " en acl oe peer, Saco earoes

‘ee . = ee ¥ A Rapes seta “a a is ier ees

f = Welhah ac SES Sb afis 4 les =e =


at

ininging 1si one of the step in the


Fant
1 hee.

eewreaser scienc e, whi i


ch | Dat a Min
KDD issk field ahof comput
a s ~
yhich
ul , | KD D pr oc es s, it applies the app’ ropriate
helps humans in extracting usef th m ba| se d on the g oal of the
fr om al go ri
i
s
ously undiscovered knowledge
reviiou
theories for

the same. ee

Scanned by CamScanner
F data warohoust
\
ng SMe
aie Warehous

——F
Y pala
of a medicin
3.6 Major Issues in Data Mining jvencss

ysing dala _—
ini ng can be
Mining methodology and user interaction issues used it

— Mining different kinds of knowledge in database. amount of data


i
Interactive mining of kn Al
owledge at multiple lev
els of abstraction collected for anal
ape ign
- Incorporation of backgroun effectiveness, |
d knowledge,
Gi Data mining query langua mor
we e-
~ ge and ad hoc data Mining munication inds
.
~ Presentation and visual
ization of data mining
- results.
Handling noisy or incomp
lete data. | recom -
j - Pattern Evaluation, —
j Performance issues 3
| ———
i - Efficiency and scalab
ility of data mining a
!
algorithms.
- Parallel, distributed j
and incremental nuning a
: : Issues relattointhge d algorithm. ;
.
Da ta Object
s
iverofsdaitatbayse type |t _ A data obje
isca t
s log
Handling of relational the same entity. It als
and complex types of
data. Example
Mi ning inform
atio
:Ina prod:
n from hete
rogene ous databa
ses and global
informatio
store, employee, cus'
n sy stem.
Syllabus Topic :
Applications of Da
Every data object is
ta Mining database in the forn
3.7 Applications of A
Data Mining attributes.

Attributes Types
j ~ Data Mining
i - An attribute is a
Public sectors,
characteristic of a.
The attributes may
@ Nominal attri
Gi) Binary attrib
Gii) Ordinal attri
Gv) Numeric att
WY) Discrete ver

Scanned by CamScanner
usi
& ng
Mining (MU..

g e
af
ective ness of e
‘ RN So oon
a medicine or Certain
> ng data mining.

pata mining can be used‘ in Pharma ceutical


E: he diseases, by analysing Itms as
chemical cor a .
: * Buid
"pounds ang Ecne tice to reg
jarge amount Of data in retail in eae on new treatments
: tals.

tapas (Red = 2018


Tel communication industry also
uses da
: of them
the customer data which fa minin g, for ¢ ad Grading SYS
are likely to remai n eae acs do analysis, based on
shift to competitors.
nm and which one will

g Types of Attributes
30

pata Objects
- A data object is a logical cluster of all
the same entity. It also represents tables in the data set Which contains data relat
an object view of the same, ed to
- Example : In a product manufacturing
company, product, customer are objects.
store, employee, customer, items and sales In a retail
>. are objects.
- Every data object is described by its Proper
ties called as attributes and it is stored
es database'in the in the
form of a row or tuple. The columns of this data tuple
=——— attributes, are known to be vers
Ow
a Attributes Types
ac. 2071) - An attribute is a property or characteristic of a data object
. For e.g. Gender is a
well as characteristic of a data object person.
~The attributes may have values like :
surance -
() Nominal attributes
fraud (ii) Binary attributes
(ili) Ordinal attributes
1 the (iv) Numeric attributes
) Discrete versus continuous attributes

Scanned by CamScanner
a

¥ Data Warehousing & Mining (MU-Sem. & Comp.) _3-13 Introduction tgData '
() Nominal attributes

— Nominal attributes are also called as


Categorical attributes and allow for
qualitative classification.
~ Every individual item has a certain disti
nct categories, but quantification Oris
the order of the categories is not Possible. ,

The nominal attribute categories can


be numbered arbitrarily.
ric attribute:
Arithmetic and logical operations
on the nominal data cannot be
performed, ) Nume
— Typical examples of such attribut
es are : Num eric Attributes :
either have an integ<
| 1. Yes
=| 2. No (a) Interval scaled
: 1. Unemployed (b)_Ratio scaled al
| 2. Employed
{ii) Binary attributes

~ A nominal attribute whi


(a) Interval-scaled at
ch
where 0 means that the tates O or 1 is called
Bi Interval-scalec
a tribute is absent and 1 j
means that it is present. e
e Example : v
ordering, com
scaled attribut

(b) Ratio-scaled attr


Ratio scaled
They are alsc
Operations 1
| are not possi
(ill) Ordinal attributes
For examp!
j ~ A discrete ordinal attribut
e is a nominal attribute, whi be 50 degre
} rank for its different States ch have Meaningful order
. ot a liquid at 2
i The interval between different States
f not possible, however logical
is uneven due to whic
h arithmetic operations Thearree11
operati are
2 As int
that it
©: Ancy

© Trans
treatir

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-14 Introduction to Data Mining
————

(Iv) Numeric attributes

Numeric Attributes are quantifiable. It can be measured in terms of a quantity, which can
either have an integer or real value. They can be of two types :
(a) Interval scaled attributes
(b) Ratio scaled attributes

(a) Interval-scaled attributes


Interval-scaled attributes are continuous measurement on a linear scale.

Example : weight, height and weather temperature. These attributes allow for
ordering, comparing and quantifying the difference between the values. An interval-
scaled attributes has values whose differences are interpretable.
(b) Ratio-scaled attributes

Ratio scaled attributes are continuous positive measurements on a non linear scale.
They are also interval scaled data but are not measured on a linear scale.

Operations like addition, subtraction can be performed but multiplication and division
are not possible.
For example : For instance, if a liquid is at 40 degrees and we add 10 degrees, it will
l

be 50 degrees. However, a liquid at 40 degrees does not have twice the temperature of
a liquid at 20 degrees because 0 degrees does not represent “no temperature”

There are three different ways to handle the ratio-scaled variables :

o As interval scaled variables. The drawback of handling them as interval scaled is


that it can distort the results.

o Ascontinuous ordinal scale.

o Transforming the data (for example, logarithmic transformation) and then


treating the results as interval scaled variables.

Scanned by CamScanner
ontg the .
© ats Werchousing& Ming (MU-Sem. 6 Comp.) _ $15 __Introducti & Mining

(v) Discrete versus continuous attributes


" oo
ied values then it ig
Mean is given by x (
If an attribute can take any value between two specif
gu
continuous else it is discrete. An attribute will be continuous on one Seale and
on another.

- For example : If we try to measure the amount of water consumed by coun, Or


individual water molecules then it will be discrete else it will be continuous. &
- Exampies of continuous attributes includes time spent waiting, direction of ‘Where, 1 represents
Wailer consumed etc. " _ For example the me
- Examples of discrete attributes includes voltage output of a digital dey; 85 |55 |8
person’s age in years. Me |

Syllabus Topic : Statistical Description of Data a — Mean


is the only 1
=—= each value from m
3.9 Statistical Description of Data : — If the weights are
weighted average

Scanned by CamScanner
W Data Warehousing & Min
ing (MU-Som, &Comp.)
3-16 ___!ntroduction to Data Mining
~ Meanis given byx (Pronounc
ed as x bar)

x = Gitmt..tn)

Or 3x
Where, Ht represents the Mean,
— Forexample the me
an for the following data is,
[ss [55 [39 [oe | 14/96 |78 | 87 | 45 92
=. 64,7
— Mean is the only measu
each value from mean isrealway
of central tendency in which the sum of the deviations
of
s zero.
- if the weights are associated
with the value x; then weighted
weighted average is, arithmetic mean or the
Mo

i
i
|
I

Me

Wi
1
ie
i

Mean has one disadvantage; it is highly susceptibl


e to the influence of outliers. Urider
such types of situations median would be a better meas
ure of central tendency.
il) Median

- This measure is suitable when the data is skewed (frequency distribution of data is
skewed). As the data becomes skewed, it loses its ability to provide a central position
under such a situation median can be used as it is not influenced by the skewed
values.

- Median is the middle score for a data set, which has been arranged in the order of
magnitude (smallest first).

- if we have the following data set :


For example,

{55 89 |66 | 25 | 14 | 96 | 78 | 87 [45 92


[ 85

Scanned by CamScanner
a
¥ Data Warehousing & Mining (MU-Sern Comp.) 17 : resent b
he
on

_
Rearransing the above dzta in the order ofmpine:
ec : TN

14 {25 45 |35 |66 [78 | 85 | 87 | 89 22 [96 | |


The median for the abovdata e is 78. It is the middle position element as there
.

scores before it end five scores after it. ah,


The above example is for odd number of elements in the data
set, Ut in case
have even number of elements for example :
ty
14 [25 | 45 |55 | 66 [78 {as | 87 | g9 2 | 96 | 10 |
In the above case an average value of two valu
es i.e. 78 and 85 is CONsidery
median of the abov
dataeset. Le. 81.5

The median is a useful number in cases where the
distribution has very large exten
values, which would otherwise skew the data.

Median = L,+ (wer.

Where, L, = Lower class boundary


n = Nuofmvalu
b es e
in the
rdata
.
(Lf i = The sum of the frequencies of all of
the classes that are lower than the
median class
fesiea = Frequency
of median class
© = Size of the median class interval
iii) Mode

If we consider a histogram then it is


the highest bar in the chart. Mostly mod
e is ws
for cat
egorical data, where the most common category is to be
known.
Data set with one mode is called uni
modal.
Data sets with two modes are called
bimodal.
Data sets with three modes are called
trimodal.
The empirical relation is :

mean —mode = 3 x (mea


~ medin
an)

Scanned by CamScanner
The midrang, :
a Average of the largest and smallest valuc in a data set.
3.9.2 Dispersion of Data

The d gree to which


: Humeric datao, tend to spread is called : f the
the dispersion OT variance 0:

(i) Quartiles : Q, (25% Percentile), Q; (75


percentile) =
Int arti}
. si ne, ; (1QR) : The distance between the first and third quartile 1s 4
meee measure Of spread that gives the range covered by the middle half of the data.

“ee
min, Qi, M, Qs, max
Where, M = Median

Qi, Qs; = Quartile


Min = Smallest individual observation
Max = Largest individual observation
(iv) Box-plot
.
— The visual representation of a distribution is the box-plot.
- Box plots used for assigning location and variation information in data sets.
It also
detects and shows the location and variation changes between dissimilar groups of
data.
— Box plots have Vertical axis which gives response variable and Horizontal axis which
gives the factor of interest.
— Tocreate box plot, order the data in ascending order.

— Example :7,8,3,1,5,6,10,13,5 |
1/3 /5})5]6 {74 8 | 10} 13
in

Scanned by CamScanner
_— inked
wv pata Warehousio & (MU-

Calculate the medi of that data which divides the data into two halves.
” standard deviation s Is the;
1 3 5 5 6 7 8 10 13

3.9.3 Graphic Displays of B


The median is Q2 =6
divi es data into quarters, Bar c harts, pi pie charts, line
ich divid
Again find the median of those two halves which
@

7 | 8 | 10 13 | s
Gi)zi Box-plot-plots
- 1 { 3.{ 5 |'5
+ Histograms
e (iii) « otots
_ Calculate the median of both the halves as there are 4 values, the median is averag Gv) Quantile plots
two middle values:
wv) Quantile-Quantile plots (
Qi = B+5/2=4
Similarly for second half, . (vi) Scatter plots
(vii) Loess curves
Q;= (8+10)2=9
- Now there are three points, Q, (the first middle point), Q,, Q, (the middle point of ty, 0. Box-plots

. . = The information shoul’


2 ras three
These . es the whole data set into quarters which are called y
points divid “qy Histograms
quartiles”.
t
— Provide a first loo
- The minimum value is 1 and the maximum value is 13, so we have:
single attribute.
Min: 1, Qy:4, Q::6, °Q3:9, max : 13
— Consists of a set
~ Then the box plot looks like this : present
in the giv
_—___ 4x 13— es 7 _ Examp le : Distr
Bee

fp”
| : 1
23 .
4
|
|
i
6
1
7
4
8 9
|
| *
l 1
Se
l l
We |
| =

5 . “10 15

Fig. 3.9.1 : Box plot for dataset


(v) Outlier: Usually, a value higher/lower than 1.5
x IQR (Inter-Quartile Range)
(vi) Variance: The: variance of n observations
x;,X9, Sesens waaXp1S
n

a1 2% Nesa| Patil2 sy,


2 1 u 2 1 n n 2
s = _ _ ]

~ i=] i=l

Scanned by CamScanner
I Bar charts, Pie‘
charts, line

ay, i (ii) Box-plots


"ay (iii) Histograms

| (iv) - Quantile Plots

(vi): Scatter plots

(vii) Loess curves


(i). Box-plots

The information shi box “Plots is already given in Se,


. (ii) Histograms 8iven in Section 3.9.2.
~ Pro
: vide a first look at a univ
sin gle attribute. istr5ibutio
arate data dist ribution, i.¢., a distribution of | values of a

0-10 11-21 22-82 33-43 44-54 55-65 66-76 77-87 88+


Salary (5 thousand)

Fig. 3.9.2 : Histogram for Salaries ~ 4


;
'
ail
Ui

Scanned by CamScanner
WF pata w: ng & Mining (MU-Sem. 6-Comp.) 3-21
A vertical bar graph and a histogram differ in these ways :

o In a histogram, frequency is measured by the area of the column,


o Ina vertical bar graph, frequency is measured by the height of the bar.

(lil). Quantile plots

- The normal quantile plot is usedto compare the data values with the Values tha
be predicted for a standard normal distribution.
- To draw normal quantile plot, compute two additional numbers for
Cach Val : I
variable. le of

1. Sort the data value from lowest to highest and in the colum
n labeled Xi) lag
: the data values is ascending order.
2. Next column shows the quartile for each data
value representation,
3. Now calculate the value of the
standard normal distribution that
quantile just computed in previous lies at the
column.
4. A normal quantile plot is formed by plotting
each ‘data value against the valueof (iv) Quantile - Quantit
the standard normal distribut
ion,- ;
,|
Example : til i
— As compared t
‘ but it needs mo
a One of the qua
2 | -o4 0.15 ~0.727891156
:3 | =
-03 0.25 ee
* —0.659863946
; ee e values.
y-axis
~ 0591836735
0.45 — If the marks :
~ 0.455782313
6 0.1. L955 | _o3g77ss ‘Standard Nor
7
i02 7 ASwehaver
0.6
ee

a
———

8 —0.047619048
27 975 ~ Finally you
| 1380959381
Pe 9 |Pe
6
S| 131292517
~ SeQde
| =
|
095 | 1448979599
5

Scanned by CamScanner
OZ = (x()-uvo

Fig. 3.93 ; Quantile plo


t of z-score
(iv) Quantile - Quanti
le Plots (Q-Q plots)
Quantile-quantile plots
are useful to compare the
quantiles of two sets of num
bers.
As compared to mean and med
ians, the comparison using QQ
but it needs more samples for com plots is more detailed
parison.
One of the quantiles is your sample obse
rvations placed in ascending order.
~ Forexample, Consider marks of 30 students.
Order the marks in ascending order from smallest to largest. That will be
<

your 4
y-axis values.

— If the marks are normally distributed, then x-axis values would be the quantiles of a
standard Normal distribution.
— As we have marks of 30 students, we will need 30 Normal quantiles.

Finally you plot marks versus the z-values.


plot isis used to determine how well a theoretical distribution models a set of
So a Q-Q Q plot
measurements.

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-23

ef pata Warehousi

Correspon
the Fig. 3.

es
(v) Scatter plot

Scanned by CamScanner
the Fig. 3.9.5 :

‘Ti
Water consumption (ounces)

°

o

;
70 75 88s gg 95 100.
Temperature (F)
Fig. 3.9.5 : Scatter plo
ik
t for Temperature and
Water Consumption
(vi) Loess curve

Loess curve is used- for fitting a


smooth curve between two variable
s.
= The procedure originated as LOWESS (Loca
lly Weighted Scatter-plot Smoother).
It is also called local regression.
To get the better. perception of pattern, a smooth curve is added
in scatter plot as
shown in Fig. 3.9.6.

Showing loses smoathing (local regression smoothing)

Scanned by CamScanner
tion
Syllabus Topic : Data Visualiza

— :
.
3.10 Data Visualization

| >
or pictorial f ‘
Data visualisation is pres
enting the data in a graphical
ise not possible Whe
analyse things which are otherw
help people to
techniques
ga
the dat visuaj iy
easily usin
s the data can be marked very
large. Pattemin
s : ;
Some of the data visualisation techniques are as follow
1. Pixel-oriented visualization techniques
' r

— Inpixel based visualisation techniques, there is aseparate sub windows latte i


of each attribute and is represented by one colored pixel. 5

It maximises the amount of information represcaed at one time without any Oy, da rape
A tuple with m variables has different m colored pixel to represent each Vatiable |
has sub window.
each variable -
Gi ) Paral
— Based on data characteristics and visualization task, the color mapping of the pin in the
decided.

A data Record with


four variables

Fig. 3.10.1 : Pixel visualisatio


n with four variables
2. Geometric projection visualizatio
n techniques
Geometric transformations and project; ays 5 .
projections
of multidimensional data sets can be fou!
:
using the following techniques : acenons
[ @ Scatterplot matrices
(ii) Hyper slice

(iii) Parallel coordinates

Scanned by CamScanner
4)
Introduction to Data Mining 4
by

.
Sati the
}
LS hte
aes S| Temp |
“XSi, nt Rf
ager
f
ve
tay Be] aes) 5
aa
y

= : Se “|. 5 po. TempJan Pd 3 ss, ¥ n°


y
= — .
= . : a Fu
. ,

Ws for
the
T Wie Z =
ry Sj ;

oespe
Tee
ee tog | RHJuly
ee
3
:
t Y Ove
Fig. 3.10.2 : An example of Scatter ploti
Variah le (ii) an
Hyperslice : : [t 1S an
extension to scatter plotl i
miatrices. They represent a
dimensional function as a matrix of
orthogonal two dimensional slices
the Di . (iii)
iii) a
Parallel co-ordi
aruinetes - : The parallel vertical
ical lines
li Separated define the axes. A point
Cartesian coordinates corresponds to a polyline in parallel coordinates.
é 45
z

3.5

| 3

2.5

= found j

2
length

: Sepal width Sepallength —_—Petal width


rsicolor == Virginica
hig
i -—— Setosa —Ve
Parallel coordinates
Fig. 3.10.3 : An example of

3. Icon-based visualization techniques


nic display techniques.
Icon b: visualisa tion techniques also known as ico
—~ Icon based visualisa

Scanned by CamScanner
_3-27 -
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.)

is mapped to an icon,
Fach multidimensional data item
ounts of data,
This technique allows visualisation of large am

Two most commonly used icon based techniques are :


(i) Chernoff faces
(ii) Stick figures

(i) Chernoff faces


— Ilsstration of trends in multidimensional data can be done by using Chernof
This concept was introduced by Herman Chernoff in the year 1973,
— The faces in Chernoff faces are related to facial expressions or features of hy
being. So to distinguish between them is easy.
- Different data dimensions were mapped to different facial features, for example
face width, the length or curvature of the mouth, the length of the nose etc.

- An example of Cheroff faces is shown below; they use facial features to repres
trends in the values of the data, not the specific values themselves.
- They display multidimensional data of upto 18 variables or dimen
sions.
~ InFig. 3.10.4, each face represents an n-dimensional data
points (n<=18).

®8@e@ Fig. 3.10.4: An ex


ample of Chernoff
faces

Scanned by CamScanner
| WF —pat_a Warehousing g
Mining (MU.g

cas
§
‘ = LIMU-Sem, 6-
q
C omp.) 3-25
Introduction ta
| - Pickett an
Data Minniing
d Grinste
I i inaim
nt,
rod
_- T
ahie eFig. e
uc i“
3.s
10.5

PR
a Su alisation
v
© e
originae techniqu
l rSs

Sa5 esiia
lick_. fig
ure wiwjith fi
ve stick- an
d family of
to display
te data with UP
to five varj ial
n helps to diffecrreenntiate € tdeixstur by using a 7two stick ico

»
t; ple ay bivariate MRI n which

ee
KIL S¥ 45
@

Fd
-

dimen OYA
#
;
A
r
4

orienta tien ") A stick figure icon family


tatio : with a body
and four limb
Fig. 3.10.5 : Example
of stick figure
4. Hierarchical visualizat
ion techniques
- The visualisation technique
s discus. sed above display mult
simultaneously. However for a | arge iple dimensions
data set having large number of dimensions
above techniques may not be us eful. the

- Hierarchical visualisation techniques partition all dimen


. - a - = * ,

sions in to. subset (subspaces).


|

These subspaces are visualised in a hierarchical manner.


Some of the visualisation techniques are
. @) Dimensional stacking
EEE

ii) Mosaic Plot


(iti) Worlds-within-worlds
' (iv) Tree-map
relations
(v) Visualizing complex data and

Scanned by CamScanner
3-29
housing «

(i) Dimensional stacking Ee


Gi) worlds
within w

In dimension stacking, partition the n-dimensional attribute space in 94


subspaces. Worlds with
_ Innermost w
- Attribute values are partitioned into various cla
sses. _ Remaining f
— Each element is a two dimensional space is a xy plo
t. Through thi
— Mark the important attributes and are used on the outer levels,
including ro
— Using quer

iy
Attribute 1 -
Fig. 3.10.6 : Data in dimension sta
cking
j (ii) Mosaic plot

~ Mosaic plots give a graphical illustration of


j the successive decompositions. . :
~ Rectangles are used to represent the ’ :
count of categorical data and at 3 eve
rectangles are split parall ry star
el.
od
i — To draw a mosaic plot, a contin
gency table of data and chosen- ord
with the response variable is re ering of variatk
quired. :
~ Example: In titanic example , out
and 33% died which is coded as 0. ofSo allthewowomemen n, 67bar% survived which is coded as!
men, only 17% survived, so this shows as 67/33 split. Amog} @¥) "Tree-maps
bar shows a 17/83 split.
. - Tree m
hierarck
— The vis
accordi
— Theley
— Each s
expres

Fig. 3.10.7 : Mosaic Pl


ot for Titanic

Scanned by CamScanner
=
“a SS
Oe
=~Sa
On, scalj & (inner)
Mic fa glove
- Using queries Static intera and Stereo disp
lays,
ction is also p (inner/outer)
Os

2"

Fig. 3.10.8 : Worlds Within


worlds visualization
(iv) Tree-maps

~ Tree maps’ visualization techni


ques are well suited for displaying lar
hierarchical struct ge amounts of
ured data.
~ The visualization space is divided into multiple rect
angles that are sized and ordered
according to a quantitative variable.
- The levels in the hierarchy are seen rectangles containing other rectangles.

- Each set of rectangles on the same level in the hierarchy represents a column or an
expression in a data set.

Scanned by CamScanner
Fig. 3.10.9 : Web traffic by location Tree-map

~ Each individual rectangle on a level in the hierarchy repres


ents a category in;
column.
For example, in the Fig. 3.10.9, a rectangle repr
esenting global below which there x:
rectangles representing continents which
contain several rectangles represennny
countries in that continent.
~ Each rectangle representing a coun
try may in turn. contain rectangles repre
States in thesecountries.
senting

(¥) Visualizing complex data and


relations
This technique is useful to vis
ualize non “numeric data such as
entries and product reviews, text, pictures, blog
-
A tag cloud is a visualization method
which helps to understand the info
user generated taps, rmation

~ Arrange the tags alphabetically or with the user 3.11.4 Form of


and colors.
preferences with different font sit
- Tag clouds are used in two ways
that with the size of tag, we can
Why Pre-proce
many times that tag is applie find out that how
d on that item lL. Real world
applied to how Many by different us
items, t tag has bee
, mers
~
— Incom

Scanned by CamScanner

W_Data Warehousing & Minin
9 (MU-Som, &-Comp,) 2-30
Introduction to Data Mining

Sas design ta

ont User.= USE; rs.


interface Steam naive =r cSume >

Cerone
ee Fodiet =gpter
Ior eyeuSabil
es tert ity =v
= =apbicaion = Used Yet eagert
vewasee as= also

Fig. 3.10, 10: Social


data Visualisation

Syllabus Topic
: Data Preprocess
ing
Se
3.11_Data Preprocessing
= Process that involv
es transformation of data into
merging, recording, retrieving, information through classifying, sorting,
transmitting, OF Teporting is called data proce
processin g can be manual or computer based. ssing. Data :

In Business related world, data Processing


refers to data Processing so as to enable
effective functioning of the organisations
and businesses.
- Computer data processing refers to a Process that takes
the data input via a program and
summarizes, analyse the same or convert it to useful information.
- The processing of data may also be automated.
- Data processing systems are also known as information systems.

- When data processing does not involve any data manipulation andanly converts the data
type it may be called as data conversion.

3.11.1 Form of Data Pre-processing

Why Pre-processing is Required ?


1. Real world data are generally

- incomplete when certain attributes or attributes


Incomplete : The data is said to be data
is available.
values are missing or only aggregate

Scanned by CamScanner
ww Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-33 Introduction Q
S00 NSS lo Daty ‘4
$f pata Warehousing AM
— Noisy : When the data contains errors or some outliers it is Considereg lO be |
data. ~ 4 Reasons for “I
— Inconsistent : When the data contains differences in codes or names it jgincon,
3.12
pummy values
4 data. =
1
_ Absence of data
i 2 Tasks in data pre-processing
Multipurpose fields
— Data cleaning : This process consists of filling of missing values, SMoothening E ' Cryptic data
data, identifying and removing any outliers present and resolving inconsistencie, ~ Contradicting date
, ntradi
~ Data integration : This refers to integrating data from multiple sources ei ieee
databases, data cubes, or files. te - Inape .
Big ¢ Violation of busi
— Data transformation : Normalization and aggregation. a
- Data reduction : In data reduction the amount of data is reduced but same analyticg _ Reused primary
= "ce di
results are produced. _ -Non-unique ider
~ Data discretization : Part of data reduction, replacing numerical attributes yg|_. — ‘Dat intesration
nominal ones. Why data cleaning
Different Forms of Data Pre-processing
— Source System
1. Data cleaning — Specialised tor
2. Data integration and transformation
— Some of the
3. Data reduction
(isin) ane
4. Data discretization and Concept hierarchy generation
3.12.2 Steps in

Scanned by CamScanner
- Dummy values
- Absence of data

- Multipurpose fields
-— Cryptic data

- Contradicting data
Inappropriat
e use of ad
dress lines
- Violation of busi
ness Tules
Maly, - Reused primary keys
- Non-unique identifiers
S Wig - Data integration Problems

- Some of the Leading data cle


a ansin: er lude Val
& vendors inc es i
(Trillium) and Firstlogic. | 5
ty (Integrity), Harte-Hanks
3.12.2 Steps in Data
Cleansing
1. Parsing

Parsing is a process in which


individual data elements are loc
Source systems and then these ated and identified in the
elemen ts are isolated in the target files.
For example, Parsing of name into Fir
st name, Middle name and Last name or
the address into parsing
street name, city, state and country.
2. Correcting

This is the next phase after parsing, in which individual data elements are corrected
using data algorithm and secondary data sources.
~ For example, in the address attribute replacing a vanity address and adding a zip code.

Scanned by CamScanner
| 2
__¥¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _
_3-35 Int
troductionA tg Oat
|
3. Standardizing
- !In standardizing
iz process
— conversion
i routine
iness aresSiem
usedecome
to transf, orm i
consistent format using both standard and custom busine: S.
a
For example, addition of a prename, replacing a nickname and
using a Preferred ty
name,

4. Matching

~ Matching process involves eliminating duplications by searching and Machi


records with parsed, corrected and standardised
data using some Standard busine,
rules.
For example, identification of similar names and addresses.
5. Consolidating
%

Consolidation involves merging the


records into one representation by
identifying relationship analysing ay
between matched records.
6. Data cleansing must deal with
many types of possible errors
— Data can have many errors like
missing data, or incorrect data
at one source,
~ When more than one source is
involved there is a possibility
conflicting data, of inconsistency and
7. Data Staging

~ Data staging is an interim


step between data extraction
and Temaining steps.
— Using different Processes
like native interfaces, fla
accumulated from asyn t files, FTP sessions, dat
chronous sources. a is
- | .
After a certain predefin
ed interval, data is lo
transformation proces aded into the warehouse after the
s,
— -Noend user acce
ss is available to
the Staging file.
— For data staging,
Operational data St
ore may be used,
12.3 Missing Values

Issing data values

This involves Se
arching for empt
y fields where V
alues should oc
cur,
Scanned by CamScanner
—- Data Preprocessing
is one
incomplete, Noisy or ine of the
o . “ Mport
filling out the Missing "Sistent, thi ties :
ns Stages in data Mining.
van ICs, Real world data is
SMoothenin Orre
“ted in data Preprocessing proc
niques for cal ess by
be dependent on Pro nd correcting inconsistencies
blems domain nd NE With misc; ,
the Boal ort un data, choosing ;
- Following are the differen; Way one of them would
s for h Mining process
1. Ignore the data
row
- Incase of cl
assification
could be ig SUppose a cl
nored, or as S label is Mi
Many attrj ssing for a Tow,
row could be
ign ored. If butes WwW within a TOW such a data row
performance. the Perc are miss
ing even in this ca
entage of such
rows js high it se data
will result in poor
- For example,
Suppose we
college. For th have to build
is Purpose a a Mode] for Pred
Student’ s datab icting student succ
address, etc. an ase ess in
d colu ha vi ng information ab
and “HIGH”. In this MN Cl as si fying the; ou t age , score,
na “LOW”, “MEDIUM”

When missing
values are difficult
to be Predicted, a global
constant value like
“unknown
”, “N/A”
or “min us infinity” can
be used to fill all the missin
g values.
For example, consider
the students data , if the
Some students it does not address attribute is missin
g for
makes Sense in filling up the
Constant can be used. se’ val ues: rather a global

4. Use attribute mean

For missing values, mean or median of


its discrete values may be used as a
replacement.
~ - For example, in a databa
se of family incomes, mis
sing values may be replac
ed with
the average income.

Scanned by CamScanner
™~
the same class
5. Use attribute mean for all samples belonging to

Instead of replacing the missing values by mean or median of all the OWS jn
database, rather we could consider class wise data for missing values to be t,
Iep|
Atyy
by its mean or median to make it more relevant.
For example, consider a car pricing database with classes like “luxury” and «
budget” and missing values need to filled in, replacing missing cost of a luxuy %y
with average cost of all luxury car makes the data more accurate.
6. Use a data-mining algorithm to predict the most probable value

Missing values may also be filled up by using techniques like regression, inferer,
based tools using Bayesian formalism, decision trees, clustering algorithms,
For example, clustering method may be used to form clusters and then the Mean cy
median of that cluster may be used for missing value. Decision tree may be used j
predict the most probable value based on the other attributes.

3.12.4. Noisy Data

- Arandom error or variance in a measure variable


is known as noise.
~ Noise in the data may be introduc
due ed
to :
oO Fault in data collection instruments.
Oo Error introduced at data entry by a human
or a computer.
0 Data transmission errors.
— Different types of noise in data :
Oo Unknown encoding : Gender : E
° Out of range values : Temperature :
1004, Age : 125
|
°o Inconsistent entries : DoB : 10
-Feb-2003; Age : 30
o Inconsistent formats
: DoR : 11-Feb-] 984;
Dol : 2/1 1/2007
How to handle nolsy data 7

Different data smoo


thing techniques ar
e given below -
1. Binning

Considering the neig


hbourhood of the so
rted data Smoothen
ing can be
app: ed
Scanned by CamScanner
© Data Warehousing & Mini MU-Sem
- yd
- The sorted data is placeg tite bin 38 Introduction to Data Mini
iS oO
- Smoothing by
bin Means,
ruckets
- Smoothing by
bin Medians,

Smoothing by bin boundaries
Different approaches of bi
nning
(a) Equal-width (distance
) Partitioning

Divides the range into N intervals of


equal size - uniform . eri
grid le

; bin width = (max value ~


ty . - min value) /N
Example : Consider a se
t of observed values in
the range from 0 to 100.
ed ty - The data could be placed
into 5 bins as follows :
width = (100-0)/5 =20
Bins formed are : [0-20], (20-40], (40
-60], (60-80], (80-100]
The first and the last bin is extended to allow
values outside the range :
(-infinity-20}, (20-40}, (40-60], (60-80}, (80-infinity)
Disadvantages .
1. Outliers in the data may be a problem.
2. Skewed data cannot be held with this method.
(b) Equal-depth (frequency) partitioning or Equal-height binning

~ The entire range is divided into N intervals, each containing approximately the same
number of samples.

- This results in good data scaling.


- Handling categorical attributes may be a problem.

Example : |
in INR
~ Let us consider sorted data for e.g. Price
26, 28, 29, 34
4,8, 9, 15, 21, 21, 24, 25,

~ Partition into (equal-depth) bins : (N=3)


Bin1 : 4, 8, 9, 15

Scanned by CamScanner
A ng Se RE) BS

cnntenenn
¥ pata Warehousin
Bin 2: 21, 21, 24, 25
_ Perform clu:
Bin 3 : 26, 28, 29, 34
cluster repre
— Smoothing by bin means:
ression
Replace each value of bin with its mean value.
3. isi
- Regression
Bin 1:9,9,9,9 between on
Bin
2 : 23, 23, 23, 23 variables.

Bin 3 : 29, 29, 29, 29 ~ Smooth by


Smoothing by bin boundaries : — Useregres
In this method the minimum and maximum values of the bean
boundaries jg ¢ aie
and each value is replaced with its nearest value either minimum or maximum,
— » The diffe
Bin 1:4, 4, 4, 15 independe
independe
Bin2 : 21, 21, 25, 25
— The gene
Bin3 : 26, 26, 26, 34
]
2. Outlier analysis by clustering
.

Partition data set into clusters and one can store cluster .
:

representation only, it Where.


replace all values of the cluster by that one value representing
the cluster. : |
Outlitliers
can be detected by using1si clustering
i techniques,
i where related values at
organized into groups or clusters,
| ReconA
x

XxX x
hae
x

x xk :: xeeex — In mut
Y-Axis

i
xXx%y Xx" Record B
— Re
22%
aX} y
‘ af
Telatio
(Lines
Record ¢
iam Regre
Price
X- Axis
Fig. 3.12.1 * Grap
hical Example
of Clustering

Scanned by CamScanner
# pataa Warehousing &M ining (MU- -6-Comp,
sen 3-40 Introduction to Data Mining
Perform clustering on attributes values and
replace all values in the cluster by 4
cluster representative,
Regression

_ Regression is a statistical measure used to determine the strength of the relationship


between one dependent variable
denoted by Y and a series of independent changing
variables.

smooth by fitting the data into regression functions.

_ Use regression analysis on values of attributes to fill missing values.


ple regressions.
_ The two basic types of regression are linear regression and multi
_ The difference between Linear and multiple regressions is that former uses one
independent variable to predict the outcome, while the later uses two or more
independent variables to predict the outcome.

_ The general form of each type of regression is :


Linear Regression; Y =a+bX+u

Multiple Regression : Y = a + b,X,+ b,X,+b,X,+...+D,X,+U


Where, '

Y = The variable that we are trying to predict

X = The variable that we are using to predict Y

a = The intercept

; b = Theslope .

u = The regression residual.

- Inmultiple regressions each variable is differentiated with subscripted numbers.


.
- Regression uses a group of random variables for prediction and finds a mathematical
straight line
relationship between them. This relationship is depicted in the form of a
way.
(Linear regression) that approximates all the points in the best
interest rates, the
- Regression may be used to determine for ¢.g. price of a commodity,
or sectors.
price movement of an asset influenced by industries

Scanned by CamScanner
341,)_
) mS4
6O-Co p IntrodoUt tay
W at a Wa
War re
e ho us & Min
in . 6-C
(MU-Semm.
g inging (MU-So
SF Dat
Log linear model
and a log linear model jg ia
na
anption
gE assum
sion a best fit between the data
: ast pata Inte
: A linear relationship exists between the log of the‘cia ee
Major
independent variables.
ee,
linear models are models that postulate a linear relationship betw
Log
exams
independent variables and the logarithm of the dependent variable. For
Jog(y) = ap + ay X; + a2 X2~-++ An Xn
where y is the dependent variable; x; i=1,....N are independent variables nat
of the model. i .
{a,, i=0,...,.N} are parameters (coefficients)
log linear models are widely used to analyze: categorical da
For example,
|
represented as a contingency table. In this case, the main reason to transfor,
frequencies (counts) or probabilities to their log-values is that, provided e.g. A.cust-id
ty
independent variables are not correlated with each other, the relationship between
new transformed dependent variable and the independent variables is a lines
(additive) one.
ij

yt"

~ With the]

- Theredu
from mu
Fig. 3.12.2 : Regression example
3.13.1 Entity Ie
3.12.5 Inconsistent Data
~ Schema inte
— The state
in which the data quality of the existing
data is understood
and the desi! task,
quality of the data is known refers to consistent data quality. 3 Identify re;

- It is a state in which the existent data quality is been modified to meet the current a identificati

future business demands. m another |

Such conf

Scanned by CamScanner
[ Sy
ep, ag
= Introduction ta Data Minin
f — - ¥ labus Topic ' Integration = 2

| 3.19_Data Integration
tee,

Cx, a! : A coherent data store (e.g. a Data whites;


3 *
> (mu- May 2016) v
: Ouse) js ca

tb] sources like multiple databases, data cubes or Sale J


Ss a{ Issues in data integratio n “a Sousesing din fro — 4
: 4)
vo
F
al : j. Schema integration «

asp | - Integrate metadata from different sources,


led / af = Entity identification .
Problem : identi we ‘
“en ty e.g. A.cust-id =B.cust-#, eenliy real world entities from multiple data sources.
Tiney | 2, Detecting and resolving data value conflicts
— As the data is collected from multiple source
s, attribute values are different for the
same real world entity.
- Possible reasons include different
Tepresentations, different scales, e.g. metric vs.
British units.

3. Redundant data occur due to integration of multiple databases


-
Attributes may be represented in different names in different sources of data.

- An attribute may be derived attribute in another table, e.g. yearly income.

- With the help of co-relational analysis, detection of redundant data is possible.


— The redundancies or inconsistencies may be reduced by careful integration of the data
from multiple sources, which will help in improving mining speed and quality.

3.13.1 Entity Identification Problem |

~ Schema integration is an issue as to integrate metadata from different sources is a difficult


task.
- Identify real world entiti es from multiple data — and their matching is the entity
in one database and enrolment number
Roll momber
identification problem. For example,
the same attribute.
in another database refers to
gra’ tion.
eate pr oblem for schema ini te
Such conflicts may cr

Scanned by CamScanner
Ww Data Warehousing & M (MU-Sem. 6-Comp.)

Detecting and resolving data value conflicts for the same real world entity, Altibut.
from different sources are different.
“ty
3.13.2 Redundancy and Correlation Analysis

— Data redundancy occurs when data from multiple sources is considered fori
in Le Rration,
- Attribute naming may be a problem as same attributes may have diffi
erent Names ‘
multiple databases. =-
L) DF
- An attribute may be derived attribute in another table e.g. “yearly income”
where. ris the n
- Redundancy can be detected using correlation analysis.
for the other cat
— To reduce or avoid redundancies and inconsiste
ncies data integration must be freq
carefully. This will also improve mining algor CAITIed ty
ithm speed and quality. attribute. The fc
= x ( Chi quare
Fe

attributes are related.


) tes! t can be carried out 0 on no:
minal data to test how 3tro 1gl the
Exe
ry thy
Oo Where Er,
attribute
Y
variation between the attributes.
o n,isthesu
The x" (Chi-square)
o n.isthest

o nis the to

Ex.3.13.1: Byt:
data
(You
Tab

Le
Soin, : State th

: Ho: Null hy

H,: Alterna

Scanned by CamScanner
———
———
Eee
¥ Data Warehousing & Min sl
Lee ing (MU
(my-So
5m
Where —
~&Comp) 3-Se44
X? — Chi- § are Introduction to Data
Mining

~
E dL: f
= Fre,
quency Cx
Pect to fing iCc ted whi
nN en ich is the am
“Mount of sub ‘
O = Frequency Obs erved
“awi legory based on ki
own iformeEne ? a
found

ee
or mary to be ; Whic
be in ch Categohry is inthe
Degrees 8 presen teen
© degrees Of free pmecmey

SS
dom (DF) j
DF = ‘
-1)*@¢~1)

PAAaBOS
where, , Tis : the S ny nae
ohne of leve
for the other cate ls
gorical variable for One Cat .
. €g0rical Variable an
d c¢ j 8 the number of lev
Expected freque els
ncies > It is th
attribute. The form which is com
ula for cmmenae puted for each level
uency is of categorical
E. = (o;*n.)/n

© nis the sum of


Sample observatio
ns at level c of at
© nis the tot al size of sample data.
tribute Y

Ex. 3.13.1:
,

Female 250 300 50 600


Column total] 450 450 100 1000
Level of significance is 0.05. Interpret the result;
soln. : State the hypotheses

Hp: Null hypothesis ; Gender and buying preferences are independent.


A, : Alternative hypothesis : Gender and buying preferences are not independent.

Scanned by CamScanner
345 _!ntroduction to a wa rehousing & Mini
! Ww Data Warehousing & Mining (MU-Sem. 6-Comp.)

a ‘N
oe is terpretation ‘
Analyze sample data

! ~The degrees of freedom DF is,


DF = (r-1)* (e-1)=@-1)*B-1)<)
r = 2 (the number of levels for row i.e. male and female)
c =3 (the number of levels for column categorical variable
i.e. Young age, Middle age and Old age.)
— By using given contingency table, calculate the expected frequency E.. for cach
a E,.= (a, * n.)/n. | | | “ where n is the num
io “Buying preferences owt are the respective 5
q | Youngage - _ Middle age _Oldage product.
| Male (0) 200 150 50 46 tf Inq” 0, Pp and q

| stale(e) | (400* 450)/1000= 180 | (400*


450) /100 = 180 | (400°
100) /1000— 40 40 stronger correlatio

| Male(O-E) Correlation meas


20 -0 10 | =

objects, x and y a
| Mate(o-E 400 900 100 a
[Macocre|. oe |g RE Se Sena
Female(O) 250 300 50 600
Female(E) | (600° 450)/ 1000 = 270 | (600* 450) /1000=270 | (600*
100) /1000=60| 600
Female(0-E) ~20 30 -10 — Covariance is g
| Female(O-E)? 400 900 100 Cov YY)
[ Female | 148 853 Sater 2 3.13.3 Tuple Duy

- Using formula,
| - In data integr
occur due to ii

x= 2 |O-2| ~ Detection of t
© Specify t
X? = 2.22 +5.004+2.50 + 1.48 + 3.33 + 1.67 = 162
- The P-value is the probability that a chi-square statistic having 2 degrees of freedom8 9 Compan
more extreme than 16.2. © Objects
- We use the Chi-Square distribution table to find P(X? > 16.2) = 0.0003. 9 The clo

4 Scanned by CamScanner
Fr
¥ pata Warehousing & Mining (MU-Som, Comp

sculls Interpretation 3-46 Introduction to Data Mini


R
ite significance
level given in prob lem
shan the significan Statement
ce level (0.05), js Q 05 and the P-value (0.0
5 0, the null 003) is less
interpret that th
ere is a relationship h ypothesis cannot be accepted and we can
rhe correlation coefficient ang Covariance der and buying preference,

The correlation coefficient is piven by,

2 (p-p)
fn = EE AD(q- 2 ou) a59= nh ‘0,6,

Correlation measures the linear rel


ationshi Pp between objects. First standardiz
objects, x and y and then calculate the corre] e data
ation between them by taking dot pro
duct.
X= (%,—meancxyystdcxX)
Y= (Ye~meancyyystdcy)
Correlation(X,Y) = x’. y’
Covariance is given by

Cov (X,Y) = E(X-X)(Y-Y)) = E(X-Y)-X¥


3.13.3 Tuple Duplication

In data integration tuple duplication should be checked. This type of duplica


tion may
occur due to incorrect data entry or updation of data.
Detection of tuple duplication :
© Specify the relevant attributes.
© Compare tuples pair wisely using a similarity measure.
© Objects with similarity above a given threshold are considered as duplicates.
© The closures of duplicates are computed, in which each is assigned one <sourceID>.

Scanned by CamScanner
aT =“ Pay

3.13.4 Data Value Conflict Detection and Resolution Ea Seu


: : ta Warehousing
| palee & Mining
- Data conflicts can arise because of incomplete data, Removing irrelevan
invalid data and OUt-OF-date wrapper methods ma
: ~ It is thus critical for data integration systems to
resolve conflicts from Various g ,
i _ Principle conten
identify true values from false ones.
representing the datz
i - Two kinds of data conflicts :
j
} 1. Uncertainty . 2: Reducing the numb
u er
2. Contradiction
| ears Contradiction
_ Binning (histogran

bins, this will real


i |i is an attri:bute level data
a conflict. It isi an attrii bute level data conflict; , <f — Cl
eeusteri
: Gro
ng
} If there is missing information or null | If there
: values, then Uncertainties occurs. is contradicting information .
different attribute values, then COntradici ——
} ng -
. } a ‘| occurs. Agegreg
: ip 3, . Re ducing the numbe
j If information for the attribute value
is not | When data is collected fom
} available, then uncertainty is introduc ais —
ed. then contradiction is introduced for
i theSane
Properties of the same objects,
i If the set of values includes 3.14.2 Data Reduction
special NULL | If at least
} value and one other NON two different non-null Value
NULL value, then s . Following are the data
appear in the set of values,
: uncertainty is present. then Contradictig
is present. (A) Data cube aggreg:
For age attribute, if the set of valu
es for | For age attribute, if the
customer] are{30,Null,30 }=(30,Null set of values for (B) Dimensionality re
- customer] are { 30,35,30}
=(30,35}.
(C) Data compressior
Then contradiction
is there, as there are
different values forage. two (D) Numerosity redu
Syllabus Topic : Dat
a Reduction - Attrib
ute Subset Se lection,
Histograms, Clustering |
and Sampling - 3.14.2(A) Data Cube
3.14 Data Reduction — It redu
the
cedata
stc
detailed) level neces:
‘ Queries regarding :
3.14.1 Need for Data Reduction > (MU - May 2018 Possible.
1.
Example
Reducing the number of attr
ibutes
Total annual sales o

Scanned by CamScanner
eS
- Removing irre]


wrapper smistiede, a Attributes ; In thi Introduction to Data Mini
AY be used i, 'S altribuie .
- Principle component ed, it alse involy, 5 ene

SS
Methods like filtering and
;
representing anal Ysis (numeric . Sivas the attribut
the data inva Sie ibute space,

- > SS
Reducing
2. 9 th the number of attribute
Pact form by
si
usi Ss only) This involves
Ng a lower dimensional space,

Ss —
ues
- Binning (histogram
. this will result inS)
bins, : ThisIS InVolv
j

“=
es representing

aS
hes
lesser number of attrib

a
a Clustering attributes into £roups called as
: Group 4

clusters 7
rd
7
r
th

3.. Reducing the number of tuples

To reduce the number of tuples, Sampling may


be used
3.14.2 Data Reduction Tec
hnique
Following are the data
reduction techniques
:
(A) Data cube aggregation
(B) Dimensionality reduct
ion
(C) Data compression

(D) Numerosity reduction

3.14.2(A) Data Cube Aggregation


It reduces the data to the concept level needed in
the anal ysis and uses the smallest (most
detailed) level necessary to solve the problem.
Queries regarding aggregated information should be answered using data cube when
Possible.

Example
Total annual sales of TV in USA is aggregated quarecly as shoinwn
Fig. 3.14.1.
«

Scanned by CamScanner
———————
Data Warehousing & Mining (MU-Sem. 6-Comp.)
oo
3-49
—————

Date

S 1Qtr = 2Qtr 3Qtr 4Qtr sum |


ee

Fig. 3.14.1 : Example of data cube

3.14.2(B) Dimensionality Reduction

— In the mining task during analysis,


the data sets of information may
of attributes that may be irreleva contain large numbe
nt or redundant.

— Data visualization becomes an


easy task.
It also involves deleting
inappropriate features or Cc
reducing the Noisy data.
Attribute subset selection
3.14.2(C
How to find a good subset of the original
attributes ? — Data
trans
selected in such a way that their distribution
& ? done
distribution considerinallg the attributes, “Presents the same as the original dala ~ Dat
loss

Scanned by CamScanner
7
~~ he
+
hd
. Forward selection

PSS
Start with empty Set of att
ributes
- Determine the best Of the
6

o—3
Ti
- Ateach Step, fi Binal attributes an d add it to the set.

=.
nd the best 0
F the remaining Original attribut
2. Stepwise backward elimination es and add it to
the set.
- Starts with
the full set Of at
tribut es
— Ateach step, j

4. Decision tree induction

- I1D3, C45 intended forclassi


fication.
Construct a flow chart like stru
cture.
- A dectrei
e issa tre
ie oin n
which :
Oo Each internal node tests an attribute.
oO Each branch corresponds to attribute value.

oO Each leaf node assigns a classification.

3.14.2(C) Data Compression —


=
Data compression is the process of reducing the number of bits me to — or or
the data. Thisi data can be text, grap hics, video, audio, etc.
transmit s can
done with the help of encoding techniques.
sified into either
- Data compression techniques can be clas
sless there is no loss.

Scanned by CamScanner
¥ 9 & Mining (MU-Sem.
Lossless compression

— Lossless compression consists of thos f _ Applying


e techniques guaranteed to Benerate
duplication of the input datasct after a compre 4, Clusters 2
ss/decompress cycle, ay
- Ba l in their b
Lossless compression is essentiai lly a coding
i ique. The re are man
coding algorithms, such as Huffman cod technique : pd diff, , i
ing, run-length coding and arithmetic Codi kin, or featur
ng Maj
Lossy compression — Italsore
- In lossy compression techniques at the The tech
cost of data quality one can achieve bh
compression ratio, ~ Goan
— These types of techniques. are useful in app Ataiff
lications where data loss is‘ affordable. Ty | - Ardiffer
are mostly applied to digitized representations of ana
log phenomenon. - Themet
- Two methods of lossy data compressi
on :
1. Wavelet transforms = Maeepe

2. 2. Principle c
Principle component analysis
— Princip
orthoge
1. The wavelet transform can al:

A clustering approach which applie


project
s wavelet transform to the feature
space : = &
; ~ The orthogonal wavelet transform .
when applied over a signal results
|| decomposition through its multi resolution aspect.
in time scae Decon
i — Itclusters the functional data into - A few
homogenous groups.
i ~ Both grid-based and density-based. estim:
Input parameters - The j
TeCon
— Number of grid cells for each dimens
ion.
~ $-14.2(D)_N
‘The wavelet and the number of app
lications of wavelet transform.
~ Clustering approach using Wavelet transform fo Numeros’
.
sss Tms for data
Impose a multidimensional rid like str
ucture on to the data for summarisation. Diff
nai Us © an n-dime i nsiona
i l feature Space for . “rent
representing
-
spatial
i
data objects, 7
Dense HI

“toregns
i f
ae iden ti
may be identified r
by applying the wavelet transform
Stogr
‘ over the featu® “
—ai

Scanned by CamScanner
|
- Clusters are iden
| . tif;
in their bound;
, King, .
Major features _
- Italso results in
Effectiy,
© hi - The technique is Cost Oval of our Pi
liers,
Rhy,
fificient.
Complexity ON).
le.
Tey ~ Abeiitoront scales arbitrary shap
eq Clusters are detected
= “Theietho d ts not SCRS
. itive to Noise or Input
- Itis applicable only to ord ,
Order
low dimensionaj itn
2. Principle comp
onents analysis

7 matrices. But in effici


ent computation related
: Decomposition (SVD to PCA, it is the Singul
) of the data mat ar Value
rix that is used,
- A few scores of the PCA
and the corresponding loading
estimate the contents of a large data mat vectors can be used to
rix.
- The idea behind this is that by reducing
the number of eigenvectors used to
reconstruct the original data matrix, the amount of Tequ
ired storage space is reduced.
3.14.2(D) Numerosity Reduction

Numerosity reduction technique refers to reducing the volume of data by


choosing smaller
forms for data representation.

Different techniques used for numerosity reduction are :


1. Histograms
representation.
~ Itreplaces data with an alternative, smaller data

~ Approximate data distributions.

Scanned by CamScanner
’ 6-Comp .) 3-53 Introductig Ki bey
¥ Data Warehousing & Mining (MU-Sem.

- de data into buckets and store average (sum) for each bucket.
Divideda
A bucket represents an attribute-value/frequency pair.
— Can be constructed optimally in one dimension using dynamic programming

— Related to quantization problems.


Different types of histogram

i. Equal-width histograms
®'It divides the range into N intervals of equal size
ii, Equal-depth (frequency) partitioning : It divides the range into N in
tervals, CF
containing approximately same number of samples. .
iii. V-optimal : Different Histogram types for a gi
ven number of buckets are CON
and the one with least variance
is chosen.
iv. MaxDiff : After the sorting proces
s applied to the data, borders of the
"defined where the adjacent values hav buckets »
e maximum difference.
Example

1,1,5,5,5,5,5,8,8,10,10,10,10,12,
14,14,14,15,15,15,
15,1 15,18, 18,18,18, 18,18,
18,18,20,20,20,20,20,
0,20,21,21,21,21,25,25
,25,25,25,28,28,30,30,
30
Histogram of above data sample
is,
Count

Fig.3.142: Example ofhi


stogram
2. Clustering

Scanned by CamScanner
® Data Warehousing » Mi
MU-g6,
m.
- The goal of UNdirecteg a
Bn
;variable © j;
18 to Predi Ata a Mining
mini ist ahrction ata Mining
independent Sted, th M
and dependent 9 expt
.
Vari oad i 8 Ore structure in the data. . Mo AO t *g
ya
"
ss Categorization ofclusters 4
ables. no difference been mace tiikoran
baseg
o An n cluster
e cxample belonging toa _ "echniques is given below -
o Any exam ple m
ay belong ti“ingle chusy
overlapping. °F would is

° many cly Sters


be termed as exclusive cluster. /?
° ¥ Sires in Such a
/rj
as :
Case it is said to be
probabilistic.
aye on
SPD act USter
S
with i probabili r
a AR raiss Sertain then fi
ty then itit;is sa
-
to be
id Fa
. Tepresentatio F
at highest leve] of hi lera rchy “ma
ate an y be used for Clust
cl ers ini which clusters
subclusters. be
. Subsequently refined at lowe
r levels oom Ca

3. Sampling r
y
- F
Sampling is used in reliminary
ling is: P j
IVestigation as well as final anal
- Samp 18 Important in ysis of data.
aiid ets bine z. data min; ng
mini as Processi:ng the entire data set
is expensive
Types of sampling

i. Simple random sampling


There is an equal. probability of sele
cting any particular item.
ii’ Sampling without replacement
As each item is selected, it is removed from the popul
ation.
iii. Sampling with replacement

The objects selected for the sample is not removed from the population. In this
technique the same object may be selected multiple times.

iv. Stratified sampling

The data is split into partitions and samples are drawn from each partition randomly.

Scanned by CamScanner
Ww Data Warehousing & Mini MU-Sem. 6-Comp.) _ 3-55 Introduction to Data
eee
¢ ormalization + In norn
Syllabus Topic : Data Transformation and Data Discretization, Normalization, Binge
e S + To transform
~

3.15 Data Transformation and Data Discretization


|
3.15.1 Data Transformation scaling by using
fst there arcssotliers}
|

| @. Explain the data transformation in detail.


— Operational databases keep changing with the requirements, a data warchouse in Attribute/feature con
data from these multiple sources typically faces legratin, used for data mining p
the problem of inconsistency, $
~ Todeal with these inconsistent data, transformation process may be employed. 4.15.2 Data Discretizat
— The most commonly used process is “Attribute Naming Inconsistency”, as it js
Very | The range of a contin
common to use different names to the same attribute in different sources of
data.
Categorical attributes
- E.g. Manager Name may be MGM_NAME in one database, MNAME in the other.
By Discretization the
— In this one set of data names is considered and used consistently in the datawarehous
e,
-
- Dividing the range
Once the naming consistency is done, they must be converted to
a common format. given continuous att
- The conversion process involves the following :
; - Actual data values t
(i) ASCII to EBCDIC or vice versa conversion process may be used for character - Discretization proc
s.
(ii) To ensure consistency uppercase representation may be used for mixed
case text. 3.15.3 Data Transfo!
(iii) A common format may be adopted for numerical data.
Data Transformati
(iv) Standardisation must be applied for data format. . . entire set of values
(v) ‘A common representation may be used for measurement _ ~— Following metho
e.g. (Rs/$).
(vi) A common format must be used for coded data (e.g.
Male/Female, M/F).
The above conversions are automated and many tools are available for the
transfor mation
e.g. DataMapper.

Data transformation can have the following activities

Smoothing : It involves removal of noise from the data.


Aggregation ; It involves summarisation and
data cube construction.
Generalization : In generaliz
ation data is replaced by high
er level concepts using concep
hierarchy.
*

Scanned by CamScanner
E Normalization ; In normal; _
Example : To Tansforn y in

: _ l), apply
: Scaling by using mean and
standard ae. ( Min) 1 (Max-Miny
Wy when there are outliers):
= P
[
deviation ¢ Useful wh en min and max are oak:

"ear 8
| Vs cy “Mean)/
nown or

| - Attribute/feature constry ction : In 4. stg, Dey,


used for data mining Process
this Process new attr;
tributes may be cons
tructed and
3.15.2 Data Discretization
Very

Se. ;

- Dividing the range of attributes


into interval
given continuous attribute.
- Actual data values may be replaced by interval
labels.
- Discretization process may be applied recursively
on an attribute.
3.15.3 Data Transformation by Normalization.

- Data Transformation by Normalization or standardization is the process of making an


entire set of values have a particular property. .
~ Following methods may be used for normalization :
1. Min-Max
2. Z-score

3. Decimal scaling

. — ion results inin aa linear


linear alteration of the
l. Min-Max normalization : Min-max no . rmalizat

original data. The values are within a given range. ing a v value, of an attribute A from
erform mappin
Following formula may be anes a minA,new_maxA],
range [minA, maxA] to a new range [new_

Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 3-57
v= (¥v ~minA)(maxA-minA)
*
new_maxA-—new mj
v =
,
73600 in [12000, 98000] nA
v’ = 0.716 in [0,1] (new range)
2. Z-score : In Z-score normalization, data is normalized based
On the mean and
deviation. Z-score is also known as Zero mean normalization.

v’ = (v—meanA) /std_devA
Where, MeanA = sum of the all attribute value of A .
std_devA = Standard deviation of all values
of A
Example :

If sample data { 10, 20, 30}, the


n
Mean = 20
std_dev = 10

Sov’ = (-1, 0, ))
3. Deci
cal mal scaling : Based on the
m aximum absolute valu
pout is moved, This pr e of the attributes the dec
ocess is call ed as Deci ini
mal Scale Normalizat
ion
V@ = v(i/10* for the smallest
k such that max(Iv’(i)l)<1
Example : For the ra .
nge between — 991
and 99,
10‘ is 1000 (k=3 as we
have maximum 3 digi
t number in the rang
e)
v’(- 991) = ~0.9
91 ana v‘(99) = 0.09
3.15.4 Di scretization by 9
Binning
This is the data sm |
oothing technique.
~ Discretization by
binning has two
approaches :
(a) Equal-width
(distance) Partit
ioning
(b) Equal-depth
(frequency) Part
B g e itioning or Equ
oth th v e
is binning ap al
proaches are
Ziven in Sect
ion 3.12.4

Scanned by CamScanner
In Discretization by Histo
. era 'N divide
each bucket in smaller daty repres Dlation, th © data jint
buckets ard
pifferent types of histogram Store average
(sum) for
}. Equal-width histog
rams
2 Equal-depth (fre
quency) Partitioni
ng
s Veepeuaal
4. MaxDiff

(e.g. Young, Middle aged or Senior).


Concept hierarchy generation for
categorical data
- a users Or experts may perform a partial/tota
l ordering of attributes explicitly at schema
evel :

E.g. street <city < state < country }


~ Specification of a hierarchy for a set of values by explicit data grouping :
E.g. {Acton, Canberra, ACT} < Australia

_ ~ Ordering of only a partial set of attributes :


E.g. only street < city, not others
~ By analysing number of distinct values the hierarchies or attribute levels may be generated
automatically.

Eg. for a set of attributes : {street, city, state, iil


E.g. weekday, month, quarter, year

Scanned by CamScanner
874,339 distinct va
lues
i
Fig. 3.16.1: Concept hierarchy
example '

_ Syllabus Topic : Concept
Desorption -ANAE
Characterization

3.17 Concept Descriptio


n : Attribute Orient
Characterization ed Induction for Da
. ta
Data mining can be
classified into tw
o Categories
,
1. Descriptive data mining
2. Predictive data mi
ning

Scanned by CamScanner
418-1 Data Generalization

Data Generalization j; isa Proces wily

database from alow conceptual leve] higher ee 4 large set Of task-relevant data in a

ta ~ according to two approaches - zen


Saatlon of large data sets can be omnes
() Data cube approach
(or OLap 4pproach)
- The data cube
approach can be con
precomput sidered
ation-oriented, mat as data warehouse-based,
erialized-view approa
ch.
- Computations and results
in data cubes.
- It performs off-line aggregati
on before an OLAP or data min
£ for processing. ing query is submitted
NoOfdaa fo _ Strengths
Concise,
© Anefficient implementati; on of dat
a generalization.
tatistics © Computation of various kinds of measur
es, €.g., count() or sum( ).
ng data © Roll-up and drill-down.
ee _ - Limitations
ion or a |
© Only dimensions of simple non-numeric1 data and measures of i
simple aggregated
,
dat a.
numeric values.
i
© Lack of intelligent analysis.
(i) Attribute-Oriented Induction (AO!)
~The attribute-oriented induction approach, at leastin its en propose is a relational
on-line technique.
database, query-oriented, generalization-based,
da tabase q query.
Collect the task-relevant data usini g a relational
i
~

Scanned by CamScanner
Data generalization is

©
p e r m e now
performed in two ways:

Attribute removal or (Why a OR)


* ;
oa r pala

Gene
iy) get'a threshold f
Oo Attribute generalization.
— Data aggregation is performed by merging identical, Seneralized ty \ -
The threshol
re
numberd, ofIf
*
accumulating their. respecti.ve counts. iS
Typical threshc
— Presentation of the generalized relation such as charts or rules.
~ These two tec
3.18.2 How Attribute-Oriented Induction is Performed?
. ‘
control mae
control to furt
3.18.2(A)
18. Data Generalization 418 g(c) Example
It can be performed in two ways : attribute removal and attribute generalization,
This example sh

(i) Attribute Removal is based on the following rule : Jation of T


working re!
~ If there is no generalization operator on the attribute then remove that attribute,
~ _ Its higher level concepts are expressed in terms of other attributes in the relation tl
remove the attribute from the initial working relation.

(ii) Attribute Generalization is based on the following rule


If there is a large set of distinct values for an attribute in the initial working relation, te;
use generalization operator on that attribute.

3.18.2(B) Attribute Generalization Control

It is subjective to how high the attribute is generalized. While attribute oriestil


generalization a balance should be maintained so that it should not be “too-high” or “too-lv'|_
generalization. The control of this process is called attribute generalization control.
For each att
Two common approaches of this process are : — Name: Thi
(i) Attribute generalization threshold control ; _ - Gender :'
Ses cge . ext) gender.
- Set one generalization threshold for all of the attributes, or set one threshold for
attribute. — Major : It
- The number of distinct values in an attribute should be less or equal to attribut _ ~ Birth_pla
threshold, If not then further the attribute removal or attribute generalization shoul ; If the nu
be performed.
threshold
~ Typical threshold value range is from 2 to 8. ~ Birth_a:

Scanned by CamScanner
ev Data Warehohousing & Mini
—= using & N ing
S(M
SU-Son, S
I. 8-Co,
Generalized relation threshot g co==.)3
a)
_ Seta thresh Ntroy Introduction to Data
Mining
old for 8ene
‘is ralizeg
Tlation,
Ube, ~ eth satis toe
the thres. * I Not then fu‘U ples in the Beneralj
rther &e
_ Typical neralj al
threshol
g value ZAtion izSed relation sho uld
hould be be less or equal to
range is Performe
from 10 d
to© 39 30,

: first apply the


attribute threshol
d
» and then apply relati
on threshold
re lation,
3.18.2(C) Example of Attribute Oriente
Induction
DC ctinThgis.ebeei
xaen
mpleot Ta
sholk
ws 3.1
th8.1
at how attribute-_5
One5 nted ind
. uct Sz
ion js performed
on the iis
initial
Table 3.18, ;
Initial Relation

jeBirth date | Residence


Vancouver, | eto76 | 9544 Phone# | GPA
Main St, | 687-4598 | 3.67

Montreal, 28-7-75 345 ist Ave, | 253-9106 | 3.70


Que, Richmond
Canada
Physics Seattle, 25-8-70 125 Austin Ave., | 420-5232 | 3.83
6 ‘| WA,USA |. . Bumety sa m
v" | [Removed | Retained | SciEng,Bus | County | Age range | Cityi Removed | Exc NG
For each attribute of the relation, the generalization proceeds as follows :

~ Name : This attribute is removed as there are large number of distinct values for name

~ Gender : This attribute is retained as there are only two distinct values (M and F) for
gender.

.| ~ can be generalized to the values {arts, science, engincering, business}


Major or : : ItIt can
. . f distinct values.
can be generali ized as. it has a large number o
a : This attribute e
ibute generalization
ie aie of cing values for
country 1s ieee a
dreds Oo 7 en birth_pl ace should be generaliz d to birth_co ‘arco
_

~ neraliized to age OF ab°. _ range.


Birth_date : It can be ge

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-63 Introduction ts- :
at
- Residence: As number of values for city in residence attribute is Jess than
value, it can be generalizes to city. Sy
- Phone#: This attribute has large distinct values so it should be removedj in genera
ine,
- GPA: Grade Point Average (GPA) has distinct numerical values but it can be ge “
into numerical intervals like (3.75-4.0, 3.5-3.75, ...... ), which in turn can be ony
descriptive values such as {Excellent (Excl), Very Good(VG),... ++}. The ati,
therefore be generalized. &
A generalized relation obtained by attribute-oriented induction on the
data ,
Table 3.18.1 is as given in Table 3.18.2. —
Table 3.18.2 : Generalized Relation

Gender | Major | Birth region’ Age range ‘Residence ‘GPA? EN Coung


M Science Canada 20-25 | Richmond | Very-good| 16 | |
F Science | Foreign 25-30 Burnaby | Excellent | 22

3.19 University Questions and Answers

May 2010

Q.1 Explain data mining as a step in KDD. Give the architecture of typical DM system.
(Ans. : Refer sections 3.5 and 3.3) _ (10 Marks)
Dec. 2010 ’
Q.2 Explain data mining as a step in KDD. Give the architecture of typical DM system.
(Ans. : Refer sections 3.5 and 3.3) | (10 Marks)
May 2011

Q.3 Describe the steps in the KDD process with a suitable bock diagram.
(Ans. : Refer section 3.3) (5 Marks)
Dec. 2011

Q.4 What is data mining ? What are techniques and applications of data mining ? Explain
the architecture of typical data mining system.
(Ans. : Refer sections 3.1, 3.4, 3.7 and 3.3) (10 Marks

al
Scanned by CamScanner
_.,. Introduction to Data Mining

(5 Marks)

izati ;
'0n techniques for Data en a
P mining. (Ans. : Refer section 3,10)
13
oes ee
g Explain Data sections
(Ans. : Refer mining as3.5a step
and
in KDD. Explain the architecture of a typical DM system.
3.3) (10 Marks)

wy 2010 ,

i. The steps in KDD process (Ans. : Refer section 3.5)

i, of a typical DM system (Ans. : Refer section 3.3)


The architecture (10 Marks)

Discuss different steps involved in data preprocessing.


, 10 (10 Marks)
; (Ans. : Refer sections 3.12, 3.13 and 3.14)
o00

Chapter Ends

Scanned by CamScanner
Warehousing
| Module 40 | a—___
! 9}. CHAPTER
various classifi
- * 5 Regression
|:

:
4 Classification, Prediction
8 . . .- |: \
§— i
Decision jsion tre
and Clustering . {
o
medi
u

—_ AAA Classifica

i Suppose a da
CH{Cy---Cn}
‘ Syllabus : oo ™s which tuple 0
Classification, Prediction and Clustering : Basic Concepts,
Decision Tree Usky
“7 Information Gain, Induction : Attribute Selection _ Actually divi
; Measures, Tree pruning, hi diction is
Classification : Naive Bayes, Classifier Rule - Based Classification : Using
: for classification, Prediction : Simple linear regress IFTHEN Ply 7 Je corti
ion, Multi ple linear regression Mods)
Evaluation n && Selection : Accura cy and Error models
measures, Holdout, Random Samping|| | 412.
Cross Validation, Bootstrap. Classifi
Clustering : Distance . Measures, Partitioning
Hierarchical Methods(Agglomerative, Divisive).
Methods (K-Means, _k-Medoits, How teache:
_ Ifx>=90t

: — Tf 80<=x<9
Syllabus Topic | Basle Conse = Wf 70<exe!
CO
P - reex
4.1__ Basic Concept : Classification
— Ifx<60 th
~ Classification constructs the classification model base
‘j
d on training data set and usinthg
model classifies the new data.
- It predicts the value of classifying attri
bute or class label.
- Typical applications :
41.3 Clas
o Classify credit approval based on customer
data.
1, Model |
o Target marketing of product.
- Bye
o Medical diagnosis based on symptoms
of patient.
fl
~ Th
o Treatment effectiveness analysis of patient based on the
treatment given. ~ Th
de

Scanned by CamScanner
. : Classiti i n & Clusteri‘ng
various classification techniques Hication, Predictio
Regression

Decision trees

Rules

Neural networks

ad Classification Problem
suppose a database
D is given as D=
Ca[Cy--sCm)> th
e Classification th, lays ast} and a set of desi
red classes are
which tuple of databa Pr ob ie ™ is to define the mapping
se D belongs to whi m in such a way that
ch class of ¢.
Actually divides D into equi
valence Classes,
Prediction is simi
lar, but may be
viewed
;
models continuous-valued functions, ie. as having infinite number of f classes. Pri ediction
» Predicts unknown or missing val
ues,
41.2 Classification Example

How teacher gives grades to students based on their marks


obtsined «
_ Ifx>
then=9
grade0
= A. x
<90/ \ >=90
- If80<=x<90 then grade = B.
i.
- If70<=x<80 then grade = C. <80/ \ >=80
- If60<=x<70 then grade = D. %. 5
. <70
of\ 70
- Ifx<60 then grade = E. fo Ae
<60
o/\~t0

E D
Fig. 4.1.1 : Classification of grading

41.3 Classification is a Two Step


Process
1. Model construction

Every sample tuple or object has assigned a predefined class label


~ Those set of sample tuples or subset data set is known as training data set.
~ The constructed model based on training data set is represented as classification rules,
decision trees or mathematical formulae.

4
Scanned by CamScanner
2. Model usage

- Fo r classifying unknown
objects or new tuple usc the constructed mode},
with the resultant class label.
Compare the class label of test sample
g the percentage of test set samp,
_ Estimate accuracy of the model by calculatin tal
constructed.
are correctly classified by the model
_ Test sample data and training data samples are always different, oy
over-fitting will occur.

Example

Classification process : (1) Model construction

Classification.

gu
__ Algorithms.-

a
1 -
“Training — —

_ Classifier
~ (Model) —

IF rank="professor ~ |
ORyear>6
THEN tenured ='yes’
Fig. 4.1.2 : Learning : Training data are analyzed by a classification algorithm

Classification process : (2) Model usage (Use the model in prediction)

41.4 Differer

at

no
Tenured? {I
no
Yes
yes

Fig. 4.1.3: Classification : Test data are used to estimate the accuracy of the classification™*

Scanned by CamScanner
8-Com
<3 Class ification, Prediction & Clusterine
‘ f example + How
to perform Classifj Cation N task
re d ase for
ClassiT
fir
cation of medical pati.ents by
"Seen patients

<a Attrib2 _Attriba Clagg”


H—Tyes |Lame 125K |No |
1 No | Medium }100K No
? No Small | 70K No
: yes |Medium /120K [No
‘ No Large 95K Yes
5 {No Medium |60K No

g:|No | Small |85K yes \


g |No {Medium |75K No’ alll
10 |No._| Small [90K + [Yes= oat |
Training Set ms
, * Model

Tid -Attrib1 Altrib2 Amribg : Class |


~ Rules telling which
; ay Lat ge eee ion
42 |Yes -| Medium | 80K ? Patient indicatesa
-. disease
19|Yes |Large {110K |?
14|No- |Small {95K |?

Test Set
‘ Shenae! =

La unseen 5
-Unseen patient ® fe Patient S

Fig. 4.1.4 : Example of classification model

41.4 Difference between Classifi


cation and Prediction
St. No. ‘Classification BE Prediction
L Classification is a major type of Prediction can be viewed ‘as the construction
prediction problem where and use of a model to assess the class of an
Classification is used to predict unlabeled sample.
discrete or nominal values.

Scanned by CamScanner
ss the val ues_
[i is used to asse Or
:
a given aly,
of an attribute that
ta A2
is Sample
assification Jabels. Ig ji
ict class have.
predic
- tion to Pre !
te Model ; pel!
Eg. if a classification
their
r o u p pa ti en ts based on predict the treatment outcom, ey |
E.g. G data and
would be a ilen,
know n me di ca l
is a
oa patient, then it ICtio
then
treatment outcome
classification.

ods
42 Classification Meth

en below :
Classifications methods are giv
ion measures, tree pruning
|. Decision Tree Induction : Attribute select
- Bayesian Classification : Naive Bayes classifier

4.21 Decision Tree Induction

- Training dataseett should be


class-labelled for learning of dec
induction. i sion trees in
€ci
decisi
dec; onp,
- e
A m decision tr represents
ee rules and it
is very
at
Rules are 4 popular tool for classification
database Y to understand and can b © directly used
in SQL,
|
— mToetr
ecognnijze g nd appr ve the records fnz
ove the di
scovereg kn Owled
m
Theree
are aman Y alg -m decision model is ve}
orithms to build dec
erative Dich
igi
otomise; —
3
C4.5 (Succe — |
ssor of ID
, 3) :
o CA RT (Cla
ssif ication
and
oC HAID
(CHi-squareg
Auuttomatee
i C In: -
Ction
"| D3 Yetector)

Scanned by CamScanner
g pata Warehousing & Minin; (MU-Som, 65,
= Mp.)
7 ., Appropriate Prob ; 4.
yo) PP lems for Decision y,Dea tt
=cification, Prediction & Clustering
Fc
pecision tree learning is Appropriate tei ea Learning
F the Problems
pelo . t
having the characteristics given
Instances are represented by a fixed
“ ale, female) describ“das attribute-vatue
(e.g. male, Set of attripy
tates, (Cg. gender) and their values
if the attribute has small number of disjoint
there are on. cl ses (e.g true, f possi
ly two possible Clas Ossityble y alues (e.g. high,
medium, low) or
ven ; False
Extension to decision tree al gorithm also biitlee veal
) then decision tree learni
ng is easy.
¥ alue attributes (c.g.
Decision tree gives a class label to each instance of P= salary).
jsion tree methods itaset.
can be used even
tones (e.g. humidi whe training
ty is known for
. * .

only a fraction ae \ unk - :


examplples).
: m
int

Learned function s are either represented b


a ne
Y a decisi
:
jf-then rules to improve readability. 10n tree or re-represented as sets of

42.1(B) Decision Tree Representation


—.. Decision tree classifier has tree type struc
ture which has leaf nodes and decision nodes

- A leaf node is the last node of each branch and indicates class label or value of target
attribute
ision tree . : .
_- A decision node is the node of tree which has leaf node.or sub-tree. Some test to be
carried on the each value of decision node to get the decision of class label or to get next
tion and sub-tree. s :

Example : Decision tree representation for play tennis.


rds fro outlook
Attribute
| is very \, Sunny Overcast pan
Value + — / Wind |
_ Humidity© :
Strong Weak
High Normal

son of decision tree


Classification pig, 4.2.1: Representation

Scanned by CamScanner
ion for play tennis
bel iow,
y ¢ en ni s = Yes is given
Pla
— Logical expression for ¥ (outlook = Overcast) v
= normal)
= sunny “ humidity
(Outlook
= rain A wind = weak)

- I[fthen miles: For continu


THEN play tennis = Yes
o = [Feoutlook = sunny A humidity = normal gxample using
= No
[IF outlook = sunny a humidity =
high THEN play tennis
o Consider th
is = Yes
[Foutlook = overcast THEN play tenn
©
is = Yes
© IF outlook =rain a wind = weak THEN play tenn
© IFoutlook =rain a wind = strong THEN play tennis = No

Syllabus Topic : Induction - Attribute Selection I Measure. -


4.2.1(C) Attribute Selection Measure

(1) Gini index (IBM Intelligent Miner)


:
Suppose all attributes are continuou
s-valued.
Assum € that cach value
of attrib
i ute has many possible split.
It can be adapted for catego
rical attributes,
af
"

After Splittin
g T into two
s
ined as,

Scanned by CamScanner
Oy, For each attrib
“Oop . saiveil attribute,ute, €ach o
theattributef the °SSible binary ‘
N Bini (7,
—s
For continuous-valued attrjp, 78 SMallog: rts 18 consi dered
ple using gini ii index; Sach POssibje iia
*Plit-point mis cho en
— ee discrete-
node,
Consider the following dataset D, i

Outlook | Ty Ty

else | No -
High True i
High te ee

|
-———__| False Yes

BO
et
Rormal =|
Fase
p
“Pxes
a
p——_| False | Yes

hee
[Normal True | No

ee eh
| Normal Truc | Yes
|High [False | No
. 1

et

Normal . || Fajs
na

ee © | Yes

se
Normal | False

AS
Yes
Sunny

AIS
Mild Normal | True | Yes
overcast | Mild

wa
ner High True | Yes
| overcast | Hot Normal | False | Yes Q
rainy Mild High True | No
- The total for node D is : P
gini(D) = 1-sum (pl’, p2’, wa. pm’) (4.2.1)
Where, p1....n are the frequency ratios of class 1....n in D.
Inthe above example, there are only two classes : 1) Yes 2) No

Therefore
n =2

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem, 6-Comp.)
oo”
_ 4-9 Classification, Pei

So the gini index for the entire set : Ry

= 1-( [9/14]+ [5/14]) = 1 - (0.413 + 0.127)


= 0.459
~ To find the splitting criterion for the
tuples in D, compute the ginj indey
attribute. 5
~ "
The gini value of a split of D into subsets D,,
D,,........D, is :
Split (D) = Ny/N gini (D,) +NYN gini (D,) +..
-u0ct NN gini D,)..¢
Where, N = Size of dataset D

Ni N)y......4.N, = Size of each


subset.
~ Consider First attribute as outlook from
dataset D
E.g. Outlook Splits into 5 tuples
for sunny, 4 tuples for overcast,
5 tuples for rainy :
So N, =5,N,=4 and N, =5
|
Split= 5/14 gini (sunny) +.4/14 gini
(overcast) + 5/14 gini (rainy) _.(42 3)
~ Now consider the dataset D, wit
h only tuples having outlook=
“sunny”
Outlook Temperature ‘Hu
midity ‘Windy Play?
Sunny Hot High False | No
Sunny Hot High True No
Sunny Mild High False | No
Sunny Cool Normal | False | Yes
Sunny . Mild Normal | True | Yes
Using Equation (4.2.1),

Gini(sunny) = 1~ sum ([2/5)°, [3/5]


= 10.376 = 0.624
Similarly calculate for “overcast” and “rai
ny”
Overcast = 1 sum (4/4, 0/4?) =
0.0
Rainy = 1~sum ([3/5]", [2/5]?) = 0.624

Scanned by CamScanner
tion 4 Cluaterin
2 Waesification, Predic

patron (4.2.3)
sai
pro
5/ 14 gi ni (s un ny) + 4/14 gini (a
te as t)
vere
+ 5/ 14 gi ni (rainy)
*
split
4) +0 +* (5/* 1
gpPlit = (5/14 * 0.62 0.6244) *
.

|
1 =

Split = 0.446
;
at ge ne ra te s th e smallest gini spl ch os en to split the node on
.
~aante th gini sp li t va lu e is

rion gain (1D3/C4.5)


ne attrib?
.

tegorical.
believed to be ca
0) [n

All attributes are

ap te d fo r co nt in uo us -v al ued attributes
spoan be ad is se le ct ed fo r split.
gh es t in fo rm at io n ga in
_ The attribute which has thsee s,hi P and N.
as
assume there are two cl to cl as s P and n samples
mp le s be lo ng s
mples, out of these p sa
_ Suppose we have S sa
ass N.
pelongs t0 cl e in $ be longs to P or
io n, ne ed ed to de ci de if random exampl
rmat
_ The amount of info
Nis defined as
_p 2+) ,
I(p,n) == —-p—4=n pen p n log p+n
(P.0)
wi ll be pa rt it io ne d in to se ts (5), Sp, -- Sy)
ibute A, a set S
_ assume that using attr the expected
. ples of N, the entropy, OF
es of P and 1; exam
«ag S, contains P; exampl cts in all subtrees 5; 1s
informatio n eded to classify obje
ne
vo.+n:
BAS i=Ll ento
ly
| Entropy (E) : ed to as si gn a class to a random
need
am ou nt of in fo rmation (in bits) :
- Expected ma l, shortest-length code
r th e op ti ed
drawn object in S unde es re du ct io n in entropy achiev
Measur
fo rm at io n ga in i.e. gain (A) : du ction (maximizes
GAIN)
- Calculate in achi ev es mo st re
us e of the spl it. Ch oo se the split that
beca
Gain (A) = T(p,
0)-E(A) ,

_ () Gain ratio favouritism on


that reduces its
information gain
alteration of the
- Jt is an
.
high-branch attributes
Scanned by CamScanner
e

¥s Data
: Warehousing & Mining Eh (MU-Sem. 6-Comp.)
cthachaiias Reotahbaanenlt _4-11 Classification, Prediction &o
RSS
{

Gain ratio should be big when data is evenly spread and small when al} data
one branch. So it considers number of branches and size of branches when Mg,
attribute to split. It seh, |

- Intrinsic information is the entropy of distribution of instances into branches


ISil IS!
Intrinsic Info(S, A)= -Lier
is log, Tg)
ISI

- Gain ratio normalizes info gain by using following formula:


Gain(S,A)
Gain Ratio (S.A)= Tririnsic In
fo(S,A)
i
4.2.1(D) Algorithm for Inducing a Decision Tree |

The Basic ideas behind ID3 :

— (4.5 is an extension of ID3. ;


— (4.5 accounts for unavailable values, continuous attribute value ranges, Pruning ¢f 6
decision trees, rule derivation and so on.

- (4,5 is a designed by Quinlan to address the following issues not given by ID3 : |
© It avoids over fitting the data.
o Itdetermines the depth of decision tree and reduces the error pruning. 7.
o It also handles continuous value attributes e.g. Salary or temperature. ;
o It works for missing value attribute and handles suitable attribute selection measure. 8.
o It gives better efficiency of computation. The algorithm to generate decision teeis 9.
given by Jiawei Han et al. as below : ‘

Algorithm : Generate_decision_tree : Generate a decision tree from the training tuples of ii


data partition, D.

Input: ©

Data partition, D, which is a set of training tuples and their associated class labels;

Attribute_list, the set of candidate attributes;


Attribute_selection_method, a procedure to determine the splitting criterion that “pest
partitions the data-tuples into individual classes. This criterion consists a j
splitting_attribute and, possibly, cither a split-point or splitting subset.

Scanned by CamScanner
ta Warehousing & Minj
by, ¥ Da 2109 (MU-Sem, 6.
Comp)
“Ni ne galt = Classification, Prediction& Clustering
NChes A decision tree.
Method :

|, create a node N;
2. if tuples in D are all of the same class, C,
thes
| | =e N asa leaf node labeled with the class C:
3 if attribute_list is empty then. ‘
| -_ return N as a leaf node labeled with the majority
class in D; // majority

4. Apply Gee attribute_list) to find the “best” splitting criterion:


;, Pruning \ S 15. ae " a
6. if splitting_attribute is discrete-valued and
3 : Multiway splits allowed then // not restricted to binary trees
Attribute_list << .attribute_list — splitting attribute; if remove
splitting attribute

7. foreach outcome j of splitting criterion

// partition the tuples and grow subtrees for aa partition


measure. | 8. - let D, be the set of data tuples in D satisfying outcomej; // a partition
sion treeis [ 9. ifD,is empty then |
attach a leaf labeled with the majority class in D to node N;
tuples of 10. else attach the node returned by Generate_decision_tree (Dy, attribute_list) to node Ny
| endfor. ea isccae ale
ft Nee aa paneson Tree Methods eae
of Decisi
| Strengths and Weakness
on tree methods are :
~ The strengths of decisi |
are abl e to ge ne ra te un derstandable rules.
© Decision trees n.
ca ti o n wi th ou t req uiring much computatio
rform classifi
© Decision trees pe l variables.
contin uous and categorica
to hianandle both
© Decision trees are able

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.

© Decision trees provide a clear indication of which ficlds are most ins
prediction or classification. Porta,

— The weaknesses of decision tree methods :


0 Notsuitable for prediction of continuous attribute.
© Perform poorly with many class and small data.
o Computationally expensive to train.

Syllabus Topic : Tree Pruning a

4.2,.1(E) Tree Pruning

— Because of noise or outliers, the generated tree may over fit due to many branches,
- To avoid over fitting, prune the tree so that it is not too specific.
Prepruning Class P: buys_¢

© Start pruning in the beginning while building the tree itself. ClassN: buys
o Stop the tree construction in early stage. Total number of
© Avoid splitting a node by checking the threshold with the goodness measure fli Count the numbe
below a threshold. : So number of ret

o Selection of correct threshold is difficult in prepruning. | So Information


Postpruning
vane m a

co Build the full tree then start pruning, remove the branches. Vem
o Use different set of data than training data set to get the best pruned tree.

4.2.1(F) Examples of ID3


: Step 1: Comput

Ex. 4.2.1: Apply ID3 on the following training dataset from all electronics custo For age < 3(
database and extract the classification rule from the tree = with “ ‘yt
R= with
Table P. 4.2.1 : Training data of customer

[Age | income | Student | Credit_rating | Class : buys computer :


ae ee 7 = . m a = Sie a fete p eat Therefore ,

| <30 | High | No Fair No


|<30 |High |No | Excellent | No
[31...40| High [No | Fair Yes

Scanned by CamScanner
yg
: Ge Student Credit_rating | Class: buys. con uter |

P| a ee lt Yes ee
|
| 7 5 Low Yes Fair Yas =
a Low Yes Excellent No i
all || Low _ Yes Excellent Yes
a | Medium No Fair No
| 730 | Low _ Yes Fair Yes |
| “10 | Medium Yes Fair Yes
| ea [Medium Yes Excellent Yes |
Sal | | Medium No Excellent Yes
i.20| [High _ Yes | Fair Yes
40] | Medium No Excellent No

| gol :

class P : puys_computer = “yes
classN: buys computer= “no”
gotal number of records 14.
e n u m b e r of re co rd s w i t h “y es ” cl as s and “no” class.
t
Coun th
5 = 9 and “no” class= 5

. 3qlee P <r ogeben


+nDB

: .'2) =~ p+
so Information gain = 1 (p

1(p, 0) 1(9, 5)
(5/14)
il

log,
= — (9/14) log, (9/14) - (5/14)
= 0.940
y for age
Step 1: Compute the entrop
For age $30,
2 an d p; = with “no” class = 3
p,= with “yes” cl as s =
= 0-971-
Therefore , I(p, .0,) = 12,3)

Scanned by CamScanner
Similarly for different age ranges I(p,.n,) is
calculated as given below :

Table le P. 4.2.1(a): Se
2:
pe He j0n
ne
| age | [a] tom yore 09 Credit satin
| <30 | 2/3] ase" <
o971 consider ABS
[31...40 4,0 0

| >40 3/2 0.971


So, the expected information needed to classify a given sample if the
partitioned according to age is, ° Pes
Calculate entropy using the values from the Table
P. 4.2.1(a) and the formula .:
below:

Vv

E@)= ¥ 23160) .
i= ptn
=]

BOB) = Bet74 12,3) ay +7qyilh 14,0) + 2 16,2) <0464


Hence

Gain(age) = I (p, n) ~E (age)


= 0.~ 9
0.694
4=0.
0246
Similarly,

Gain (income) = 0.029


Gain (student) = 0.151
Gain (credit_rating) = 0,048 .
attribNo
utew andthe cre
ageatehasthethenedhighest information On
gaienn; among all the attrib
: utes, so select age as test
© as age and show all possible
values of age for further splittin
Since Age has three Possible g.
values, the root node has thr
ee branches
($30, 31...40, > 40).

AGE

<=30 31.40 >40

Fig.P. 4.2.1

a.
Scanned by CamScanner
ww Dat
—<=—a$<Wa reho
—= using g Mining (Mu.
Sem,
step 2:

The MeXt Question i yp.)


nave used Age at
the Too)
+. COL now attr ibut ld be |
student, or credit_rating We have lod Ccid,‘ “Sted at th
Consider On the Femaining
Age: < 3 thr, 7
Cc Ape brane}
9 And cou ee a,
n; the n crof we
tuples : income,
f,
"Sio = 5 ( Age: ‘OM
<30) the Original pi training
£iven ini set
Table P, 4.2] (by

Total number of Yes


tuple=
_I(p,n) =I (2, 3) =~ (2/5) Io,

For Income = High,


Pi = with “yes” clas
s = 0 and 1 = with
“no” class = 2
Therefore , I(p, , n;) = 1(0,2)
=— (0/2)log,(0/2) — (2/2)log,(
Similarly for 2/2) =0,
different age ranges I(p,
, n, ) is calculated as given
below:
a Tablele P.P. 4.2.1(c)
1(c
Ln nese ER aae

Scanned by CamScanner
_4-17 Classification, a
MU-Sem. 6-Comp.)
¥ Data Warehousing & 4 . o
.1 (c) and iin Hig
the va l ves from the Table P. 4.2
Calculate entropy using a
Bi
4 Ney 4 y 7 pata Warehou:
SS
= -
: rg n,)
E(A)
i=l ; ain Compute the

B (Income)
= 2/5 * 1 (0, 2)
+ 2/5* Il, 1)+ 1/5 *I(1, 0) = 4 For credit_rat
[Note : S..s is the total training set. a SEEN now's
Hence —: |
Gain(Sza. Income) = I (p, n)— E (income)
= 0.971-0.4=0.571 Similarly for
(ii) Compute the entropy for Student : (No , yes)
For Student = No,
P;= with “yes” class= 0 and n, = with “no” class=
e
Therefore, I(p, , n; ) = (0,3) =— (0/3) log,(0/3)— (3/3) 1og,(3/3) =
Similacy for different outlook ranges I(p; , n; ) is calculated as given below - F .
.
Table P. Llp

Scanned by CamScanner
For credit
. rating
= E ai .
: ith “yes : cl
ass s =]
and nh,
= With
“hy io”" Cl
ass =2
Therefor
e
K(pi
Pp
=
=
‘0;
stings Wa
) (2/3) log2 (2/3)

ene. Batty) numa emt


ula given
the total trainingset
[Note :- Scsois <=
Hence

Gain(S<30, credit_rating) 1 (p, n)-E (credit_rating)


0.971 —0.951 = 0.02 AGE
Therefore, Z™
<=30 31...40 >40
Gain(S<30, student) = 0.970

| Gain(S<39, income) = 9. 0.570 P% —


No- Yes:
Gain(S<o, credit_rating) = 0.02
w Age: “<30”. Fig. P. 4.2.1(a)
Student has the highest gain; therefore, it is belo

Scanned by CamScanner
* . i6-
=F Data Warehousing Mining (MU-Sem_ =

«cating for age : 31.--40 and count the ee lize


Step 3:
Consider now only income and credit rung me so nu
na : stag sel \ ic
tuples from the original given training S¢ ; ; formal
° Say..40 = 4 (age: 31... ” ) Ss In

Table P. 4.2.19 I(p.™)


uter Age ute the ¢
Age Income | Student | Credit_rating Buys_comp (i) comP st rat
31...40 | High No Fair Yes - io a! \ For credit
Yes / | a) p=
_ with “yes
31...40 |.
Jl... Low Yes Excellent Y eusis
ent Yeq sore 1q
es . Yn There oO ,
31...40 | Medium No Excellent

. 1 t
| 31...40| High Yes | — Fair Yes No Yag For credit_t@
it
Since for the attributes income and credit_rat rati ,
ing —_ p=aa with “y e:
buys_computer= yes, so assign class ‘yes’ to 31...40. Fig. P. 4.2.1(h) Therefore, IC,
Step 4: similarly for

Consider income and credit_rating for age: >40 and'count the number of tuples from ih
original given training set
Syyg = (age: >40) .
Table P. 4.2.1(g)

Age | Income | Student | Credit_rating | Buys.c


>40 | Medium | No Fair . , Ves Calculate ¢
" below :
>40| Low | Yes Fair Yes

> 40 | Low Yes | Excellent No


>40 | Medium | Yes Fair Yes

>40| Medium |No . | Excellent No


Consider the Table P. 4.2.1(g) as the new train ing set and calculate the Gain for inoo™ Hence
and credit_rating

Class P: buys_computer = “yes”

Class N : buys_computer = “no”


Total number of records 5.

Scanned by CamScanner
~~ the num
ber of r
ecords Wi
yat th “y c
@ aber of FeCOTdS With “yogs
be Cla SS
$0 nu ; =o, nds,
“of jpformation gain = T(p, n) =_ . 1 Class
-
P +n [0g
) = 13 2) =
1(p® B48) tog, ays
Peak,
PH Mog, Der+n *
iapute the e ntropy PY for “red
it_rating
ot)
5) ]
) OR, (
all
Jang 10
@ credit_rating = Fair
\
= ‘
49 - with “yes class =3 and n= With “no” clnas
aw > 0

i T(p;» 9)) = 1,0) = 0,


predit_rating = Excellent
por
with “yes” class = 0 and D; = with «
NO”
ory

class — 2
mrerefore, I(p; ’n;) = I(0,2) =0

similarly for different age ranges I(p


i» T;) is calculated

ee, we
Table P. 42.1¢) «i given below 2

hte tL kh A ht Lt
Calculate entropy using the values from
the Table P. 4.2.1(h)
klow: and the formula gi
ven

a: |
Vv
=
PI

E@)= > 2 F165) a


a

ta”
+n
"
cu
be |

E(Credit_rating) = 2 (3,0) + 5 1(0,2)=0


ee
Pe
ae

Hence
ee

Gain(S,.9, credit_rating) I (p, n) —E (credit_rating)

0.970 -0 =0.970

Scanned by CamScanner
: 1 Classifica
_¥F ata warohousing & Mining (MU-Som, 6-G2M2) FET ENON, Prov,
w)
(High, medium, lo
(ii) Compute the entropy for inco me: FA pata warel
For Income= High, 6 xrracting clz
P;= with “yes” class = 0 and n,= with “no” class= ia
Therefore, I(p; +1) = 0,0) ==0 : pxamP
Similarly for different outlook ranges I(p, . m,) is calculated as given bejoy, page= "= 70
Table P. 4.2.1(i) page=" = 30
IF age = “3 1 *

IF age= > 40

2 | a5 B iuge='>
Ex. 4.2.2
Calculate Entropy using the values from the Table
P. 4.2. 1(@i) and the formula gj
E (Income)= 0/5 * I (0, 9 +35*a. 1) +265 5 “I, = 0.95) Ven by,
[Note: S.nisthe totaltrainingset. = =8 | eh eR

Gain(:
nei mcome) = I(p,n)—E (income)= 0.970— 0,95] = 0.019 AGE

srefore, ‘ S30
Ts
31.49
Gain(S.49, eis = 0.019 | *
Gai Student Yes

Credit | rating has the hi AN /\


below Age: “> 40” ghest gain; baie it is No Vise Ewcelee

Output: A Deg; si Fig. P. . 4.2.1(c)


on Tree for és buys_computer”

AGE

Student ns on

os
No
Yes aN
| | Savant Fair
No Yes | |
Fig. P * 4.2.1(a) : Decj .
No Yes \

for “buys ‘omputer” i

Scanned by CamScanner
The weather attr;
They can have the race SUutlooK
Outlook = {sunny, Cnn 'NG valy.
a .
Temperature — hot, mil
Humidity = {high

> ye
wind = {weak, strong}
Sample data set 5 are -

. aan
|
Yes
No
| Yes
Yes
| i | Strong] Yes
| 12 | Overcast | Mild High | Strong] Yes
| 13 | Overcast | ‘Hot — Normal | Weak | Yes
| 14 | Rain | Mild High | Strong No

We need to find which attribute will be the root node in our decision tree. The gain
S calculated for all four attributes using formula of gain(A).

Scanned by CamScanner
Class P : Playball = “yes”
Class N : Playball = “n0™
Total number of records 14. att .
with “yes” class and “no” class.
Count the number of records
class = 9 and “no” class = 5
t

So number of records with “yes” {


I Qutlook shows
P saison mal n
So Information gain =1(p.") =-5 yn °82p+n~ p+n loge San ' attribute in the root
: As Outlook hi
I(p,n) = 1(9, 5) =— (9/14) log, (9/14) — (5/14) log, (5/14) root node has three

= (- 0.643) * — 0.637) + (— 0.357) * (— 1.485) . | Step 2:


= 0.409 + 0.530 = 0.940 As attribute
ou
branch node.
Step 1 : Compute the entropy for outlook : (Sun
ny, overcast, rain)
For outlook = sunny,
P; = With “yes” class =2 andn; = with
“no” class= 3
Therefore , I(p;, n,) = (2,3) = — (2/5) | 0g, (2/5) — (3/5) log,(3/5) = 0.97 1.
Similarly for different outlook rang
es I(p, ,0; ) is calculated
as given Delow:
Table P,P4220)

Scanned by CamScanner
Prediction & Chus
Gain (T, Temperature terin
:
Gain (T, Humidi )= 09
ty) = 0 -
. Gain (TT, Wind)
gutlook atoms = 0.04%
the highest gain
, sq itis useg
gquribAsute Ou
in tlthoo
e kro ha c. ~ 8S the decision
s only va
lues “sunny aaa
va Ove
soot node has three branches : rast,
, rain”, the SY Ovemact Fain
Fig. P. 4.22
& As pe
attr
l ibute outlook at
root, we have to de
cide On the
Femaining three

+ ara
Consider outl attribute for sunn
ook y
= Sunny and co
0 unt the number
of tupl es
from the Original give

hhh .
n tr
aining set
Ssumy ; = {1,2,8
,9,11) }= = 5 (From
Table P. 4.2.2, outl

% a
: k = sunny)
oo
Table P, 4.2.2(b)

a ee
Day | Outlook | Temperat | Hu
ure midity Wind | Play ball

a
] Sunny

,
Hot
High
Weak No
2 Sunny | Hot

.:
High Strong | No
8 | Sunny | Mild
I a
9 Sunny | Cool Normal | Weak | Yes
11 | Sunny | Mild Sorina _| SHDN] Yes
Note: ReferTable P. 4.2.
2 | of No tuple =3
Ia given Total number of Yes tuple = 2 and total number
— (3/5)log,(9/5) = 0.971
___1(p, n) =1 (2, 3) = — (2/5)log,(2/5)
ef

() Compute the entropy for temperature : (Hot, mild, cool)

For Temperature = Hot,


“no” class = 2
P; = with “yes” class = 0 and n; = with
1(0 .2) = — (0 0/
(2/2)l
/2)log— ,( (2/2) = 0
og,2)
The ref ore , I(p; .9; =
)

Scanned by CamScanner
ng & Mining (MU
mu-Ser econo) ‘ By patawt
¥ - warehous! a Min! ste B )is calculated as given beloy, 5 } w Entropy using, the
ranges
outlook ran MPI" 2c) caicolel®
Similarly for differen Table P. 4-2- ao :

aa
P. 4.2.2(c) and -
-» the values from the Table (C) and the fom, . Hence
Calculate Entropy usiné Gain(T,,
below : 1
. : ute the entropy
E(A) “5 Beto gan COmP ire
For wind =
B Temper) = mes 2) + 2/5* ICL, p= with “yes” class

is the total training set. See Ness Therefore, Mp, +™):


similarly for differer
Hence ;
Gain Tony temperature) = 1(P, x) ~ (temperature)
= 0971-04,
= 0.571

(if) Compute the entropy for humidity : (High, normal)


Calculate Entrop\
For Humidity = High,
p; = with “yes” class = 0 and n; = with “no” class = 3 .

Therefore, (pi, ni) = 1(0,3)=— (0/3)log2(0/3) — (3/3)log200)"" |


Similarly for different outlook ranges I(p; , n,) is calculated as given below

Table P, 4.2.2(d)

unity [|| oy)


High 0/3 0
Normal 2/10 0

Scanned by CamScanner
=I

Sor winds (Weak, fen, Smid) =0971 -9=051


aqovmpate the entropy
. for wind = weak,

(1, 0) = 04 pe with “yes” class = 1 andn;= with “no” class = 2


qperefore, 1(P, + 9) = 11,2) =— (1/8) log, (1/3) — (23) leeorchait
similarly for different outlook ranges I(p, ,n,) is calculated is given below :
Table P, 4.2.2(e)

3 n)
Weak | 1/2] 0.918

Strong | 1 | 1 1

Calculate Entropy using the values from the Table P. 4.2.2(e) and the formula given as:
y :

BaA)= DBP tena)


/3) = 0 i=1
D= 0.951
E (Wind) = 3/5 *1(1,2)+25*1G,
bl Tar isthe total training set
Hence Gain(T sunny, Wind) =
I (p, n) - E (Wind)
= 0971-0951 =0.02
Therefore,
wavy = 0.970
Gain(Teuanys Humidity)

Scanned by CamScanner
4-27 —_ Classification, Pretoria,

® pata Warehousing & Mining re)


usemt = 0.570
~ do [a

Gain(Tam TOM
Gain(T nay?
wind) = 0«+eon below Outlook = “sunny”, Hf
as
0 " jder temPS
in: therefore, 11S ginal given
Humidity has the highest £2!
| ,
Sunny gyercast er ,
4 __
[amc | Day
oes
High Normal
Fig. P. 4.2.2(a)
_
5
prs
ie

Step 3: had
Consider now only temperature and wind for outlook = Overcast and count theMn, 14
tuples from the original given training set Consider thi
T = {3,7,12,13} = 4 (From Table P. 4.2.2, outlook = overcast) temperature and
Table P. 4.2.2(f) ClassP: |
ve ClassN

High | Weak | Yes — Bie


7 ‘| Overcast | Cool. Normal | Strong | Yes Count the
5: Pa: So number
12 | Overcast | Mild High ~ | Strong | Yes F
13 | Overcast | Hot Normal | Weak | Yes __ So Inform
Since for the attributes temp
overcast. mPerature and awind, playball = yes, so assign class YS} i E(p, n) =
, Birt ee a
Bs (i) Com wi

a, For Winc
unny Rai e
Ove ain P - = Wi

Yes . its
| - Win
x \ .
. e Pi = will

Be

Scanned by CamScanner
ClassP: Playball = “yes”
Class N : Playball = “no”

Total number of records 5


Count the number of records wit
h “yes” class and “no” Kae
So number of records with “yes” class = 3 and “no”
class = 2
So Informati
ensnin 100.0)
in= aE
5 Wea
pea pra
_p_ | n

1(P,n) =1 @, 2) =— (3/5) log, (3/5) ~ (2/5)log,(/5)= 0970


{) Compute the entropy for Wind
For Wind = Weak

Pi= with “yes” class = 3 and n, = with “no” clas=s0


Therefore, I(p, , n,) = 1(3,0) = 0.
For Wind = strong
= with “no” class = 2
P= with “yes” class =0 and n,

. al
Scanned by CamScanner
on, p "eh7,
e Classiefincaeti
e e e n me
9.) _— AO eene
g-comp-)
2
Mini= nging (MUN-Sem.
sing & a
*
! ousing &
YY Data Ware
- n,) = 100, 2) =0 1p, is calculated as given below, .
Therefore, I(P;+ i?
outloo k ra nges
Similarly for differem Table P. 4.2.2(
h)

on 7 |
4 (hh)
the T. ‘able P. and th
2.2
values from .

I e entr
Calculat opyy Uusing the
trop qherefore,

below : y

E(A)= 2 Pal em) , peu ls. s

7 2 wind bas th
E(Wind) = 51, 0)+5 1(,2)=0
Hence Gain(T i Wind) = I (p, n)—E (Wind) = 0.970 ~0 = 0.979
(ii) Compute the entropy for Temperature : (Hot, mild , cool)

For Temperature = Hot,

P= with “yes” class = 0 and n, = with “no” class = 0.

Therefore, —I(p,,0;) = 1(0,0) =0


Similarly for different outlook ranges I(p, , rr) is calculated
as given below : Therefore
Table P. 4.2,2(i) i.

a 2/1] 0918
Cool 1
1 1 ;
Calculate Entropy usin th —
below & the values from the Table P. 4 2.2(i) and the formula git

V on
E(A) = > ota , E © decisio,
<, P+n! @n) a
= Bou

Scanned by CamScanner
Gain(T,,:, T
emperature)
= 0.019
Gain(T,,,,
Wind) =
0.979
hi in;
Wind bas the highest gain; therefore, it is below outlook
00k = “Tain”,

Sunny
reast Rain

High Ni al
@ [ay
A N oo Weak

Fig. P. 4.2.2(c)
Therefore the final decision tree is -

iy Net
Fig. P. 4.2.2(d) : Decision tree for play tennis
| Ite decision tree can also be expressed in rule format
| IFoutlook = sunny AND humidity = high THEN playball =n0

Scanned by CamScanner
ny AND humidit
IF outlook = Sun —
TF outlook = overcast THEN playball y fie i: 4.2.4:

_ AND wind = strong THEN playball = no 6


— [Foutlook =rain = j
_ [Foutlook = rain AND wind = weak THEN playball = yes
stock market is given below,
Ex, 4.2.3: 5 oF ean@ training
value isdataset
based id
on age coed and type. Eh Ofit i, my

[age | Contest | Type | Profit .


old | Yes Swr | Down
|Old | No Swr | Down
[ Old | No Hwr | Down

| mid | -Yes | Swr | Down


| Mid | . Yes Hwr | Down
| Mid | No Hwr Up
| mid] No | swe] Up
| New | Yes | Swr Up
| New | No Hwr | Up
Soin. : [! New N
= Swr | Up Soln. :
: Class P:
In the stock market case the decision
tree is -
— Class N:
Total nur

Count thi
So numb

So Infor

IQ,

Step 1: Co
For Inc
Pi = wit

Scanned by CamScanner
& Minin MU-5,

ot
Using the for,
ag =

4S]
4%.
te bt
A
Soln.:

eee

al

Class P: Own house = “yes” *v1


Class N: Own house = “Tented”
23a
ad
|

Total number of records 12 -

Count the number of records


with “yes” class
ec
and “Tented” class,
So number of records wi
th “yes” class = 7 and “n
o ” class=
; 5
So Information gain =I(p,n)=-—2-4,
Pa. lt. n
p+n 8p 4a-pen log. Fen
Ptoa
Tp, n) =1 (7, 5) =~ (7/12) log, (7/12) — (5/12) log, (5/12)
=0,979
Si «
"ep 1: Compute the entropy for Income : (Very high, high, medium,
low)
For Income = Very high,
Pi= With “yes” class = 2 and n, = with “no” class = 0

Therefore, —‘I(p,,;) = 1(2,0)=0.

Scanned by CamScanner
WF Data Ware! n,) is calculated as given below .
es I(Pi +
Similarly for different 1nc0m° el p. 4.2.4(a) |
Income Pi a

Very high 2 po} °


0
High 40
Medium 1/2] 0.918

Low - 0} 3 y

Calculate entropy using the values from the Table P. 4.2.4(a) and the formul, Bing I Income 4
below : " gsed 25 the de

p+2: Since inc


E(A) = 2 pei 1 (pn) pas four bran

E(Income) = 2/12 * 12. 0); + an2eiG 0) + 3/12*1(0,3) = 0.229 step 3:


a ee SinceWw
Note: Sis the total training «
Conside
Hence
training set
Gain(S, Income) = I (p, n)— E (Income)

= 0.979—0.229 = 0.75 Since |


Step2 : Compute the entropy for Age : (Young , medium, old) below “ver
imilarly for different age ranges I; , n;) is calculated as given below : Simila
Table P. 4.2.4(b) Income =
“rented” re

? * i Now
oung |3/ 1] 0.811 of tuples |
Medi ”
|edu | i
3/2 | 0971
ee
rented
Pl
below : PY using the values from the Table P. 4 “ta
~ ‘+ 4.2.4(6) and the formula 8”give aSo t
Vv
Ex. 4.2

Scanned by CamScanner
nov (MU-Som.1. 6-Comp,
g Mining (MUS
_. Classification, Prediction & Clustarine
E (Age) = 4/12 * 1(3,1) + 5/12°1(3,2) + 3/12*1C4,2)
= 0.904

ae
: trainin sot. hai
tne 10'2 8
Aue gis

= 0.979-0,904
= 0.075

- atribute has the highest gain, therefore it is on


socom? sexision attribute in the root node.
the i
Very high High Medium Lo “Low
* income has four possible values, the root node
low). Fig. P. 42A
se he (very high, high, medium,

si? .
33
. we have used income at the root, now we have to decide on the age - e.
attribut .
since
inal given
ider income = “Very high” and count the number of tuples from the orig
Cons
j anil sel
Sveryhigh = 2
, so directly give “yes” as a class label
} Since both the tuples have class label ="yes”
tow “very high”.
Similarly check the tuples for income= “high” and
itome = “low”, are having the class label “yes” and
“ented” respectively.

Now check for income = “medium”, where number


{tuples having “yes” class label is 1 and tuples having
] ‘etted” class label are 2.
So put the age label below income = “medium”.
. Fig. P. 4.2.4(a)
So the final decision tree is : |

w. Apply ID3 to generate .


~~

A set of clas sifi ed obje cts is give n as belo


&425: Data Set:
tree.

Scanned by CamScanner
|=
a
a

Dashed | No | Triangle
jal

Gmen
IN| ola lalaln m

Green | Dashed | Yes | Triangle


Yellow | Dashed | No | Square
Red | Dashed | No | Square
Red Solid | No | Square
Red Solid | Yes | Triangle.
Green | Solid | No | Square
Green | Dashed | No | Triangle For color =
OO

Yellow | Solid | Yes | Square pi = with “


Red | Solid | No | Square |
oS
_

Green | Solid | Yes | Square : Similarly


_—,
—_—

Yellow | Dashed | Yes | Square


i)

Yellow | Soild |.No | Square


oo
_

Red | Dashed | yes | Triangle


—s
pS

Soln. :
Class N: Shape = “Triangle”
Cais tags aie
Total number of records 14

Count the number of records with “triangle” class and “square” class

Scanned by CamScanner
rds with “tr:
ber of recor’ With “triangles Class
P(square) = 9/14
P(triangie) = 5/14

= ~~B_ toga =P . h ,
formation gain=I(p,n) P+n
>
so!
toga —D_ i?
I(p, n) = 19 5) +h Ptn
DPtn r
’ =
— (914
14)loga (iy
4)
the entropy PY f for Color ; (Rea
1 comp’ ute
’ Breen, Yellow)

Fig. P. 4.2.5(a)

For color = Red,

p= with “square” class = 3 and n, = with “triangle” class = 2


Therefore, I(p;,0;) = 1G,2)=0.971
below :
Similarly for different Color values, I(p; , ni) is calculated as given
Table P. 4.2.5(a)

Scanned by CamScanner
4-37 Clalassification, Prag 1
| . -
ris
. AzS(a) and ction
MU Sem. sa
Data Warehousing & Min
:: nous
:F ww
and the form g pa wi
the Table P. 4.4. ‘
ing the values from
Calculate Entropy using
i My
below :
E
Vv

E(A) “o ne
££ ree I (pis ni)

i=l
E (Color) = 5/14 * 1(3, 2) + 5/14*1(2, 3) 4 4/14*1(4, 0)

g
[Note: Sis the total traininset. ~
Hence _ Gain(S, color) = 1 (p,n) — E (Color) = 0.940 — 0,694 ~ 0.2%
Step 2: Compute the entropy for outline : (Dashed, solid)
Similarly for different outline values, I(p; , 0;) is calculated as given below :
Table P. 4.2.5(b)

Outline.
Dashed | 3 | 4 | 0.985

Solid 6/1 | 0.621

belowCalculate
« ues from from the the T; Table
Entropy usingg the values P. 4.2.5(b) and the formula fie

Therefore

E(A) = z pen (Pi, m)


iz E eamne) S= 7/14* “ i . iliac 1)= 0.803
Note:
, S is the total trainin
ing set.
As color
ence

= 0.940— 0.803 = 0.137


Step 3; Compute the entropy for
.

dot : (no, yes) , - . 7

,
Similar] y for diffe
rent dot values, I(p; , n;) is calculated |as given below
t

.| Stepa: a,
a

j Te

a Consia

Scanned by CamScanner
Gain(S, color) =
0.246
Gain(S, outline) = 0,137
Gain(S, dot) = 0.048
As color has highest gain,
it should be the root node.

Send: As attribute color is at the root, we have to decide on the remaining two attribute for
red branch node.

Consider color = red and count the number of tuples from the original given training set

Scanned by CamScanner
| 2 | Red | sotid
|3. i Red | Solid Yes | Triangle
| 4. | Red | solid No | Square
Calculate entropy v

oe p,m) =I (3/5)log, (3/5) —(2/5) log,(2/5)=


0,
pute the entropy for outline Hence
: (Dashed, solid)
Similarly for diff
di erent outline
i values, [(p; , n;) isi calculat
| ed as given below:
Table P. 4.25(e)
.
Dot has the highes
| Check the tuples y
| Check the tuples 1

So the partial tree

4
|! : pe
.
(Outline) = 2/5 *1(1,1)43/541(2,1)
_
= 0951
[Fria

= 0.02

eee
Scanned by CamScanner
of
the entropy for Dot -
ee {0,ye5)
. italy for different Dot Values

Re
(f) and the formmy
B(A) = > Btn la Biven be}
ix] P+n!(,n,)
E Wor ~ 35 *1¢
=0 q
Heace

*
a-")
4
ae"

oC
shthee c
Color = “Reg |
al

tuk
ples with Dot == “y“yeses” from Sample S..4,
ra”)
=|

eo
it has class trian, io
¢ tuples with Dot = =
“no” from Sample Si |
eg , it has cl} nai A
So the partial tree for re to
d col ~«
yak ep
ar =
a")
J
7p a?
TY
Ee la
Pe

Scanned by CamScanner
FF data Warehousing & Mining (MU-Sem, 6-Comp.) _4-41 Sat eation. Precicten
eo
Step 5: Consider Color = Yellow and count the number of tuples from the op;
Bina
m)
training set.
Table P. 4.2.5(g)

Lf Attritut
Ye!

[1 | Yellow | Dashed | No | Square

g
?
[2 | Yellow | Solid | Yes | Square
/3. | Yellow | Dashed | Yes Square
Check the tuples 1
[4 I Yellow | Solid | No | Square

tec ec
yn oe
Check the tuples 7

Sreetly aston, | Therefore the fin:

~—6~—~EX 4.2.6: Apply


color = “green”
to cla:

Scanned by CamScanner
Rae
chock the tuples with Outline —= “dasheg»

check the tuples with outline< «


Soliq”
qherefore the final Decision Tree

ere
teh
Eup

Pr Te
Ry

Pits et te ae
] 426: Apply statistical b
atay

Person ID] Name Genda: e cone 4Y


L 1 | Kristina Short fi

BS
[ 2 | aim Tall
| 3 | Maggie | Female | 1.9m | Medium
Y
M
a

f 4 | Martha. | Female | 1.85m | Medium


Peat Tt,

| 5 | John Male | 2.8m | Tall


| 6 | Bob | Male | 1.7m | Shor
Medium .
| 7 | Clinton | Male | 1.8m_|
| 8 | Nyssa | Female 1.6m | Short
| 126m | Sto
[2] Katty | Femaie
Scanned by CamScanner
p.) 4-43 = Classification,P
MU-Sem. 6-Comp.
W Data Warehousing
& Ming ining — - —aiCtion a,

7 | Soln. :
P(Short) = 4/9
p(Medium) = 3/9 . .
PCTall) = 279 Lapae
lly Actual
e es as given below :
Divide the height attribute info six rang! aneaies a

[0, 1.6], [1.6.1.7], [1.7,1.8], [1.81.9]; [1.9,2.0], [2.0,

ues Male and Female.


Gender attribute has only two val
a = 3, Tall 29
ium
Tota l Num ber of shor t pers on = 4, Med
_
Prepare the probability table as given below

1 1 2
[ $4? 4
Ez af t e
|
E 2 {| ‘o-1 too |
0 o
L 0 2 0 0
[ | [1.9,2.0] 0
— 0 1 0 0 12 =a
[ | (2.0,infinity | 0 0 1| mal
0 1n =
Use above values to classify new : Male
tuple as a tall : > | Fi emale
Consider new tuple as t = {Adam,
M, 1.95m}
secs,

P(tiShort) = 1/4*Q=9
P(tMedium) = 1/3 + 0=0

_ Therefore likelihood ofbeing short


= p(tlshort)* P(short) = 0 * 4/9 =0
| ’
Likelihood of being Medium
= 0*3/9=0
Likeli
of beiho
ng od
Tall = 219 * 19-94)
Then estimate P(t) by adding ind
ividual likelihood values sin
ce t <i be either shat” 4 Soln, : |

P(t) = 0+0+0.14 =0.1]

Scanned by CamScanner
— P(Short In.
Similarly p . "= if
(Medium) Be ote * 15

isaT: : PCTany 0.11 =0


new uple is a Tall as it has the hig an
The training data j.
hest Probability Wy
mode choice to eae
in a city.

‘i The question is what transportation mode would Alex, Buddy and Cheery use?
Ns

Class P ; Transportation mode = “Bus”


Class Q: Transportation mode = “Train”

Scanned by CamScanner
ousing & Mining (MU-Sem. SE Odictn,
re ’ ¥ Data ware

Class N : Transportation mode = “Car” ge 5


f Total no. of records: 10 :
No. of records with “Bus” class = 4
No. if records with “Train” class = 3
No. of records with “Car” class = 3
So,

Information Gain = 1(p,q.n) =— (pp + q + n)) log, (p/((p + q + n))

- (Gp +4 + n)) log, (tp +4 +) similarly,


—(@(p+q+n)) log, (a/(p +q+n)) i

I(p.q.n) = 1(4,3,3) =— (0.4)(-1.322) — (0.3)(-1.737) — (0.3)-1.737)


1(4,3,3) = 0.5288 + 0.5211 + 0.5211
143,3) = 1571 4 Benes =
Step 1: Compute the entropy of gender :(Male, Female) _ | Since travel
For gender = Male Pj=3 (Cheap, Standar

1(p;, q; nj) = 13,1,1)= —(3/ | . esi ane


~ 7) = — BlS)log,(3/5) — (1/5)log,(1/5 ) — (1/5)log,(1/5)
“Train”, so assij
= 1.371 :

Scanned by CamScanner
Gain (S, Car Owners ™

Gain (S, Travel Cost Kay = 0.535


—— ¢ > Income vel) ~
= 12]
6

gravel cost ($)/Km attribute has 696


grate in the root node, the high,
“ Sain, therefor, oar
since travel cost ($)/Km It is Used

(oem, Standard, E: XPensive),


i si POssib] "ee —_

lead
le Valu » he Toot node has
Since im all the attributes of Travel c three branches

ooika
, 50 assign class ‘Car’ to expe nsive. Ost (S\V/Km 5

Since for all the attributes of


Tin”, s0 assign class Train’ tos Ti Favel Cost ($)/Km

NT
tandard. =

Ro
ne
Ft
st

wat
3

Fig. P. 4.2.7
] Cons idm
travel cost ($)/Km = Cheap and count the number
Bien of tuples from the original
Set

Scheap = 5

Scanned by CamScanner
Table P. 4.2.7(a) g Calcula
==
— te ent
[ Attributes | |
| Gender | Car ownership | Travel cost ($)/kmn | Income level Transorage
I Mate 0 Cheap __Low oy
[we
| Femate |
[1 1 [ow | ete | Oo
Cheap aa
[Fenate| 0 crew | tow | ag
te —

[rae| ia
Cieip | Medium [agoe :
~|_ y Comte
For Car ow
p; = with “t

Therefore,
1(P; » Gi, Mi)

= =
eee 0.722 —
Compute the entropy for gender : (Male, female
)
For gender = Male,
P; = with “Bus” class = 3, q, = with “Train” class = 0 and n, with “car” class =0
Therefore ,
1,40) = 1G,0,0) = — (3/3) log, (3/3) — (0/3) log,(0/3) — (0/3) log
= 0. ,(0) mam
Similarly for different genders
I(p; , q;) is calculated as given
below :
Table P, 4.2.7(b)

Scanned by CamScanner
caleulate entro, PY usiUSing the y, alueg
from

E i = p+ I

— =35.,. 0 "™
Hence ,
Gai
noth (Scop Bender) .
(i comp een tropy for Car owne
rs, .
for Car ownership = 0, hip ; (0,
ne with “bus” class = 2

Therefore,

rT
haha
Li
PPE
ey
Pt

Ua SUES
EAGER ee

Gain(SpreapsCar ownership) = I (p,q, n)— E (car ownership) = 0.722 - 0.551

= 0.171

Scanned by CamScanner
WF pata w & MuU-Som. cameo —eretlction4 | ;
fhousi
(il) Compute the entropy for Income level :
(Low smedium, high) ee
For income level = Low,

P, = with “Bus” class = 2 ,g; = with “Train” class = 0 and n, with “car” class 0
Therefore, E
1,,.4;+0,) = 1(2,0,0) =-(2/2-)(0/l2)loog,g(0/,2) — (O/2)log,(0/ C oF
(2/2)2)
= 0.
Similarly for different outlook ranges I(p, , q;, 0;) is calculated as given below
_ ‘Table P. 4.2.7(d) :

0 — bn

0.918 |
0
bet Calculate Entropy using the vafrom
luthees
Table P. 4.2.7-7(d) and the formula
gives
low : ;

E (A) z ey
ll

ptn

E (income Level) =25-10 0, ae te. pee oe ONO anes

Gain(Scheap, Income level) = I (p,


q, n) (Income level)
;
= 0.722-0.551=0.171
" Therefore, since gender
has the highest gain, it F
comes below cheap.
For all gender= Male,
Transportation Mmode= bus

||
;

Scanned by CamScanner
Fig. P. 4.2.7(b)

Scanned by CamScanner
Classification, Procictg
——— el
: Bayesian Classification
- Naive Bayes
Syllabbius —

3
4.2.2 Bayesian Classification : Naive Bay es Classifier ID)
P(hl
ctical difficu
4.2.2(A) Bayes Theorem
pequire initi
— _ Itis also known as Bayes Rule. .
— Bayes theorem is used to find conditional probabilities.
significant ¢
ae
- The conditional probability of an event is a likelihood obtained with the ae 42-20)
information that some other event has previously occurred.
~ Zane: 7
~ P(XTY) is the conditional probability of event X occurring for the
event y hic
already occurred.

PCXIY) = P(X and Y)/ P(A)


~ An initial probability is called as apriori probability which
we get before any addi:
information is obtained.
,
— The probability is called as a posterior probability value
which we get or reviseaie
d|
any additional information is obtained.
4.2.2(B) Basics of Bayesian Classification

— Probabilistic learning : Explicit probabilitie


s are calculated for Hypothesis.
- Incremental + The probability of a
hypothesis whether it is correct can
increased or decreased by each be increment
training example.
- Probabilistic Prediction : Multiple
hypothesis can be predicted by their
weight. !
probutilt
- Meta-classification : The Outputs
of several classifiers can be
| ‘eeambaned, eg.4
multiplying the probabilities that all classifiers predict
_
:
for a given class E,
Standard * The computationally intractable Bayesian metho
optimal decision making against ds provide a standard n.:
which other methods can be
measured.
a Given training data D, posteriori Probability Given at
of a hypothesis h, P(hID) follows the Bi
The clas;
,
PCy) j
| PhD)= P@th)P
PD) Assign t
P(h) : Independent pr
obability of j: Pr
ior Probability

a
Scanned by CamScanner
a”
pD) : Independen, Probabiy;

p(pih) : Conditiona} Probab 'Y of zy


it;,
phiD) = Condititiional Probabitity se Riven h

difficulties Biven D
ca

ao
quire initial Knowledge Of many py,
Ff ificant computational cost
sign
abilitics,
ae) Naive Bayes Classifier . Example

Nisexample,

High falsa "Ne |

S| false | Yes
High false Yes
ee |
[Normal | tase Yes
Normal i true No
5
is Normala
unn true Yes
Yj mi
mi
d | Hiigh
Sunny
false | No
“tally coo] Normal false | Yes
Rain mild Normal false | Yes
vility | Sunny mild Normal | true Yes
| Overcast mild High true | Yes
| Overcast hot Normal | false | Yes
|__Rain mild High_| true | No
Given a training set, we can compute the probabilities as follows : a
~ TheClassification problem may be
formalized using a posteriod ee :
HCY) is the probability that the sample tuple Y = <y nate is of class C.
~ Assign to sample Y the class label C such that P(CIY) is maximal.

Scanned by CamScanner
y-som.6-comp)p)_4559
:53_— Classicaton, Pray
eet ,
— lata, calculate
the P probabilities for play tenn:
~ From the above given sample d Is aay

play tennis(N) for all SR


attiDUtS_
os __——__ oc ind iia iat PCYIN
[utoo 2/9 | P(sunnyINo) = 3/5
P(sunnylYes) =
| P(overcastINo) = 0 | P(Yes) = 9/14
P(overcastlYes) = 4/9
| P(rainlNo) = 2/5 P(No) = 5/14 w choose the
[ Peaint¥es) a fie
| ‘Temperature
emper
+3 ; pc hale ie eS eet Ne wili be5 classi
= 2/5 : jpstanc®
9: Car the
=

[P (hotlYes) = 2/9 Photo) Ex. 4.2.9 be eithe

P(coollYes) =3/9 . | P(coollNo) = 1/5

| Peright¥es)=3/9 | P(highINo) = 4/5


| P(normall¥es) = 6/9 | P(normallNo)= salad

| P(truetYes) =3/9 P(trueINo) = 3/5


| P(falsel Yes) = 6/9 P(falseINo) = 2/5
- AD unseen sample Y = <rain,
hot, high, false>

P(¥IYes) - P(Yes) = P(rainlYes) - P(hot! Yes) -


P(high! Yes) - P(false! Yes) - P(Yes)
= 3/9 -2/9-3/9. 6/9. 9/14 = 0.010582
Soin. :
PCYINo) - P(No) = P(rainiNo) - P(hotlNo)-
P(highINo) - P(falseINo) - P(No)
We want to cl
= 2/5 -2/5- 4/5 . 2/5.
5/14 = 0.018286 ; example of a<Red,
- Choose the class so
that it maximizes this pro
will be classified as bability. This means tha
no.(don’t play) t the new insta
- Sample Yis ,
classified in class
No (i.e. don’tsii
An unseen sample
-= sunny, cool,
high, ris >
POYI¥es) - P(Yes) = P(
sunny | Yes) - P(cool
I Yes) - P(highl Yes)
Il

~ P(tue! Yes) . P(
Yes)

Scanned by CamScanner
24. 3A, 34, us
P(YINo) - P(No) =
P(sunny INo

i+
*P(No)
= WSUS. ays,
hoose the class so {hy
jaw © at it
ili be classified as no, st mi Zes
pan wi (don play) this: Probar:

P(RedINo) = 2/5

P(YellowINo) = 3/5

Scanned by CamScanner
P(SUVIYes) = 1/5 P(SUVINo) = 3/5
P(SPORTSIYes) = 4/5 P(SPORTSINo) =2/5

P(Domesticl Yes) =2/5 P(DomesticlNo) =3/5

P(mported!Yes) =3/5 P(UmportedINo) = 2/5


An unseen sample X = <Red, Domestic, SUV>

P(XTYes) - P(Yes) = P(Redl Yes) - P(Domesticl Yes) - P(Suvi Yes) “Pie


= 3/5 x 2/5 x 1/5 x 5/10 = 0.024

P(XINo) - P(No) P(RedI No) - P(Domesticl No) - P(Suvi


No). P(No)
= 2/
x53/5 x 3/5 x 5/10= 0.072
Since 0.072 > 0.024, our example
gets classified as ‘NQ’.

neofsovemia|
Ex. 4.2.10: Consider the
following data set S, which con
of sunbum : tains observatio

Sarah | Blonde
Ligh No Sunbumed
ee | Blonde Tall | Average| Yes None
Alex | Brown Short | Average | Yes
Annie | Blonde | Short
None - Anunseen sai
Average | No Sunbumed
Average | Heavy PCRISu
No | Sunbumed
Tall | Heavy
:
No None
Average | Heavy | P
No None
—Stot_| tight | Yes [None
Sa

ck 2 <Prown, tall, average no> Predict the res


ult vave ©
in

Since 0,032

Scanned by CamScanner
7 lS

?
U-Sen,

se

y
Sees Stabirned) Sas a
a 27, Pong
= P
®
P(Brownl Sunbumeg : F a
) KO (Blonde Noney — Vs a
PBrowntny

P(Averagel Sup ur
b
_ .

ed) = 9/3
-

PCFall p ¥
t

! Sunburn _ A
P(Shorti Sunburne
“D=0 PCTali
—aEel None) = 9
d) =1/3
Ses me
— P (Short! Non
PLight! Sunburned)
— 1/3

P(Nol Sunburned)
= 373 P(Nol None) = 2/5
P(Yes| Sunburne
d) =0
P(Yes| None) =
3/5
P(Sunburned)
= 3/8
P(None) = 5/8
. Anunseen sample X
= < brown , tall, av
erage ,no >
P(XiSunburned) *P(Sunbune
d) = P(Brownl Sunburned)
- P(talll Sunburned)
- P(averagel Sunburned) - P(
Nol Sunburned)
- P(sunburned) = 0

P(XiNone) - P(None) = P(Brownl None) -


P(talll None)
- P(averagel None) - P(Nol None) - P(None)

= 0.032
ice 0.032 > 0, our example gets classified as ‘NONE’. |

Scanned by CamScanner
F bata Warehousing
Predict a class label of an unkno
wn sample using Naive Bayogj
Ex, 4.2.11:
aset from all electronics Customer day
on the following training dat
s com, |
Age | Income | Student Credit rating | Class: buy
<30 | High No Fair No
Excellent ks
<30 | High | No
31.40} High | No Fair Yes
> 40 | Medium| No Fair ioe

Low Yes Fair ‘oe.


> 40
Yes Excellent No
> 40 Low
Yes Excellent Yes
31...40] Low
md

No
.

<30 | Medium No Fair


ce aire

Yes
ae

<30 Low Yes Fair

Fair Yes |
>40 | Medium Yes

<30 |Medium| Yes Excellent — Yes


No Excellent Yes 4
31...40 | Medium |
High Yes Fair Yes
31...40 |
"No Excellent No
>40 | Medium
Soln. :

ium”, Student= “Yes”, Crd


The unknown sample is x= { age= “< 30” ,Income= “Med
rating= “Fair”}

P( < 30lyes) = 2/9 P( < 30INo) = 3/5°


4/9 _| P(31...401 No)= 0
P(31...40lye= s)
P( > 40lyes) = 3/9 P(> 401 No) = 2/5

P(Highlyes)= 2/9 P(HighINo = 2/5


)
P(Mediumlyes ) | P(Mediuml No= )2/5
= 4/9
P(low! No) = 1/5 Seed
P(lowlyes) = 3/9
Scanned by CamScanner
P(Nolyes) = 3/9 om

a
P(yeslyes) = 6/9 an ©) = 4/5
Credit_Rating
P(fairlyes) = 6/9

=ose) ]
.- jo unseen sample X = < age= “< 3qp
egy
? Income= * * ”

vaing= “Fait” > fedium”, Student= “Yes”, Credit


jatte)-POfes
4 ) = P(Age* <"30"l Yes) - PCincome=Medium Y, es )
: P(Stud ent=“yes "I Yes) - P(Credit Rating
="fair" Yes)
- P(yes) = 2/9 , 4/9
, 6/9. 6/9. 9/14 = 0.0
28
KINO) - P(No) = P(Ages “30”! No) - Pncome=M
edium! No)
- P(Student=“yes”! No) - P(Cred
it Rating=“fair”| No)

lit -» P(No) = 0.007

- Since 0.028 > 0.007, Therefore the naive Bayesian classifier’ predicts Buys
computer = “Yes” for sample X.

&4212: Using Naive Bayesian classification on the following given training set, classify
the unseen tuple (Refu

~ [rid | Refund| Maritl status | taxable | Evad


| 1 Yes Single 125K | No
[2] No Maried | 100K | No
| 3 No Single 120K No
| 4 Yes Married | 70K No
| 5 | No Divorced 95K_| Yes
16] No- Married 60K | No
[7] Yes |. Divorced_| 220K | No

Scanned by CamScanner
No Married
No
P(No) = 7/10
P(Yes) = 3/10
Soin. :

Given a test Record

X = (Refund = No, Married, Income = 120 K)


P(XIClass=No) = P(Refund=NolClass=No) x P(Married| Class=No)
x P(ncome=120K! Class=No)
= 4/1 4/1 x 1/7 = 0.0466
P(XIClass=Yes) = P(Refund=Nol Class=Yes) x P(M
arried| Class=Yes)
® Pdncome=120KI Class=Yes)

= 38x0x0=0
Since P(KINo) P(No) > P(XIYes)
P(Yes)
0.0466 x 7/10 >
0 x 3/10
Therefore P(NoIX)
> P(YesIX) > Class
=No

ification classify the


= unseen tuple
mica = No, Live
birth | Gan fy | Live in wi
in water= Yes, have
legs = No)
oa
No No Yes Mammals
_ No No
xe | Non-mammals,
Yes No | Non-mammals
Yes
No
Nop etimes_|
Mammals

>No
Someti
Yes _| NoNon-n-mamammmmala’s
Yes Non-mammals_
No

——Ne__
ee |

Yes | Mamma

Scanned by CamScanner
P(Al M) PM) == 0.05 x5Zh5 = 0.0175

P(AIN) PQ) = 0.008 x33 = 0.0052


, P(AIM)P(M) > P(AIN)P(N)
“seen record belongs to class
mammals.
&4214; - ba
a in the following dataset that help
pp ication based on the applican CRE s to predict the RISK of a loan
t' s D HI IT STORY, DEBT and INCOME.

Low | Oto15 | High


High | 151035 | High
Low | Oto15 | High

Scanned by CamScanner
Ww Data Warehousing & Mining
Classification, . p = edictio,
aa
|CREDIT HISTORY DEBT | INCOME}. RISK _
High | 15 to35 High
| Unknown
| High 0to15 High
| Unknown
| Good High | otots | High
| Bad Low | Over 35 | Moderate
| Unknown Low | 15to35 Moderate |

Good High | 15 to 35 | Moderate


| Unknown Low | Over 35 Low
Unknown ‘Low | Over 35 Low
Good Low | Over35 | Low
nee High | Over35 | Low |
een High | Over35.] Low
Soin.:

be returned by Naive Bayes?

Ex. 4.2.15: Using given tabl


able,e, createt classificatition mod
eS el using any algorithm and hag)
tuple Seas = medium, credit = good>.
;
i <Incom edi ecision |
1 Very High | Excellent
AUTHORIZE
2 High Good | AUTHORIZE
3 Medium | Excellent | AUTH
ORIZE
: High Good | AUTHORIZE
: Very High | Good AUTHORIZE
: Medium Excellent AUTHORIZE
; Higah
e Bad REQUEST ID
5 a Bad REQUEST ID
a High Bad REJECT
——Low
—_L Bad
CALL POLICE |.
ee

Scanned by CamScanner
PiVery High] REQUicgy ID)
= 9
P(High| REQUES1p)
T ~ 175
PiMedim| REQUEST ip) _ te
Plow REQUEST iD) « 9

0 P|

Res REQUEST 1D) = Pat zen.


P(AUTHORIZE) = “o ~-—
PREQUEST 1p) — 515
P(REJECT) = 1/19
P(CALL POLICE) = V0
| BIAUTHORIZE) x PCAUTHORIZE) = 216 5.36 619 <p
nhd enca §IREQUEST ID) x P(REQUEST
ID) = 0
P(XIREJECT) x P(REJECT) = 9
MICALL POLIx CP(ECA)LL POLICE)=
9
yesian classifier predicts
pk X .
decision = “AUTHORIZE” for

* Given the training data for height


classification, classify the tuple, t = <Ro
1.95> using Bayesian Classi
fication. hit, M,
Name | Gender | Height |Output
| ten | | ey | ian
| Jatin M i
| Madhuri F_ | 1.09m | Medium
[ Manisha F 1.88m | Medium
| Shilpa | F | 1.7m | Short

Scanned by CamScanner
Bobby ‘se 1.85m Medium

Kavita F 1.6m Short

Dinesh M 1.7m Short


Rahul M 2.2m | Tall
Shree ‘M 2.1m Tall
Divya F 1.8m | Medium
Tushar M 1.95m | Medium PO
Kim F | 1.9m | Medium
Aarti F 1.8m | Medium
Rajashree F 1.75m | Medium
Soin, ; Divide the height attribute into six ranges as given below :
(0,1.6] , [1.61,1.7], [1.71,1.8], [1.81,1.9], [1.91,2.0], [2.1, infinity]
Since 0.016 ;
Gender attribute has only two values Male and Female.
Total Number of short person = 4
4.2.3 Rule ba

Medium
= 8 Rule-based
Rule-based
Tall
= 3
than superv:
Prepare the probability table as given below :
Learned me

phere es
P(FiMedium)= 6/8 P(FITall) = 0
FeMediam) = 2/8 P(MITall) = 33

pa 6] ! Shere 2/4 PCO, i 6] |Moin = 1/8 P({0,1.6] | Tall) =0


[Pa -61,1.7] | Short) =2/4
P([1.61,1.7] | Medium)
=0 P({1.61,1.7] | Tall)=0
P({1.71,1.8] | Short) = 0
P((1.71,1.8] | Medium ) =
3/8 P({1.71,1.8] | Tall)=!
| P(CL81,1.9) | Short) =0 . P((1.81,1.9] | Medium
)=3/8 P({1.81,1.9] | Tall)2
| P(1.91,2.0) I Short) =0 P([1.91,2.0] | Medium 1p
P({1.91,2.0] | Tall)
) = 1/8
| Pe. 1, infinity] | Short ) = 0 P((2.1, infinity] |
Mediu) =m0 P(Q2.1, infini
infin ty) 72)

Scanned by CamScanner
rehousing & Mining (Mu
pats WE

The unseen tupleis


x .
- PCXISh
ort) . P(Short)tm Name = Rohit , Geng
= P(M| Short)
aii Mae
1.91,2, ee
O11 Short ) x P(Short)
~ M4 x0x41529
P(X! Medium) . P(Medium)
= P (MI Medium 1) * POL ;
* P (Medium) a ( 91,2.0] | Medium )

2/8X1/8 xen 5=0.016


P(X] Tall) . P(Tall)
P (MI Tall) x P({1.91,2.0] |
Tall ) x P(Tall)
I

3/3 x 1/3 x 3/15 =0,067


I

since 0.016 < 0.067, Therefore the naive bayesian classi


fier predicts Rohit is Tall
423 Rule based Classification

L Rule-based classification is featured by building rules based on object attributes.


Rule-based classification is a powerful tool for feature extraction, often performing better
than supervised classification for many feature types.
Learned model is represented as a set of IF-THEN rules.

_ JF-THEN rule is expressed in the form.


- it
THEN
JFcond io
conc on.
lusin ~
ed as “rule antecedent” or “precondition”.
- The LHS or “IF” part of the rule is call
as “rule consequent”.
- The RHS or “THEN” part is called
h THEN loan = yes
- Example : IF age = young AND salary = hig

- This can also be written as :


Bh) ih=> loan =_vyees
salary = high
(age = young) * (
: coverage and # couracy
Rule R can be accessed by its
_. ADI
coverage(R) = Deore
accuracy(R) = Teo [eos

Scanned by CamScanner
PF pata WarehousiD 5

Neovers~
1D L.

= number
n § (i
: cas’“ goed . “
tionon.
.

neaed
then in it dat
tuples conflict resoluluti
IDI 1 == number of ved ; geri. ng rule Which .
eu
4 ack appr
_ ig
trigee $0 give hi ghest prior ity to that tr _
- Ifmore thano ne rule is der. hay go
.
ofr
_ Based on size, it has io pulZ¥ se
i ute tesst. ing of the rules. Rules are organj , =
maxii mum attrib zed batey —
sed on the ord eri ae op inion
. ‘ ape
Make the decision list ba tak ing B exP ot
- some measure of rul e qua lit y ot by

om decision tree pred |


Extract the rules fr easy y to und erstand than ti
is
ate d, i the rules whic‘ch h are
list an
e is cre
— Once the decision tre -
7 ee oa
complex tree. . t node to a leaf node.
create 4 rule from roo
_ Forevery path of the tree, ——
es the class label.
— The last node or leaf node giv
prediction i
continuous

Most comm
43.1 Struct

excellent ; =a
7 - Regressio1
a of these re

Fig. 4.2.2 : Decision Tree for “Buys_Computer” F tenon


independ:
Rule extraction from above buys_computer decision-tree ~ Depende
1.
a IF age ate
=" S30” "9 AND student = HE “no” THEN buys_computer= “no” ~ The two |

_ age
2 =“ < 30” 30” AND student = Hapaed?
yes THEN buys_computer = “yes” 1, Lin

; age = “31...40” THEN buys_computer = “yes” 2. Mu


4, IF =* 37 ‘ 8 .
ag > 40 AND credit_rating = “excellent” T ”

5. IF age = “> 40” ‘


A YD credit_rating is “fair? BEN huys_pomputes ae ‘a The
|
ge
THEN buys_computer = “yes”

Scanned by CamScanner
warehousing & Mining
= 9 (MU-gpe m,
pa a EC
=Lomp,
ner Classifi
AA other Cation Methods l.

i .
nearest neighbor classifje,
? case-based reasoning

etic algorithm
Gen
’ govgh set approach

° purty set approaches

Po gs Topic
Topic : : Predicti
Prediction -
Linear R
————— esression,
Multipie Linear Regression

prediction
43

continuous-valued functions.

_ Most common
.
ly used methods for prediction is Tegressio in.

43.1 Structure of Regression Model

Regression Model represents reality by using the system of equations.


- Regression model explains relationship between variables and also enables quantification
of these relationships.
- Itdetermines the strength of relationship between one. dependent variable with the other
independent variable using some statistical measure.
- Dependent variable is usually denoted by Y.

- The two basic types of regression :


|. Linear regression
2. Multiple regressions

~ The general form of regression is:


u
Linear regression : y=m+oXt

Scanned by CamScanner
4-67 Classification, Prediction 2 ous
waren
See)
comp.)

ata Warehousing
¥F Data Wa’ & MIDIS. us
.gom. &
228
——— : oh
Y= m+y%1+ nghet Taha + ont 1X + y
gression ch we are trying (o predict
Multiple re
t varia
Ye The dependen
Where:
xX = The inde
penden t va riable that we are using t0 Predict varigy).

m = The intercept
n = The slope
ai re ression
residual.
u = Thereg «differentiated wit4 h subscript
i
ed numbers.
regressions each j a
variableb l e i
— In multiple
finds ga Mathe
nt ables for prediction and
f random vari
- Regression uses a grou lp hip is depicted in
form of a s the
relationship between them. This relati
i es all the po: in ts in the best way. ne.
(linear regression) that approximat
odity, interest rat,
Regression may be used to determine for e.g. price of a comm
ri or sec tors.
industtries
price movement of an asset influenced by indus
gi 43.3 Multip!
4.3.2 Linear Regression Multiple lit
Regression tries to find the mathematical relationship between variables, if it is a g1«: It uses twe
line then it is a linear model and if it gives a curved line then it is a non linear model. dependent

Simple linear regression

~ The relationship between dependent and independent variable


is described by straight
fg Where, Y
and it has only one independent variable
= , x
Y= a+BXx
- Two parameters, a €
and B
specify the (Y-intercept and slop
estimated by using the data e of the) line and are tobe
at hand.
The value of Y incr
eases or decreases in a 43.4
accordingly. linear manner as the val
ue of X changes Othe
F
Draw a line relating —
to Y and X ee is well fitted to
Inlog li
- given data set,
The ideal Situation
js that if the line which is well Major z
i for prediction, fitted for all the data points
and no emt indepen
If there ak
is random vari
ation of data
Probabilistic mo
de] telated to
point 8, Whi
ch are not fitted in a line the Log lin
X and y " n construtt# Vatiab]
, Simple linear Tegr
ession m od
in the Fig. 4.3.1.
“assumes that data Points deviates about the line, as st™ |
Where
La, is

Scanned by CamScanner
2
terest Tat Fig,
> the
ABs Linear regression
a Multiple Linear Reg
ression

_ Multiple linear regression


is an extension of simple
linear
isa :Strip :
_ fuses two or more ind .
ependent Variables to predict the *eéression analysis,
le], aan dependent variable ms
outcome and a single continuous
Y= agta,X I FAIR s+ ta Kite
straight fin Where, Y is the dependent
variable OF response variable
X 1, X 2 .s00XK , are the
independent Variables or pred
ictors.
e is random error.
are to be
7

Ghats 43.4 OthOt er Regressioi n Model

~ Inlog linear regression a best fit between the data and a log linear model is found.
~ Major assumption : A linear relationship exists between the log of the dependent and
o error independent variables.

~ Log linear models are models that postulate a linear relationship between the independent —
* fF + 7 4 t

mee Variables and the logarithm of the dependent variable, for example :

log(y) = a + 41 Xy + A Xp «+ + Ay Xn
variables and
Where y is the dependent variable; x» i=l, -N are independent

(8, 0,...N } are parameters (coefficients) ofthe

Scanned by CamScanner
4-69 Classification, Prodict,
On 2,
muU-Sem. 6-CompP.
BF Data Warehousing & M ised to analyze categorical data te
widely | Pre;
- For example, Je, 18log i.linearIn models
this case,aF° 1”
the man reason
Teto transform frequencies ;. “4 7
a contingency (able. log-values is that, pre
dependent var. oy ode!”
‘ie sew fhe 4 Tables opab’
probabilities to. their TB the relationship between ic ansformed rs it, P
correlated with cach other, riables isa linear (additive) one. Per valida
pendent va
-ariable and the inde { ta cz
ee,
d s fecti

L Ss ycces ;
4 n and Selection

; nif
to estimate the accuracy of model.
- Validation test data is very useful
‘ 4.
All of them recognize
— Various methods for estimating a classifier’s accuracy are given below.
&
based on randomly sampled partitions of data :
| poss
© Holdout method i
o Random subsampling eee
o Cross-validation
o Bootstrap Actu:
~ If we want to compare classifiers to select the best one then the following methods
ap
used : : c
Oo Confidence intervals,

© Cost-benefit analysis and Receiver Operating Characteristic


(ROC) Curves i, - T

Syllabus Topic : Accuracy and Error Measures


: -_
- &¥
4.4.1 Accuracy and Error Measures . == ¥
a are Spit rig Reppin und fa) 5 ~

Accuracy of a classifier M, ace(M) is the ed data mining techniques.


_ ~
_ Classified by the model M. Petcentage of test set tuples that are correc!) i
i} Basic concepts

i I. Partition the data randomly into three sets


: T,
~
f ~ Training setis the subset of data * “Talning set, validation set and test set
; used to .
Tain/build the model

Scanned by CamScanner
Tep set isa sect of instances that
ies (coy, ta a
1 I's performance Is Svaluatey
able, Mts) as mie pility OF SUCCESS ON UNknow, dh . ol
cd gq are © Sr
© tain)
ao ita. “ee
jigation data is used for Parame Sting j Protess, The
* |My eMtimates
va can be the training dat ler tun thy
Yin
Oras é a
, i"
fa
tion EError : Mod ubset
:
of Wining om!
daty
1 the
© leat d
aty, Va lidat
generalize cl error On the lest d ion
ata,
2 Instance (record) class js Predicteg ie
;: seq instance cllass is pred rectly,
predic tedd Mcor
icte ; rectly,
nfusion matrix : It is 9 Useful
:
rent classes,
tool for an alyzi‘I
er tu ples of diffe nng ho W
well YOUr

Classifier
wwe have only two way Classifieg can
3 possible which are given tion then Only
below in th four Classifj
¢ form of a Confusion eae oucomes are

Class
s e n Predicted class r
eeLab
e el Cc . a
Se ee C, Sa
a
=' € Positi—ves (Tp
(TP) | False Negatives
; (FN) | p
18 are a False Positives (FP) | True Negati
ves (TN) | N
Total Pp’ N’
All
- TP:Class members which are classified as Class members,

- JN: Class non-members which are classified as non-members,


- FP:Class non-members which aie Classified as class members,
- FN:Class members which are classified as class non-members.
- P: Number of positive tuples.
- N: The number of negative tuples.

~ P’: The number of tuples that were labeled as positive.


~ N’: The number of tuples that were labeled as negative.

~ All: Total number of tuple i.e. TP + FN + FP + TN or P+NorP' +N

Scanned by CamScanner
enousia:

ye
meas eee
iJ J
i ‘ santified-
are correctly iden Sensitivity = TPP . ‘ ed
te which is the proportion of negagi, weigh
recognition *4 © tupy Fo * -
6. Specificity + Tue Negative “yp recall 2° a
are correctly identified. cacity = TN/N
specificity f test set tupl
‘ Ayaan accuracy or recognition r ate : Percentage of fest set tuples that are te where B is aM

Jassificd.
classi Accuracy = (IP +TN)/AL 3 classifiers c2

OR speed

. TP+IN busthy
Accuracy = “p+N na ReScalabil1

Accuracy is also 2 function of sensitivity and spec saa . _ - Jnterpre


_ agai ae N
Accuracy = Sensitivity (BN) + Specificity Py 14, Re-substitu
8. Error rate : A percentage of errors made over the whole set of instances (records) te — Re-sub
for testing.
—"
Error rate = 1— accuracy, or Error rate = (FP+ FN)/All — Itisdi
Or prefer:

itz SLEEN
Error‘rate =PN ———=
9. Precision
pasive. : Percenta ge ofofeen
tisameamne tuples which are correct] y classified
i as positive
iti are actl 4.4.2 Hold

; Precision = —2PI__
= TTP + IFPI Pesce
NS forte
.» Recall : Percentage of Positive tuples which . ont
measure of completeness ch the classifier labelled as positive. It!

Recal] = —/ZPI
ITP + [ENI

To trait
Use test

Scanned by CamScanner
F
Precision a
X Prog: a
'eRarjty, e Unie : _ weighted measure of Precisio, * Ttitlo
Sionns
+ RececCal]
ee] x Re .

8 ‘ wee as to precision.
1
c

and recall] ANd “SSi


ace: tns"
f times -
oS it
‘hat Fy = +p, Pp Uh Weight to
are Co
Pecyy *
where pisa non-negative rea] B Pr ~—cisio * Rec
number, a. Recah
seors can also be c
"
rs classifiers oMpared with respect |
_ Speed 7
Robustness
Scalability
oo _ -Interpretability
Te
-N) Pi re-substitution error rat
: * ee

e
rds) li - Re-substitution error
wcities rate is a Performance
: m,
fasure and is “quivalent to trai
ning data
VAT - pre
Itisferdiff
able,icult to get 0% error rate
: but it can be min
minimized, so low error rat
e is always

Syllabus Top
: Hol
idou
ct ==
actual 442 Holdout
|
~ Inholdout method, data is divide
d into training data set and testing
13 for testing, 2/3 for tr
aining). data set {asually
|
le Total number of examples ;

Training Set

Fig. 4.4.1
To train the classifier, training data set is used and once the classifier is constructed then
Use test data set to estimate the error rate of the classifier.

a a
Scanned by CamScanner
Classification, Prediotc ;
; 6-Comp. 4-73
na
WF data Warehousing & MINE we dif th the test dar, iy
is constructed and if
|

bett
Oe er mod el my
is more than t ste
_ If the training
the er ror estimalcs. é
some ¢ a
more accuralc ative. For example, Asse, “,
j be repre sent
with no instances at all,
_— Problem : The aon ee v eeeen
ith very few mista .
nin
represented Wi : thod which ensures that both trai & and testi,
_ 38 the metho ;
Solution : stratification Bt
class.
have equal number of samples of same
ng ee
— Syllabus Topic : Random Sampli

4.4.3 Random Subsampling


od.
— It isa variation of the holdout meth secon
°
1 — The holdout method is repeated k times. The advat
i t replacement, >
- Each split randomly selects a fixed number example withou The error
4 be
\ Total number of examples *
;
Experiment 1
_ Leave-o
Experiment 2
o if d
* Experiment 3 cro
: > Fo
Fig. 4.4.2 a 4

4 |. |
- For each data. split we retrain th m scra tch wit h the traini _- Theav
sifi er fro
esti: mate B; with the test examples, e clas training examples ui |

- over:
The het all accuracy 'Y j is calcula ‘
cac ai ted by taking the average of the accuracies obtained fron
) - Strati
,
.
. Perfo:

E= Kl .
Da E;
: - Strat

i
i=]

eke ee:
ee Cross-Validation
| — ———— —
;
Oo '

. “
———
.
Validation (CV)
ij‘| 4,4,4 Cross-
i
Avoids overlapping test
sets

Scanned by CamScanner
& Mini z “i ais .
ng (MU-Sem, 8-Comp ) es
Pe) warehouse
i 4-74 Llassitication
q ¢ -vyalidation
,-fold a s Pradiction a Clustering
tep : Data is split into k SUD
sub:Sel
s of equ
: T ize
reat ctexamen ny by random Sampling)

=
Experiment 2 Training exampioa

a
Tast examples

Fig. 4.43
ese d a : Each subset in tur
n js used for testing
and the remainder for
a advantage is that all the SaRM traini
ples are used for both traini
ng and testin ie
error estimates are averag
_ The ed to yield an overall error est
1
imate.
K
°

E=xX x E,
nple 1=.

Leave-one-out cross validation


ne-out
> If dataset has N examples, then N experiments to be performed for Leave-o
cross Validation.

9 Forevery experiment, training uses N-1 examples and remaining example for testing.
|_ Theaverage error rate on test examples gives the true error.
mples and ; N
1
E = N ba E;
i=l ._~
ined from
- Stratified cross-validation: Subsets are stratified b efore the cross-validation is
performed,

- Stratified ten-fold cross-validation


timate.of evaluation.
—— 0 This gives accurate es |
e to stratification.
0 The estimate’s variance get redu ced du ed
y the results are averag
——
tim es and fin all
0 Ten-fold cross-validation is repeated ten
based on the previous 10 results.

Scanned by CamScanner
aq (MU-S0M 6: CODE —Sttion2
F pata warehousing
ng a = =<. Gh :

Vos Total number of examples


; «|—Training examples

a ———
Experiment

Experiment 3 :

Experimen
Fig. 4.4.4

From Fig. 4.5.


4.4.5 Bootstrapping ‘ . 4 Geometrical di
.
ement. Once the tuple or instance jsSelon pelong to whi
_— CV uses sampling of data set without replac easel
.
not be sel ect ed aga in for training or test data.
it can
_ The other kir
nt to get the training set.
_ The bootstrap uses sampling with replaceme
fom& cluster are 2 |
is sampled with replacement k times to
— Training set : A dataset of k instances descriptive ©
training set of k instances.
y Applications
original dataset which is not the part of triti
- Test set : This is separate dataset from the
Clustering ¢
— L -warketine
small datasets.
- Bootstrapping is the best error estimator for

stesteriring
opice : : CluClu
r antifi
be ide
~ Syllyllabusabus Topi
_ - Biology :
4 What i . NS aa asseeibs
5 hat is Clustering ?
- Librari
Farle
4.5.1 What is' Clustering ? er) ordering
' a ae
> (MU - Dec. 2010, May 2012, Dec. 2012, De®
identitic
- Clustering is an unsupervised learning problem.
City-p)
in *,
que used to place data elemess
Clusterini g isi a data minvosing (machine learning) techni
.
_—
can be
related groups without advance knowledge of the group definitions.
are called asclustes
- It isa process of partitioning data objects into sub classes, which

Scanned by CamScanner
Ne fs contains data objects Which have ig
Bh inter ¢ MilaSION, Predict
6

rity and Io clon & Clustering


cl " * ; .

i gAv ee
now this with a simple raphical ex
ample : W intra
Similarit
y,

_ qe other kind of clusteri


ng known as conceptual
clustering in which data obj
TM the clusterare a part of it based on a'com mon ects in a
concept. In other words they fit based on
some
descriptive concept and not on a similarity measure,
ining spplications

Clustering algorithms can be applied in many disciplines :


- Marketing : Clustering can be used for targeted marketing. For e.g. Given a custome
r
database, containing properties and past buying records. Similar groups of customers can
beidentified and grouped-into one cluster.

~ Biology : Clustering can also be used in classifying plants and animals into different
classes based on their features.

* Libraries : Based on different details about books clustering can be used for book
ordering,
Insurance : With the help of clustering different prongs of aelea ea.
Identified. For e.g. policy holders with high average claim cost or identifying so ,
. City-planning : Using details like house type: geographical locations, groups of houses

“in be identified using clustering.

Scanned by CamScanner
Data Warehousing & MI
can also be used
to identify dangerony, ,
Cl us te ri ng ‘3 ‘ONey
udies 1
Earthquake st
es.
earthquake epicentr groups of similar access Patter,
find
: Cl us te ri ng can be used [0
WWW cuments, : Usin,

be us ed for cl assification of do
data, It can als o ah

Requirements
y are :
eme nts tha t a clu ste ring algorithm should satisf
The main requir
Scalability.
s of attributes.
Dealing with different type
itrary shape.
Discovering clusters with arb
s,
wledge to determine input parameter
Minimal requirements for domain kno
Ability to deal with noise and outliers.
Insensitivity to order of input records.
High dimensionality.

Interpretability and usability.


Problems

- Because of time complexity, it creates problem to deal with large amount of data ‘a

- For distance based clustering, the effectiveness-of method depends on distance definitns

4.5.2 Categories of Clustering Methods

> (MU - May 2010, May 201


A good clustering method will produce high quality clust
ers with :
oO High intra-class similarity
© Low inter-class similarity

Major clustering methods can be classifi


ed into the following categories :
1. Partitioning methods
: In Parti
and then they are evalua tioning based approach, various partitions arecre
ted based on certain
criteria,
2. Hi erarchical .
methods : The set of data objects are
certain criteria.
decomposed hierarchical! “ :

Scanned by CamScanner
worohQZe i

A gsity-based meth ods :: Thi Th: appro


g. Density connected point,
_pased methods : This appro
ach is base

gan and Kamber has given the Verview


inne Table 4.5.1. Of abo

Can find arbitrarily sh


aped clusters,
Clusters are dense regions of
obj ects in space that are Separa
low-density regions. ted by

Cluster density : Each point must have a


minimum number of points
within its “neighborhood”.

May filter out outliers.

Use a multi-resolution grid data structure.


Fast processing time (typically independent of the number of data
objects, yet dependent on grid size).
Difference between Classification and Clustering

Bs

oul thy

In Classification, a Training -set|In clustering,


been similarity of
data is not known in
Containing data that have
ts, we
ance so using statistical concep
ueviously categorized and _based_on|adv

Scanned by CamScanner
sation,

| Sr. No. | Classification


Clustering SS,
the algorithm fine i split the datasets into su ,
this tra ini ng set, y
data points that the Sub-datasets have sty
category that the new called as clusters, *IMitaan
belong to.
. Clustering is unsupervised lec fe
2. | Classification is supervised learning Thin
You're given a set of ty
a. You're given an unseen tuple and you M8Which
action»hin
to that gives details about
are suppose to set a label or a class
that tuple bought what. °
ay
ll

For example, a company wants By using clustering technigy


to
classify their customers. When the tell the segmentation of Your ee YOU¢,
company launches a product they want
to classify which of their customers will
tin
buy their product and which one s
- will
not buy it.
A common approach for classifiers is to There are a variety of algori
~~
use decision trees to partition and clustering, but they all
ace why
segment records. of iteratively assigning Tecords , °
cluster unless it indicates that the
wa
‘| has converged to stable segments,
46 Types of Data
5
_ Interval-scale
i
=
“othe
; are mainly two types of data structures for main memory-based cluster
co’ need
have values '
Mu Xe Xp Consider the
.
1. Data Matrix or Object by variable struc o Standar
ture} x; -* Xi Kip
9 Calcula

2. ‘ Dissimilari arity
matrixi or by object structure eee a
Where a
0

d(2,1) 0
d3,1) 4.2) 0 9
ee
d(n,1) dm2) ~~ ase O
Using

Scanned by CamScanner
fication,

Clustering jlarity
i /Simi
i ilarity Metrj i: Ieee
.
is into
pis which is typically Metric 1a Ar;
su
nc
parate “quality” ‘
ere i 5° ee maaltty fnetion that Meagy
definitions ista, nce function,
of dista r C8 the« © distan.fee
ii
ean ¢, orical,
cates! i 1, ordin
i au land tio " n ec Ustuia Mn] Neay a“
*
te nolits should be associateg
. Of a hu
With ditt,*tiables ¥ Very 4iftereng j “ ‘ter,
v.
Cte, €
: welantics: nt Variables iT Met Val ac aler,
b
‘ hard to define “similar enough” epi
is

pe a
f data in clustering analysig ;

terval-Scaled Variables
{ Ini scaled va
riab les are Continuous
ement Measur
- : weight, height a
nd diff
ing and quantifying the weateren
her ce‘ebet
mPewee
ratunre.the Tvalu
hesees,atAntributes al
alues whose erence: diff ces i
are inte rpretable iNterval-scaleq
1] fave
consider the measur =—1 ifics

T coasi ement for a vari


able f, this can be
Performed as foll
» Sandardize data ows -

» Calculate the mean absolute deviation :

s. = > (IX;, — m/ + IXo¢— md +...4 Xap —m, |)

Wherem, is the mean value of f

mm = day thet. tm)

0 Calculate the standardized measurement (z-score)

2p =_ Xc—™;

ing s standard deviation.


: Using mean absolute deviation is more robust than using

Scanned by CamScanner
__ Classification, Predict;
¥F ——
pata Warehousing se :
used to me as
ure t he similarity or dissimjj,,;
Milarity ling =
normally Mea
o Distances are “ance:
data objects. :
i dis
include : Minkowsk
o Some popular ones
q = q

dij) = Yixy — xl" + Kia — ial + sotKip ~ Xiph


p-di ension; g
Mp) are LWO Pdim
= (Vip Vie Xip) and j i = bie(Xj1, Xjarjls «+++ ces 7)
where f
integer
objects, and q is a positive

o Ifqg=/,d is Manhattan distance


d(i,j) = — Xl
Ixy + [Xj9— Xl + wit [Kip — Xjpl

o Ifq=2, dis Euclidean distance :

dG, j) = Vix I + 2 — Xp + «+ Ip — Xp
o Both the Euclidean distance and Manhattan distance satisfy the following
requirements of a distance function : mathe
dij) = 0

d(i,i) = 0
dij) = d(.i)
daj) < d&k +dkj)
© Also one can use weighted distance, parametric.
4.6.2 Binary Variable

Tiable ;
| example of such a variable is ae cei of the states are not equally important. ~ 4
€ : ei ence
ifI} mae ee is “handicapped or not lee absence of a relatively rare attribute.
Y coded as 1 (present) and the other js coca om a Most important outcom** .
aDsent).

Scanned by CamScanner
“ingen table for binary data -
patio’
AK

object m
tonal] da

, we are many ering S80 obj ects, Object


. mand Object
ob n,

‘ would be the number of variables


. Which are Present f or

mnas
jwould be the number found in object ™ but not in obj ect = mee
jos the opposite to b and dis the number y
that are not f
matic eal? matching coefficient (invariant, if the binary y ia IN either object,
arable js symmetri
dG, j) = bic "
SED tonal (4.6.1
cient (non-iInvari ‘ 5

d Gi, j) = eran
+b+e --(4.6.2)
fmple
Table 4.6.1 : A Relational table containing mostly binary values
: ug ae a . 2 ee ia ce pourees pe

rariable, Se | eee S i

N P N N N
>. Here
P N
should o 7 N
e both Jim Me fF at esp N | N|NJN
etric binary
* Gender is a symmetric attribute the remaining attributes are asymm
to 0 as shown in the
Tet the Values Y and P be set to 1, and the value N be set
|
Tile 4.6.2,
“ing formula (Equation 4.6.2) of asymmetric variable.

Scanned by CamScanner
& Mining (MU-
g
Ww Data Warehousin

- B

‘ th
qT
Gender | Fever Cough | Tes
M 9 ° ° 9 _
| 0 _ I
F 1 0 I 0

M 1 ! " : ; 0
Distance between jack and mary (i.e. - dfjack mary)) is calculated Using
4
(Equation 4.6.2) and use contingency table given above-
7
Consider attributes : Fever, cough, Test-1, Test-2, Test-3, Test-4
gti No
pur “
Consider jack as object i and mary as object J
arith
a = attribute values 1 in jack and in mary also = 2
t of st
b = attribute values 1 in jack but 0 in mary = 0
c = attribute values 0 in jack but 1 in mary = 1
8 b+c
dij) = aeb+e
O+1 |
d(jack, mary) = 24041 70:33

Similarly, calculate distance for other combination (ay Or


‘s 1 +1 -
d (ack,: (jack. jim) == >
<~>777 == 0.67

Gim, mary) == 12
<4 5 ma= 0.75
~
d (jim ma

as having lowest dissimila


rity yalue am
| / have a similar disease as having highest ong g th the three pairs: But Jim and mary are unlikely"
dissimilarity value,
: 4.6.3 Nominal, Ordi
nal, and Ratio Vari
ables

Scanned by CamScanner
.
venousing & Mining (MU-Sem Com
ae —_— — ———
p)
ja
dividual item
has 4 Certain
dic
, . vpnoaemir naofl the categories is nop Possib ©
variable Cate
go
ries can be
ie.
+ netic and igi
; ni examples of su *Perations on t n
ch Attributes ar
e :
’ ‘Car owner :

H Ordinal Variables

- Adiscrete ordinal variable is a nominal variables, which have


meaningful order or
rank for its different states.
- The interval between different states is uneven due to which arithmetic operations are
not possible, however logical operations may be applied.
- Forexample, Considering Age as an ordinal attribute, it can have three different states
MeanE based on an uneven range of age value. Similarly income can also be considered as an
ly to ordinal attribute, which is categorised as low, medium, high based on the income
value,
“Age: . 1. Teenage
Boe 2. Young
13. old
eet 1. Low
ly
——|2, Medium
13. High

Scanned by CamScanner
& Mining (Mu-Sem, 0-COTE: BS een POC Ao, q
WF pata Wa
.
(3) Ratio-variables ontinnous positive measurements on a nop ti
. ci ‘
_ Ratio scaled a cial data but are not measured on a linear sca] . ar ty -
vail ty ane
e

. be performed but multiplicat; spicien . The


They are also interval
atton diy; om esentation
Operations like addition, subtraction can ‘tg or
are not possible. coos tf
if a liquid s at 40 degrees _
For example : For instance,
ro degrees, it will be 50 degrees. However, a
ey .
eS ie!
degrees does not hay.
d *

the temperature of a liquid at 20 degrees because EFEeS CEs not rie


temperature”
— There are three different ways to handle the ratio-scaled variables :
as interval an
o As interval scale variables. The drawback of handling them
ti
that it can distort the results.
o Ascontinuous ordinal scale.

o Transforming the data (for example, logarithmic transformation) and e


treating the results as interval scaled variables. te i

4.6.4 Variable of Mixed Types


_ Enclid stated
. : ; ; is
:
= oo
Vari af mixed type tay be either interval-scaled, symmetric bin ; 00) isoe
unary, nominal, ordinal or ratio scaled. ‘sy Mmetic since it 1S ds
~ Objects are many times described i ” : Minkowski «
edd Grn Aelia as a mixture of different variable types and it is usedin rectilinear di
formulae (1)
applications, 7 to a . 2,
~ theInreal . 4
preferable waa compu
iis ie dissimilarity
Tees
between Objects
.
of mixed variable types, infinite, the
0 : iL
cluster analysis, .- rbeess all Variable types together, performing a singk men,
It brin gs Chebyshev.
all of the meaningful .
! variab le onto a common scale of the interval [0.0, 1.0]. a ‘
These dista
. with each ¢

Calories th
Way of co

Euclidean
: we hac

CtWeen ,

Scanned by CamScanner
. yet j , Oecas;
Scala gel
Orit ts. The choice
ri fF Similar
of distance/simitenn a
Measures diver
Ca
one Metric white
OP and gi wfie” entation of objects, Y Measures gent Often canes IM for
Visig oro T, Pends on the ed Similarity

and %
7 SF
able 4.7.11 Minkowst fami
surement type

not fan a i) lL. i


Euclidean L y

TePresen Wigg dp, ~ d


= ‘No SSS eee e z IP, = Qr

2. CityPa
block L, aater cites
Scaleg i pee ee i=]
5 3. Minkowski Li. Pt |
ta pfa
i
: atk x P-QpP
i=]

and then 4. Chebyshey L, ite


dene = ; Pi-QI-

wis predominantly known as Euc


lidean distance: It was often call
yMMetric since it is
ed Pytha: 1
derived from the Pythagorean The
orem. In the late 19" century. Herman
Minkowski considered the city block distance. Oth n
er names for the equation 2) include
used in rectilinear distance, taxicab norm, and Manhattan distan
ce. Hermann also generalized the
formulae (1) and (2) to the equation (3) which is coined after Minkowski. When p goes to

: ty infinite, the equation (4) can be derived and it is called the chessboard distance in 2D, the
oagle minimax approximation, or the Chebyshev distance named after Pafnuty Lvovich
Chebyshev.
dimensions,
| - These distances (similarities) can be based on a single dimension or multiple
with each dimension representing a rule or condition for grouping objects.
of
~ Forexample, if we were to cluster fast foods, wej cou 1d take into account the number “i
— most straightforw
orice. subjecti
i their+ price, i
subjective ratings Bate EIetc. TheFenenadie
0 F taste, ete
i they contain,
calories
objects in a multi-dimensional sp
way of computing distances between
—_— Euclidean distances. — | al geometric distance
- " ‘ ctu
~ Ifwe had a two or three-dimensional space this nee Ppa
) Loedt waited with a ruler).
measured
ve
: between objects in the space (1.€., aS if

Scanned by CamScanner
Classification, Preg; W

w-
a Mining (MU-SOD. gS Ne |
Data wWarehousin tnods (k-means, k-medoids)
gina ;
“partitioning Me—— , vet
— Syllabus TOPIC:

hods
4.8 Partitioning Met
> (Mu. w
tract a partition of a database D of n object, inte , ; 4
Partitioning methods const” ty
K clusters.
ing methods
Diterent eo tively enumerate all partitions.
1. Glo bal optimal method : Exhaus
K-means and K-medoids algorithms.
2. Heuristic me thods : K-me
d by the centre of the cluster.
K-means : Each cluster is represented by K_mean

- K-medo ids or PAM (Partitioning Around Medoids) : Each cluster ;, repr


-medor kit
7
by one of the objects in the cluster.
n:

4.8.1 K-means Clustering : (Centroid Based Technique) | mm; :


> (MU - May 2010, Dec. 2012, May 2013, Dec. 2013, May 2014, Dec. 2014, May a5 i. AS
- In 1967, J. MacQueen and then in 1975 J. A. Hartigan and M. A. Wong develon — Mz:
K-means clustering algorithm.
Sth
~ In K-means approach the data objects are classified
based on their attributes or features -~ Us
into k number of clusters. The number of clusters
i.e. K is an input given by the user.
;
~ K-means is one of the simplest unsupervised
learning algorithms.
Define K centroids for K clusters
which are generally far away
from each other.
"
Then Group the elements into clusters, which are nearer to the centroid of that cluster.
an ce 6
si again calculate the new centroi
d for each cluster based
ments of that cluster. on & ~ ¥
Follow the same
method and group the elements - |
In every step, the based on new centroid
centroid changes

Scanned by CamScanner
ro ace fn
is algorithm aims at mip
imiz; an
« ©),
"1S given
wf
we

(Mu . De, = Adata point


‘Cte tS :
int, 20,
2 se Q) ¢, = The cluster centre
°F n = Number of data points
k = Number of clusters

p 2 - Di
ix, = DIsAnte measure between a data point x"
ntx and
algorithm (ne the cluster centre <,
“ai ymber of clusters

-_sample feature vectors X}, X., ..., x,


}
a the mean of the vectors in cluster j
Assume k < 1.

\ke initial guesses for the mea


ns m,, My, ..., My.
_ [ntil there are no changes in any mean.

. Usethe estimated means to classify the samples into clusters,

Ser. 9 fori=1tok
Replace m; with the mean of all of the samples for cluster i

o end_for

sa © end_until
on ~ Following three steps are repeated until convergence :

* lterate till no object — to a different group


|. Find the centroid coordinate.
two 2. Find the distance of each object to the centroids. 4|
3 Based on minimum distance group the objee
|

Scanned by CamScanner
CANT
PT
Arbitrarily choose K _
object as initial Update
cluster center the
cluster
2 ad means
ade Stop
“e423 4 $6 7 8 10) two groups are i

Fig. 4.8.1 : K-means graphical example


10. Sothe final ans
method 2
/ Number
of / he Given : (2,4, 1(
Be cluster K :
4, Number of clu

~ Distance objects to
centroids

Grouping based on |
~ minimum distance “

Fig. 4.8.2 : Basic steps


for eens clustering |
- Givenacl as
aa Closter R= fhistes-<asten}, the cluster mean is
m;= (1/m)(ty +... + ti)

Soin. : Method , oseeie id ead Pn TES hives 7 Ie ae TT Ma 2014,

1. Randomly assign means : m= 3, m= 4,

Scanned by CamScanner
pers which are close to me
are close t0 Mean m,= 4 are M1 = 3 ar
&louped into én froy
Ster
new mean for NEW chast
‘ ag? 3h Ko= the
7 calculate {4,10,12,20,30, 11,25), fig
1 =

‘s 1 2 [2th Ke= {10,12,20,30, 11,25), my = 3 ’ Ms 1g


10}, Ke= (12,20,
ae } ; 30,11,25}, mm,» 4.75, Ms 196
‘ z 2.34-10.1 1,12}, K,= {20,30,25) m
-
‘ 1=
7,m:= 25
‘ K E2 (23410,11,12}, Ko= (20,30,25)
Justers With these means ¢;
opas the ch OS (in step 7
so groups are identical
: P and 8) are the Same. The clusters
; inin the the | last
J, go the final answer is K, i= = 2,3, 4,10,11,12}, K,=
{20,30,25).
yetod2
given: {2:4,10,12,3,20,30,11,25), Randomly asgj
|. Y assign alternative yal ues t
9 each cluster.
Number of cluster = 2, therefore
K, = {2,10,3,30,25}, | | |
Mean = 14
K, = (4,12,20,11}, ee
i Re-assign , .
Ke *S"(20,30,25}, “itn, we 8

K, = {2,4,10,12,3,11},
Mean =7

4 Re-assign

K, = {20,30,25},° _ Mean = 25

K, .= {2,4,10,12,3,11}, _ Mean =7

So the final answer is K; = {2,3,4,10,11,12}, Ko= {20,30,25}


| 492. Use K-means algorithm to create 3 - clusters for given set of values :
{2, 3, 6, 8, 9, 12, 15, 18, 22}
Son, :
ters)
ter s (R an do ml y assign data to three clus
— break into 3 clu s
| 2,356, 8, 9, 12, 15, 18, 22
and calculate the mean value.

Scanned by CamScanner
W bate We ing
& Mining (MU-Sem. 6-Comp. 4 “91 Classification,
o p sation 5 a warghous® ‘
S GZ 5
K,= 3, 9. 18— mean = 19
= 2, 8, 15 — mean=
K, 8.3;
8.3;

- mean = 13.3
K,= 6, 12, 22
2. Re-assign = ; 50 the final answ’
ar one
K, = 2, 3, 6, 8, 9 - mean 25.6; ; Kyo mean = ‘ 4.84 * gro upe
groupe
K,= 12, 15, 18, 22 — mean = 16.75
3. Re-assign

K, = 3, 6, 8, 9- mean =6.5; K, = 2-mean


=2
K; = 12, 15, 18, 2—2
mean = 16.75
4. Re-assign

K, = 6, 8, 9- mean =7.6; K,=2, 3- mean =2.5


K,; = 12, 15, 18, 22 — mean = 16.75
soln.oui- agente
3. Re-assign
K,=6, 8, 9— mean =7.6;
K,=2, 3 - mean= 2.5
coordinate
in an att
K;= 12, 15, 18, 22— mean = 16.75
6. Last two groups are same. So fina
lly we got 3 clusters
Cluster | =

mly assign alternative


2. Number of cluster = values to each cluster
2, therefore Senex

KI = {10,2, 3, 30, 25},


Mean = 14
K2 = [4, 12, 20, 11,3
1}, Mean = 15.6
3. Re-assign }
,
KI F
|= (2,3, 4, 10, 11, 12}, Mean =7
;
K2 = 20,25, 30, 31}, Me 3 l. Tak
an = 26.5 a" € initi:
4. Re-assign - = 14)¢
KI = (2,3 4, 10, 11
.
, 12}, Mean =7

Scanned by CamScanner
Consider four objects yw;
Ith two a
grouped together into two Cluster
values.

:Objects ars ty as,


With Sir attribute

ob]pject represents one point wi th two


in an attribute space as attributes (X,
nate in shown belo
aw w,

Attribute Y

8.4 : Graphical
Fig. P. 4.8.4: Gra representation of Data Points

ject B as the first centroi ids, then


! Take initial centroids : Cons ider object A and objec
“(1,1) and C= (2,1)

Scanned by CamScanner
7 Attribute X ‘N
_
—s
‘z
t—
——

‘ ' ' it
4.5 ; eas
“a seateaeee § sri
1 :

secede 1.

jal eeodduangend sedeu


\ '
. ag eeey i ’
wasneaengenenen ooonee
a5
soawee ! eenee eee
aeeenees ' avancée »
Qo daesneeae ; esaneeuee ‘

7
er 7 aeoeneeee
eenans 1 eesuac
a5 cosnseeed ana

4 ‘ oandee basces
wsc ase feenesena
'
dewceeane aseeecere . eac
esne

ee: ee
D

Attrtiate 4,5 Joneansee qeneeee


wawens
{ hyn wannne Livsuntagsh wneenene Lae

ens
OS dee Be saeiits baeeseensbeneerens fosseen feveeo
0 +—_}+____ +++ i
0 1 2 3 4 5 8

Fig. P. 4.8.4(a) : Randomly selected centroids C, and C; for two clusters

Objects-centroids distance : Using Euclidean distance, calculate the distance bey


cluster centroid to each object. For Iteration we have, - ‘
p’.(° 1 3.61 | C,=(1,1) group
-1
1 0 2.83 4.24) C,=(2,1) group -2

A BCD
The above distance matrix shows the dist
ance of each o bject wi respect to the cen
ject with
of cluster Cy and C>,
For example, distance from object D
= (5,4) to the fi i ss $f
cond cnttid =, )ied2Aet NO CI= Ch Dis Saint

G° =| lean group - |

grou-p2

; : Based . |
centroid. As group-1 has only one object, a belong to group, calculate the ™
roid of group-1 is C, =(
1, 1).

Scanned by CamScanner
Fig. P.4 -8.4(b) : Clus
ter formation
after First Iterat
ion
: Calculate objects-centro
ids distances : Comput
e the distance of
wnew centroids. So new distance matrix wi Ib all objects with Tespec
t
Centrojd e D'is given below
ni [ 0 1 3.61 C,; =(1 1) group - 1
an 11 8
d to the “13.14 2.36 0.47 1.89] Cc, = G ; -
" oo
A B Cc D
imum { Make the new clusters of objects : Foll
ow the step 3 given above. Based on minimum
Toup- distance assign the group to each object. Now an objec
t A and B belongs to group-1 and
object
s C and D belongs to group-2 as given below:

g=[2
“LOO
0 0) group- 1
1 1) group-2

ABCD oa
ew
Again determine centroids : Repeat the step 4 to calculate new cen

Le? Jt -(15 7
% -(, T° 2

Scanned by CamScanner
4+5 34) (45 1
c, = (EE AE) = (49.39$)
Attribute X

Iteration 2

Fig. P. 4.8.4(c) : Two clusters with


centroids
8. Co
Comp
mpan
utie
e s the objects-centroid
s distances *: R ‘Nepeat
iteration 2 is obtained step no 2, a new dis
as shown below : i
;3 =
v-[, 05 05 3.20 4.61 C=(13,1)
4.30 3.54 0.71 0.71 group - 1

A Boc D
9. Make the Clusters
of objects
calculated using : Assi eac h object based on the minimu
Euclidean distan
ce, = m dius
o*=[) 1 0 "| Stroup - |
0014 group - 2

Scanned by CamScanner
ee
a

Enel bt he
rey au Sa
3
i
15 0.5
Rs
> Matrix for for simplicity we can find the adjacency matrix y Which gives distances
hich gives di of all
) object from

Using Euclidean Distance we have


DiGi, j) = VWika—xil +ly2.—yil’ -
DG@,j) = D(AB) = Ve =a +(2- y=1,

Similarly we can compute for the rest.


mene ——
pe
1.58

2.12

0 2 0.71 |

ition of 2-71[| 08 [us0s


fp| 1.58 | e2.12A] an0d
Ln
st s
C as the firr
oiids are
centroids. So the cen tro
value of “tata— ‘Assum
l, hhitial :
value of centroids
(2,2) and C,(1,1)

Scanned by CamScanner
WF pata warehousing & Mini . ; Classification, Prag
i warehousing é
2. Objects-centroids distance : Using Euclidean distance formula
agi » the qj Zz
object with respect to centroid C, and C, is given below : at ‘t Z For example, di
D°= [eats
os — 2) + (0.7
tteration-1, objec
6. minimum distance

Gi=
eb. ge
[a |B
*
The object is. represented by
5
column in the distance matrix. The first ro ; d
—~
distance of each object to the first centroid and the second
Tow to thesecond TeDTeseny os
Cen * .

3. Objects clustering * Each object is assigned based on the minimum distance “= By comparing mi
si.glance D is assigned
and matrix to group 1 and C and E to group 2. A value of 1;
if an object belongs to that group,
Thu ogg to-a ci
move any more
S aSSignedj te So final clusters a
G’= .

Ci(2,2)~Group 1
C,(1,1) — Group 2
gning the objects to their appropriate gro,
has three member thus the
centroid C, iste
‘ian Similarly Group 2 now has i
are ale among the two members Ex. 4.8.6: Suppo:
CG = £7o2t3 2424 1
re
3° 3) = (2.67, 1.67) pres‘
c, = (41S 140 | Met
5 2° 2 3) = (1.25, 0.75) The di
. eee objects-centroids distances :
objects to the new centroids, Similar to ’ : The next Step is to compute the distance ofl ie , ¢
show
D! = tep a we have distance matrix at iteration Lis (a) q

(b) -

Scanned by CamScanner
a
(Se

as example, distance fro, ;


67-2) +(1.67-2)° = 9
Acne
5 ‘iyi 2
(1.25 - a 145, sj ila us Ce tg ro
. ration-1, objects clustering : Simi y Caley) oe
‘ isi distance. The Group Métis; a to Step Pints R De 25,0.75) is
: X is Shown be} » We Assign ok,
G' z oy Cach

ania
By comparing the above Tesults we obs Ly
ea more to a different gtoup. Thus K erve that Ge G 1

SC
Means ¢] r » this shows that .
$0 final clusters are Group 1= ustering has reacheded itsite oc.4 ot b OO Hot
{ A, 21D } and Group 2= (6,5 stabi;
Stability,

PRTC
x
A
Es
q

4486: Suppose that the data mining


task is to cluster the following points with
representing locations) x,
into 3 clusters.
mn Gon
A1(2,10), A2(2,5), A3(8,4), B1 (5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9
)
The distance function is Euclidean distance. Suppose initially we assign A1, B1
and C1 as the centre of each cluster respectively, use the K-means algorithm to
show only
(a) The three cluster centers after the first round execution.
(b) The final three clusters.

Scanned by CamScanner
at dad C2(4,9)
| ° 9 168) ae
|
sterati
' | © A2(2,5) o— 7)
; B3(8,4)
ject
PIGAy ay
enna ail
0 + ¥ < i
=
0 2 4 6 8 10 F

I. Initial value of centroids : In this we use


Al, Bland Cl as the first
and X2, X3 denote the coordinate of the
centrojoids, then X1 =
X3 = C1(1,2) yet ids,
ae X2= Bi
2. ape
Objects-
kacentroids di stance : We calculate the
distance between cluster q
us use Euclidean distan ce, then we i ance matr 4
6. Tterat
Z have dist att +
ix at iteration ini
8 —_e
G=

7. Iterati

8. Iterati
D? =

9
Tteratj
The G,

Scanned by CamScanner
, Mining (MU-Sem. 6-Comp.) 4-100 Classification, Prediction & Clustering

et2 542) (1.5,3.5)


x3 =
ray

-octs-centroids distances : The next step is to compute the distance of all


eo a ew centroids. Similar to step 2, we have distance matrix at iteration | is
“ts50
f 3
, aan | Ad | BL | B2 | B3 | Ci | C2
* Cs
5 | 848 | 3.61 | 7.07 | 7.21 | 8.06 | 2.24 | x102,10)
qin | 2.83 |224| 141 | 2 | 640 | 3.61 | x2666)
0 .

38 | 6.52 | 5.70 | 5.70 | 4.52 | 1.58 | 6.04 | x3(1.5.3.5)


6 .

6.52
the
clu ste rin :
g Sim ila r to step 3, we assign each object based on
0pjects
wn below.
'
yest" . tance. The Group matrix is sho
Jy m §

pinot

° Tal [2 fe lao le| l,1 | X1@,10)


0 |
awl 1 ss

] 0 0 | X2(6,6)
0
0 o | 1 | 0 | X3(1.5,3.5)
centroids
iteration-2, determine
95)
x1 = (2+4)/2, (10+9)/2)= 3,
)/4)=(65, 5.25)
= ((8+5+7 +614, (44+84+544
3.5)
x3 = ((2+1)/2,(5+2)/2)=(15,
s
\ Iteration-2, objects-centroids distance
De

1.12 | X1G.9.5)
235e17.43 | 2.5 | 6.02 | 6.26 7:76 |
112/a
s
,5.25)
X2(6.25 PO
6541 451 | 195 | 3.13 | 0.56 | 1.35 | 6.38 | 7.68 | X20
5,3.5 )
| 6.52 | 1.58 | 6.52 | 5.70 | 5-70 J
4.52 | 1.58 | 6.04 | X3(.5
| et

1 in distance.
un
lleration-2, objects clustering : We assign each object based on the minim
Group matrix is shown below.

a
Scanned by CamScanner
0
fofi1{[o o] oj] 0 1

10. Iteration-3, determine centroids

XI = (2+5+4)/3, (10+9
+ 8)/3) = (3.67, 9)
X2 = (8+7+6)/3, (44+5+4)/3)= (7, 4.33) strenofgth
K-ms
X3. = (2+ 1)/2, (5 +2)/2)=(1.5, 3.5) Relatively
ef
11, Iteration-3, objects-centroids distances where n is nu
D?= ae
Normally, k,
: K-means ofte
0.33 | X1(3.67,9) oe
| 6.01 | 5.04 | 1.05 | 4.17 | 0.67 | 1.05 |
6.44 | 5.55 X2(7 ,4.33) global optim
| 6.52 | 1.58 | 6.52 | 5.70 | 5.70 | 4.52 | 1.58 | 6.04 | X3(1.5,3.5)
_ Weakness of K.
12. _Iteration-3,
, 0 objects clusterining g :: We assign
‘on each :
object based on the minimum distance
Group matrix is shown below. Applicable c
Need to spec
G=
Unable to h:
Not suitable

0} 0 |] 1 |x1@.67,9)
1} 0 | O | x2(7,4.33)
O11] 0 | x3(1.5,3.5)

So final clusters are Grou


Pl={ Al, BI i
= (A2, Cl}. » C2} and Group 2 = { A3, B2 , B3) and Gro”?

Scanned by CamScanner
of K-means clustering
adl atively efficient: O(ikn),
ere is number of objects,
"his number of clusters,i is number of iterations,
Tx vormally, ki<<n.

_ Kemeans often terminates at a local optimum.


_ Techniques like deterministic annealing and
dobal optimum solution. senetic algorithms are used to get the

gainess of K-means clustering

- Applicable only when mean is defined.

- Need to specify k, the number of clusters, in advance.:


- Unable to handle noisy data and outliers (outlier : objects with extremely large values)
~ Not suitable to discover clusters with non-convex shapes.

' Given: (1,2,6,7,8,10 ,15,17,20}


=6andC2=
Consider initial two centroids for two.clusters Cl=6an
215 =
|
therefore
Ly, Umber of cluster = 2,
n = 5.
0}, cl =Mea
1= (1,2,6,7,8:1

Scanned by CamScanner
K2 = {15,17,20}Cl
, =Mean= 17 3
3. Re-assign oe

K2 = {15,17,20}

As no element is moving from cluster , So the final answer js


K1={1,2,6,7,8,10) and K2 = {15,17,20}
4.8.2 K-Medoids (Representative Object-based Technique)

i lue of the object in a cluster asa eferen


ce Point
can bene Si ieceemnte located object in a cluster.
Meta
~ Also called as Partitioning Around Medoids (PAM).
- Handles outliers well.
~ Ordering of input does not impact results,
— Does not scale well.
Each cluster represented by one
item, called the medoid.
Initial set of K medoids randomly
chosen.
In a single partition of
data into K clusters wh
that is centrally locate ere each cluster has
d point of the cluster a Tepresentative po
based on some distance measure.
These representative poin
ts are called medoids.
Basic K-medoid al
gorithm
I. Select K Points as
the initial medoids.
2. Assign all Points
to the closest me
doid,
3. See if any other
point is a “be tt
er” medoid,
4. Repeat Steps 2
and 3 until the
me doids don’t ch
ange,

Scanned by CamScanner
¢ ach SIP in algorithm,: medojgg are Chan 7
_ cost¢ hange for an item 4, ASSOciateg ed if
wit h
. n by Margaret H, §

Pe With Non-med ‘
i? ceca . :
.
ite \
;
ye" s 2 _ in} if set of elements.
qe oe 2

yAdjacenY matrix clshowing distance hein


"aes

a Nele, Tents,

om
;
ired clusters,
mber of desired

i :
erence . ae ‘
Bo in| ;

° "doy, ft peateei™ !
ate
pagent Meee oe
pirat select k medoi

Presentative Point cae wali

replace medoid ti with th

Calculation of swapping cost

n
Tey.= 2 “Cin
j= 1
ids)
Advantages of PAM (Partitioning Around Medo
t handle large data sets
3 Wer.
but does no
i ‘small data sets, k is number of
iia is number of data,
where 1
Complevwxity isomO(aa k)*) forte each iteratio
n(n-sy |
.

; .

Clusters.
d outliers.
presence of nol se an oe |
re rob ust tha n k- me an s in the
~ PAMis mo

Scanned by CamScanner
w. Apply K-
ects are given belo
Coordinates of Obj
clusters = 2.
Table P. 4.8.8

1.0
5.0
5.0
5.0
10,0
25,0
25.0
25.0
25.0
29.0
Soln. :
y

10

8
6

4p e4 e4

2 03
e2
|
1 5 10 20
|
15 25
Fig.P. 4.8.8 : Two-dimensional example 30 " *| |
with 10 objects
4 (
2t
Objects 1 and 5 are the selected
oo
representative objects initially.
(Random selection)
men Stance of every object with :
Tespect to selected object
oo e clos st representative 1 and 5
object with Tespect to the selected
culate the average object.
value of minimal dissimilarity
Cost = Average value of
minimal dissimilarity =9
= 4
ase
Objects

Scanned by CamScanner
10 15 20 25

Fig, P. 4.8.8(a)

J Select two objects randomly.

* Objects 4 and 8 are the selected representative objects.

Scanned by CamScanner
gem. 6-COmp. 4-107 Classification, Pr

(MUSE = warenousicg &


F pata Warehousin g. Mining pata

stepI.valu a geif ,wapPine con


™“ the
Repeat = 2.30. “ ;
average of minimal dissimilarity

7 ees Je P. 4.8.8(b): Assignment of objects to two representative objec, replace


, new
es
medoids al
panei Dasaimiin rity Minimal

4: Repeat steP 7
Pei from object 4 |_ from object8_|__ dissimilarity
, 100 24.19 4.00 sampling Be
2 3.00 20.88 3.00
3 2.00 20.62 2.00 CLARA
0.00 20.22 0.00 Clustering le
4
5 5.00 15.30 _ Built in stat
6 20.00 3.00 3.00 Draw multi
7 20.10 1.00 1.00 Performsb
8 . 20.22 0.00
0.00 _ Efficiency
9 20.40 1.00 _ 1.00 LARANS
10 24.19 4.00 4.00 e °
Average 2.30 - CLARAN
search.
A
y — ‘CLARA

— Search is

ol - Itdraws

4 - It has tv
and nun
6 a q
L —_—_—_—_—_—————
4 = Sy!

49 Hierar
2r

12345 10 a 20
Fig. P. 4.8.8(b) : Clustering corresponding to the selections described in Table P. 48.70) wt
ari Sus hierar:
Step p 3; Single-lin
~ Calculation of swapping cost.
Complete
i193 ~ Swapping cost = New cost— Old cost ost == 2.30 9,32= —7. 02 ~ Average-

Scanned by CamScanner
. ping cost < 0
llf
oe medoid with new Selected Object,
pee bj
ne medoids are object 4 ang object g
perepeat step 2 and 3 until Swapping COSt en
0
3 samp ling Based Methoa

cARA
_ Clustering large Applications.
_ Built in statistical analysis Packages, such ass,

_ Draw multiple samples of the data


Set, applypPAM on each sample
_ Performs better ,
than PAM in larger data sets,
_ Efficiency depends on the ay size,
1 CLARANS
aaARANS is
- CL siiiliodhle to large a pp licati
Ons which are based
upon randomized

- ‘CLARANS is more efficient than


PAM and CLARA clusteringalgorithms
- Search is over the sample of the neighbours of a node
.
- Itdraws a sample of neighbours in each search step.
- It has two main parameters ‘for clustering they are maximum number of neighbours
and number of local minima obtained.

Syllabus Topic : Hierarchical Methods - Agglomerative, Divisive

{9Hierarchical Clustering
> (MU - May 2014, Dec. 201 4)
7
Veous hierarchical clustering algorithms are :

, Single-linkage clustering, nearest-neighbour.

* Complete-linkage, furthest neighbour.


5 Average-link age, Unweighted Pair-Group Method Averag

Scanned by CamScanner
YF Data Warehousing & Mining (M

- Weighted-pair group average, UPGMA weighted by cluster sizes,


- Within-groups clustering.

- Ward's method.
Hierarchical clustering technique (Basic algorithm)

1. Compute the proximity matrix (i.e. distance matrix).


Let each data point be a cluster.
ho

Repeat.
mw

Merge the two closest clusters.

Update the proximity matrix. -


Dn

oe eee

Different approaches to defining the


distance between clusters dis
algorithms i.e.
Single-linkage Clustering
. :
Single Linkage clustering is als Clusters
o called as minimum
method, the minimum distance 4
from any object of one
Cluster to any object of ano “ee
ther cluster is considered.
the single li In
nkage method, D(A,B) is computed
T
D(AB) = Min { d(ij) : Where as
object i is in cluster A
_ and object j isin clu
ster B}
This measure of inter-
.
In the Fig. 4.9.1.
group p distance Ce 18is illustrated
;
e
s ee . Cluster A

’ Fig. 4.9.1 : Single-linkage


cust |
Complete-linkage Clustering

In ney
the complete linka. 8¢
method, D(A,B) isfe com
an uted = na
d(i,j) |
object iis invtuster A and objectj :
is in cluster B} ee

Scanned by CamScanner
°
« Cluster4
Fig. 4.9.2 : Complete. é

perage-linkage clustering S* clustering


In average-linkage clustering,
between any two clusters A ang Bi
We consider the
7 1s take
seaverage of all distances between pairs of objects "nse
een 7
gements of each cluster.
o7
: uster B . In the average linkage method, D(A,B) is computed
3
sD(A,B) = mean{ d(i,j) : Where object i is in cluster
. * md object j is in cluster B} A
e e «@
Lproximity(p.
we ‘ p,eCluster, 4 my
eCluster,
urimity (Cluster, , Cluster,) = teen
(Cluster * |Cluster|
Fig. 4.9.3 : Ayerage-linkage clustering

The Fig. 4.9.3 illustrates average linkage clustering.’


ter A
{91 Agglomerative Hierarchical Clustering
> (MU - May 2010)
ees up eee ae
algorithms either top down or bottom 7:
In Hierarchical clustering '
equ

a cluster and in sul S| AgglomeratVe
object
‘ o c t is
i idered
considc!™* to be erarchical
en any Bot 0m up approach, every
enn Therefore it isalso called as Hie
a we nerged in to
Chsterin
iE Where 8 (HAC).

Scanned by CamScanner
warenousi
wecgift as a dendrogram as shown a Fj go pa! ta Kowa
An HAC clustering is typically visualized
each merge is represented by a horizontal line. * 494 ex 49-1* techn
What is Dendrogram ? ‘hy
- Dendrogram: A tree data structure which illustrates hierarchical cluster}
n
— Each level shows clusters for that level. . ni
o Leaf — individual clusters
o Root—one cluster

- Acluster at level i is the union of its children clusters at level j+1,


— Aclustering of the data objects is obtained by cutting the dendro
then each connected component forms a cluster. Bram at the desirey
soll.
1: Plot the «
the numl
attributes

r p6 in 2-d
| 5 [: Step 2: Calculat
all other
mee Fig. 4.9.4 : Dendrogram ‘ and plac
- ow: chart of agglomerative hieerarchici
} aera al clustering
i ‘
algorithm is. shown in Fig. 495 | sae

where x; is
on, aS Many attrit
;
In our case,
and p2, which h
d(pl, r

Scanned by CamScanner
ae
oe? —

Assume that the q


technique to fingClusters» D
in

1; Plot the objects in n-dimension al spac


the number of attributes). }, our ¢ (Where nj, 9

o<
attributes x and y, so we plot the inn Wwe have 9 ss
p6 in 2-dimensional space : iad ae
op?! Calculate the distance from cach gs 04 "3 98
i . . ject (poj to
all other points, using Euclidean Pili,
easure,
and p lace the num bers ini a distance
° matrix. 0 01 0203 ; 04 os on™
The formula for Euclidean
distance between two points
j andj jis
vn in Fig. 4.9.5 d(i,
Gi ))j) = af Init —Xy + xg— xi + Fxg oe
where x;1 is the value of attribute 1 for i and x); is the value of
attribute1 for j and so
} «,asmany attributes we have ... shown upto p Le. Xp in the
formula.
In our case, we only have 2 attributes. So, the Euclidean distance between our points pl
ad p2, which have attributes x and y would be calculate
as follows:
d

APL De), = Ixp, — xp, +1 yp — yp


= [10.40 —0.22F'+10.53-0.38"
= 4/1018 P+ 10.15"
= 0.0324 + 0.0225
e the
ini ints and we will receiv
pom
the2 distance to the remaining
alce ae
An al og ic al ly , we ca ul at e
blowing va lues .

Scanned by CamScanner
Distance matrix , —

p2 0.24 | 0

p3 | 0.22 0.15 | 0

p4 | 0.37 0.20 | 0.15 | 0

p5 | 0.34 | 0.14 0.28 | 0.29 | 0

p6 | 0.23 | 0.25 0.11 | 0.22 | 0.39 | 0


O57 (
pl p2 p3 p4 pS p6
Step 3: In the above matrix, p6 and p3 are two clusters with shortest distance 9.11, 5 o.4

p6 and p3 and make single cluster (p3,p6). Now re-compute the distance 7»,
Dendrogram

Disa
0.30

0.25

0.20

Oo 01 0203 04 05 06
Distance matrix
” The «
pl 0 . dist(
p2 0.24 | 0
(P3, p6) | 0.22 | 0.15 | o
p4 0.37 | 0.20}015 . | 9
ps 0.34 | 0.14 | 0.28 0.29 | 0
; pl p2
To calculate the distance (p3, p6) p+ pd
of pl from (p3,p6) :

dist( (p3, p6), pl) =


MIN ( dist(p3, p1), dist(p6, pl) )
MIN (0.22, 0.23) //from original matt
0.22

Scanned by CamScanner
Classification, 7
& mini

s
sone single cluster is formed i. merge al the chuter ,
*
.
nll

: have the smallest distance from above matrnixn, .. oo merge pr wnel


| Now ie ajuster(P>P5). —_
a 3 40 &
a
O25

4 9 oh

F015

+014
|
7Olt
| |

6 2 5

0
0
0.22 | 0.15
{015 [0
037/020
pl | @2P5) (p3, p6) | p4
ca lc ul at ed as gi ven below:
and (p2, p5) is
between (p3 p6) (p 6, p2). dist(p3, P:
st(pe3, p2 ) » <i st di; 5),

p5) ) = MIN ( di
5)
i

a t i

ae ); (p 2, ’
,
6
dist( (p: 3, p ;

//from origigininal al
matrix
25, 0.28, 0.39)
MIN (0.15 0.

= 0.15
and again
distance i.¢.0.15
(p 3, p6 ) as h a v1 ing miinniimum
YMerge (p2, p5) and
e e - Pp.

matrix.
erge

compute distance

Scanned by CamScanner
4-115 Classification, p .
* Sem, 6-Comp.)
=F Data Warehousing & Mini MU-Set <Siction warehousing &M

Space Dlstane, J since we ha

so. looking
r and pl hav
* two ina si

oF
— there are nl

rit}
O14 space

3 62. §
o 010203 04 05 0.6

Distance matrix

pl 0

(p2, p5, p3, p6) | 0.22 | 0


0.37 | 0.15 0
p4
pl | (p2, p5, p3, p6) | p4
c. Since we have more clusters to merge, we continue to repeat Step 3,

So, looking at the last distance matrix above, we see that (p2, p5, p3, p6) aad To stop fuer
have the smallest distance from all i.e. 0.15. So, we merge those two in a sing bas t0 make ests
cluster, and re-compute the distance matrix. merging of amar |
stop clustering at th:
Space dendrogram _ 7% Complete link :
Dendrogram Distanca —
Step land step 2:
0.20 .
Refer the sing]
an Distance matrix |
ofl

0 01 02 03 04 05 06 $ 62 5 4

Scanned by CamScanner
d. since we Dave more ClustersSto
so, looking at the last dixtn
and p! have the smallest die Matrix
0.29 two in a single cluster, There nee _
Ove,

Oy §
there are NO More clusters 4, tncrge,'S no
lo floes One lefty
14

O44

yA
9 0102030405 06 x

5 4,

he he
ndition :
gorring
indicated that “each object is placed ;

* WA“ N
‘ ma Separate cluste;
ir of clusters, 9 until certai,

aA
ee
gecosest Par 0.
ain termination conditions an * and at eachSED We merge
€ Satisfied”,

Any
.
Jo stopclustering either user has to specify the number of cise
é =:

ct
at which level. Through dendrges nn Mem
3, P6) and pd
,sn ukeof clusters stop clustering
decisionat tovarious distances. If merging of clusters is at high a WE can notice the

Tk
Wo in a Single
rid
igh distance then we can

WS A
epcustering at that level.

ip tomplete link :

«<
ALiUnoas
P
Sep land step 23

% Refer the single link solution (same)

;i Dstance matrix :

4 pl |0
p2 | 0.24 | 0
p3 | 0.22} 0.1 1 ——
|5; 0
| 0.15
p4 | 0.37 | 0.20
0.28 |
p5 | 0.34 | 0.|14
25 | 0.11 | 92
p6 | 0.23 | 0.25 |00

Scanned by CamScanner
Step 3: Identify the two clusters with the shortest distance in the Matrix,
together. Re-compute the distance matrix, as those two clusters Mey
cluster. Me Roy ny ‘i new 2
Dendrogram 0
Distancg pl 0
p2
0
(p3s p6) 5

pt 0
ps :

o 0102030405 06 . pl

By looking at the distance matrix above, we see that p3 3 _ 2 |


distance from all and
fist i.e, 0.11 So, we merge those two in p6 have the My (p3, p6)
: . a Sing le clus
, ter and Te-¢o
MPU fy p4
aisH((P3. p6), pl) = MAX (dist(p3pl), , dist(p6, pl)
) ps
= MAX (0.22, 0.23)
= 0.23 as oneina _ So, looking at
“A (3, P6), p2) = MAX (dist(p3, p2), dist(p6, from all — 0.14. So
p2)) using the followin;
= MAX (0.15, 0.25)
//from original matiy
= 0.25
dist (p3, p6), p4) = MAX( dist
(p3, p4), dist(p6, p4) )
= MAX (0.15, 0.22) | /from original matt
= 0.22
dist('(p3, p6), p5) = MAX ( dist(p3,
pS), dist(p6, p5))
= MAX (0.28, 0,39)
/from original
aa
= 0.39

Scanned by CamScanner
se

.
' at the above distance
So, looking matrix ‘meta Pe 4 p5
— at p2 and .
fom all - 0.14. So, we merge those two ina Single cluster, and = seit Smallest distance
wing the following calculations. “ , Compute the distance matrix
Space : Davida

sinal matrix Distance


0.30

0.20

nal matrix
| | 0.10

x 3 6 2 5
0 01 02 03 0.4 05 06
] matrix:
|
dis (p2, p5), pl) = MAX (dist(p2, pl) dist(o5,P!))
original a
= MAX (0.24, 0-34) //from
= 0.34

Scanned by CamScanner
& Data Warehousi &M

(dist(p2, P3) dist(p2, p6), dist(p5, p3), dist(ys


dist( (p2, p5), (p3.p6)) = MAX
= MAX (0.15. 0.25, 0.28, 0.39) Mcom a
= 0.39 ‘
dist{ (p2, pS), p4) = MAX (dist(p2, pA), dist(P9. P4))
= MAX (0.20, 0.29) “tomop
= 0.29
Therefore new distance matrix is :
pl o
(p2,p5) | 0.34 9
(93, p6) | 0.23 ai :
p4 0.37 0.29 0.22 0

pl —s (2p) (3, p6)~—Sspsd


Step 5: Consider the following matrix

pl 0 |

(p2, p5) 0.34 0


(p3, p6) «| 0.23 0.39 0
p4 0.37 0.29 0.22 0

pl (p2, p5) (p3, p6) p4


The minimum distance is 0.22, so merge (p3, p6) and p4
Dendrogram Distance

0.22

1 0.14
0.11

0 01 0.2 03 04 05 086 3 6 4

Scanned by CamScanner
Therefore new distance matrix

(p2, pS)’
(P3, p6, p4)
I =!
sep 6: Now consider the follow PY ©2,P5)
ing distance matrix
(p3,p6, py
pl 0 ry
-—-—.,

(p2, p5) .=-—_—----—_——__]

0.34 | 0

(p3, p6, p4) 0.37 | 0.39 0

pl = (p2, p5)_ (p3, p6, P4)


Since the minimum distance is 0.34, merge (p2, p5) with p1.
Distance
+
Space Dendrogram

Ya

0.6 5

0.5

0.44

0 0102 03 04 05 06

Scanned by CamScanner
dist ((p2, p5, pl), (p3, p6, p4) == 0.39
03
Therefore new distance matrix

ason fo|
carsrp fos fo |
(p2, p5,p!) — (p3, p6, P4)
Finally, merge the cluster (p2, p5,p1) and (p3,p6,p4)
Space Dendrogram

rT
o 0.1
0 01 02 i 040508
3 ¢& = me
Average link: 5 dist( (p3 p6), pl.
Step 1 and Step 2 : “dist( (p3 p6), p2

Refer the single link so dist( (p3 p6), p4


lution
Distance matrix
tai(
l 3 PO), BS
Distance matrix
pl 0

p2 [0.24 0
mec
P3 | 0.22 | 0.15 0
P4037 | 0.20 | 0.15 | 9
PS | 0.34 | 0.14 | 0.28
| 9.29 0
0.231 0.25/ 0.11 | 0,
Pe [025] 025] 011 [aaa 039] | 0

Scanned by CamScanner
0 0102 03 04 05 og x
4 6

s{(6376), PL) = 12 (ist(p+3,disp1


ypspyy
) « 05x (0224929
.
ga( (93 pO), p2) = 1/2 (dis: t(p3,p2) ; .
+ dist(p6,p2)) = 05x (0.15+0.25)
, ~ 2
= 02
Gs((p8p6), p4) = 1/2 (dist(p3,p4) + dist(p6,p4)) = 05
x (0.15 +0.22)=049
fst(3 p6), p>) = 1/2 (dist(p3,p5) + dist(p6,p5)) = 0.5x (0.28
+ 0.39) = 0.34
stance matrix :

pl 0

(p3, p6) | 0.23 | 0.2 | 0 =


oa foa7fo20/o
[0 19
|
p5 | 0.34 0.14 | 0.34 29 |
_—
p

_
Scanned by CamScanner
4-123 Classifica ing & Mini
Ww Data Warehousing & Mini mot, PreK, ~ warenoes
02
w the closest
Step 4: 3:
o ‘ tthe ma
So, looking at the above distance matrix above, we see that p2 and 5 hay ” jookin® ie
distance
§ from all - 0.14, So, we merge those two in a single cluster, ang ,, on"®t i ”
distance matrix. Dendrogram ;
Space
T O25 Y
06
TO25

y
0.6 wi
o4 [Os
TO414 o4

0.4
o4 TO05

dist( (p3, P6. P4 )-


x
0 010203 04 05 06 3. 686 a
* dist( (p2, p5), pl ) = 1/2 (dist(p2,p1) + dist(p5,p1)) bit BA
= 0.5 x (0.24 + 0.34) = 0.29 ae
dist( (p2, p5), (p3,p6)) = 1/4 (dist(p2,p3) + dist(p2,p6) + dist(p5,p3)
+ dist(p5,p6)) Distance matrix :
= 1/4x (0.15 + 0.25 +0.28 + 0,39) =0.27
dist( (p2, p5), p4.) = 1/2 (dist(p2,p4) + dist(p5,p4))
= 0.5 x (0.14+0.29) = 0.22
Distance matrix :
Step 6 : So merge t!
pl 0
°
(p2, p5) | 0.29 | 0 dist( (p3 p6 p4

(p3, p6) | 0.22 | 0.27 0


p4 037/022 |o1s lo
Pl (p2, pS) (p3, p6)_p4
Since, we have merged (p2, p5) together in a cluster, we now have one et
(p2, p5) in the table, and no longer have p2 or p5 separate
ly.

Scanned by CamScanner
oO
w the closest clusters
CreRed,

ae th Sma gokingSpaatce Ee maces distance hety,


en,
OMY leg,

Gas

0.29

O15
0.14

9.19
os 0 01 02 03 0405 o¢ x

) dist( (P3+ p6, p4 ), (P2, p5)) = 1s s (dist(p3,p2) + dist(p3,ps) 4 dist(p6,p2)


+ dist(p 6,p5) + dist(p4.p2) + dist(p4,p5))

= 6X (0.15 +028 4025


4 0.39 +0.20 + 0.29) =0.26
dist( (p3, PO, P4), Pp!) = 1/3 x (dist(p3.p1) + dist(p6,p1) + dist(p4.p1))
= 1/3 x (0.22 + 0.23 +037) = 0.27

pl e
).27 ||
(p2, p5) 0.24 | 0
|0 |
(p3, p6, p4) [0.27 [0.26
P4)
pl _(p2, p5) (p3, p6,
and p 1,
Sep6: So merge the cluster (p2,p5) (p3,p!)
= 1/9x (dist(p3
p2)+ dist(p3,p5) + dist
dist( (p3 p6p4 ), (p2,p5.P1)) 6 p5) + dist(p6,pl)
+ dist(p6 2) + dist(p
we p5) + dist(p4+P))
039 +023

sntry £0"

Scanned by CamScanner
o 01 0203 04 05 06
Distance matrix : 5

(p2, p5,p1) 0 dyoo = min (4,

(p3, p6, p4) | 0.26 0 dyz4 = min {d

(P2, p5,p1) (p3, p6, P4) dans


‘We need to re-compute the distance from
= min {d
all other points
/ clusters tO our te,
cluster - (p2, p5,p1) to (p3,p6,p4) and
enter the maximum di Stance in
original distance matrix). the above Matrix (yg
vs . 23 4
Finally merge the cluster (p2
, p5,p1) and (p3,p6,p4).
Space
Dendrogram
t 9
3 0

109 7 ¢
5] 9 8 5 .

dio;
da:

Scanned by CamScanner
ayo =i {di3,do3} = min (6, 3) =3
dua = in {dj,4424} = min { 10, 9} =9
e
dys = min {d,,5,d, 5} = min
{9, 8} =8 ©
ve Matrix (use 2 . ry 8
123 45 (1,2) 3-4 ‘

0 (1 2) 0
| (i, ’ (1, 2,3) 4 5

25 3 | 3.0 4 <6
g6 3 0 > 4) 9 70 > 51 5 49
° 09 7 0 51 8 540
i985 4 0 |

di2s4 = min {dards} = min (9, 7}=7 5


Grays = min {dy 25,435} = min (8,5} =5

Scanned by CamScanner
¥‘ Data Warehousing & Mining (MU-Sem. 6-Com .) 4-127 Classifica, ‘on, Predictign ‘

Step3:

123465 (i,2)
3 4 5 (1,2,3) r

a,2) | 0 (1,2,3)] 9 an | c
0 : 3. 0 4 _
3} 6 3 0 => 4 /9 7 0 ~ * Le | se?
109 7 0 5 8 3 6
4 0
5} 9 8540 1
| 12 °
{
3} 6 3 |
(1,23) 46 4) 10 9
(1,2,3) | 9 5) 9 8
(45) | 5 4 da
dazaas) = min {di12.3),4,dq,2.35}= min{ apes

Step 3: ;

s | 1
.
1] O

2 0
(ll) Complete link
3} 6 3
Step 1:
4110 9
P2449 (1,2) 3.4 5 5 9 g
1/0 ‘
1,2, 3),(4,5

4/10 9 7 9 s1 9 540
519 854 0

rr

Scanned by CamScanner
5 4 0
danas = MAX {daa 4dao5}= max (10, 9} hg
dyc4s) = Max {d54,d,5} = max {7,5} =7

step 3 *
1, 2 (1,2) 3 4 5 (1,2) 3 (4,5)
if 0 (1,2) | 0 (2) | 0
sta 0. | 3| 6 0 3 | 6 0
> 4 | 10 70 => (45){ 10 7 0
116 3.0
i|10 9 7 0 5 | 9 a4%
s}9 8 5 4.0
dha, 3) ¢4,5) = max {daa aspdsa.s} =. mt {

Scanned by CamScanner
(iii) Average link
Step 1: /a4as (2) 3 4 5 2

tl 0 (2); 0
2/2 0 a) ee 0 4| 9
a 95 7 0 ; 2 oO
3} 6 3 0 =
4/10 9 7 0 5 85 5 4 0 Je 28
5/9 8 5 4 0 Z 10 9 (7 0

1 6+3 5 4
dang = 7 dist hs) ="9 =F 5L? °= 61
1 1049 dga.sntt-® Gia +

daz = 5 dis + dy 4) = 2 = 9.5

1 9+8 @
days = 9 dist zs)=—9 =8.5 pa)
Step 2: ;
12345 (1,2)
3 4 5
1; O (1, 2) 0 (1,2) —————
2)2 2-0 "sige —x.4.9.3: Uset
3 45 0 3 link a
316 30 > 4/95 70 => (45)
4,10 9 7 0 5 85 5 4 0
5} 9 85 40 |
a2
das) = | (det dys + dag + dys}
1/4 { 10 +94 948}
I

ond 9 .
Soin. :

1
dys) = 7 {daat dss) =6

Scanned by CamScanner
(1,2,3) 0 2
4,

(4,5) 8 ‘

Scanned by CamScanner
¥+ pata Warehousing & Mining (MU-Sem. 6-Comp.).) 4-131 Classi
assification, Prog "
; °

For simplicity we can find the adjacency matrix which gives distances of al
each other. Using Euclidean distance we have Sbise, .
Di(i,j) = ix, — x1 +ly,-y,P 4

D(A.B) = V2-3) +(2-2)' =1,


Similarly we can compute for the rest.
| [a |B [ce [pv Je
fa] 0
[B] 1 | 0
fo] 141[ 224] 0 ay complete
[p{14i] 1 | 2 | 0 : Closes
[E | 1.58/ 2.12] 0.71 | 1.58] 0 on
() Singlelink
a
Step 1: Since C, Eis minimum we can combine
clusters C,E

Step2: Now
twoc

Step 3: Clus
|
valu

Step4s in
(Cc)

Scanned by CamScanner
complete link
a
1! Closest clusters are Merged y,
. . Te the
the maximum dis tance between és e dia:
Stance

Step2: Now A and B is hav


ing minimy
two clusters,

’ clusters,

Sep3: Cluster (A,B) and D can be merged together as they are having minimum distance
value. .

1um distance

(C.E).

Scanned by CamScanner
5 nousing & Mining (Mt
Classification rel
¥ pata Warehousing & Mining (MU-Sem. eaten Stodiction 4 i¢ oi pe Dene

Final dendrogram

2.24

1.44
1
0.71
c
1Cc
A BOD
matrix
a, OF _
following data and pita doc
Ex. 4.9.4: Discuss the agglomerative ¢ algorithm using = gn &
contains
using single link approach. ‘The following figure Sa (EA) B) = MD
indicating the distance between ane Seren Sai
ae xe ee Me ~
if eid fj,
one Paty A] he].

Step 2 3 Consider the ¢


Soln.:
Given : Since B,C dis

Distance matrix
EACBOD
E/0
A/1/0
C}2/2)0

B/2/5/1
D/3;3 16/3 /0

Step 1: From above given distance matrix, E and A clusters are having minimum distance ° Distance matrix
merge them together to form cluster (E, A).

Scanned by CamScanner
te at 134 Classification, Prediction & Clustarini¢
ing & mining
Dendrogram
Distance

2
MIN (dist(E.C), dist(A.0) = MIN (22) =
- MIN (dist(E.B), dist(A,B)) = MIN (2,5)
ma 2
‘ _

a
:

) =3
jE? »P MIN (dist(E.D), dist(A,D)) = MIN (3,3
| BA cB D
gc E> P

e A ,
ae .
7
) » d i s t ( C B), dist(C A))
ix : gist(B A
Distance matr
9, 2) =*
* MIN (2.5
acca
Scanned by CamScanner
& Mining (MU-Sem. 6-Comp. Classification, Pre
¥F pata Warehousing
tion &
dist( (BC), D) = MIN (dist(B,D), dist(C D))
= MIN(G,6) =3
EA B,C D
[r,A} 0
[ B,C 2
jp | 3 | 3 Jo
Step 3: Consider the distance matrix obtained in step 2 (given above)

Since (E,A) and (B,C) distance is minimum, we combine them


Dendrogram
D stance ‘
sol a ; elena
step 1*
2 |—__—

[1
E A
TI
Bo oc! | =
dist((E A), (BC)) = MIN (dist(E,B), dist(E C), dist(A,B), dist(A
C))
= MIN (2, 2,2, 5,2) = 2 | Hom dhe: ak
E,A,B,C D distance-is betwe
E, A, B,C 0 and is called the
:
Step 2: Calcul:
D
2/0
St ep 4: : FinFi ally combini e D i Therefore,
with (EABC)
: Final dendrogram Déndrogram dis(C37,4) :
:
Similarly, ¢
Distance
fe

Scanned by CamScanner
Following table .,
clustering ang oo0V8S
on

gol

adis called the C37. i ingle point (cluster)

(14.25
= min4) 14.25
) , 18.1= 9)
tos
dis(C37,4 ) == minmi ( dis2 (3,4), dis(7,

iara [ete
ar.
ys ae the.other distances to get the distance matrix.

91.18 | 39.00 | 45.46 | 5.08


6.77e| 5.03 35.5 4
15.52 | P ee

Q | 14.25 | 20.28 | 10 |

Scanned by CamScanner
In the above matrix distance
between paints 2
cluster as C25,

Step 3 > Calculate the new distance


matrix with C25 using single
link age “ls
tering
| Cluster number | cl | C25 | C37
Cl | 0 | 40.6
| 21.218
Fr C25 | | o [145.52
L C37 !

az. L_ |
In the above matrix dist.
ance between CI and
formed new cluster is C6 in minimum Which js 5.08,
C16.
sea 9 ney,
Step4: Calculate the new distan
ce matrix with cluster
C16,
Cluster number | C16 | C25_
cies | o [3554 —qa6: Sup
C25 r hays

The minimum distan


ce is 6.77, So comb
ine clu
Step 5: Calculate new distan
ce matrix.
| Cluster number [C16|
C254 C37.

Scanned by CamScanner
Clasaification, Pradintlan 4 Clyatarire

16.11 16

4.257 44

12
+10

671
t 6
28ST
4.01t 4

6 4 2 6&6 3 7
= 7

6 objects (with name A, B, C, D, E and F) and each object


pose we have X1 : an and X2 ).
Sup features(X1
406 ;; have two meas ured

, x1 x2 5
a(t. 45 1) 4

- 15
c J§ s
3 4 2
D /
eE | 4 4
F 3 35
0
0 2 4 6
and draw Dendrogram.
Apply Single linkage clustering
rix of size 6 by 6.
win.: We have given an input distance mat

A B c D E F
Dist
“| 0.00| 66:| 3.61 | 4.24 | 3.20
0.71 | 0.00 | 4.95 | 2.92 | 3.54 | 2.50.
BP

5.66 | 4.95 | 0.00 | 2.24 | 1.41 | 2.50


3.61 | 2.92 | 2.24 | 0.00 | 1.00 | 0.50
seo

4.24 | 3.54 | 1.41 1,00 | 0.00 | 1.12


3.20 | 2.50 } 2.50 | 0.50} 1.12 | 0.00

| ‘
| ae
Scanned by CamScanner
We summarized the results of computation as follow :
1. «In the beginning we have 6 clusters: A, B, C, D, E and F, 50 in
5
2. We merge cluster D and F into cluster (D, F) at distance 0,50.
4
3. We merge clusterA and cluster B into (A, B) at distance 0.71,
3
4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00,
i
5 We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41,
4
6. We merge cluster (((D, F), E), C) and (A, B)
6
into (((D, F), E), C), (A, B)) at 2.5
distance 2.50. 20 acy Mat rix
7. The last cluster contain all the objects, thus ts pdjace’
. conclude the computation. ,
Using this information, we can now draw the Te
final results of a dendrogram.
05

Using Euclidean D
x1 x2 .
Aled 94 Dii,j)
B 1.5 15
Sue 3
D
84s D(A,B)
Ee |4 ay Distance matrix:
F 3 35

0 2 4 8 Step1: Identify 'th


— — = the matrix

f ~ © technique
{ala set given. Create
to cluster the adjacency
the given matrix.
data, Draw the Use singe bt
dendrogram.
distance 1
dititn- dis
So ele a

Scanned by CamScanner
Gh as Dend
D(i,j) Ix2—xlI +ly2-yil rogram neve

D(A,B) = 2, similarly we can compute for the rest,


Distance matrix :

Step 1: Identify the two clusters with the shortest distance in ‘


the matrix, and merge them together. Re-compute the
are now ma
distance matrix, as those two clusters |
er exist by themselves).
single cluster, (no long
the smallest dismte
that C and B have
pute the distance matrix.
By looking at the distance matrix above, W°cluster, am d re-com
merge those two ina single
fom all - 1 So, we
OI Stance matrix un Gin 1) dist(CA))= 141

dist( (B.C), A) =

Scanned by CamScanner
¥F bata Warehousing & Mining (MU-Sem. 6-Comp.) 4-141 __Classification, Prediction :

Similarly calculate for other objects.


A |BC;
D JE
A 10
B,C | 1.41] 0
D | 3.60 | 2.24] 0
BE |5 13.60} 1.41/0
Step 2: Consider the distance matrix obtained in step 1 (given above)
Since E,D distance is minimum, we combine E and D
Dendrogram Distance

. 1.44

BC E oD
New Distance matrix :

| AT BC | DE:
A_|0
B,C | 1.41 | 0
D,E | 3.60 | 2.24 | 0
Step 3: Consider the distance matrix obtained in step
2
Since (B,C) and (A) distance is minimum, we combine them
Dendrogram Distance.

rif]
B CAE
D |:
or
_u
Scanned by CamScanner
MU-Sem. 6-Comp.) _ 4-142 ___ Classification, Prediction & Clustering

A,B,C | DE

A,B,C | 0

D,E 2.24 0

ne A,B,C and DE
| combi

rogram Dendrogram Distance

2.24

1.41

B
| CA E OD
|
(based on distance formula)
ari son of the three method: s
comp

\ single-link :

“Loose” clusters.
al shapes.
strength : Can handle non-elliptic NH
a * aes of

- : Pa OT
ye eM Tee
ws, « sf. et
wn!
Pe:
a eats : we, * be
oud A
ee.LS "4"
tt ‘es
ae ke
ae Ee e =28 * * .* g s wre *.
we He e sae ot ef es
os "AE Eo ost 5 Te
tye
+ -_— o* -
aeeet? ff
seat

Orginal points Two clusters

Limitations : Sensitive to noise and outliers.


. -
ata
2
@
. . 4
ry .
* . i.e s

.? .* ee sates * Tee ’ nee "4 :


ons m, ee eee “ene gt
€ “pes
"2h es Se Same
eg -Te Se » Poy Bee
ne 8
oS. ow Me Pats fe ae, ‘ “eat Pew a
oe ‘ie alhe _ eer’ 2
"4? et so. 4s 17s ow eet
= oPuyt oe sm 8 ye * te
sath « -_, wv
a rT) “te
i “es eae 5 ee
e * * of . Se
ees
* ”
.
*
:.
=
7
= wea?

a ee> ° *, Me *5 ea
° * . = ae

Original points Two clusters

a i
Scanned by CamScanner
I OO >a
a

Data Warehousing & Min g (MU-Sem.6-Comp.)_4-143.


oe

Classification, Pradiction A
2. Complete-link :

“Tight” clusters.

te 4
'
< art
Le "te .
2 “a efe we Leas?
sha, ates, fe i oats Bas
oe a. see Meee3; ‘
ha 70k toe ‘ha "y as fe mt f
, a * “, ". "y eee "
af
re en a , ooh fas
oe abe oo * & 6 te soi ‘
‘ reheys
oe fot geet
ee ee ae . ‘ oe 1 a a. 4
* ee o? at's “erewe
meet, * 6 tte rie . ee, 6 .
get
#eFt me . oP it
* tat “eo a,
. . #. ‘ fant “Pent
. ate : * ots .

Original points » Two clusters

Limitations:

- Tends to break large clusters.


- Biased towards globular clusters.
. oe * C * ataia
«Poe
. weet “Ff
es, Pet out et Tate
. tae ee
at"s ‘ oh i oe
ot Pe) . oa eye hee rons, "¢ 3: ..
; _ . * tet 8 = t : "nih * Pst
ct 7, * 2 o * * we re ate
. ‘ - . aah ae fi * phy * . “e+ uo « ah i
* fe ot.
air Seton tk
ents * a abe
ne vite
fer
ret ee te
sae “ a we
* " oe wd Seg "he

Original points Two clusters


3. Average-link :

- “In between”.
- Compromise between Single and complete link.
~ Strengths : Less susceptible to noise and outliers.
Limitations ; Biased towards globular clusters.
Note: Which one is the best? Depends on what you need! ine
ele we
Agglomerative algorithm given by Margaret H. Dunham:

Input :

D= Ustaewitd
i set af elements
Aff Adjacency matrix showing distance between elements

Scanned by CamScanner
_aa
pl gj) Dendrowrar® TePtosented ag g gai 2
om ative algorithm;

p24

en,

52 <d, k, K
pb > i Initially dendrp

pepe
Oldk = k;
d=dt1;

for each pair of KK, EK do


ave = average distan
ce hetwee
ifave <= q, then Mall ey
vk Nandy kan

B= K~ {KD
1K y {KiuK}
K = oldk-~]

DE=DEU<q
*K, K> 1 New set
Uilk=2o f
MSs edo dnd
492 Divisive Hie
rarchical Clusterin
-

- Th

, Divisive Methods are not


generally available, and
G NES (AGglome rarely have been applied.
ra tive NESting) an
d DIANA (Divisive
ANAlysis)
Agslomerative and divi
sive hierarchical Clusterin
g on data objects {1,2,3,4,5}.

Scanned by CamScanner
te,,
n _ Prediction & Clus
catio
cl assifi

Divisive
(DIANA)

i c a o q ] c clustering
hiera r c h
ve
ive v/s pivisi
9.6 a
6 * A g glomerat
g. 4-9-
iFi”
rmative.
c t u r e th at is more info
q stru
Advantages hi erarchy;

n t s th e o utput as 4
Is simp] e and represe

It does not
split, it
o u p of © bj ects is merged or
Disadvantages @ gr
t po in ts is cr itical as once nd o wh at wa s done previously.
rge or sp li t u
Selection of me ne ra te d cl us ters and will no
will operate on
the new ly ge
le ad to lo w -quality clusters.
not sen may
well ¢ ho
me rg e or sp Jit decisions if
_ Thus
ve
n ag gl om er at ive and divisi
Differ ence betw ee
—<—$——]

Sr. Agglomerative
No.
l-inclusive cluster.
l clusters. Start with one, al
Start with the points as individua ch
cluster until ea
l.
At eac h ste p, spl it a
of
2, At each step, merge the closest pair nt (or there a
clusters until only one cluster (or k cluster contains a poi
k clusters).
clusters) left.

BIRCH (Balanced Iterative Reducing and Clustering usin


g Hierarchies)
4.9.3

- Itisascalable clustering method.


- BIRCH technique
i place significant
igni emphasis on scalability of very 1 arge d data sels-
ts

: This tec! i i i

Scanned by CamScanner
if pased on the notation of CR
; u
¢ is a height-balanceg = &
© that
rer of data points (vector) ig 1, .
cus sen

Numb

I
. er of items jp ' Yay im iy al
LS Linear Sum ofthe p . Chiste, her, Ns jn

i
Oints, , :
SS = Sumof thes
Divisive Uar
(DIANA) . . an gf Point
A cr Tree is a height-balanceg tree that ss 5
; we al nodes store the sums of
their
© the clu... 7
lusterin ig "
desce,
Ndants 8 feature, i icony
4 ,
yor ee structure is given as bej,,,elow :

a gach non-leaf node has at most B entries


Ore inform ; , i
ative, och leaf node has at most-L cp etittigs wes
diameter OF radius. Which each Satisfy ut
. os : + @ Maxim tim
_ p(page size in bytes) is the maximum size of a nie
cts is merged or split, it
| _ Compa ct : Each leaf node 1si a sub cluster, not a data point,
. Was done Previously

ality clusters,

a
isive. (ei

Jusive cluster.

1 cluster until each

g Hierarchies)

data sets.

Scanned by CamScanner
oe

Y Ss ousing && Mining


Data Warehousing
(MU-Sem. 6-Comp.)
Mining (MU-Sem. 6-Con _ 4-147 —= aassilication, Prediction
Detail algorithm as a flow-graph Te &o Cluss,

Data
Phaset; Load into memory by
buildingaCFtree — -
Initial CF treo qo
[esis rcs Condange into !
desirable rango by building a
smaller CF irae
Smaller CF tree

|Phase3: Global clustering


Good cluster i

Phase4(optional and off line);


| Clustering refinement
oes
Better cluster I

Fig. 4.9.8 : Flow ch


art of BIRCH algo
rithm

NO more I/O needed)


2 Accurate (outliers are s
€parat ed)
© Less order Sensitive
Step 2:

Scanned by CamScanner
x Cluster refining
{4ff passes OVEr the datase,
ie ast centroid from step 3,

oe problem with CF trees


: is
+ quod data points. ;
*ixed Whete g:
yi
Uifferens 1
always converges toa Minimum, Cntteg
, allows 10 discard more outliers, *itned i,
e " Samy

sample :

’ of

CF = 3
ny: Number of data points N.Ls, SS) :7

N -
Z
Lee Se

N
i= l x B

SS: =x ,
i so a >

yo, oe CE (5,(16,90) 64,150)

(3)
(2.8)
(4,5)

(4.7)
(3,8)

Scanned by CamScanner
WF ba ta Warehousing m p.) 4-149 Ula
Mining (MU-Sem. Se i
eee ——
Ne#5
BE I6ID) ta FRAO OS Bp
i.e 3°3? ++2°274 + 44 4744? Bs x
SS = (54,190)
ic, do 4 3? = 54 and424.+6745 An
<2 2 ta
- .
Advantage : Finds i with
a good clustering ith aa single
sing scan and im Proves
r
few additional scans, the iy a

Wea Kness : Handles only numeric data,


-
and sensitive to the order
Of the data recon
Complexity : Algorithm comple
xity O(n) where n is number.o f Ob
jects to
Practical use of BIRC be Cluste
H y
Gf,
L Pixel classification in images
:
~— From top to bottom
- BIRCH classificati
on
Visible wavelength
band
- Near-infrared band
2. Image compression
using vector quanti
zation :
Generate codebook
for frequently occ
urring patterns.
BIRCH Performs fa
ster th an CLARA
and nearly as good quality. NS or LBG,
while getting
3. BIRCH Works
with very large
data sets.

Scanned by CamScanner
pined clusters Cannot be Split
+ 92 co s
SIg9 | goes BNE PIPES CTStering it a
Wali, ft

ith i ang of big. cluster into
. sma Ma is no}
j axing ; Olay
ller © ison Sse,“eMsitiy
.
Teo,
, university Question
s ang ;
luste
{

What is K-means Clustering 2


wo clusters. Data set {4 0,42 Confer the k.
. ' 1413, 20 3
Mean |
. . Refer section 4.8 90,14,
I0rithm Wi
(Ans. 7 and Ex, 4, &. 3) 25, 31}
ith the tor.
‘Ollowing data for
what is Clustering Technique 2
Discuss th
data and plot a Dendrogram usi eA ; (10 Ma
Ng sinate 5 Sglomerative alone: tks)
sample data items indicating the
distance between '

2
2
or ESsion
0
1
6
int of (Ans, : Refer section 4.5.2, 4.9.1 and Ex. 4.9.4)
(10 Marks)
tee, 2010
=e

lity. (3 What is Clusterin g ? Explain K-mean


s clustering algorithm. Suppose fe — n
clustering is {2, 4, 10, 12, 3, 20, 30, 11, 25} consider a "it es
using above algorithm. (Ans. : Refer section 4.5.
1 and Ex. 4.8. :
tion 4.8) (10 Marks)
Bh Explain Partitioning Method for Clustering. (Ans. : Refer sectio
ley 204
cy ma tr ix Use single link
. the adj ac en
Consider the data set given. ae data. Draw the dendrogram
x“gglomerative technique to cluster the gV°"

Scanned by CamScanner
rarget
Dec. 2011
(Ans. : Refer Ex, 4.9. 4)
packer©
1 Missio
Q.6 Write short notes on OM work P
Similarity and distan
(Ans. : Refer section ce measures in data
4.7) Mining An €xa
May 2012 of a st
(10
Q.7 Explain what iS meant ‘y produc
by clustering. State experic
example for each and explain the various
(Ans. : Refer Sections types Surrou!
4.5.1 and 4, 5.2)
= basket

= @.10 Apply
Clustering algo use Clusteri dend:
rithm with the he ng, Descri
lp of an exampl
(Ans. : Re
ter sections 4. e
5.1 and 4.8.1) aay 7
May 2013
(10 Maris}
brQ.e 9 lllustrate how :
the su Permar
ket can use cl
ustering Meth
ods to improv
e sales, (5 Marks)

Scanned by CamScanner
v > «
i, Jaser focus cuts down on w “ z
' . : ] om Selling
. lines ast < si ~
i elf space

gore’ § shoppers, mn *dVertising isco, ¢ .


ore often oF tO . spend : more, Pr Mot; nnn lingy * "th i
5 demographic profile of ‘a 8 thay ty .. Clin
: ri t ey

geted peosiiet offerings : N€ede, rubles


Ir Tetajy ers Miva sh
packer ness an well ay different life s ®Pperg te as “UDDtie lo gh

7
5 Sheg; 18 to
(10 Ma Mission profiles reveal detaiy, ab °8 andj 00 Wifi tlie, ‘dentig,
Tks) ' ‘k Ip in, lunch time sh : Out en Mes Us Silty live
work Po ©Pping o¢ Wee Shoppers , Tal an tthe

‘sel am «
Aa example of the way this inform... €nd x OPpin P= top 4
¥ TMation § ey, ae ly Sl "
a ' éf a store 5 shoppers are Popping . Coy tag Peditions hopping after
0 In fi Ww
Markey products ne located nea; to Ws and te Peitas -

a
ience and disc Tont > Oe nu

ia
aT aii g the b Md ee Shoppers ‘7 Of the Store to i "RUE that ten

woh
th si; Surroundin reac’ and milk with Telatedime M8 8 conven; Ove stopping
Ultable pasket and the margin on these sho _, lMPUlse lines i COCE store insteag
° Mark ) PPSE Missions,

wT
Sul also help increase the
210 Apply Agglomerative Hierarchical Clustering
; . an

PL ke
dendrogram for the following distance matriy d draw Single link ang pie at
any One fo A/B/C;/DJ/E

ee
A/O |}2/6]10]9
) Marks)

vy
B/2 /0/3;9 /8

tte}
C}6 /3/0/]7 |5
Marks

Pout
; ) D}/10;9/7/0 |4

range E/9 |}8/5/4 /0


(10 Marks)
(Ans, : Refer Ex. 4.9.2)
‘ : ; 4.8.1)
+ Refer section ( 40 marks
stores ;
ables Q11 Write detailed notes on K-Means Clustering. (Ans
sing Dee, 2013
._ esoribe any one (10custennd
Marks)
can use clustering.
O12 2 a:Give five examples of applicication that
er sections 4.5.1 8 nd 4.8.1)
s. : Ref
algorithm with an example. (An

Scanned by CamScanner
May 2014

Q.13 Describe the working of the


K-Means clustering algori
dataset. (Ans. : Refer Sect thm wit h the help
ion 4.8.1 and Ex, 4.8, 1) of the

Q.14 a
Write detailed notes
on Hierarchical cluste
ring methods.
(10 Meri
(Ans. :Refer section 4.9) :
)
Dec, 2014
10
( Marky)

(Ans. : Refar Q.
10 of May 20 73)
Q.16 Explain hierarchical
clustering methods.
(Ans. : Refer Section (5 Mar
Q.17 5
Write . eat 4.9) (to
detailed notes on
K-Means clustering, Mats
May ay 201
2016 (Ans. Reter section 4 8.1)
(49
i)

Scanned by CamScanner
the help of th 7

(10 ms
Tks)

(lo Marks

OVE sales,
( 5 Marks)
sabus *

3.7 ) . (10 Marks)


Margy 7 yt BasketFal Anais,
requent Algorithm, ining,
Freauont
Efficient
tam sty » 4l0Sed It
ioral Association Rule ‘s Scalable Frequent = and Association Rute,
pp growth, Mining frequent Itemsets usi peat ", Improving the Eee Methods :
f yuttlevel Association Rules and Mut i ng Vertical Data Format, | iclency of Apriori,
Idimensiona| Associati » Introduction to Mining
or the following ion Rules.

(10 Syll :
Marks) yilabus Topic : Market Basket Analysis

Q00 ;{ Market Basket Analysis


Chapter Ends ie
> (MU- May 2012, Dec, 2013)
si What is Market Basket Analysis?

- Market basket analysis is a modelling technique which is also called as affinity analysis, it
helps identifying which items are likely to be purchased together.
- The market-basket problem assumes we have some large number of items, e-g., “bread”,
‘nilk.”, etc. Customers buy the subset of items as per their need and marketer gets a
information that which things customers have taken together. So the marketers use this
information to put the items on different position.
. o tends to buy a bread at the same
* For Example : If someone buys a packet of milk als

lime

Milk => Bread

Scanned by CamScanner
————
Mining Fred. Pattems & Asso, pa,
— 8M

culties arise mainly in eating


straaightforward: diffiying
Market basket analysis algorithms are § ere after appl algorithm it May give i
i dat a -
with large amounts of transactional | in nature.
to large number of rules which may be trivis

5.1.2 Howis it Used ?


, for Cp. if
sis is used in deci
© ding the location of items inside a store
Market basket analy
a pack et of brea d he ime fr c likely to buy 4 packet of butter too, keeping
cust omer buys
in as tore would result in customers getting tempteg
the bread and butter next to each other
a
to buy one item with the other.
be overc
me of trivial results +canin-Eniting : ome
with the help of
interesting’ result i
The problem of large volu
differential market basket analysis which enable
eliminates the large volume.
p are results between various store;
Using differential analysis it is possible to com
between customers in various demographic groups.
aes holds in one
Some special observations among the rul es for e.g. if there is a rule which
store but not in any other (or vice versa) then it may be really interesting to note that there
is something special about that store in the way it has organized its items inside the store
may be in a more lucrative way. These types of insights will improve company sales.
Identification of sets of items purchases or events occurring in a sequence , something that
may be of interest to direct marketers, criminologists and many others, this approach may
be termed as Predictive market basket analysis.

5.1.3 Applications of Market Basket Analysis

Credit card transactions done by a customer may be analysed.


Phone calling patterns may be analysed.
Fraudulent Medical insurance claims can be identified.
For a financial services company :

o Analysis of credit and debit card purchases.


o Analysis of cheque payments made.
o Analysis of services/products taken e.g. a customer who has taken executive credit
card is also likely to take personal loan of $5,000 or less. .
For a telecom operator :
o Analysis of telephone calling patterns.

Scanned by CamScanner
a =

A alysis of value-addeg
SCtViceg =
2 ether at a Point ;, ;: . en
ee ath mn time, i could 5
six mon * Rath
nsidy “CrViegg
ea Store, by yariows ways can be used to Siderj &
; apply Marke bap seryj
et of butte, ; Te 8. ip jal combo offers May @ Periog Of, let's on
5 Spec be off “set Analysis .
stomers Bens, ; Ted to the tte,
Keep, placement of items Nearby
inside we ers g
& Product,
tes tempted to buy one Product
ome With , with the Silas
The layout Ms ¥ tesyy
—i Min custOMOmMerstoget
of catalogue Of an én
. he] o pets; z He
we ting Tesult, at inventory may be manageg based May be
id
on : fined,

Se items that app


ear in at least a fraction
sof the basket, where s is a chosen constant with a val
ue of 0.01 or 1%.
- Tofind frequent itemsets one can use the monotonici
ty principle or a-priori trick which is
given as,

Ifa set of items say S is frequent then all its subsets are also frequent.

The procedure to find frequent itemsets :


0 Alevel wise search may be conducted to find the frequent-1 items(set
i 0 size 1), then
t of

proceed to find frequent -2 items and so on.

ive credit 9 Nex t search for all maximal-frequent


; 1 itemsets.
KeCUuLIVEe
i2a Closed Itemsets
o the
di at e su pe rs et s has the same supp rt as
f its imme
lemset.

Scanned by CamScanner
SP,
bu t th er e is at fees
in Y
em of X :is ; oy 5
— Consider two itemsets X ane Me ie,
roper super-itemset of X. In this case ; tems... fl gP
3 ¥ is notaP
Y, which is not in X, then } ® MAX, 5 or objec
closed. Josed frequent itemset gh ee are

both closed a one Bequest TES a . positon” ctures


- re-etn is bontievod . f : : f
causa aa
if none of its immediate superscts 1s frequent, es for in
- Anitemset is maxima | frequent ; ipo
‘ smal frequent itemset OF me tones st te Ereatentiang there Chig ee
it § gactions: or s
. + eabsatot Y and Y is frequent.
= An itemset X is maxima
This knowledge ‘
no super itemset Y such that X 15 $
association rule
Example :

where. I, are se
The rule shoulc

jikely to also P
Large ttemsets
An itemmset 1S

]f some items

Support

-— The support

or in other v

Fig. 5.2.1 : Lattice diagram for maximal, closed and frequent itemsets

‘is Let us consider minimum support = 2.

Ia _ The itemsets that are circled with thick lines are the frequent itemsets as they satis the
1 minimum support. Fig. 5.2.1, Frequent itemsets are { P,4,1,S,Pq,PS,9s.18,pqs} ~ The suppe
database t]
- The itemsets that are circled with the double lines are closed frequent itemsets. Fig- s2,
closed frequent itemsets are { p,r,rs,pqs}. For example {rs} is closed frequent itemset ® ~ An itemse
| all of its superset {prs ,qrs} have support less than 2. Confidence

- The itemsets that are circled with the double lines and shaded are maximal freq ~ The cons
itemsets.
Fig. 5.2.1, maximal frequent itemsets are (rs,pqs}. For example (+s) ' a transact;
frequent itemset as none of its immediate supersets like {prs, qrs} is frequent. | ~ 9 ,
Onside
Nd B tx

Scanned by CamScanner
ve
s or objects j

ositor are cone Rel
elatjfor
idereg tOna]
rf da labs “4 >
st Qe ‘ ‘

aut ee * Tague dang


at, MY 20139 ne
sre for interesting Telatig Patterns, Pi * Other ing "
eee eT shep farts, We crete ‘Mong it
t 5 Mem, “Stations, coretatignm4, or
qhis knowledge can be used Mj adve
sn Which ite, ed ® given d
ociation rules have the ge
ASS Nera] form OF in Soa, 5are
placeCommo,
ent in ALA PutChased
set by us
Amini
ig
‘ Stores, cr.

where, fn are Sets Of items, for exXample ¢ I, (Wheome =0)


e should be read as « UrChase
_ The rul sanrak Given that totic, Jina store.
~ ikely t0 also Puy the items in the set» PS bought the te
yerge It emsets “eMS in the set} they
1 are

| jfsome items often occur together they can f,


tm an 4SSOciation, tule,

support
. The support of an itemset is the count of thati
iemset in the total number of
orin other words it is the percentage of the tran sactions in whi tran; sactions,
ch the items appear.
If A>B
# _ tuples_conta
bothini
A and
ngB
Support (A > B) =
atisfy the - total_#_of_tuples
- The support(s) for an association rule X = Y is the percentage of transactions in the
database that contains X U Y i.e. (X and Y together).
ig. 5.2.1, et if its support is above some threshold.
=mset as - Anitemset is considered to be a Jarge items

"aldence B is the ratio of the number of


‘ation rule A >
requem! =~ The confidence or strength for an — of transactions that contain A. oa
jaximal tansactions that contain A U B to then f the number of tuples containing bo
ys ure of ratio
~ Consider aa rul rule A => B, it is measut®
ntaining A
and B to the number of tuples c

Scanned by CamScanner
#_ tuples_containing_bot
Confidence (A => B) =
h_A_ang p
#_tuples_containing A
Finding the large itemsets E
4
1. The Brute Force approach be
pe
- Find all the possible association rules. \- *
FP "
— Calculate the support and confidenc
e for cach rule generated in the
~ The Rules that fail above Step,
the minsup and minconf are prunned from
the above list,
The above steps would be a time consuming
process, we can have a better
asgiven below.
“PPT
2. A better approach

The Apriori Algorithm.


- Th

Syllabus Topic : Frequent Pattern Mining #* Th


= tog

9.3 Frequent Pattern Miniin g 2. “EE


te be
Frequent pattern mining is classified : ‘
in the various ways based on following Basic
criteria :
1. Completeness of the pattern ‘
to be mined : Here we can ae otk
frequent itemset, closed frequent mine the complete set of
itemset, constrained frequent itemsets,
2. Levels of abstraction involved - F
in the rule set : Here we use multilevel
based on the levels of abstraction association rules - A
of data.
3. {
arr iF

7 IE
4. Types of the values handled
in the rule : Here we use
association rules. Boolean and quantitative
5. Kinds of the rules
to be mined Here we use association rules and corr
based on the kinds
of the ry les to be elation nt
mined.
6. Kinds of pattern to
be mined : Here we
ig! use frequent itemset mining, seque
mining and structured patter
n mining,
ntial patie

Scanned by CamScanner
1
ave a better app TOach
j
p orl algorithm finding Frequent ‘. ;
oa Emset fi
S Using Cang > (wu = Dec. 201
_ The Apriori Algorithm solyes "
the frequ ent; Item Hate Conerat ton
setg Problem,
analyzes q ue
_ The algorithm
uti

to determing
together frequently. ch Combinations of items
4 Apriori al occur
gorithm is at
the core of
Various algorithms for data min;
best known problem is finding the asg iat mi
OClatio
ea on.mm
pasic Idea rules that hold in at
Z criteria :
An itemset can only be a large itemset if all its subsets are lar
the complete set of are large itemsets '
.
itemsets: The sets of items that have minimum
Se Frequent num support.
All the subsets of a frequent itemset must be fre quentforor €.g,e.g. {PQ} is: a frequent itemset
.
‘el association rules
be frequent.
{P} and {Q} must also

single dimensional Find frequent itemsets frequently with cardinality 1 to k(k-itemset).


jtemsets.
ciation rule if there Generate association rules from frequent
wei Han et al.
Aprlorl Algorithm given by Jia
2 and quantitative
Input

transactions;
D: a database of
1 correlation mules mu ! m support count th
reshold. |
; th e mi ni
min_sup
ts in D.
:L
ut : fr equent itemse
sequential pattem Ou tp

t_l-item sets(P)s
Sey find_frequen

Scanned by CamScanner
W Data Wa
’ rehousing & Mining (MU-
‘ Sem. 6-Comp 5-B
(2) for{k = 9, .) Mining Freq
. Pa tal
The Or k++){ M3
& a,
8) Gs §prion_g F pata ware’
en (Ly pi
(for each transa
ctio1n€ D {4 sean A avantaees
qhe algorit
6) Cc D for counts
= subset (Cy, t)s 4
&et the subsets of¢
(6) for each eg nd
idate ce C.
that are candidates
, The metho
@ 1. eal gorit
©.count +4
;
3. = es
pisadvanine
Although

the overall
C1) retury L = Uk
Ips _ Due to Dat
Procedure pr
iori_gen (L_,
: frequent (k
(1) for each — 1)- itemsets) 54.3 Solve
it emsets el ]
(2) for each Hems
eteL, 1 Cr
(3) #U.A) = £ Ex. 5 4.1:
4) 9 42) [
AA Ch Ik 9) =4()
= 40k-2) 0 g
@
© = 4044: If join
y <4 [k~1) then
slep:generate ca {
(5) if has_infr ndidates
eque n L_subset(o
(6) , L._,) then

(7)

Soln. :

Step1: Scan
supp
: candidate
k-intemsep;
~itemsets); Cc, =
“use Prior
knowledge

Scanned by CamScanner
Y. Paterna & Asse. Patlerr

“ff vi Kes use of large itemset property,

i Cepit} ne easily parallelized,

from implementation point of view,

a - algorithm is easy to implement it needs many database scans which reduces


ue pmance-
ya
yav per
pase 5 ans, the "
algorithm assumes transaction database is memory resident.
{0 pata
é
ples on Apriori Algorithm
‘ goed Exam
3
given the following data, apply the Apriori algorithm. Min support = 50 %
D.
ahi" {: pata pase
_TID| Items
100 1134

200 |235

300 |1235
400 |25

whe . “| ae
find the
walt Scan D for count of each candidate. The candidate list is {1, 2, 3,4, 5} and
support.

C=

Scanned by CamScanner
a

{12}
pA
{13}
{15}
{23}
{2 5}
{3 5}

Step 4: Scan D for coun! t of each candidate in C, and find the support

Step 8:

23} | 2
Step 9.
25} | 3
sri

Scanned by CamScanner
)_ Mining Freq. Pattems a ”
So, Patte,,
n support count (i.e, 50%)

oo a candidate C; from

j
i

sip? Scan
D for count of each ike
} G =

PhS
tats
pov;
és

'
¢
j| ds Sodata contain .
th ef enti
temset(2,3,5)
/ teTherefore
gu the ass
Ociation rule that can be generated from L, are as shown below
with
Pport and confidence.

Scanned by CamScanner
——_Data Warehousing & Mining (MU-Som, 6-Comp,) +12 __ Mining Freq. Pattems AS Pay,
n
Association Rule | Support | Confidence | Confidence %
243=55 2 2/2=1 100% i
3*$a>2 2 2/2=1 100% |
2'5=>3 2 2/3=0,66 66%
2=>345 2 2/3=0.66 66% _|
3=>245 2 2/3=0.66 66%
S=>243 2 2/3=0.66 66%
If the minimum confidence threshold is 70% (Given), then
only the first and Second ry
above are output, since these are the only ones generated Tules
that are strong.
Final rules are :

Rule 1: 243=>5 and Rule 2 23%


5=52
Ex.5.4.2: Find the frequent item sets in the following
database of nine transactions , wins
minimum support 50% and confidence 50%
Transaction‘ID- ‘Items Bought.
2000 A.B,C
1000—, A,C
4000 A,D
5000 B,E,F
Soln. :

Step 1: Scan D for count of each candid


ate. The candidate list is {AB,
support
CD.EF and find the

C=

{A} | 3 |
{B} | 2
{C} | 2
{D} 1
{E} |} 1
{F} |} 1

Scanned by CamScanner
Freq. Patterns gAsso ss
atte
Th
idence %

i Generate candidate C, from L


1
‘d GF
‘rst and second ryje,

7
iaifiesci.
ctions , » wi
With a :
yt scan D for count of each ;
: candidate in C, and and f;
a Cz Ind the Support

|F} and find the 43: Compare candidate (C, ) Support count with the minimum support count

{A,C} 2
96: So data contain the frequent item 1(A,C)

Therefore the association rule that can be generated from L are as shown below with
the support and confidence
| Support Confidence Confidence % »
_Association Rule
A->C 2 2/3 = 0.66 66%
C->A 2 aaah im

Scanned by CamScanner
) 5-14 MiningSFreq.
—— Patte
S & Asso, Patter
ms n
0! & Mining (MU
-Som. 6-Comp.
then both the rules are output as the
va Warehousing
WY Data
is 50% (Given),
Minimum confidenc
¢ threshold
50 %.
confidence is above

So final rules are:


Ru1l ->C
:Ae
thm y,
Rule2:C->A
giv Use Apriori algori
en below. se ait
baserate the. association rule
gIV' s alon g with it,
Ex. 5.4.3 : Consi e i: tran ort sact t 2.dataGene
counion
eee ae 9016, 10 Marks, May 2011 , 8 Marks
minimum supp’ a
_ confidence.
Ds.
‘TID | List of item_I
T100 | 11, 12, 15
720|012, 14
+30|012, 13
T400 | 11, 12, 14
7500 | 11, 13
T600 | 12, 13
T700 | 11,13
TB00 | 11, 12, 13, I5
T900 | 11, 12,13
Solin. :

Step 1: Scan the transaction Database D and find the count for item-1 set which is the
candidate. The candidate list is {I1, I2, 13, 14, I5} and find each candidates support.

its
t Step 4: Com
L-ltemsets | Sup-coun item
ll 6
L,=
EB 7
5 6
14 5
H 2 _
————-_

Scanned by CamScanner
Ng Freq. Pattorng 4 Aden. Pattern

nether ea least two transactions (Aa


: pnt give mel
F ott
yo oo? °
bt 1-Itemsets | Sup-count
: I 6
2 7
3 6
4 2
3 2
didate C, from L, and find the support of 2-itemsets.
je
General a
#
C7 2-Itemsets | Sup-count
12 4
1,3 4
1,4 1
1,5 2
2,3 4
2,4 2
2,5 2
3,4 0
3,5 1
4,5 0.
4: Compare candidate (C,) generated in step 3 with the support count, and prune those
itemsets which do not satisfy the minimum support count.
L,=

‘Frequent | Sup-count
2-Itemsets | 5
1,2
vilmlalroleala

1,3
15
2,3
2,4
2,5

ae : |
Scanned by CamScanner
=

WF Data Warehousing & Mining (MU-Sem. 6-Comp.) 5-16 Mining Freq. Patterns RA -
Step 5: Generate candidate C, atm
from L,
GQ=

Frequent 3-Itemset
1,2,3
1,2,5
1,2,4
Step 6: Scan D for count of each candidate in C, and find their support count,

C=

| Freq uent 3-Itemset | Sup-count


| 1,2,3 2
1,2,5 2
L 1,2,4 1
Step 7: Compare candidate (C3) support count with the minimum supp
ort count ang Prune
those itemsets which do not satisfy the minimum support
count.
L;=

| Frequent 3-Iiemset | Sup-count soln.=


123 | 2 gep1: Scan the tr
candidate. ‘I
L 1,2,5 2
Step 8: Frequent itemsets are {I1,12,13} and {I1,12,15}

Let us consider the frequent itemsets={I1, 12, 15}.


Following are the Association
tules that can be generated shown below with the
Support and confidence.
| Association Rule | Support | Confidence Confidence %
fu AI2=>15 2 2/4 50%
1 A15 =>12 2 2/2 100%
he “15 => 11 2 2/2 100%
I] => 12415 2 2/6 33%
fesnas | 9 ann 29%
[1s =>I] 412 2 2/2 100%

Scanned by CamScanner
se
jf the minimum Confidence 7
as output, as they are Sttong 1, oth
5

count,

* Count and pring


Apply the Apriori with minimus
and find large. item SetL. 5

1; Scan the transaction Database


p and fj
candidate. The candidate list is {1,2,3,4 nd the count for item. I set which is the
,5 6,7, 8,9,10) and find the
Support.
C=

‘| Itemset | Sup-count
€ Association
I 3
ea
120 3
3 3
4 *, J@
5 3
6 1

yr. | ob
| 8 | ——
po
|
fo
|1

Scanned by CamScanner
° 3 Mining Freq. Patterns g
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _5-18 2880, Patter

Step 2: Find out whether each candidate item is present in at least 297 of transaction,
support count given is 30%).
L,=
Itemset | Sup-count
1 3
g 3

3 3

4 2

5. 3
Step 3: Generate candidate C, from L, and find the support of 2-itemsets.
C,=
Ttemset |Sup-count
12 1
13 2
14 2
1,5 r
2,3 2
2,4 0
23 3
3,4 1
3 2
Step 4: Compare candidate (C,) generated in step 3 with the
support count, and prune those
itemsets which do not satisfy the minimum support count
.
L, =

Ttemset | Sup-count
1,3 2
1,4
a 2
2,5 3

Er —

Scanned by CamScanner
¢ pata Warehousing & Mining (MU-Sem. 6-Comp.) 5-19 Mining Freq. Patterns & Asso, Pattern

5; Generate candidate C, from L, and find the support.


steP
C;=
Itemset | Sup-count
123 1
23,5 2
1,3,4 1

step 6: Compare candidate (C,) support count with min support.


tee

Itemset | Sup-count
| 2,3,5 Z
Therefore the database contains the frequent itemset{2,3,5}.

Following are the association rules that can be generated from L3 are as shown below
with the support and confidence.
Association Rule ‘Support | Confidence ‘Confidence %
243=>5 2 2/2=1 100%
345=>2 2 2/2=1 100%
245=>3 2 2/3=0.66 66%
2=>345 2 2/3=0.66 66%
3=>245 2 2/3=0.66 66%
§=>243 2 2/3=0.66 66%

Given minimum confidence threshold is 75%, so only the first and second rules above are
output, since these are the only ones generated that are strong.

Final Rules are: Rule 1: 243=>5 and Rule 2 : 345=>2

Ex.5.4.5: Adatabase has four locimltbas Let min sup=60% and min conf= 80%.
“Tip.|’ ‘Date. |.“ Items-bought
T100 10/1 5/99 | {K, A, D, B}
T200 | 10/15/99 | {D, A, C, E, B}
T300 | 10/19/99 | {C,A,B,E}
T400 | 10/22/99 | {B, A, D}
Find all frequent itemsets using apriori algorithm
List strong association rules( with supports S and confidence C).

aa
Scanned by CamScanner
Soln. :
| a Data Warehous
i
Step 1: Scan D for cou nt of each candidate. The candidate list
list isis { {A, BODE, 7 asses]

support.
fing step 5: Compare
L,=

SteP 6: Generate o,
C=

SEP 7: Scan
D for g
C=

Step 8: Compare oa
Gz
L,=

Step 9; So data Suit

Therefore the ass


below with the sui
or

Scanned by CamScanner
¢ pata Warehousing & Mining (MU-Sem,
6-Comp.) 5-24 Mining Fre . Patter
& ns
Asso. Partarn
op Compare candidate (C,) support count with the min
5 imum support count.
L,=

Itemset Sup-count
A,B 4
A,D 3
B,D 3
step 6: Generate candidate C, from L,.

C=

‘Itemset

A,B,D
step 7: Scan D for count of each candidate in C,.

C=

itemise t | Sup
A,B,D 3
Step 8: Compare candidate (C,) support count with the mini
mum support count.
L,=

‘temset | Sup
A,B,D “13
Step 9: So data contain the frequent itemset(A,B,D).

Therefore the association rule that can be generated from


frequent itemsets are as shown
below with the support and confidence.

Association Rule |Support | Confidenee | Confidence %


A*B=>D 3 3/4=0.75 15%
AAD=>B 3 3/3=1 100%
BAD=>A 3 3/3=1 100%
A=>B‘D 3 3/4=0.75 15%
B=>A‘D 3 3/4=0.75 75%
D=>A*B 3 3/3=1 100%

aa
Scanned by CamScanner
“Eocene sen et) th A ta
If the minimum confidence threshold is 80% (Given), then only the SECOND
>
AND LAST rules above are output, since these are the only ones generat
ed that arestent
ng
Ex. 5.4.6: Apply the Apriori alg orithm on the following data with Minimum suppoq——_
TID | List of tem_IDs |. =2
T100 11,12,14
T200 11,12,15
T300 11,13,15
T400 12,14
T500 12,13
T600 11,12,13,15
T700 14,13
Ta00 11,12,13
T900 12,13
T1000 13,15
Soln. :

Step 1: Scan Dfor count of each candidate. The candidate list


is {11
support.

C=
Blolalala

I2
3
14
I5
Step 2: Compare candidate support coun
t with minimum support count (i.e. 2).
L,= Step 5 Genera

re

Scanned by CamScanner
4
4
1,5 3
2,3 4
2,4 2 |
2,5 2
3,5 3
Step 5: Generate candidate C, from L,.

C,

1,2,3
123
1,2,4
135
235


XN

Scanned by CamScanner
| rrequent 3-Itemset | Sup-count
[ 1.2.3 2
*
a

| 0

1.3.5 2

23:5 0
Step 7: Compare candidate (C,) support count with the minimum support
count,
L,=

Frequent 3-Itemset | Sup-count


Soln.:
1,2,3 2
1;2;5 2 Step 1:

13,5 2
Step 8: So data contain the frequent itemsets are
{11,12,13 } and {11,12 »15} and {11
13.15},
Let us assume that the data contains the frequent
itemset = { 11,1215) the then
association rules that can be generated from
frequent itemset are as shown below
withthe support and confidence,

| Association Rule Support Contidence | Confidence %


[ 11sI2=515 2 2/4 50%
[ 1as=>12 2 2/2 100%
| I2AIS=511 2 2/2 Step 2
100%
[ 1=>12"15 2 2/6 33%
[ 2=>1145 2 a7 29%
| 15=>11412 2 2? 100%
If the minimum confidence threshol
AND LAST rules above are output, sinc
d is
70% (Given), then only the SECOND, THI D
e thes
Care the only ones generated that are strong:
Similarly do for frequent itemset {11,
12,13} and {11,1315 }. |

Scanned by CamScanner
Step 2: Compare candidate support count with minimum support count (i.e. 50%).

L,;=

1 6

2 5

3 1

5 6

Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 5-26 Mining Freq, Patt
ems
&A S89
Step 3; Generate candidate C, from L,and find the support.

soln
Itemset | Sup-count step hee
1: -
1,2 3 one

1,3 a
1,5 3
2,3 4
2,5 4
3,5 4
Step 4: Compare candidate (C,) support count with the minimum support count. ied |

” 7 - =

temset | Sup-count.
1,3 5)
So data contain the frequent itemset= {1, J}.

Therefore the association rule that can Oe cence from L, are as shown below With te Step3:
support and confi denice
~

[ -[=>3 5 5/6=0.83 83%


| Ra] 5 5/7=0.71 71%
Given minimum confidence threshold is 50% , so both the rules are strong
.
Final rules are :

Rule 1: 1=>3 and Rule 2: 3=>1

ni é, determina tl
ge rules eee the; a ioe.‘asain
“Transaction | items
T1 Bread, Jelly, Butter
T2 Bread, Butter
T3 Bread, Milk, Butter
T4 Coke, Bread —
T5 Coke, Milk

Scanned by CamScanner
¢ pata warehousing & Mining (MU-Sem. 6-Comp.) 5-27 _ Mining Freq. Patterns & Asso. Pattem

7
In 1:* Sean D for Count of each candidate.

st e ” candidate list
a ‘
is { Bread, Jelly, Butter, Milk, Coke }

C=
I-Itemlist Sup-Count
Bread 4
Jelly I
Butter 3
Milk 2
Coke 2

step 2: Compare candidate support count with minimum support count (i.e. 2)
-. ]-Itemlist Sup-Count
Bread 4
Butter 3
Milk 2
Coke 2
Step 3: Generate C2 from LI and find the support

G ~ - : ™ oe a

- LlItemlist |; Sup Count


{Bread, Butter} 3
{Bread, Milk} 1
{ Bread, Coke} 1
{ Butter, Milk} l
{Butter, Coke} 0
{ Milk, Coke} 1

Step 4: Compare candidate (C2) support count with the minimum support count
L=
‘Frequent 2 -Itemset | Sup — Count’
{Bread, Butter} 3

Step 5: So data contain the frequent itemset is {Bread, Butter}


“Association Rule | Support | Confidence | Confidence
% -
Bread——>Butter 3 3/4 75%
Butter—— Bread 1 3/3 100%

| aa
Scanned by CamScanner
Mini
‘himum confidence threshold
is 80% (Given)
Final rule is
Butter ~ Bread
=
Ex, 5.4.9;

Scanned by CamScanner
gteP 2: Generate 2 item sey :
Item | §
AB i = Hem [Stipa
AC |4 eet
AE
AD [4
3
fc{7-—
/DE tq
AF |] /DF- >
AH |2 DH een

AL | 2 DL [2
BC {5 EF [2
BD |6 [BH [2
BE 4 EK a
BF {1 H (3-. |
BH !|0 FH |92
BK |3 FK |0
BL |2 FL |2
cp |5 HK |1
[CE |2 HL [2
[cr [1 KL |1
Step 3: Generate3 item set:
Item sets of 3 items
“Item
set | Suppor
ABC 3
ABD 4
ABE 2
. ABK 1
ACD 3
ACE 1
ADE 2
AEK 1
AEL 1
BCD | 5.
BCE 2
BCK |
BDE 3
BDK
|PVA 2 4

Scanned by CamScanner
BEL
Fro’
CDE
DEK
final rul
DEL
Item set above 30 % support

—e——

Step 4: Generate 4 item set

23
Associat
section 5
ABDE.
BCDE ——.
Therefore ABCD 1s:
the large item set with ———
minimum support
Following 30%.
Rules generated —

The
efficien:

- Ha
For
buc
|
she

j - Tr

i ~~
na
Pa
|
ite

!
din

a
ear

en
of

Scanned by CamScanner
Boa tereresg
§ on ny 5,
4,
Data Warehousin

From the aboye Rul =


C8 ge
nal rules. So final Rules are Nerated, on]
+

Syllabus Topic : Imp


roving the Efficienc
y of Apriori

5.6 Improving the Efficiency


of Apriori

There are many variations of Apriori


algorithm that have been proposed to impro
efficiency, few of them are given as : ve the

~ Hash-based itemset counting : The itemsets can be hashe


d into corresponding buckets,
For a particular iteration a k-itemset can be generated and hashed into
their respective
bucket and increase the bucket count, the bucket with a count lesser than the support
should not be considered as a candidate set. |
Transaction reduction : A transaction that does not contain k-frequent itemset will never te
have k+1 frequent itemset, such a transaction should be reduced from future scans.
Partitioning : In this technique only two database scans are needed to mine the frequent
rithm
i
database se i is
has two phases, ini the firs st phase, the transaction databa
tti
divided into non The minim
n overlapping partitions. The um support count of a partition 1s min
atte x ‘ litt
support X number of transactions in that partition. Local frequent —— a see sale
P be frequent wil
we temsets
may or may not
eachedges however
partition. The frequent 1 itemset from database has to be frequ ent in' atleast one
local afrequent

Of the partitions.

Scanned by CamScanner
=

_¥ Data Warehousing
& Mining (MU-Sem. sing 8
6-Comp.) 5-32 Mining Freq. Patterns
~All the frequen, & 4,
itemsets with resp 89. Patten,
itemsets, In ect to cach Partit
the second ph ion forms the global
Support of €a ase 0 f the algo
rith cand ns
ch item is fou m, a second scan Of database for a idate ’ pat?
nd, the Se are global frequent siZ@ onthe
Sampling : Rather itemsets. c tual e aa
than finding the frequent :
transactions
are Picked up and itemsets in . i proach is v
minimum Support searched for frequent the entire database D,a Subset This :
is considered as itemsets, A lower thy - compres sion
frequent itemse this reduces the poss faa, oF
ibility of MISs ay
t du eto a higher
support count sing the Old of It isa frequent

Dynamic ite ~ 8Ctuay


mset counti ng
b it adopts a div
the itemset is adde OCkS and ig my
&enerate longer
d to the frequent it
emset collection wh Minimum
ich can be further
sy tt, qhe database of
informe
candidate itemse tion of iten
t, Used 4,
-
mine ea ch su
e ThenFP
-Tree
Algc¢
5.
onstruction
pp-tree ©
FP. Growth:
allow:

Once the FP tree i


-

Algorithm : FP ,
An FP- tree is a tre
e Structure which co gowth.
nsists of :
One root labeled
as " null", Input :
A set of item Prefix - D,a transaction d:
oi: sub-trees wit; h each
node-link. node formed by thr
ee fie‘ lds - it
.
em-name, count, - mi
. ee , the mini
A frequent- item’
header table with two
node-link, fields for each Output :The complete
entry : item-name, head
| of Method -
It contains the co
if mplete informat
. ion fo r frequent
H Pattern mining
Th e size of the FP-tre
e is bounded by the
| AFP tree is co
size of the database ns
| sharing, the si , but due to frequent ;
ze of the tree items @ s
is usually mu ch smaller th cn the tran
~ High compaction is an its original
achieved by Placing database
more frequently item
thus more likely to
be shared), s closer to the root (b . eo‘ms, oc
eing
(b) Create th
e ro
the followin
g

Scanned by CamScanner
i re
weme size of the FP-tre, is < 3

e . Can i

sibility Gp NE tego
a qhis approach 1S Very effic; ent g Gidate
oe oO > Sub, .

ye . t Ue to 7
Compression of large aig :
ing © ao, of Ase; /
u It is a frequent Pattern _ ints Me
It adopts a divide-ang.,

nquer Str
he database of frequen; ite
“formation of items is Preserved is
hen mine each such data Se ge

(2 FP-Tree Algorithm

epatree construction algorithm given by,


Fp-Growth: th: all allows frequent itemse
; t diseoy
SE Hen F
Without can
_ Once the FP tree is generated, it jg besyCallingOUt Candide Tete teecon
Fp Bowtie
Algorithm : FP growth, Mine frequen t ions
heer),
gowth.
oe Pee y Pate fageny
Input:

- D,a transaction database.

fields : item-name, count, - min_sup, the minimum support count threshold,

Qutput :The complete set of frequent patterns,


y : item-name, head of lethod :

| AFP tree is constructed in the following steps


(@) Scan the transaction database D once, Collec oa vera, wes fe
equates
; their

s
it due to frequent item
support counts. Sort F by support count in a
| database. items. ssation TesD €0

bel i1t ag “aul” For each
oser to the root (peike (0) Create the root of an FP tree, and la
the following :
ea6
» for mining freq
uent.

Scanned by CamScanner
Select and so
rt the fre quent items in
re
=P “F
wnt
fiFrequent Tr an s ac co rd in
item list in Tran g to th e order of L,
Temaining list. Cal} s be [p | P], wh 2
insert_tree ([p | P]-T ere p is the first element and “M-eg FP:
child N such ), which is perfor o
that N_item-nam med as follows, tp a,
create a new no
de
e = Prilem-n
am e, then
the FP
N, and let its co increment Vs
unt be I, its pare
liink to the
nodes . nt link be linked Sount by a /
with the same . ; to T. i i She ue
nonempty, call in item-name
via the node-lin i
And its ,
sert_tree (P, N) re k Structur e If ore: Tre
| 2. The FP-tree is mine cursively, dl
d by calling FP growth. FP ri
follows : tree, null/, which is imPl Fi ail
ementey 7A Ex:
Procedure FP —fro
wth (Tree, q) : 7 -
(1) if Tree contains a single
path P then Vee a
(2) for each combination (de
noted as B) of the nodes in the
path P
(3) generate pattern BUo. with support_count = minimum
(4) else for each a; in the
header of Tree {
d
(5) generate pattern B= :
a,U awith support_coun
t = a;,Support_count In. =
;
(6) construct B’s condit =
ional Pattern base and then Step 1: F
B’s conditional FP_tree
(7) if TreeB4 TreeB;
then .
(8) call FP_growth (Tree, B):)
Analysis

second constructs the


FP-tree.
} —
The cost of inserting a tra
nsaction Trans into the FP-
i number of frequent items tree is O(ITransl), where ITra
in Trans. - nst is the
| 5.7.3 FP-Tree Size

- Many transactions share ite


ms
ms due to which the size of the FP-Tree can have
size compared to uncompressed a smaller
data.
- Best case scenario : All
transactions have the same set
of items whic. h results inin aa sinsing
g!le
path in the FP Tree.

Scanned by CamScanner
ge
¢ case Scenario : Every transaction has a distinct set of items, i.c. no common items.
wors
: pp-tree size is as large as the original data,
0
FP-Tree storage is also higher, it needs to store the pointers between the nodes and
the counter.

ing
p-Tree size is dependent on the order of the items. Ordering of items by decreas
suF port will
not always result in a smaller FP -Tree size (it’s heuristic).
i example of FP Tree
61.

-7qai__ Transactions consist of a set of items| = {a, b, ¢, ...} , min support = 3.


ex. 5! "TID |. Items Bought.
f, a,c, d, g, i,m, P
om] &)oa}rm)—

a, b,c, f,1,m,0
b, f, h, j, Oo fF

b,c, k, S, P i
a,f,c,e,1,p,m,n A

soln.
1; Find the minimum support of each item. | |
step ‘ a Ps
3 id
yet
|
|
GEERT EEE
th ceed

Scanned by CamScanner
Scanned by CamScanner
wpe insertion of Second transagi f
18+” ggction TH POMUNE 10
“gon T is pointi the root yyy ap ,
f a” . Header table
os

| this step we get f:2, finished adding f in the above tree.


After
the above transaction .ie. c.
Now consider the second item in
, : (a, b, m)
| Header table
~

Scanned by CamScanner
(iv) Similarly cons
ider the next item a.

Header table
tenes

(v) Since we do not hav


e a node b, we
maintain the path).
; fi
Header table ;

Second transaction is complete.


This ig

Scanned by CamScanner
jns
itt of third transaction (f, )

f° jp Header table

_ aster
f”

Tiss
the final FP-Tree.

Scanned by CamScanner
a

¥ Data Warehousing & Mining (MU-Som. 6-Comp. 5-40 Mining Freq. Patterns& Assn, Patt
fr

5.7.5 Mining Frequent Patterns from FP Tree


— General idea (divide-and-conquer)
© Use the FP Tree and recursively grow frequent pattern path.
- Method

o For each item, conditional pattern-base is constructed, and then jt’s Condit,
FP-tree.
© Oneach newly created conditional FP-tree, repeat the process.
© The process is repeated until the resulting FP-tree is empty, or it has Only a sin
path(All the combinations of sub paths will be generated through that single path
each of which is a frequent pattern).

Example : Finding all the patterns with ‘p’ in the FP tree given below

Header table

- Starting from the bottom of the header table.

' ass Generate (p : 3)

‘p’ exists in paths :

(f:4,¢:3,a:9
m:2,p: 2) and
(c:1,b:1, p:1)
process these
further

Scanned by CamScanner
2
4, 6:38 at3, m2, Pp pi2) DCT
and (4 4. pty
bel,
wot actions containing 'p’ have p.count
oo awe nave (£:2, c: 12, a:2, m:2, p:2) and (e:1,
0
be
vee pl
qe spar of these we can remove ‘py’

se pattern Base (CPB)


at tio” noving p we get: : (£:2,
(f:2, ©:2, a2,0:2, m:2)m:9) and an, (cs, bt)
c:2,
rer in the CPB
0 , frequen es and add°p’ to them, this will give us ai
pind? contain ing ‘P frequen
Q patter 6s done by constructing a new FP-Tree for the CPp.

js can

ioe ain filter away all items < minimum support threshold (ie, 3)
again
0 we 9, a:2, m: 2), (c:1, b:1) => (¢:3)
og,
ovwe erate (cp:3) paterns containingitem po we
: we are finding frequentthreshold
(Noteitem
5 r toc as ¢ is only that has min support .) —
P
apP endevil is taken from the sub-tree
. «

thus far: (p:3, p:3)


SupP

requent ee . band 7
Freq

prample -

Find ‘m’ from the header table 29)


Header table Generate (m
pen Pail
G--.4. ‘mt existsin paths

- 4,6! 3a: 3
Bi
besae ee : 2) and
(f:4,0:3.8%
ae b:4,m:1)
», ap

Conditional Pattern Base:


—_- (£:2, c:2, a2)
© Path 1: (f:4, c:3, a:3, m:2, p:2)

Scanned by CamScanner
AA Data Warehousing & Mining (MU-Sem. G-Comp
-
.) _5-42 Mining Froq. Patterns 2, Aany p
2 try,
© In the above transaction we need to consider
m:2, based on
this we get f:2 and so on. Exclude p as we
don’t want p GD Ty
i.e. given in example.

O Path 2: (f:4, ¢:3, a:3, b:1, m:1) — (fl, c:1, ail, bi)
~ Build FP wee using (f:2, ¢:2, a:2) and (f:1, c:1, a:1, b:1)
Ca:3)
— Now we got ( £:3, ¢:3,
a:3 ,b:1)
- Initial Filtering removes b:1 (We again filter away all items < minimum a
threshold).

~ Mining Frequent Patterns by Creating Conditional Pattern-Bases.


| Item Conditional pattern-base | Conditional FP-tree
[P {(feam:2), (cb:1)} {(c:3)}Ip
|M {(fca:2), (feab:1)} {(f:3, c:3, a:3)}Im
[B {(fea:1), (f:1), (c:1)} Empty
j [A {(fe:3)} {(£:3, c:3)}la
Pi [ce |reay - | 1€3))c
[f Empty Empty
Ex. 5.7.2: — Transaction item list is given below
. Draw FP tree.
Ti=b,e T2=a,b,c,e
T3=b,c,e T4=a,c T5=a
; i Given : minimum support
=2
Soln. :
Root

f
ee abi2 ee *asa
i ae ..o8hies / ota
; i b+ ne ‘s
! ou ei “ort bi4 e:4

‘, | “ Z
A cl 1

est

Scanned by CamScanner
vy a

¢ pata Warehousing & Mining (MU-Som,


aa

a 5.7.3: Transaction database is

sneer |

7200 | 12, 14
| T300| 12, 1g
—___| I, 12, I4

T500 | 11,13
| 7600| 12, 13
| 7700 | 11, 13
——__| l1, l2, 13, 5

T900 | 1 , 12, 13
Min support = 2 a a
Soln. :

| Support Count
7
12
6
i
I3 6

‘Ths, ;
15
2

* Conditional pattern | base yr ditional FP-treeee : ‘Frequent patterns generated


{211 :1),21113:1)} | (12:2, 11:2} 1215:2,115: 15:2
120 2,

{(211:1),02:1)} | {12:2} 12 14:2


2), (1:2)} | @2:4,01:2), 1:2) | 2B:4,
{211 : 2),(12: 0,5 :2,.20B:2
{(2:4)} 2:4,
{(12 : 4)}
Ex. 5.7.4:
7A:
dataset of frequent itemsets. All are sorted according to
ekrmonndl ee rnin the FP-Tree and find Conditional Pattern base

for D.

h eT = malian
ou ee
———-—£°
Scanned by CamScanner
Scanned by CamScanner
rehousing & Mining (MU-Sem, g C »O* omp.)

for D
jaittonal Pattern base
(0
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
— @B:1,C:1))
We have the following paths with ‘D’
(AA.:11,,BD.:11)),
p = {((A:1,B:1,C’ :1),AW (A:1,C:1 :

),(A:1)} and. ((B:1,C:1)}


Support count of D =I,
Conditional Pattern Base (CPB)
o find all frequent patterns in the
o ini ng
To find all frequent patterns contataini 'D' we needto fi
.

CPB and add 'D' to them


o Wecan do this by constructing a new FP-Tree for the CPB

_ Finding all patterns with ‘D’


nimum support threshold
o Again filter away all items < mi
1)
(i.e.1 as Support of D=

o Consider First Branch


4,B:2,C:2)}
BD, (A:1,C:1), (A‘D)} => {(A:
((AsL,B:L,C:1)(A:L
th D
So append ABC wi
We generate ABCD:1 1,C:1)} => (@B:1,C:)))

ra nc h of the tree {(B:


b
° Similarly for other

so append BC wale
e BCD: 1
We generat oo
FP-growth ACD, BCD. which are
apply , BD
6 Re cursively w i t h s up > Ip AD
u n d (
Ttemsets fo
-, Frequent
.
n c o n d i t i onal node D
o
oO
from CPB
°g°en erated

Scanned by CamScanner
| name and TID- transactional
set is the set data is repres
of transactions ented as item
- TID-set, Item
Oo Eg. of Vert co nt aining that it is the item
ical Data Fo em,
rm at

Scanned by CamScanner
I TD_sets of the fj :
Sets, Th Tequent k-itemsets to
a
2
: found . ieee & Stopped when
i
: Introd
uc tion oO n i
eT Ctlon to Mining g Mult eve Asso
cia
5 9 Introduction to
on ules

Mining Multilev
e| Association
Rules
> (MU- May 2012,
May 2013, Dec. 2013, Dec
Items are always in the . 201 4, May 2016)
form of hierarchy,
Items which are at leaf nod
es are having lower support.
An item can be either gene
ralized or specialized as per the described hierarchy of
and its levels can be powerfully preset in that item
transactions,
Rules which combine associations with. hierarchy of concepts are called
Multilevel
Association Rules.
em Depa rtment Foodstuff.

| ; Frozen” Refrigerated] | = Fresh_}I ]_ S200


a

A : ! ‘| |. Dairy— - Ete...
fe Family _ Vegetable] | _ Fru

Banana Apple | [Orange a


Product _——

hy of concept
Fig.5 9.1: Hierarc

Scanned by CamScanner
a ata Warehousing& Mining (MU
-Sem.
Support and confidence of mul rely FLO. Patios 0 bay # pata Waret
tilevel association rules
20 ~ The Support and There are
Specialization value ofconattr
fidence
ibutes.
of an item is affected due to its Benetalizay:,
The support of &ner evel-by-
alized item is more than the support
of Specializeg item,
= iJ
Similarly the Suppor — Itsa
t of rules incre. ases from specialized to generaliz
eg itemsets,
If the Support is below the threshold w ‘The
value then that rule becomes invatig,
Confidence is not affected for general or specialized. node

Two approaches of multilevel association rule ao Level-cr


I. Using uniform mini
mum support for all lev The chil
els
Consider the same min
imum support for all ci) Level-c
levels of hierarchy,
‘a

As only one minimum sup


‘ port is set, so there is no = Fir
itemset whose ancestors do necessity to xXamine theiter
a
not have minimu t,m suppor
: of - Or
If very high support is considere
d then many low level asso ciatio
: n May get Misseq
. ;
~ If very low support is considered then many
(iv) Contr
high level aSS0Ciation rok
7 generated, - T
Level 1 a
min support = 5%

Level 2
min support = 5%

Fig. 5.9.2 : Example of


uniform minimum Sup
port for all levels 5.10 |
2. Using reduced minimum
support at lower level
1 Consider separate mini _ = Sin
mum support at each lev
el of hierarchy.
As every level is having ©Xz
its own minimum suppor
t, the support at lower
level reduces

Scanned by CamScanner
qhere are 4 se
arch Strategi
es i
evel-by-leve
l independen
t
_ Itsa full-bread
iy Search
The pae rent No
= Methog,
de
node is examin is 4 cc
e, Ked Whether it’
a Level-cross
filtering by
i sin
The children 2le item,
of Only freque
nt node.
ii) Level-cross fi
ltering by k. “i
‘S

temset
"XaMing
the
- Find the freq
; uent k ite
© item,S of - Only the
k itemse
t at nex:
May &et Miss (iv) Controlled
eg
level-crogg fi
Ciation
Tulesre
are

- So the item
s : Which do
threshold this pn Ot Satii sfy
is also Ca Minim
lled “Leyey Pag
“ SUpport are Ch
ecked for Minimu
m Support

Single-dimensi > (mu ~ May


onal Tules : Th 2013, Dec, 2014,
e ru le May 2016)
example the rule Co nt ains only one di
has Only one pred stinct predicate, In
vel Teduces, icate “buys”, the following
buys(X, “Butter”) = buys(X, “Mi
lk”)
-
Multi-dimensional Tul \
es : The rule contains
two or more dimensions
or predicates,
© Inter-dimension associ
ation rules : The rule doesn’
t have any repeated predicate
gender(X,“Male”) salary(X, “High”) > buys(X, “Computer”)
© Hybrid-dimension association Tules :
The rule have many occurrences of same
predicate i.e. buys
.
gender(X,“Male”) a. buys(X, “TV”)
> buys(X, “DVD”)

Scanned by CamScanner
las

. : f possible values an
- Categorical attributes ; ;
This have finite number of po s there 8 pp
ordering among values, Example : brand, color.
- Quantitative attributes : These are numeric values and there is implicit Ordering
wig
values, Example : age, income.

ceicttalinsiiieais
Techniques for Mining MD Associations

1. Using static discretization of quantitative attributes


- Using concept hierarchy, discretize the quantitative attributes.
— Convert the numeric values by ranges or categorical values.
- To get the all frequent k-predicate , k or k+1 table scans are required and Senerare
. ea
cuboid which is5 more suited
. ha
for mining a

- Mining is always more faster on data cube. _


- Predicate sets are determined by the cells of n- dimensional cuboid.
~ Inthe example below, the base cuboid contains the three pred
icates gender, Salary ang
buys. Each cuboid represents a different group-by.

0-D
Cuboid

(Gender)
pal
Cubold

(Genda fr, Salary) 2.5

Cuboid

(Gender, salary, buys) Cuboid


| Fig. 5 10.1 : Lattice of cu
boids making up a 3-D data cube
Quantitative Asso
ciation rules

Scanned by CamScanner
) 5-51 Mining Freq, Pattornis & Asso. Pattern

ains 2-D quantitative


saatiVe association rule, the left hand side of rule cont
attribute
"5 and right hand side of rule contains one categorical ‘a

rg A.quantitative! A A quantitat
ive? > Categorical
a
tative measures are
. Jf we are interested in association rule where two quanti
pond le salary
« and the type of the phone that customer buy.
ag «30-34”) a Salary (CUST,“24K - 48K”) => buys(CUST, “Sony Xperia Z”)
A ge(CUST
rn
us ed to find such rules is the Association Rule Clustering Syste
; approach -

° (ARCS):

invninolvg ed”* PartAR ES


’ . spin
siep ition the ranges of quantitative attributes into intervals. These
s are con sid ere d as bins . ARC S are equ iwidth binning method where each
interval
al size.
pin has same interv at e sets which
It fi nd s th e fr eq ue nt pr ed ic
ate set :
ip) Finding frequent umpredsuicpport and also minimum confidence. Rule Generation
satisfy the minim generate strong association rules.
algorithm is used to
from above
the ass oci ati on rul e : Str ong association rules obtained
(c) Clustering
das shown in Fig. 5.10.2 :
step is mapped to a 2-D gri
70-60K

60-70K

50-60K
Salary
40-50K

30-40K

20-30K |.

<20K
36 37 38
33 34 35
32
Age
2-D grid for tupl es rep res ent ing cus tomers who purchase Sony Xperia Z
Fig. 5.10.2 : A
d to the rules :
- Following four CUSTs correspon
40K") — buys (CUST,” Sony Xperia Z”)
© age(CUST,34) ASalary(CUST,"30—
Sony Xperia Z")
© age(CUST,35) ASalary (CUST,"30 — 40K”) —» buys(CUST,”
Sony Xperia Z”)
© age(CUST,34) ASalary (CUST,"40 — 50K”) > buys(CUST,”
” Sony Xperia Z”)
© age(CUST,35) ASalary (CUST."40 — 50K”) — buys(CUST,

Scanned by CamScanner
_W sing
Data Warehousing & Mining arehor
(MU-Som. 6-Comp.) _
As above rules are close to
5-52 Mining Frog. Patterns
& Aaso, Patty ye 2010
each other, they can be clustered fn
following rule: together 0 form the
asider the
© ase(CUST, “34 — 35") ASalary q -
Xperia Z”) (CUST,"30

- 50K")* — buys(Cust, Sony ort cour

Limitations of ARCS
: A P
: Refe
- It works only for quantitative att
ributes on LHS of rules.
— The limitation is 2D i.e. ‘wo
attributes on LHS only so doe
sn’t work for more dimension
3. Distance-based associ ,
ation rules
- This is a dynamic dis
cretization Process that
considers the distance be
points. tween dies
— This mining process has onl
y two steps :
o Perform clustering to find
the interval of attributes in
©
volved.
Obtain association rul
es by searching for gro
ups of clusters that occ
~ The resultant rules of thi ur together.
s method must Satisfy :
© Clusters in the rule antece
dent are strongly associ
consequent. ated with clusters of Tules
in the
' May 2011
© Clusters in the antecedent
occur together.
© Clusters in the consequent occur 0.3 Whatis.
together.
5.11 University Questions and all frequ:
Answers
May 2010

Q.1 Consider the following tra


nsactions :
TID Items
[on 1,3, 4,6
02 2,3,5,7
03 1, 2, 3,5, 8
04 2,5, 9,10
05 1,4
Apply the Apriori Algorithm with mi
nimum Support of 30% and minimu
7% and find the m confidence of
large item set L. (Ans. « Refer Ex.
5.4.4) (10 Marks) Mir
Mit

Scanned by CamScanner
ini 29 Freq. Patterns
atte, == & Asso,
4, i og

y.
clustereq B,
) 010
together t
© form
i =,
5
OK”)8
the ; consider the transaction d
_, buys(Cugy o. Atabass
Ony I
3 gupport count 2, Generatg

(Ans. * Refer Ex, 5.4.3)


the asso”

t Work for more qj :


i, I2, 5
NSion
Ss,
/ )et, 4

!
the distance between
da li, 12,14
11,13
ved.

"00 |
12,19
ers that occur togethe,’ 11,13
T,
|
i T9009
h clusters of
Tules in th
ie yoy 2011 Lts
Nh, 2

93 What is Association Rule Mining ? Give


the
all frequent itemsets from owing
the2 following table
table “ov
: algorithm. Apply AR Mining to find
———
Transaction -ID Items
100 1,2,5
200 2,4

300 2,3

400 1,2,4

500 PTT

600
PL
a

700

800

g00

n confidence of port Count =2 (8Marks)


(10 Marks) Minimum Sup

onfidence
2 70%. (Ans
Minimum — C

Scanned by CamScanner
¥ pataata W Ware housin
& Minin
g g (MU-Sem. 6-Comp.) 5:54 _ Mining Freq . Patterns & Asso,
Pattem |

Dec. 2011
EE

Q.4 Explain how Apriori algorithm is useful in identifying


ifyi frequent iitem s et?
(Ans. : Refer sections 5.2.3 and 5.4.1)
(10 Marks)
May 2012

Q.5 Explain what is meant by association rule mining. For


the table given below Perform
apriori algorithm. Also :
(i) Determine the k-item sets (frequent) obtained.
(ii) Justify the Strong association rule that has been
determined .€. Specify which is
the strongest rule obtained.
The table is as follows
:
TID Items
01 1,3,4,6
02 2,3,5,7
03 . 1,2,3,5,8
04 2,5,9, 10
05 1,4
Assume Minimum Supp
Ort of 30% and Mini
(Ans. : Refer Ex. 5.4. mum confidence of
4) 75%,
Q.6 What is meant by (10 Marks)
market-basket an
with formula the al ys is ? Explain with an ex
Meaning of the te ample. State and
rms : explain
(i) — Support
(ii) Confiden
ce
Hence explain ho
w to mine multi
level @ssociation
.
with example for ru le s
Gach, _ from transaction
(Ans, : Refer Sectio | databases,
ns 5.1, 5.2.3 ang 5.9)
Deec.
c, 2 2012
(10 Marks)
Q.7
om
2 e ae cierm
neine a Si
* Gel Hf
minimum Support is 30% and
Priori algorithm, (Ans. : Referthe frequent
Ex. 54.8)" ;item sets and £ minimum
@SSociation rules usi:ng the
a
(10 Marks)

Scanned by CamScanner
Va
7 garth following transection
| TID Items i
01 A,B,C, D,
o2 A,B,C,D,E,G
03 «Oo A,C,G,HK
04 B,C,D,E,k
(oe
06 «=A,B,C,D,L
oy) =—sa#BELLL EE, K,L
08 =A, B,D,E,K
09 OA, E,F,H,L
10 B, C, D, F
Appl
the Aprioni
y algorithm with minimum support of 30% and minimum confidence et
rules in theion
70%, and find all the associat data set.
(Ans, : Refer Ex. 5.4.9) (10 Marks)
49 Define multidimensional and multilevel association mining.
(Ans, : Refer sections 5.9 and 5.10) (10 Marks)

213
Seteal

example Stato and


Li) What
is meant by market, basket analysis? Explain wi sn
explain with formula the meaning of following terms

Hence explain how to mine multi 57,523and59)


with examples. (Ans. ; Refer sections 5:

a
_—,
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Som. 6-Comp.) 5-56 Mining Fr _Pattoms & Asso, Patton

Q.11 Consider the following transactions :


TID | Items
01 | 1,3,4,6

02 | 2,3,5,7

03 | 1,2,3,5,8
04 | 2,5, 9, 10
05 | 1,4
Apply the Apriori Algorithm with minimum support of 30 % and minimum
of 75 and find the large item set L. (Ans. : Refer Ex. 5.4.4) (10 Marks)
May 2014
——

Q.12 Consider the following transaction database :


TID Items
01 A,B, C, D
02 A, B, C, D, E, G
03 A, C, G, H, K
04 B, C, D, E, K
05 D, E, F,H,L
06 A; 8,0; By 1.
07 B, 1, E,K,L
08 A,B, D, E, K
09 A, E, F, H, L
10 B, C, D, F
Apply the Apriori algorithm with minimum support of 30% and minim
um confidence of
70% and find all the association rules in the data set.
(Ans. : Refer Ex. 5.4.9)
(10 Marks)
Dec. 2014
ey

Q.13 Define multidimensional and multilevel association minin


g.
(Ans. : Refer sections 5.9 and 5.10)
(10 Marks)
May 2016

Q.14 Discuss association rule mining and apriori algorithm. Apply AR mining
to find all
frequent item sets and association rules for the following
dataset :
Minimum support count = 2
Minimum confidence = 70 % (Ans. : Refer Ex. 5.4.3) (10 Marks)

-_
Scanned by CamScanner
Multidimensional essocieton rule
(Ans. : Refer sections 5.9 and 5.10) —

goo

Chapter Ends
I
'
i

t
4
i

, - .%
Scanned by CamScanner
Module

Spatial and Web Mining

——__)

Syllabus :
Spatial Data, Spatial Vs, Cla it
Association and Co-loc
ssi cal Dat a Min ing , Spatial Data Structures,
ation Pattems, Spatia
l Clusterin: g Technique
: Minj
Extension, Web Min s -
N9 Spatiay
ing : Web Content Cc
mining, Applications Mining, Web Structure Mining, Web Usage
of Web Mining.

types of data objects


8eographical space or ho or elements that are
rizon - It supports the glo Present in a
devices anywhere in the bal finding and locating
world. of individuals or
— Spatial data is also kn
own as Seospatial dat
information, a, spatial information
or geographic
.
Spatial Data Mining
¥) Spatial data Mining is
the process of discoverin
Ef from large spatial datasets, g interesting, useful,
f
non-trivial patterns
i Non-trivial Search
Large (e.g. €xponential)
search space of Teasonab
le hypothesis.
~ Example : Asiatic choler
a causes : water, food, air, insects, ...; water delivery
mechanisms - numerous pumps,
rivers, ponds, wells, pipes,
...

Scanned by CamScanner
Useful in certain application %,
‘i ‘ Main,
Bxample : Shutting off identifieg
W:
ynexpected ae Pump = saved
human life.
edge,

| Data Structure S, Minj.


ring Techniques ap Spatia
Structure Mining, Wep yas
Sage
- Simple Querying of Spatial Data

o Find neighbors of Canada given


gl Dames and bound
aries of all countries,
i
-o Find shortest path from Mum
bai to Delhi ina freeway map
o Search space is not large (not expone
ntial),
|
- Testing a hypothesis via a primary data analysis
|nts that
are present o Ex. Female chimpanzee territories are smaller than male territories.
jn a
I locating of in
dividuals or o Search space is not large.

o SDM: Secondary data analysis to generate multiple probable hypotheses.

mation or Bovaraphig - Uninteresting or obvious patterns in spatial data


with heavy rainfall in St. Paul, given that
o Heavy rainfall in Minneapolis is correlated
Pt: es)
= 1A sejsauwes

| . the two cities are 10 miles apart.


similar rainfall.
places have
Wl, non-trivial pattems o Common knowledge: Nearby
Mining ?
2 isle

6.1.3 Why Learn about Spati ‘al Data


iti al ques! tions
for Critic
de rs ta nd in g of ge ographic processes
New W un
PUT Ph

- . —
o f pla net Earth?
: How iis the hea lth y
o Exam vironment and ecolog
ffe cts of hu man activity on en
ONY

ize €
water delivery © Example : Character
MALSAS

Scanned by CamScanner
0 Example : Predict eff
ect of weather, and economy.
Traditional approach
: Man Ui: ally generate and test
hypothesis but, spatial data js 8rOwin,
‘00 fast to analyze manually,
© Satellite imagery, GP
s tracks, sensors on hig
hways.
Number of Possible &cogra
phic hypothesis too large
to explore manually
‘ © Large number of Seographi
c features and locations.
© Number of interacting
subsets of features grow exp
onentially.
Oo. Ex. Fiind teleconne
ctions between wea
ther ¢ vents across
SDM may reduce ocean and land afeas,
the set of possible hypoth
esis
9 Identify hypothesis su
pported by the data.
© For further exploratio
n using traditional sta
tistical methods.
6.1.4 Spatial Data Mining
: Actors
Domain Expert

Identifies SDM
80als, spatial data
set.
Describe domain
knowledge, €.g.
well -known pa
Validation of tterns, e.g. correlat
new Patterns. es.
Data Mining Anal
yst
Helps identify pa
ttern families, SD
M techniques to
Explain be used,
the SDM outputs
to D ‘omain Expert,
Joint effort

| 6.1.5 Characte
Ms
i ristic s of Spatial D
4 ! ata Mining
- Auto Correl
ation.
i - Patterns usua
lly have to
;| complete attrib be defined in
ute space. the Spatial at
tribute Subspa
ce and not
in the

Scanned by CamScanner
Objects £0 poin
tS, but can alsg
Large, Continuous . : Tefer to 1)
. ‘Pace
oa Regional know] . “fined by spatial ay .

geography (F i hetero,
Spatial of Particular imp “ties
[= Beneity),

| Spat das ig
Implicit

| Horizontally : Spatia
l Indexing
isStatistical
Independence ofcont
ext eSpa
atial autoc
utput Set based osi
pattia
Spa i l bas ed —
Algorithm | Generic
i i de
: . divi and coinquer, MBR, i
Spatial i
index in ig, Plane
or dering, hierarchical
Sttucture sweeping, Computational gcometry

Syllabus Topic : Spatial Data Structures

6.3 Spatial Data Structures

Spatial data consists of spatial objects made up of points, lines, regions, rectangles,
surfaces, volumes, and even data of higher dimension which includes time. Examples of
Spatial data include cities, rivers, roads, counties, states, crop coverages, mountain ranges,

parts in a CAD system, etc.

Scanned by CamScanner
Rules for R- tee are similar to B-tree. @ gare for the 0
fis ‘aoe
Example given by Hanan 98
Samet is given below :
tree.
problem is tha
whole database

432 A+tree

- R+ttree and C
collection of ot

- Cell tree deals

- R+-trees is ext

- Try not to Over

If object is inn

Scanned by CamScanner
a ng
R1| R2
NN ab
R6|R6 } g h A He
R3| R4

Ete
di gjh ciilyyel ft R4 f
a b [s R5

g-tree for the collection of line segments (b) the spatial extents of the bounding rectangles
er. Fig, 6.3.2

R-tree is not unique, rectangles depend on how objects are inserted and deleted from the
tree.

_ Problem is that to find.some object you might have to go through several rectangles or
whole database.

6.3.2 R+tree

Rttree and Cell Trees used approach of discomposing space into cells and deals with
collection of objects bounded by rectangles.
Cell tree deals with collection of objects bounded by convex polyhedra.
R+-trees is extension of k-d-B-tree. | .
Try not to overlap the rectangles.’
If object is in multiple rectangles, it will appear multiple times.

Al) R2

R6

Fig. 63.3 : Decomposition of snare inta diciaint nals

Scanned by CamScanner

¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _6-7 Spatial te
patal and Web Minin
6.3.3 More Spatial Indexing

Uniform Grid

Ideal for uniformly distributed data.


More data-independence then R+-trees.
Space decomposed on blocks on uniform size.
Higher overhead.
Quadtree

Space is decomposed based on data points.


Sensitive to positioning of the object.
Width of the blocks is restricted to power of two.
Good for Set-theory type operations, like composition of data.

Syllabus Topic : Mining Spatial Association and Co-location Patterns

6.4 Mining Spatial Association and Co-location Patterns

Looking for interesting, useful, unexpected spatial


patterns of information embedded in
large databases. .
Spatial Data Mining is searching for spatial patte
rns :
0 Non-trivial search - as “automated” as possible
reduce human effort.
© Interesting, useful and unexpected spatial
pattern.
6.4.1 Mining Spatial Association

A spatial association rule is a ruleindica


ting certain association relationship amo
of spatial and possibly ng a set
some non-spatial predicates.
Example ; “Most big cities in Can
ada are close to the Canada U.S.
border”
A strong rule indicates that the patterns in the rule have
relatively frequent occurrences in
the database and strong implica
tion relationships.

Scanned by CamScanner
| where at least One of the predicates p

se le wl ln erates Ch ark + Qnivaspas


: t of prediica we ‘a
Ase mee ; robe Shag pee pm
f hierarchy
mipimum aresupport threshold
large at thej, otk faite
tee ® at level k ig UPPort ofrop ;pe tn
|, The confiidence of a rule “ P+ Qis” ig eve} TOM the
ia co! Ncept

ene
“ 735
tm confidence threshotq aE Nit confidence sno
ess than its
A epeete is strong if predicate «p >:
‘P> QS" is high.
Vis*TA¥Ge in set Sand the condense ot
| 64.2 Mining Co-location Patterns

ty occur in together.
- ForExample
Se ° Ecology : Symbiotic relationship
in animals orplants
on embedded jn o Public health : Environmental factors
and cancers
0 Public safety : Crime generators and crime
events
- Colocation pattern is a subset of spatial event types or inst
of an
these ce
event s
types
Jrequently occur together.
~ Consider following example :
Li id: -
Ti veprasents instance | with feature ype T ; ?-
una insta nces represents neighbor relafianships
among a set

C2
rrences in Al

/ B.1
i

Scanned by CamScanner
Spatia
land Web Minin,
~ From the above example,
© Spatial event types are A, B, C

© Spatial event instances are A.1, A.2, A.3, B.1, B.2 ... .
© Candidate Colocation are the combinations of event types (A, B), (B, C), ( AC)...
© Neighbor relationship represented by solid line between
event instances (A.1, B.] ),
(A.1,.C.2), (A.3, C.1)..5.3..

© Table instance of (A, B) i.e. direct relationship between events A and B from says
graph :
(Al, B.1)
(A.2, B.4)

(A.3, B.4)
.
Interestingness Measure are defined
by participation ratio (pr) and partic
ipation index (pj)
1. Participation ratio is defined
for the colocation pattern

C= (f,, £ ..., f)
| tf, Table.Instance (c)I
pr (c, f) =— eo! |
[Table.Instance (fl
2. Participation index for
given c can be calculat
ed as
| Pi(c) = min, {pr(c, £)}
~ Exampl
: Find
e pi
i
Table instance for each pattern -

Scanned by CamScanner
ysing & Mining (MU-Sem. 6-Comp.)
_ 6-10 Spatial and Web Mining

‘A, B, C) is as given below because only A.3, B.4, C.1 have direct

fe Table instance for ABC

A3 B.4 Cl

¢ the participation ratio using formula given above :


calcula!
pr (A,B,C). A) = 3
A )
of table instances of ABC / number of instances of
gumbet
pr (A,B, C).B) = 3
(i

1
pr (A,B,C),C) = 3
pi(c) = min(1/4, 1/5, 1/3) = 1/5
crocation Mining Algorithm
' starting with k = 1

alent pattern
_ Iterative until no prev
ns {c,}
Generate size k colocation patter
h c,
_ Generate table instance of eac
prevalent
= Compute each pi(c,), add to result if
k= k+l

k=1 A B Cc

[XD
AB AC BC

\LZ
k=2

k=3 ABC

Scanned by CamScanner
PI(T4) = og,
pPi(TS) = 0.5

&reater and eq ion patterns


ual to threasho are (A, C)
and (B
ld,
~ Algorith
m 8iven by H
uang Yan js
Colocation Mi 3
ning Algori
thm :Filter-B
aseq
Pah Pema
Sym DOL

all candi date


colocation of size k
me] all Prevalen
Starting with t Colocation
k=] Of size k
Iterative unti
l no Prevalen
t pa tter
n
Generate size
k Candidate pa
tterns c «fro

For each Candid m Prevalent pa
ate GEC, tterns Py
Check all subset Pa
tterns
If any subset
Pattern not Pr
evalent, prune
Out

Scanned by CamScanner
s i n g & M i n i n g (MU-Se
ou
of c a c h r e m aining c, € Cy
ar se t able instance
co
Ge gerale

ach pile
F Co mpute e o w threshold,
u t i o n b e l
08 coarse resol
(cy) pased
if p prune out C,
°
© CG
each remaining
ov e th re sh ol d, ad d c, to set Py
° each pili): if ab

- CLARANS
C l u s t e r i n g Techniques
6.5 Spatial
ba se d on R a n domized Search)
Cl u stering Algorithm
CLARANS (A ho mo ge ne ou s groups of objects
identi fy
de sc ri pt iv e task that seeks to , A. , Kriegel, H.-P. and
Clus te ri ng is 4 , Fr om me lt
es (Ester, M.
tribut spati
pased on the va
lues of their at pe rm it s a ge ne ra li zation of the
ial data sets, C Ju
stering icit
Sander, J, 19 98 ). In sp at
of sp at ia l ob je ct s which define impl
ion
plicit location and extens
component like ex
ighbourhood.
relations of spatial ne s.
A R A by us in g mu lt iple different sample
on CL
o CLARANS improves
C L A R A N S dr aw s a sample neighbour.
o Forevery step of search
thi s is no t co nf in in g a search to localized area.
© So
pa ra me te rs nu ml oc al and maxneighbor.
0 Ituses two additional
di ca te s the nu mb er of samples to be taken.
o Numlocal in h an y specific node can be
of a no de to wh ic
er of neighbors
o Maxneighbor is the numb
compared.
Jiawel Han
Algorithm CLARANS given by
mincost to a large
and maxneighbor. Initialize i to 1, and
1. Input parameters numlocal
number.

2. Set current to an arbitrary node in Gn;k.


3. Setjtol.

™~.
_—-
Scanned by CamScanner
¥¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _6-13 Spatial and Web yj
4, Consider a random neighbor $ of current, and based on 5, calculate the cost differentia, of
the two nodes.
5. If Shas a lower cost, set current to S, and go lo Step 3.
6. Otherwise, incrementj by 1. If j maxneighbor, go to Step 4.
7. Otherwise, whenj > maxneighbor, compare the cost of current with mincost. If the former
is less than mincost, set mincost to the cost of current and’set bestnode to current.
8. Incrementi by 1. If i> numlocal, output bestnode and halt. Otherwise, go to Step 2.
Steps 3 to 6 above search for nodes with progressively lower costs. But, if the curren,
node has already been compared with the maximum number of the neighbors of the Node —
(specified by maxneighbor) and is still of the lowest cost, the current node is declared to
be a “local” minimum. Then, in Step 7, the cost of this local minimum is compared With
the lowest cost obtained so far. The lower of the two costs above is stored in mincost,
Algorithm CLARANS then repeats to search for other local minima, until numlocal of
them have been found.
Syllabus Topic : Web Mining

6.6 Web Mining -

Introduction to Web Mining

— Web Mining refers to application of data mining techniques to web data.

Wob mining —

_ Web content ‘Web stricture _ Wob usago’


, mining mining _ Mining

s Uso intorconnoctiona | | Undamtand access’


epee between wob pages to} | pattoms and the
pages giva wolghttotho =} | fronds to Improve
pages -olructure
Distinguish personal
home pagos from
other wobpages: —
Bla

Fig. 6.6.1 : Web Mining ‘Taxonomy

etl

Scanned by CamScanner
|aad
ah

is...
See Balad Wey ue
=e.
-
on 5, calculate the i¥ Data Warehousing
cosy diff.erentin = & Mining (MU-Som,
Me - ey ithelps ini solving
i the Problem
robs no of how,
_ The Process involves mining
es_
Lips ing HSIN the we sig ==
p 4. : .
nen
urent With mincost If th _ It is the process
web data.
of qj i
re oF gs Meanin
seton
bestnode to curren covering the "seful and Previous
t. = forme eee
| ISe, &0 go 1, to Step 2
- Web data is - “
o ” = anal
Wer costs. Bur Web content — text, nom
if the im:
of the Neighbors
> current node of the
ec
minimum is compares
wok the ogret Sti
o Web usage
~ —
~ pt ei
ove is'wtorea 2
ais a
to nd 5s, 1 —
|
tp logs, 4pp server logs,
etc, |
6.6. Ww
Web Mining is Different
nini ima, untili AUMloca]
:a from Classical py
- Web Mining
ining isi Similar >
to data mining,
- Itdiffers in data
ae collection,
- Data Mining
: The collection
of data is already
~ Web Mining : Data done‘ant
Stored in a4 data
collection is done data warehouse.
; as by crawling through
a number of target
66.2 Benetits of web Pages.
Web Data Mining
:
Match your available
ta, resources to visitor int
erests,
Increase the value
of each visitor.
Improve the visitor’s experienc
e at the website.
Perform targeted resource man
agement.
Collect information in new
ways,
-
Test the relevance of content and web site
architecture.

re Syllabus Topic : Web Content Mining


eel ntactaaila dels Hy}
es)
§.7 Web Content Mining a
a
Fo

~Q
| 6.7.1 Introduction to Web Content Mining a
|
ee

rm
from atof
the content a webn
io page. 4°)
- p s to discover useful info
The proces Ly
A
*
4

oO .)
a Mi

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 6-15 _ Spatial and Web Minin

The type of the web content may consist of :


Le) Text
oO Image
oO Audio
Video

Web Content Mining is also known as Web Text Mining.


Web Content Mining uses the following techniques

oO Natural Language Processing(NLP)


Oo Information Retrieval (IR)

6.7.2 Text Mining

The process of deriving high quality information from text.


Text mining is an. interdisciplinary field which draws on information retrieval, data
mining, machine learning, statistics, and computational linguistics. As most information
(over 80%) is currently stored:as text, text mining is believed to have a high commercial
potential value.

Using Statistical Pattern learning, high quality of information is derived.


v

Text mining involves the process of :


oO Structuring the input text by Parsing.
oO Addition of some derived linguistic features and the removal of others.

oO Subsequent insertion into a database.


oO Deriving patterns within the structured data.
o Evaluation and interpretation of the output.
‘High quality’ in text mining refers to some combination of :

oO Relevance

°o Novelty

o Interestingness

Scanned by CamScanner
Typical text mining tasks
include :
o Text categorization

o Text clustering

o Conceptentity extraction

/ Entity relation modeling


(i¢. learning relations betw
een named entiti
entitesies)
;" |
Syllabus Topic : We
b Usage mining

6.8 Web Usage Mining

6.8.1 What is Web Usage Mining


2
nation retrieval, data
As most information

are likely to be Visi


> a high commerciay ted im near future based
on the active users

- The usage data records the uset’s


behaviour when the user browses or make
on the web site s transact
in order to better understand and serve
the needs of users or Web-basea
applications.

Automatic discovery of patterns from one or more Web servers.


Organizations often generate and collect large volumes of data; most of this information is
usually generated automatically by Web servers and collected in server log. Analyzing
such data can help these organizations to determine :
oO The value of particular customers.
oO Cross marketing strategies across products.

© The effectiveness of promotional campaigns, etc.


The first web analysis tools simply provided mechanisms to report user activity ag
le to determine such information
such tools, it was possib
recorded in the servers. Using
as:

o The number of accesses to the :


or time intervals of visits.
o Thetimes

o The domain names


and the URLs of users of the Web server:

> LP
Scanned by CamScanner
—*_Data Warehousing & Mining (MU-Som. 6-Comp.) _ 6417 PUA a We Mining
A very little or no analysis is provided by these tools of the data relationships among the
accessed files and directories within web space.
The tools for discovery and analysis of patterns may be classified into following two
categories:

o Pattern Discovery Tools


o Pattern Analysis Tools

Web servers, Web proxies, and client applications can quite easily capture Web Usage
data.

Web Server Log’: It is a file that is created by the server to record all the activities jt
performs. .
For example when a user enters URL into the browsers address bar or requests by clicking
ona link.
The page request sent to the web server maintains the following information in its log :
o Information about the URL.
o Whether the request was successful.
o The users IP address.
o Time and date about the page request completed.
o Thereferrer header. |
Web Mining systems use the web usage data and based on the analysis the system can
discover useful knowledge about a system’s usage characteristics and user’s interest
which finds its application in :
o Personalization and Collaboration in Web-based systems,
o Marketing,

o Web site design and evaluation,


o Decision support.

6.8.2 Purpose of Web Usage Mining

Web usage mining has been used for various purposes : _


o A knowledge discovery process for mining marketing intelligence information
from Web data.

———ill
Scanned by CamScanner
ry

WF Data Warehousing & Mining (MU-Sem, 6-Comp.) 6-18 Spatial and Web Mining
o In order to improve the performance
of the website, web usage logs can
to extract useful web be used
traffic patterns. |
Search engine transaction logs also provide valuable knowledge
about user behaviour on
Web searching.
Suc
ie h sinformation is ye,
TY useful for a better understanding of user
ormation seeking behaviour and can s’ Web searching and
improve the design of Web search sys
tems.
Web usage mining is useful in finding interest
important knowledge about
ing trends and patterns which can provide
the us ets of a system.
Sevev eral Machini e learni: ng and
data minaeing techniques for e.g. Associ
classification and clu ation rule mining,
stering can be applied.
Web usage data Provides a ext
remely useful way to learn user’s
interest.
Weveeb logs may y bebe used in web
usage minin & to help identify use
. Similar web pages. These patter rs who have accessed
ns may be us eful in web searching
and filtering.
For e.g. Amazon.com uses collab
orative filterin g for recommen
customers based on the data ding books to their
col lected from other customers sim
ilar interests or purchasing
history. .
|
6.8.3 Web Usage Mining Activities
Pre-processing Web log :
o Cleanse.

o Remove extraneous information.

o _Sessionize.

Session : All the requests that a single client makes to a web server.

Pattern Discovery :

o Uncovering the traversal patterns.


ted by user in a session.
© Traversal pattern is a set of pages visi

o Count patterns that occur in sessions.


references in session.
o Pattern is sequence of pages
Discovery
sc uses techniques
5 such as Se a —
ed eling. rules,
owes
clustering, classification, sequential pattern, meen i

Scanned by CamScanner
YF Data Warehousing & Mining
SSS eee
(MU-Som. 6-Comp.) 6-19 _
SSS SSS
Spatial and Web= Mining

- Pattern Analysis :

o Once the patterns have been identified, they must be analyzed to determine how
that information can be used.
o A process to gain Knowledge about how visitors use Website in order to Prevent.
(i) Disorientation and help designers to place important information/functions
exactly where the visitors look for and in the way users need it.
(ii) Build up adaptive Website server.

6.8.4 Web Server Log

6.8.4(A) Structure of Web Log

The Web log contains the following information :


(i) The user’s IP address,

(ii) The user’s authentication name, .

(iii) The date-time stamp of the access,


(iv) The HTTP request,

(v) © The response status,


(vi) The size of the requested resource, and optionally also,
(vii) The referrer URL (the page thesie “came from”),

(viii) The user’s browser identification.

6.8.4(B) Web Server Log - An Example

Web log fields

=
© 152.152.98.11

© IP address - can be converted to host name, such as xyz.example.com


- Name

o The name of the remote user (usually omitted and replaced by a dash ‘-”)

Scanned by CamScanner
Web Minin, ¢— & Mining (MU-Sem.6-Comp)a _. 20
Spatial and WelyMining
gin

° ain of the remote user (also usually omitte


TMine hoy, d and replaced
by adash “2)
7 pate/Time/TZ.
event
6 16/Nov/2005: 16:32:50 -0500
Vfunctions Request, Status code, Object size, Referrer, User agent.
. = mn

- - [16/Now/2005 : 50 - 0500)“ GET/ob


] 152,152.98 s 200 ...|
/ HTTP/1.1"

Fig. 6.8.1 : Example of a Web Server and Client Interaction

server logs for loganalyzer-net. All the


Ex. 6.8.1 : The following is a fragment from the
http://www. loganalyzer.net/.
relative URL's aref ‘or the base URL
gment of log file....
Soin.: First let’s look at a fra egorer PATE aTTP/I-1”
aia mene
200 1179 ~ ”
Pes si]

66:249.65. 107 = _ (08/0cv/2007:04:54:20-O


aeamemnetemh ertan areER MS

fur golem”
Peta) oe Googlebot/2.11; hut:
/HTTPI. J"
“CETae b ie. ee
a gs200pi10801
an - <n (08 0eu /20 07: £11:17:55 -0400] q=l &rls= mozillazen-
iL nini
a.
oe= u
SS
u
no
ie=
Wi
lyz erf
o:
ana

ae
*h
ee
h?q=1
ri or ” "M al la . (W ad vie 0 at
O07: 35 cetseertA
Gecko/20070914 Firefox/2. 55 0400] ae nt
h1n1,111,111 - - [o8/Ocv2007:11:17: i auaeen
=
atest
ena eigeesc oo
0/200 0014 Firefoxi2.0.0.7". for
retrieved and indexed them
ee 07. The pages are
show
Te above fine” "
1. A google spider, which ' 6.249.650qu ai l! 11 who gave 8 search for Nihuo
Web
ine.
their search eng
m Ip address °"
2. Another visitor is fro
Log analyzer homepage-

ttn sO,

Scanned by CamScanner
W Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 6-21 Spatial and Web Mini
A few things to note :
1. A single hit is represented by each line in the file, for a file on web server.
2. A web page hit is a page view. For c.g. if a web page contains 4 images that a hit on that
page will generate 5 hits on the web server, one hit for the web page and 4 for the images
3. IP address or cookie is used to determine a unique visitor. A visit session is terminated if
the user falls inactive for more than 30 minutes. So if a unique visitor visits the same
website twice then it gets reported as two visits.

Description of fields

Let's look at one + line from the above ‘fragment. —

i. 11. lll. 11

[08/Ocl/2007:11:17:55-0400]).
“GET/HTTP/LI"

10801
“http://www. google.comsearch?q= Jog-+analyzer&ie=utf-B&ioe =utf-8&aq=txls=org.mozillazen-
UStofficialéeclient=firefox-a” ices ;

1. IP address : “111.111.111.111”

This is the IP address who visited the site Nihuo web log Analyzer:

2. Remote log name: “‘-”

This field will return a dash until Identity Check is set on the web server.

3. Authenticated user name : “~”

This information is available only when accessing content which is password protected by
the authenticate system of web server.

4. Timestamp : [08/Oct/2007:11:17:55 -0400]

This field represents the timestamp as seen by the web server.


Scanned by CamScanner
Spatial and Web Mining
Access request : “GET / HTTP/1.1”

qhis represents the HTTP method called


as GET and the version of the HTTP protoc
which is HTTP/1.1 ol
code : “200”
Result status
This field represents the status
of the requ est, cod
(eg: HTTP 404 “File Not Foun e 200 represents su Ccess. Oth
d” or H TTP 500 er code like
“Internal Server Error’).
7, Bytes transferred : “10801”

This field tells the amount of data transferred to the


or 10k is the size of the homepage, user. In this example its 10801 bytes
.
g, Referrer URL

“http/www .google.com/se
om. arch?q=log+analyzer&ie=u
8&aq=téerls=org.mozilla:en-U tf. 8&oe=utf-
S:official &client=firefox. a”

page the visitor was on when


this page. they clicked to come to

9, User Agent

“Mozilla/5.0 (Windows; U; Windows NT


Firefox/2.0.0,7”
5.2; en-US: v:1.8.1.7) Gecko/20070914
The “User Agent” identifier, The user
agent is whatever application the
access the site, it could include a visitor used to
brow ser, a web robot, a link checke
r an FIP client.
In this case “Mozilla/5.0” probably mea
ns visitor's brows er is Mozilla compatib
Windows NT 5.2” indicates Windows le,”
2003, “en-US” probably implies it's an Engl
version, “Firefox/2.0.0.7” means Firefox ish
2.0. In the first line, “Mozilla/5.0 (compati
Googlebo t/2.1; ble;
+http://www.google.com/bot.html)” means this hit is caused by
g0oglebot(spider).

Syllabus Topic : Web Structure Mining

6&9. Web Structure Mining

6.9.1 Introduction to Web Structure Mining

The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as
edges connecting between two related pages.

.
LPn~ a
Scanned by CamScanner
¥ bata Warehousing & Mining (MU-Sem. 6-Comp.) _6-23 Spatial and Web Minin
a Hyparink

Web
— document

Fig. 6.9.1 : Web Graph Structure

~The process of using the graph theory to analyze the node and connection structure of 4
web site. Web structure mining can be divided into two kinds :
o Extract pattems from hyperlinks in the web. A hyperlink is a structural
component that connects the web page to a different location.
© Mining the document structure. It is using the tree-like structure to analyze ang
describe the HTML or XML tags within the web page.
- Web structure mining has been largely influenced by research in :
o Social network analysis. .
© Citation analysis (bibliometrics).

© In-links : The hyperlinks pointing to a page..


o Out-links : The hyperlinks found in a page.
© Usually, the larger the number of in-links, the better a page is.
o By analyzi ng cont
the pages aini
a URL, ng
we can also obtain.
o Anchor text : How other Web page authors annotate a page and can be useful in
predicting the content of the target page.
~ Web Structure Mining : Discovers the structure information from the web
© This type of mining can be performed either at the (intra-page) document level.
or at
the (inter-page) hyperlink level.
© The research at the hyperlink level is also called Hyper
link Analysis.
— Motivation to study Hyperlink Structure :
O Hyperlinks serve two main purposes.
o Pure Navigation. /
© Point to pages with authority on the same topic of the page conta
ining the link.
o This can be used to retrieve useful information from the web.

Scanned by CamScanner
structure Terminology

I web-graph : Directed graph,


| Node: Each Web Page
Iaink : Each hyperlink on the Wey, MS an Ode of the Wey,
i
i sn-degree : The number of distin
| t-d :
p, ORS THE-PUMDEr Of distinct linge
L Directed Path : It
is a sequen ;
| toreach another node say, °! links

- Diameter :It is the maxim,


: of nodes
all pairs um of al]
p and q in the We ne ares ats
j

6.9.2(A) Page Rank Technique


Used by Google
Prioritize pages retumed from
search by looking at Web structur
e.
| be usefulin Importance of page is calculated bas
ed on number of Pages which point to it -
Backlinks.
Weighting is used to provide more importanc to bac
e klinks coming from important pages.
Page Rank of Page P is defineas
d PR(p)
evel or at
PR(p) = c (PR(IYN; +... +PR(QVN,)
PR(i) : Page Rank fora page i which points to target page p.

N; : number of links coming out of page i


1 and is used for normalization
c: the constant having value between 0 and
Example anew - *
wwers!
yriver ye 2009
Google’s : Page Rank vine web poges poiig tit ae credit
the rank 0 pages
Rank of a web page depends on i 4

Scanned by CamScanner
ees nenrttee eel
Po
a *.

OulDeg(P2)

1
OutDeg(P3)
. = a,
Oe ewe nent . =-

PR(P) = UN + (1a) (SPR (PI PR(P )_____PR(P


ERED, OutDeg 2(P 2)+ OutDeg (P3y))
The Page Rank Algo
rithm
Set PR In, rp, .
----In], where q is
some initial ra
Pages in the graph; ofn
pak
ge I , and N the number
of we,

Stee eee

PRuy << AT pp,


PRs << (1-@*PR,,, +¢¢D:

6.9.2(B) CLEVER Te
chnique

Identify authorit
ative and hub Page
s by creating Weig
(@ Authoritative Pages : hts.

Oo Highly important
Pages,
© Best source for Te
quested informat
ion,

Scanned by CamScanner
-
f 008 warehousing & Mining (MU-Som. 6-Comp.) _ 6-26
Spatial and Wob 1
) Hub Pages:
(ii
Oo
Contain links to highly important pages.

pubs and authorities are ‘fans’ and ‘centers’ in a bipartite core of a web graph.
-

A good hub page is one that points to many good authority pages.
A good authority page is one that is pointed to by many good hub pages.
Hubs

Fig. 6.9.2 : Bipartite Core

ic Search)
HITS Algorithm (Hyper! ink-induced top
kind of rank algorithm that
HITS (Hypertext Induced Topic Search) algorithm is a
analyzes Web resource based on local link.
query, and Page
The difference between Page Rank and HITS is that HITS is related to
each page a rank
Rank is a kind of query unrelated algorithm. Page Rank algorithm gives
each
value which is unique and unrelated to query keyword, but HITS algorithm gives
page two values, which are Authority value and Hub value.
Authority page and Hub page are two important concepts in HITS algorithm, which are
the concepts all related to query keyword.
Authority page refers to some page that is most related to query keyword and
combination. ,

Hub page is the page that includes multiple Authority pages. Hub page itself may not have
direct relation to query content, but through it, the Authority page with direct relation can
be linked.
Analysis of HITS Algorithm

The central idea of HITS algorithm is that :


this
- Firstly, use text-based retrieval algorithm to obtain a Web subset, and the pages in
subset all have relativity to user query.

LPB»
Scanned by CamScanner
_ Data Warehousing
& Mining (MU-Sem. 6-Comp
,) _ 6-27
Spatial anc Wab
Then, HITS Performs link
Hub pages related {0 ana lys is on this subset, and find out the Aut
query in the subset. hority Pages tig
The selection of subnet in HITS
This subnet is defined as
algorith m is acquired by means oif key
root set R, then use link worg Matching
S is the Page that includes analysis to acquire set § fro
Authority page and ulti m TOO set Re
Process from R to S is called mately meet the query requ
“Neighborhood Expand”, iren The
The algorithm procedure for computing S is
shown below:
Step 1: Use text keyword matching
to acquire root set R, whi
Or more; ch includes thousands of
URL
Step 2; Define $ to
R, that is § and R are
equal:
Step3: To each page p in R, pu
t the hyperlinks includ
referring to P into set ed by p into set S; PUt
S$; the pages
Step4: Sis the acquired expanded
neighbourhood set.

above, the
keyword
Analysis of HITS Link

~ The process of HITS


link analysis takes ad
are interacting to ide vantage of the attrib
ntity the. m from expa ute that Authority ang
nded set S. Hub
Assume the Pages in
expanded set S are
respectively 1,
B (i) represents the Page
set referring to page
i,
F (i) represents the
page set referred
by Page i,

Scanned by CamScanner
wae Mining (MU-Sem.6-Comp.) 6-28 Spatial and Web Mining
sl
wo
the Hub value of each page is the sum of the authority values of pages
teP -’ . That is:
jn * ing to it
0 fe ef
I: a= Z h
jeB(i)

oO h, = Y

jeB(i)
jo stePSs [and O, are based othe fact that one Authority page is always referred by

ro pub pages, and one Hub page includes many Authority pages.
‘ iteratively computes the two steps. I ; they converge. At last,
algorithm iteralive Steps, I and Q, till
aha the Authority and Hub value of page i. ia

on procedure is shown below :

o , Initialize ay, by; {


wis Iterate procedure I, O; Perform iteration I; Perform iteration O; Normalize the value

of aand h, let Ea, =1; Zh, =i


i i
step 3: Complete iteration

. Fig. 6.9.3 shows the application of HITS algorithm in a subgraph including 6 nodes.

h=0 h=0 . bed


a=0.408 a=0.816 a=0.408°

h=0.408
a-0

Fig. 6.9.3 : HITS algorithm


on six-nodes graph

~~
Scanned by CamScanner
~ As shown in Fig. 6.9.3, the authorit
y value of node 5 is equal
1, 3 which refers to it, after normalization, the value is 0.816. to the Hub values of Nodes
~ Assume A,,,.,, is the matrix of subgraph, then the
value of position i, j in matrix a is
to 1 (if page i refers to J), or 0. Set a to be the authority valu
e vector [a,, a, .__, a tbe
the hub value vector [h,, h,, ...,h,], then the iteration I, O can be CXpresseq ay
a= Ah, h = A™ o. After completing the iteration, the values of authority
respectively satisfy a = c,AA"a, h = c, A’Ah, which c, and and hub
c, are constant in order to
satisfy the normalization condition. Thus, vectora and vector b respectively become the
eigenvector of matrix AA™ and
matrix A'A. This feature is simi
algorithm, their convergence speeds lar to Page Rany
are decided by eigenvector.
6.10 Web Crawlers

~ A web crawler is an automated program that scans or crawls thr


ough the internet Pages to
create an index of the data.
— A web crawler is also known as Web spider, web robot,
bot, crawler and automatic
indexer.

~ Search engines makes use of web crawler to collect infor


mation about the data on public
web pages, their primary purpose is to collect data iontex
so that when a user enters a search term
on their site, they can be quickly be provided with relevant web sites. - ©
o CC
~ The search engine’s web crawler visits a web page,
it collects information like visible text. o R
hyperlinks and the various tags like keyword rich meta tag.
This information may be used N
by the search engine to determine what the site is
about and index the information. The °
website is then included in the search engines database
and its page ranking process. o L
— Web crawling is considered to be an important method
for collecting data and keeping up Approac
with the expanding internet.
~ A vast number of pages are added on a continuo - Cons
us basis and information keeps changing
continuously, a web crawler is a method for search
engines and others to ensure that their ~ Perf
database is upto date.
Different Types of Crawler

- Traditional crawler : Visits entire Web and replaces index.


- Periodic crawler : Visits portions of the Web and updates subse
t of index.
- Incremental crawler : Selectively searches the Web and incrementally modifies
index.
-— Focused crawler : Visits pages related to a particular subject.

Scanned by CamScanner
°o
Cr awler : Vi
si
Visits Pages to
based on crawle
| Context focuse r and distiller sc
ores| ,
d cr awler
| - Context Graph:
Oo Context graph crea
ted for each seed do
cument.
© Root is the seed doc
ument.
Oo Nodes at each level show doc
uments with links to documents at
next higher level.
oO Updated during craw itself.

Approach

Construct context graph and classifiers using seed documents as training data.

Perform crawling using classifiers


i and context graph cre ated.
Seed document

yer
Level 1 documents i i ;
were” oe
at ? a?
ts yer gre’
Level 2 documen Be
;
Level 3 documents

Scanned by CamScanner
: ata Warehousin, g &
Mining

a Syllabus Toplc : Applications of Web Mining


———
6.1—
— 1 _—
Ap—
plP
icp
atl
iot
nscof
of Wei
at We bb o
MiMin
nis
n in
ng
I. E-Learning.
2. Digital libraries,
3. E-Government, -
4. Electronic commerce.
5. E-Politics and E-Democracy.
6. Security and Crime Investiga
tion.
7. Electronic Business,

QoQ
Chapter Ends

ee a

Scanned by CamScanner
(Section 5 ? Outline
Ans. : their im
Spatial data mene
are data tha (5 Maarks)
a8 data ab t have spati
out Object al Or locati
implement s th € on Componen
ed with a Sp mselves ar
e located I ts Spatial da
ya Pattiti ecifi
c location at na p Ysical can be viewe
onin
g of the data tribute(s) s space. This g
base based o u ch Tess OF may be
Geograph n location, latitude/ lon
ic Informat gitude or
locations On ion Systems
the surface (GIS) are Us
Of the earth, ed to store i
nformationTelated to
Q.4 (b) What
is Metadata?
Wh Y do we
80 effective? need meta da
( S e c t ion 1.7) ta when Sea
Ans, : rch engines
like google s
eem
(5 Marks)
_

Word-based

Process : Meta data is


Most appropriate data used to direct a query to
source. the
The structure of Meta data
will differ in each Process, be
cause the purpose isis difdi ferent.

MLOh
Scanned by CamScanner
Ww Data Warehousing & Mining (MU-
Sem. 6-Comp.) _A-2 App
Sa endin
iy
Q. 4(c) In real-world data, tuples with : missing
values for some attri ibutes arear a common
occurrence. Describe various methods for handling this problem.
{Section 3.1 2.3)
(5 Marks)
Q. 1(d) With respect to web mining, is it possible to det
ect visual objects using Meta-objects>

Ans. : (5 Marks)
It is possible to detect visual objec
ts using meta-objects as visual
objec
ts are re
by small rectangle that completely contains enensetle
that object which is minimum boundi “Pre sented
(MBR). ng Tectangle
Q. 2(a) Suppose that a data wareh .
ouse j for DB - UniveUnivers
rsi ity consists of the foll
dimensions: student, course, semester, and instru Owing four
ctor, and two measur
avg grade. When at the lowest conceptual €S count and
level (e.g., for a given Student, coy
semester, and instructor combination), the avg-grade
course grade of the student. At stores the actual measure
higher conceptual levels, avg-gr
grade for the given combination, ade stores the average
(i) Drawa snowflake schema diagram for the
(10 Marks)
data warehouse.
(ii) Starting with the base cuboid [stud
ent,
- OLAP operations (e.g., roll-up from semecourse, semester, instructor ], what Specific
ster t 9 year) should one
per
rses for each Big University studen form in order
to list the average grade of CS cou
Ans. : t.
() A snowflake schema is as sho
wn
course
dimension table univ
student
fact table dimension table area
dimension table
student. id _student_id | |
course_name. area id
course_id _student_namé’. ee | mee,
semester _id _areaid
| city
‘ _Instructor_id
—- province
semester major * . country
dimension table count
status
~ avg-grade 2 ;
university a
ester
instructor
dimension table

_instructor_id i
— dept

Fig. 1-Q, 2(a)


(ii) Starting with the ba
se cuboid [stud
ent, course, Sem
operations (e.g, roll ester, instructor], what specif
-up from semester ic OLAP
average grade of CS fo year) should
one perform in order to list the
courses for eac h DB-
operations to be performed are : University student, The
specific OLAP
Roll-up on course from
Course id to department
.
*
=.a es Knowledge
Pobitraeions

Scanned by CamScanner
r Mining (Wile oetil. erie
i

from student id to university.


_vp on student
-~ Rolice on cours
e, student with department = “CS” and university = “DB-University”.

-_down on studen t from univer


sity to student name.
prill-do
- the relationship between data warehousing and data replication? Which form
‘aid) i
tisfication (synchronous or asynchronous) iis better suited rehousi
for data warehousing?
of reP iain with appropriate example. = (10 Marks)
why? EXP

we? _houses are carefully designed databases that hold integrated data for secondary
- pee . you only need data from one system, but can’t impact the performance of that
nee” then it is suggested to take a copy ie. a replicated data store unless the
sro itis and width of data is very large and the various ways to require access to it
eit high, a data warehouse would be overkill.
A replicated data store is a database that holds schemas from other systems, but doesn’t
truly integrate the data. This means it is typically in a format similar to what the source
systems had. The value in a replicated data store is that it provides a single source for
resources to go to inorder to access data from any system without negatively impacting the
performance of thatsystem.
_ Data replication is simply a method for creating copies of data in a distributed
environment.
_ Replication technology can be used to capture changes to source data.
o Synchronous replication : Synchronous replication.is used for creating replicas in
real time. In synchronous replication data is written to primary storage and the
replica is done simultaneously. Primary copy and the replica should always remain
synchronized.
o Asynchronous replication
: It is used for creating time delayed replicas. In
asynchronous replication data is written to the primary storage first and then copy
data to the replica.
Synchronous replication is best suited for data warehouse as it creates replicas in real
time. i

Q. 3(a) The following table consists of training data from an employee database. The data
‘have been generalized. For example, “31::: 35" for age represents the age range of 31
to 35. For a given row entry, count represents the number of data tuples having the
Values for department, status, age, and salary given in that row. (10 Marks)

Tech
Publications

Scanned by CamScanner
E_Data Worerousing & Mini

Department
\.
ng (MU-Sem.6-Comp.)

| Status | Ago
_A-4
tenet,
pata!
gain |
| Salary | Coun t
Sales
Senioior | 31..ZA.35/ 46K...50K| . bran c
30
Junior | 26... tos”
Ifa di
Junior | 31... -

Intrin
Splith

— Gain
Gain ]

Let status be the


class label attrib
ute
() How woul
d you modify
consideration the the basic decisi
count of €ach gene on tree algori
ra li thm to take int
(il) ~— Use your ze d da ta tuple (i.e., of ©a y
algorithm to construct a deci ch row entry?
Ans. sion tree from
the given data
(i) The basic decisi
on tree algorith
the count of cach m should be mo
generalized data dified as follow
tuple. The coun s to take into Co
the calculation t of each tupl nsideration
of the attribute e mu
Count into Consid se lection measur st be Integrated int
eration to dete e (such as in o
rm in e the most comm fo rm at io n Bain). Take the
(il) Age and Sa on Class amon
lary have been g the tuples,
attributes. When discretized in
to intervals, Yo
tying multi-spli u can Consider
example, if yo tting, you ca them like ordina
u have a three-wa n
merge values l
y sp li t of ge by their closenes
at one, and [36 40 , you can have s, For
Teasonable spli ] [41 45] [46 50] [2 6-30] at one bran
ts at one, It is 0k as ch , [31 35]
Iong as you try
Step 1; Choose a number of
the Toot node
It is an altera
tion of the in
high-branch attribut formation Bain
es, that reduces its favouritism
on
C
n
te —
Terre .
+‘
re
"Tee

Scanned by CamScanner
Gain ratio should be bis wa
dhs. da
3
branch. So, it considers mii Ata js -
to split. eroOF bran,

isa]
Where, p; is the Telative nen
oP Intrinsic information Split iy cy of Classj in A
SplitInfo (S, A)= - =e Bt tog Sl
8
- Gain ratio normalizes info
Gain Ratio (S,A)=

ke into

*ntry)?

Tation
1 into .
> the

CAIN = (1- (iesal -(B))


a,

inal
For
5]
165 (1-(ita) - G0) )-ies'-@)-@))
— 0.4001= 0.0315
~é(I- (a) -()) )-res(! i) }-())- 0.4316
of
31 14 iB)
SplitINFO = ~(7¢65!o8 110, 1653
10;,, 765+ 8 165° eG"
165 765 98165) = 09636 penise® s
the Ne™ ity
yainers ois" spp

GainRATIO = ogag7 02227 : oe moved


ne

Scanned by CamScanner
*
os

Gi

Theb

17 Senior ror DeP=


Kisdoreoe3a):
"'8-2-Q.
.
Simplitied decision tree for Age.

Gain RATIO = a;
GAIN = (i- ia) -~ Gy)

2-8 -@)
rat 6
~Tes(1--(3y-

B(1- -() -F))-o4316- ~0.2363

|= aaa "
SplitINFO - (2: 22,2
7
165 * 165 log 17)_
15
ca" Testes 163) =9.9513
CNRATIO = aoe = 0.2053 ° °

For Ay

GAIN = (\-@2)_( (zy) - (. esp (9)

“Te 2 -(3))- £1.


= 0.4316 ~0.2234
= 0.2082
a

Scanned by CamScanner
i SpliINFO = ~ (P41 4
mer P 165
3 ==
* Tee tog 63.
Gai
= nRA TIO = 9.2082
The bigge 0.8349 = 0.2494 los"
st GainR
ATIO is
obtained
ror Departme
nt

5) (3)
363

Ving So
= 0.0815 ~0.051
|
= 0.0305
| ‘ -_(80,
SplitINFO 8 4 4 1
= -(2 108 94-94 log 94 +54 = 0.5099

GainRATIO = asa = 0.0598


For Age
nn

2 |
| ener,

sriic net
yer?
s cr
jee ase
cei

Scanned by CamScanner
“am= (GF) (65g
HC-65-@)86-7 a
= 0.0815 — 0.07
73
y
= 0.0042
SnuentiG =. ‘80
-% log9480-94 4 log 52+
:
3a) = 0.8863
-
Gain RATIO =_ 0.0.88
0042
63 = 0.0047

The. biggest GainRATIO


Connection with Salary = [26 _ 45]is. obttai
ai ned with Depart
€nt, therefore Department j; the
Notes - For thi.

Q. 3(b)

Ans.:

Tree Pi

- Be
= Te
Prepru
- St
— St
— Ay
at
— Se
Postp

- B
~ U
Fig. 6-Q. 3(a) : Upda
ted decision tree
When we connect the node Age = ©
in Salary = [40 — 50] we already
all the age classes. Therefore, arrive in leaf nodes for SC
this is our final tree (Fig, 7 - Q. 3(a)) pI
WF Tech reminds
Pebl
ications —_

Scanned by CamScanner
} Shown ; of (i) converti
in ng the
(ii) pruning the deci decision
ne:
s;sion t
advantage does (i)
h VE Over er
ri(ii
ne
(i )?

Tree Pruning

- Because of noise or outlie


rs, the generated tree may
over fit due to many bra
j = To avoid over fitting, prune nches,
the tree so that it is not too specif
ic
Prepruning
Start pruning in the beginning while building
the tree itself.
Stop the tree construction in early stage.
Avoid splitting a node by checking the threshold with the goodness measure falling below
a threshold.
| ~ Selection of correct threshold is difficultin prepruning.
i Postpruning
| ~ Build the full tree then start pruning, remove
i
the branches.
.

mie
west
a s
than trainidatng et to get the best pruned tree.
Use different set of data | a
tne’ morsitl
a eeeco —
I
a

ca re
|
de ci si on tre e bui lti may overfit the training , data. The too many branches uni
ott
ite
ed cred
The
i may reflect ' ano malies le branches noice
:| / some of which oving the least reliab
of overfitting" the data by rem:
runing addresses this issue
a

ea
,

Scanned by CamScanner
described by the
example, ho
w tg compu (10 Marks)
following : te the dissim
ilarity betwee
n Objects
(i) Nominal Attr
ibutes (10 Marks)
(i) Asymmetric bi
nary attributes
(i) Nominal Attrib
utes
Nominal attributes
are also called
qualitative Classi as Categorical at
fication, tributes and allo
Every individu
w for only
al item has
the order of th a certain distinct ca
e catego ries tegories, but quanti
, is not Possible, fication or Tanking

Oper
ations on the no
Typical exampl minal cannot beperforme
es of sy ch at d.
tributes are:

Ent
e
Scanned by CamScanner
_ The dissimilari
cts 'Y computed between ob;VEC descrip,
the dissimilarity between tivo obi

priate sides.
d (i, j= (p- m)/ ects
tion j “md . Nom} Mal a

of attributes for which j and j are M is the number of


attributes describing the objects M the same : m
(i) Asymmetric binary attributes

- Ifthe outcomes of
the stg

“handicapped
OF not handi
Capped”.
(Present) and the other ig PP
Coded 28.0 (absent),

(10 Marks)

Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _A-12 Appendix

Transaction ID Items
3 C, A, B, E

T4 B, A, D

TS D

T6 D, B

‘TT A, D,E

T8 B,C
Ans. :

Step 1: Find support for each item

tem | _— Support
A 5
B 6
C 3
D 6
E 4
Step 2 : Consider items with min. support = 30% = 3.
As all items have support > =3, consider all items and arrange the transaction items in
descending order i.e. B, D, A, E, C.

‘Transaction ID | items Boughts | Ordered


So ale ee e
Tl . E, A, D, B B, D, A, E

T2 D, A, C, E,B B,.D,
A, E, C

T3 C,A,B,E B, A, E, C

T4 B, A,D B,D, A
T5 D D

T6 | oa | Bp a
Scanned by CamScanner
Fig. 1-Q. 5(a) : FP—Tree

Q. 5(b) Differentiate between simple linkage, average linkage and complete linkage
algorithms. Use complete linkage algorithm to find the clusters from the following

24

12

Scanned by CamScanner
ered
Single linkage Minimum distance is consid

Complete linkage : Maximum distance is considered

Average linkage : Average distance is considered


distance formula
Calculate Distance Matrix using Euclidean
p,. | P, | Ps
P, 0

P, 4 0
P , | 11.704} 8.06 | 0
P, 20 16 | 9.848
Ps | 21.54 | 17.988 | 9.848
Distance matrix after merging P, and P,

(P1,P2) P3 P4 P5

(P1,P2)| 0
P3 8.06 0
P4 16 9.848
Ps | 17.888 9.848
Distance matrix after merging P, and Ps

(PyP,)| Ps (P,P,)
(P,,P,) 0-
FP 3 8.06 0

P,P.) 16 | 9.848
Distance matrix after merging (P,, P,) and P,

(P, P., P 3)
P,P.)
(P» P,P) 0.
Py Ps) 9.848

_ eee
Scanned by CamScanner
2 -warehousing & Mining (MU-Sem, 6-Comp.) —

i
ja

P1 P2 P3 pe ts

: . Fig. 1-0. 5(h)


oO 26a) Data quality can be assessed in terms of sev eral issues, including accuracy,
completeness, and consistency. For each of
the a
quality assessment can depend on the bove three issues, discuss how data
inten use of data, giving examples
Propose two other dimensions of data quality. ded
(10 Marks)
other* dimensions that— can be aused to assess the quality of data iinclude timeliness,
imeli
efevability, value added, interpretability and accessibility, described as follows:
'_ Timeliness : Data must be available within a time frame that allows it to be useful for
decision making.
|. Believability : Data values must be within the range of possible results in order to be
useful for decision making.
| Value added : Data must provide additional value in terms of information that offsets the
.| cost of collecting and accessing it.
x that the effort to understand the
& Interpretability : Data must not be so comple
analysis.
| information it provides exceeds the benefit of its the
: Data must be accessib le so that the effort to collect it does not exceed
| - Accessib ility
benefit from its use. ©
a
sesing iisn cruci onli
min
Pe
am pl e wh er e dat be performed 2
| Q.6(b) Present an ex thi s business need? Can they
fun cti ons do es
data mining ical analysis?
a qu er y pr oc es si ng oF sim ple statist
| by dat ‘ items or
| Ans.: ng eos
could be found from practically ay bi
u would require both cros
rofiling (what
types of customers hatbuykindwhatof
services. Such business
sales) and qeed pro predictions can be made on W _
| between producedt on the“id acquir ;
Ser
| ed
products). Bas :
i
: j marketing strategies WO
7 Tochtoonin ty, i
aceee®
Pobite ——
|

Scanned by CamScanner
Data Warehousing & Mining (MU-Sem. 6-Comp.) _A-16 Appendix

— In theory this knowledge can be acquired with data query processing or simple Statisticgs
analysis, but it would require a considerable amount of manual
work by expert Market
analysts, both in order to decide which queries to use or how to interpret the Statistics and
due to the huge amount of data.
A department store, for example, can use data mining
to assist with its target Marketing
mail campaign. Using data mining functions such as association, the store
can USE the
mined strong association rules to determine which produ
cts bought by one £Toup of
customers are likely to lead to the buying of certain other produ
cts. With this information
the store can then mail marketing materials only to those kinds of custo
high likelihood of purchasing
mers who exhibit
additional products.
Data query processing is used for data or information retri
eval and does not have the
means for finding association rules, Similarly,
simple statistical analysis cannot
large amounts of data such as those of customer han
records in a department store. *

Scanned by CamScanner

You might also like