Professional Documents
Culture Documents
D oe . architecture ’ metadata, = j
waare ene Information Package Diawrem, st meters ais
schema, STAR
a keys, Snowflake Schema , Fact C : Schem ,
tables, Updat : . onstellation =i
pdate to the dimension tables, Aggregate fact tabl see a
les.(Refer chapter 1)
ETL Process and O LAP : Maj :
, process Data: .
. yor steps in ETL extraction :| 8
Techniques, Data transformation : Basic tasks Major sien
Dai L i : y i
> ormation types
I V
Dimensional | Analysi
Analysis, Hypercubes, ‘
¢1 on.
Scanned by CamScanner
Ce Set TR
Scanned by CamScanner
Sem. 6-Comp.) 1 Tabla of
BFt pata
Data Ware & Mining (oe Lortanig
Module 1
ae.
a lonal Modellin 1-1
Chapter 1: Introducti to Data Warehous and Dimens'! 9 nl
on e
su
Scanned by CamScanner
mn 2 tJ
162 OD career: Architecture (May 2010, Dee, 2010, May 2011, Dac. 2011,
TAG
» May 2013, May 2014, Dec, 2014, May 2016) n..cssccccsessnsseseene
1.6.3
siaianetialrcllThreeGees
Tier/ Multi-ti
exons 118
‘tier Data Wareho Use Architecture (DAC. 2093)....ccecsseressvneee
vy § I PARA AtO sccsssesstessnnscusnensasesnusdnunsonsaicieneqnvadsonvvranecersiveskslévoaisniiibeesunntsussburceoieuvers 1-19
4-49
1.7 | Metadata (Dec. 2010, Dec. 2012, May 2013, May 2014, Dec. 2014) 0... sis
, Dec. 2013)........cssseees
1734 Definition (May 2010, Dec. 2011, May 2012
(7B, Thasctie UR of Stk Sieuensnusnnacnnovatinemine 1-20
2012 , Dec. 2013) enna 29
eho use Meta data (May 2010 , May
Data Wartion of Metadata or Types of Metadata in Data Warehouse
1.7.3
Classifica
1.7.4
VE
(May 2010, Dec. 2011, May 2012, Dec. QO13) coccesessccsssssssssvssorvecesesccersossssrenseneees
svsssssssesarnssceserersiete T -22
¥ Syllabus Topic: E-R Modelling versus Dimensional MOGEIIimg «..sscsssescessereesnv
ssanearutscsesevssavsecsressersssnesee T7OL
1.8 E-R Modelling versus Dimensional Modelling...
1.8.1 What is Dimensional Modeling ? asia2011, May 2012) ssausassnesvoisaceurners sestieeoneea 1-22
1.8.2 Difference between Data Warehouse Modeling and Operational
avatnemen veetes 1-22
Database Modeling ...
sever 23
4.8.3 Comparison Database and Data sinretinase Database...
ER model... we 128
1.8.4 Comparison between Dimensional Model and
1-24
Y_ Syllabus Topic : Information Package Diagram... 1-24
Information Packag e Diagrarn. ......-ss ssssrrsse eee
19. sates 12S
¥ Syllab us Topic : Star Schema ...
ssxceypececusaaee [OO
(Dec. 2010, Ma 1y 2012, Dec. 22014) a
1.10 The Star Schema sesaensereeneee OT
: STAR Schema Keys...
Y Syllabus Topic : -27
wal
al
4.11 STAR schema Keys... er
Top ic : Sno wfl ake Sch ema ...
Y Syllabus 2, Dec. 2012,
The Sno wfl ake Sch ema (May anti, Dec. 2014,‘al 201
1.12
4)...
May 2013, May 2014, Dec. 201 seecenaatesseeane 129
Sta r Sch em a and 4 Sn aw ieke Schema...
1.12.1 Differentiate between nenanne TES
a Dim ens ion al Mod el ce cnmnsartnmunmanennnsonanven
1.12.2 Steps of Designing
stellation met
Y Syllabus Topic + Fact Con Sta r (Ma y 201 2, May 2014) .ssssninneseinnens
nere TST
ilies of
1.13 llation Schema or Fam
Fact Conste a
t t
ss Face
¥ Syllabus Topic: Factle adil
1.14 Factless Fact Tables ... ci
nsion Tables...
1.14.1 Fact Tables ee Dime scenes
NS
le (De c. 201 2, Ma y -a0i9).-
1.14.2 Factless Fact Tab nsecerecan@nensst?
the Di me ns io n Ta bl es ccanauaneeanonsssnegersecqnsso
: Update to
¥ Syllabus Topic : (M ay O16) aesessssesesesssnssns
trnnecenennrren
ns io n Ta bl es
1.15 Update to the Dime
mensions... en 5
Slowly Changing Di nnennnnannne ne neste
4.15.1 sccomanajananeesnannneeaqen
bles ...
4.15.2 Large Dimension Ta
Scanned by CamScanner
_
eT
1.15.3 Rapidly Changing or Large Slowly Ch
anging Dimensions
e ee N
1.15.4 JUNK DIMONSIONS wesscssees abe 1-33
setessssesesesasece i
¥v Syllabus Topic ee 14g
ae ne Bee nniannin
innininininiin
nng 1-49
Mee ae AY 2A) nnnnninaniininnnti
1.17” Examples on Star Schema
and Snowflake SCAM nininnin
inutti
innnn 1-49
innsinnanininin,.. 1-49
re Rey Gusationa end ANEHEP a etin
e
nnnnininiiinnnniinni nnncte 1-55
Chapter EE eA
ANAM MNANaNNtto
manan,,
1-69
Module 2
Scanned by CamScanner
Table of Contents
Reasons for”
2.9.2 ; Data c| Diry" Data......
2.10 ®ansing......
Sample ETL Tools
2.11 Need for Oni corer
Wad penta snesbiveisiesss,
Ne Analytical -
~ Syllabus Topic : oLT Processing...
2.12 OLTP Vis OLAP (
a 13 OLAP and Mult » M Vay 20 H,
01 1, Maran,
idimensional
Lo SyllHt Analysis
abus
ipet
Toin
pie.
c : Hypercubes
sean deesetiaieas_.
,
/ Syllabus Topte ogame,
2.15 Operations - Drill down, Roll
2.16.4 DOLAP... ne
2.17 Examples of OLAP..ww.
University Questions and ANSWEIS ooo
® Chapter ENS ecnninnninniniununnivnnniniecc
2-5
np Chapter3: Introduction to Data Mining, Data Exploration
and Preprocessing 3-1 to 3-64
é i
1-8 Syllabus : Data Mining Task Primitives, Architecture, Techniques, KDD proces
s, Issues in Data
Mining, Applications of Data Mining, Data Exploration : Types of Attribute
s, Statistical Description of
Data, Data Visualization, Data Preprocessing : Cleaning, Integration, Reduction :. Attribute subset
3 selection, Histograms, Clustering and Sampling, Data Transformation and Data Discretization :
| Normalization, Binning, Concept hierarchy generation, Concept Description : Attribute oriented
Induction for Data Characterization.
3.1 Whatis: Data Mining ?(D@6.2011) siicsiissosnnnnannuaniennnnmman anime!
¥ Syllabus Topic : Data Mining Task Primitives ........cccsssessssssssssecsssseessssnesssssseccassssuecesessssasnuneieseeen
SS
3.2 Data Mining Task Primitives .ssserseonsrssesestnsinenssenetntnesnnsnntstntnmenniarnnsinnmennnnnn DR
Scanned by CamScanner
i, “
— Table of
5
(MU-Sem. 6-Comp.)
housing & Mining
Ww Data waro —
wy Data Warehousi
— ————_—
Ar
fe of Typiit
a ch tare
calecDatu Mining ayeern
¥ oe ure’
Are 016
33
(May s010, Dee. 2010, Dec. 2011, Dec. 2013, May 2016). . 3.13 Data intes:
he 3.13.4
nano
¥ syllabu is Topic:
Pp Techniques ..
ieee, "
Data Mining Technique oie, "20tt) saennnananense ene st ae gee
} me 2133
Statistics...
an sneneeuneeen eee sesenmeneesinesee
|
3.4.1
3.4.2 Machine HERE senor 3.13.4
reiensiuei,
ensenrncessningninnnenravsnnsserineestosnrersncenenssson ¥ Syllabus To
| 3.43
0
Information Retrieval (UPR) sonrsossocensesc
wo
3.9.2
wo
Dispersion Of Data..c.cccsccssessessscsssessseneess susie sahil eee
3.9.3 Graphic Displays of Basic Statistical Descriptions of ¥ Sylla
Data .........escscesseers +320 3.16
Y Syllabus Topic: Data Visualization ....ccscssss (
sssessesesceses
3.10 Data Visualization (Dec. 2012).....csesssssee ¥ Syl
ssesteseesees
¥ Syllabus Topie : Data Preprocessing for I
...........
iin
Scanned by CamScanner
Syllabus Tople :
Integratio Table of Co
3.13 Data Integrati n tren neeese
nag ntents
On (May 20
3.13.4 16)...
Entity Identification
Problem 3-42
13.2 Redundancy
3.13.3 4nd Correlation
Tuple Duplication... Analysis ... a
3.13.4 Data Value
Conflict Detection . aa
“ Syllabus Topic
Data Reduction and Resolution ai
Clustering ang -
Sampling soestasavaies
3.14 Data Reduction
(May 201 )
3.14.4
Need for Data
Reduction...
3.14.2 Data Reduction Te
chnique
3.14.2(4) Data Cube Aggr
egatio
n
3.14.2(B) Dime
nsionality Re
ducti on
3.14.2(C) Data
Compression
3.14.2(D) Nume
rosity Reductio
Y n
Syllabus Topic : Da
ta Transfor mation and
Data Discretization,
Data Transformati Normalization, Binnin
on and Data Discre g
tization
3.15.1 Data TrANSFOrMARION
eee cccsessee,
3.15.2 Data Discretization Tttteave
eeetssentesuenstncssscoceses ases
3.15.3 Data Transformation by
Normalization
3.15.4 Discretization by Binning.
.. .ccscccssssssss ssess.,
Discretization by Histogram
3.15.5
Analysis
¥ Syllabus Topic : Concept Hier
archy Generation
3.16 Concept Hierarchy Generation
Ceone e
eee ere eeee: eeSOSQGSNOUNENSERSGAACa0AsenassoaseseyausHBEIEGanu
me esyesera
a. en senessuap
sees euaseoeea
“ nne, beanaeees!
nee
3.17
3.1 8 Data Generalization and Summarization-based Characteriza
tion
ae
3.18.1 Data Generalization... cccssssscsssssssssssssssarsessesseseses.s
3.18.2 | How Attribute-Oriented Induction is P@rfOrMED? ...vsseeserseesnensen svasteanscnes
a eteeeeaeeen 3.18.2(A) Data Gene raliZation.......sesessesessesseesssessasssssesssssescereeessanssinteesessenesesese —
3.18.2(B) Attribute Generalization Control .......csccssccsssesesesceresnees esses natn
3.18.2(C) Example of Attribute Oriented INGUction ......ccsssccssseseesessnsesessesssnseeen 7
3.19 University Questions and ANSWEIS.....:sesssessseneesneernesesssssnsssessnsesensssessserssees
Chapter Ends.....cccccccsnens Jenceunvsnvanses
Haus sivstnnddaunennnen(iiveaiued bores
Scanned by CamScanner
i
{
{
i
te
)
t
'
v
j
a ) 7 Tabla of
:
j
—
€F pata ware pousing & Mining (MU-Sem. 6 coe) Corton,
Module 4 WF ata vioreteuvin & Minin
' j SSS Minin
F ; “ Syllabus Topic + Pradict:
i ae 43 Prediction
ter 4: Classification, Prediction and Clustering
ii Chal -
: Tree using te 1% 434 Structure
j ‘Syllabus : Basic Concepts, Decision i Information Gain, Induction :
: Attribute
Houte Seco >
Induc' Selection 422 Lintar F
ea Tree pru ning, 5 Bayesian Classification : Naive Bayes, Classifier Rule - Based Classificay
THEN 433 Muttigie
Rules for classification, Prediction : Simple linear regression, Multiple linea,
= a, Model Evaluation and Selection 434 Other F
: Accuracy and Error measures, Holdout,
regressit Random ¥ Syllabus Tople ; v, on
Sampling,
i Cross Validation, i Bootstrap, Clustering : Distance Measures, Partitioning Methods (ke 4 A
Model Evatuation
iucae k-Medoids), Hierarchical Methods (Agglomerative, Divisive). ¥ Syllabus Topte ; Ac:
Syllabus Topic : Basic COMCepts ..........seeseesssesesensenaeeeeensenensseesnnscaneteneesnanenenetenecseens 44.14 eet
44 Basic Concept : Classification.........00++. ¥ Syllabus Topic : 4
4.1.1 Classification Problem.. 4.42 Ho!
412 Classification Example .. ¥ Syllabus Topic «1
4.2.1(A) ‘Appropriate Problems for Decision Tree Learning ..cssssscissecsesssssessesssecsereecevecs ¥ Syllabus Top
45 What is C
4.2.1(B) Decision Tree Fepresemtation...cssssessesscsssssssesoyssseessnssnsecssseearsnneesessaersersasessenssecennadeh
454
¥ Syllabus Topic : Induction - Attribute Selection Measure ....c...... Deere ere rer rer eee
45.2
4.2.1(C) Attribute Selection M@aSUre «..e.ssecsceseese Steet nema reseea teeter emeee,
45.3
4.2.1(D) Algorithm for Inducing a Decision Tree.
46 Types
v Syllabus Topic : Tree PLUNig .....ssseensseeee
4.6.4
4.2.1(E) Tree PIUMiNG..0..sccescessessessescesesessseesseare
4.6.2
4.2.1(F) Examples of IDS............... deavik eee rer eee reer
4.6.
¥ Syllabus Topic : Bayesian Classification - Naive
Bayes... 48.
4.2.2 ° Bayesian Classification : Naive Bayes Classi
fier. ¥ ” Syllabu
4.2.2(A) Bayes TROQHOM sssssseueonnensssesnessassscesanvaimisenis
sossicivaesesit inseatannescerastinenectean anced 47 Dis
4,2.2(B) Basics of Bayesian Classification................
.... mukicd Y Syllab
4.2.2(C) Naive Bayes Classifier: Example .....
..s. SHSO0 Ce nee bea ne eenenLeneeaseeeeesagebennen, svinsverebve 48 P
nn "D2
4.2.3 Rule based ClASSHCHON esspsnininsnnenanintisnsansiensisitinuesesse
e,.-6l r
4.2.4 Other Classification Methods cease
NRERORE
SNeucnOa nen eserenemiseesesu
eens nne Hr vssseesseceseeeresneere4=66
Scanned by CamScanner
Accuracy an
wv Syllabus dE ror
Topic : Hold
out
4.4.2
eu, 4,
4.4.3 Random
ea, 4y ae Subsampling ww
Y Syllabus Topic : Cros
s-Validation
ee Re 4.4.4 Cross-Validation (CV)
_ Y Syllabus Topic : BOOTS AD
. eecccctetnee..
4.4.5 Bootstrapping... c
ss’. uae.
¥ Syllabus Topic : TS I
4.5 What is Clustering ? __
1 tHeneorennonsnesseestsbnennenm
aegeaaesceitntinidtivarenrengo
4.5.1 sabesanssiinsinc
What is Clustering? (Dec. 201
0, May 2012, Dec, 2012, Dec.
47 4.5.2 . Categories of Clustering Methods (May 201
2013)...
0, May 2012) voessrcrssseccrsere
4.5.3 Difference between Classification and Clustering
.....csscsccsessese
i 4.6 TYPES Of Dat....eccccccssecssssesssssee ses tT esebsvneartes AO00T COTE eee eeecS
Bebade emer renee: FEN REGS SR neon daar enn PO
3 4.6.1 © Interval-Scaled Variables oocccccssssssssssssssssssessees Bee NTNU cavnarnarsaneuvanidl4-80
4.6.2 Bimary Variable .....scssssecssscssssesecsssssvecsssssssessesssssssssscscsssssecessessessusanansesuennenannneat
eB
4.6.3 Nominal, Ordinal, and Ratio Variables 1-483
4.6.4 Variable of Mixed Tepe wereaanentecreenmenensia
¥ — Syllabus Topic : Distance A me SE
4,7 Distance Measures (Dec. 2011)... a
_ °
a
Object-based Technique) ....... i
4.8.2 doidi s (Repres' entative
K-Me
vassecssessssecssessesneresnseess
a
4.8.3
Scanned by CamScanner
W Data Warehousing & Mining (MU-Sem. 6-Comp.) 9
Scanned by CamScanner
- Data Warehousing 9 & Mining a (MU-Sem. 6-Comp
mee, mp.)
.) 10 Table of Conten ts
. a
‘.
¥ Syllabus Topic : Improving the Efficiency of Apriori
5.6 Improving the Efficiency of Aprioti....c.cccessesesss
¥ Syllabus Topic : FP GION srssssceseseserstssssseeee a
aa %
57.1.
5.7 FP Growth (May 2016) .scsscscssssesssssseerneeeeecce.
wan
Delniionel Five... aeonsan oe
oe ip
*
Wan, 5.7.2 FP-Tree AIGOrithM .essssssssseescensevseesenetuseniessttvtssnetseee
"., my, eeeeeec 6993 : |
‘ 5.7.3 FP-Tre@ Size ssssssssessssssvecsnsessusssseersnseosssassnasesseessunssensiecsisseceeeesuss
iesunentensssatecsccéeosess 5-34
5.7.4 ENTE: GEER TG oasicsesicsvicinsivsseasncssisvanetasenvinissacecngysy R35
5.7.5 Mining Frequent Patterns from FP Tree.......ssssssssssssesssetstssusssssesesessussssecsssesessserssere 5-49
5.7.6 Benefits of the FP-Tree Structure... eee cccccssecsseesonssesesenssssssestsucsusnesevarsnesee! 5-45
y¥ Syllabus Topic : Mining Frequent Itemsets using Vertical Data FOrmats.........ecsessescsesesesseeeee 45
5.8 Mining Frequent Itemsets using Vertical Data Format .............c:sscessseeesseessseesseescsssossuseeeeees 5-45
¥ Syllabus Topic : Introduction to Mining Multilevel Association RuteS........rcsersseserssnesnesnteennese
5.9 Introduction to Mining Multilevel Association Rules
(May 2012, May 2013, Dec. 2013, Dec. 2014, May 201 GY -emrernemrennonsitoaseian
nase eea ata
6-1 to 6-31
Chapter 6: Spatial and Web Mining
g Spatial
Data Mining, Spati! al Data Structures, Minin
Syllabus : Spatial Data, Spatial Vs. Classical
- CLARANS Extension, Web
rns, Spatial Clustering Techniques
Association and Co-location Patte lications of Web Mining.
Stru cture Mining, Web Usage mining, App
Mining : Web Content Mining, Web sees
cena? seeesecenuannsssnas
somesmscs cesses naee
nnseq
anne nemes uemeaeasenanaqn
¥ Syllabus Topic : Spatial Data
enna cnnar
eeanee gunca
morsveennnnn enennerverneane
6.1 Spatial Data....sssesesessnsenneemen
6.1.1 Spatial Pattern
?
6.1.2 What is NOT Spatial Data Mining
Data Mining ?...-sssssrssssssere
6.1.3. WhyLeam about Spatial
rrn
rS ...ssssssesseeessenssneessenarensens
6.1.4 Spatial Data Minirig : ACtO
Data Mining
6.1.5 Characteristics of Spatial ee
Spa tia l Vs. Cla ssi cal Dat a Mining ....--ssssssrersesreernerren
Y Syllabus Topic : eeeser sansa bhbiS
ring.a.esss-sssrsrsr
Data Mi
6.2 Spatial Vs. Classical
a cl
Scanned by CamScanner
} €F pata Warehousing & Mining (MU-Sem. 6ComP a
Structures.
¥ Syllabus Topic: Spatial Data
SPUC HUIN OS ooo. cc ee
eee s es
eeeeee
63 Spatial Data
6.3.1 R-tree ....-
6.3.2 Re -tree ..
6.3.3 More Spatiat tigi. dae conek Wicd unas near seain dlila ri neatEtEs
ati on and Co- loc ati on Pat ter ns
..
¥ Syllabus Topic: Mining Spatial Associ
64 Mining Spatial Association and Co-location Patterns...
ssessssseneeeeee
6.4.1 Mining Spatial ASSOCIALION ........ssecssesseescsnseeestsseeans
eaee snsees
6.4.2 Mining Co-location Patterns .......ssssscssesssesssrensesseneecn
¥ Syllabus Topic : Spatial Clustering Techniques - CLARANS Extension
65 Spatial Clustering Techniques : CLARANS aeeheeanetasseassenseeasenessnes
Scanned by CamScanner
Module 1
Introduction to Data
Warehouse and
Dimensional Modelling
A
Syllabus :
Introduction to Strategic Information, Need for Strategic Information, Features of Data
Warehouse, Data warehouses versus Data Marts, Top-down versus Bottom-up
approach. Data warehouse architecture, metadata, E-R modelling versus Dimensional
Modelling, Information Package Diagram, STAR schema, STAR schema keys,
Snowflake Schema, Fact Constellation Schema, Factless Fact tables, Update to the
dimension tables, Aggregate fact tables.
- For the past few years, data warehousing solution is becoming a huge trend in medium
and large sized companies.
- Data warehouse is no longer just a theorized study in universities, because many business
and industries have started to embrace this mainstream phenomenon.
_- Although it’s not applied in every company yet, we can see the stable demand growth for
this.
- With the help-of Data warehousing technology, every industry right from retail industry to
financial institutions, manufacturing enterprises, government department, airline
companies people are changing the way they perform business analysis and strategic
decision making.
- The fundamental reason for the inability to provide strategic information is that strategic
information was been extracted from the existing operational systems.
- If we need the strategic information, the information must be collected from altogether
different types of systems.
Scanned by CamScanner
=F pata Warehousing & Mining (MU-Sem. 6-Comp.) _1-2 Introduction to
Data Warehoy,
%
- Only specially designed decision support systems or
informational Systems can pros
strategic information. I de
Scanned by CamScanner
a
.
¥ Data w _1-3 Introduction to Data Warehouse
—_ aos io ser 8-Comp.)
nit
- If we need the strategictin |information, the information must be collected from altogether
en e
igned decision support systems or
nformati ypes of systems. Only specially des
ategic information.
informational systems can provide str
4.2.3 Operational v/s Decisional Support System
. > (MU - May 2016)
Operational systems are tuned for known transactions and workloads, while workload is
not known a priori in a data warehouse.
Special data organization, access methods and implementation methods are needed to
support data warehouse queries (typically multidimensional queries).
during the
Example, average amount spent on phone calls between QAM-5PM in Pune
month of December.
Operational System - | Data Warehouse (DSS)
Application oriented Subject oriented
Used to run business Used to analyze business
rehouse
1.3 Introduction of Data Wa
May 2011, Ma y 2012, Dec. 201
3)
=> (MU - May 2010, Dec. 2010,
use
1.3.1 Definition Data Wareho
: "A
us e wa s def i ned by Bil l Inmon in 1990, in the following way
ho
The term Data Ware latile collection of data in
sub jec t- ori ent ed, int egrated, tim e-variant and non-vo
warehouse is a process". He defined the terms in the sen
tence as
's dec isi on ma ki ng
support of management
follows :
Scanned by CamScanner
Pata that gives information abo
ny ut a particular subject instead of a
v *
bie ts
AMON, OPERATIONS, } Dany,
Integrated
Scanned by CamScanner
Sig
% ; — ‘areh OUusin
i g & Mini
i ng (MU-
Sem. 6
i Features Of
a Data Wareho
use
1. Subject Oriented
lis
. ve : , loans, ete.
ef€r ques
ave tions like
l “Which customer has taken
Imum loan amount for Jast year?” This ability to
define a data warehouse by
subject matter, loan in this case, make
s the data warehouse subject oriented,
Operational applications i . Data warehouse subjects
2. Integrated
- A data warehouse is constructed by integrating multiple, heterogeneous data sources
like, relational databases, flat files, on-line transaction records. The data collected is
cleaned and then data integration techniques are applied, which ensures consistency
in naming conventions, encoding structures, attribute measures etc. among different
E- data sources. ‘
Example:
Data warehouse subjects
- Subject
Student
Data Warehouse
Fig. 1.3.2: Integrated
Scanned by CamScanner
£F pata Warehousing & Mining (MU-Sem. 6-Comp.)__1-6 Introduction to Data Warehovs, -
sv
- 20 Nonvolatile means that, once data entered into the warchouse, it cannot be removed
: or
changed because the purpose of a warehouse is to analyze the data.
It consists of multip
4. Time Variant
A data warehouse maintains historical data, For e.g. A customer record has details of his It is highly flexible
job, a data warehouse would maintain all his previous jobs (historical information) When,
Implementation tal
compared to a transactional system which only maintains current job due to which its not
| Generally size is f
possible to retrieve older records.
— A data mart is oriented to a specific purpose or major data subject that may be distributed
to support business needs, It is a subset of the data resource. — The data flo
operational ¢
— A data mart is a repository of a business organization's data implemented to answer very
specific questions for a specific group of data consumers such as organizational divisions
of marketing, sales, operations, collections and others.
— A data mart is typically established as one dimensional model or star schema which is
composed ofa fact table and multi-dimensional table.
- A data mart is a small-warehouse which is designed for the department level.
— Itis often a way to gain entry and provide an opportunity to learn.
Major problem : If they differ from department to department, they can be difficult to
integrate enterprise-wide.
Table 1.4.1 : Differences between Data Warehouse and Data Mart
Scanned by CamScanner
q “tg, Se Data Warehousing & Mining (MU-Som. 6-Comp.) 1.7 Introduction to Data Warehouse
b | Data Warehouse
Data Mart
has It consists of multiple subjects,
It consists of a single subject of concern
detaiy to
orn, :
Sto tion Sop the user,
Z
— _ Syllabus Topic : Top-Down versus Bottom-U
p Approach
1.5. Top-Down versus Bottom-Up Approach
13, » May 2 Data Warehousing Design Strategies or Approaches for Building a Data
Warehouse.
1.5.1. The Top Down Approach : The Dependent Data Mart
Structure
be Csteibuy ‘ > (MU - May 2010, Dec. 2010, May 2011, May 2012)
- The data flow in the top down OLAP environment begins with data extracti
on from the
answer 9 operational data sources.
1al divisions Other
operational DBs
1a which is
Extract |
Transform
Load
F | Refresh
ifficult to
Data
Warehouse”
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-8 Introduction to Data Warehoy
——— 7 Se
- Data is also loaded into the vie warehouse in a parallel process to avoid extract
Ng i
from the ODS.
o Detailed data is regularly extracted from the ODS and temporarily hosted in the
staging area for aggregation, summarization and then extracted and loaded into the
Data warehouse.
o The need to have an ODS is determined by the needs of the business.
o If there is a need for detailed data in the Data warehouse then, the existence of an
_ ODS is considered justified.
© Else organizations may do away with the ODS altogether.
Once the Data warehouse aggregation and summarization processes are complete, the
data mart fresh cycles will extract the data from the Data warehouse into the staging
area and perform a new set of transformations on them.
0 This will help organize the data in particular structures required by data marts.
- Then the data marts can be loaded with the data and the OLAP environment becomes
available to the users.
Advantages of top down Approaches
— It is not just a union of disparate data marts but it is inherently architected.
- The data about the content is centrally stored and the rules and control are also centralized.
- The results are obtained quickly if it is implemented with iterations.
Disadvantages of top down Approaches .
> (MU - May 2010, Dec. 2010, May 2011, May 2012)
- This architecture makes the data warehouse more of a virtual reality than a physical
reality. All data marts could be located in one server or could be located on different
servers across the enterprise while the data warehouse would be a virtual
entity being
nothing more than a sum total of all the data marts.
- In this context even the cubes constructed by using OLAP tools could be considered as
data marts. In both cases the shared dimensions can be used for the conformed
dimensions.
- The bottom-up approach reverses the positions of the Data warehouse and the Data marts.
Scanned by CamScanner
h Ogy, whe i ¥F Data Warehousing
SS &S
Mining (MU-Sem. ot
&-Comp. 1-9 Introduction to Data Warehouse
_____ introduction to Data Warehouse
&q .
ade gq. in - Data marts are directly loaded with the data
from the operational systems through the
{ty staging area.
The ODS may or may not exist depending on the
business requirements. However, this
approach increases the complexity of process
coordination.
- The data ‘flow in the bottom up approach starts
with extraction of data from operational
% databases into the staging area where it is processed
and consolidated and then loaded into
the ODS.
mp] ete - The data in the ODS is appended to or replaced by the
fresh data being loaded.
the Stag: — After the ODS is refreshed the current data is once again
extracted into the staging area
By and processed to fit into the Data mart structure,
ts, - The data from the Data mart then is extracted to the staging area aggregated,
summarized
bec and so on and loaded into the Data Warehouse and made available to
the end user for
OMes analysis.
alized
" Extredt
Transform
_ Load
012)
aah Data aie
ical _Warehouse
rent males
. ~ This model strikes a good balance between centralized and localized flexibility.
‘ — Data marts can be delivered more quickly and shared data structures along the bus
eliminate the repeated effort expended when building multiple data marts in a
non-architected structure.
Scanned by CamScanner
The standard procedure where data mart
s are refreshed from the ODS a Nd
not from,
operational databases ensures data consolidation and hence it is ge
nerally eCOMMe, :
approach, it i
4]
Manageable pieces are faster and are casily implemented,
Risk of failure is low.
Allows one to create important data mart first,
Disadvantages of bottom up Approach
— Allows redundancy of data in every
data mart.
Preserves inconsistent and incompati
ble data,
Grows unmanageable interfaces.
Scanned by CamScanner
°
YF Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-11 Introduction to Data Warehouse
Scanned by CamScanner
ww Data Warehousing & Mining (MU-Som. 6-Comp,) 1-12 Introduction to Data Warehou,,
— Se
aS Per the *
4. Consider the series of supermarts one at a time and implement the data warchouse, SF pata Warehousing ¢
bai Unij 5. In this practical approach, first the organizations needs are determined. The key to thiy
——
—————————————
= ——
- This Operz
1.6 Data Warehouse Architecture processing
- The inforn
1.6.1 The Information Flow Mechanism make the
- The Building Blocks of a Data Operationa
Warehouse Or information flow analysis te
Fig. 1.6.1. Mechanism as shown in
2. Data Staginc
Source Data staging area
data systems Data and metadata
storage area
End-user — As soon
Presentation tools
* transform
“Processing = — There ar
clean |hOc query data, they
reconcile tools:
derive matched to” — As the \
match presentation.
combine format warehou
remove dups ; ; data stru
standardize - Report writers > |.
transform nee ay ee - The fun
conform End-user
dimensions applicicati oo Rel
ations
. . o Co
export to data [Load ‘Modeling/. .
Marts mining tools o Es
— i
Visualization o A
~ tools”
8. Data Stor
- Thec
envirc
- For i
techn
Scanned by CamScanner
__ntroduction ‘0 Data Warehous
In Fig. 1.6.1, the source e
data component is shown
external source data syst on the left, shows various
em. internal and
— The next building blockis the data staging area where
data system and then transf the data is extracted from sou
ormation and loading of data rce
- Inthe middle, data sto is done.
rage compone! nt is plac
ed which manages the data war
- <Ametadata Tepository ehouse data.
is used to keep a track of the
data.
- On the rightmost side there is information
delivery component. This
responsible to deliver the inform component is
ation in different ways to the
end users.
1, Source Data Component.
- The source data component Prov
ides the necessary data for the data
~ Typically, the source data for warehouse.
the warehouse comes from the
operational applications.
- This Operational data and Processing
is completely separated from data
Processing. warehouse
~- The information Tepository is surround
ed by a number of key components desi
make the entire envi gned to
ronment functional, manageable and
accessible by both the
operational systems that source data
into the warehouse and by end-user
analysis tools. query and
,
2. Data Staging Component
- As soon as the data arrives into the data
staging area, it is cleaned up, and using
’ transformation function it is converted into an
integrated structure and format.
- There are different. types of transformation funct
ions that may be applied over the
data, they are conversion, summarization, filterin
g and condensation of data.
- As the warehouse maintains historical data, the
amount of data is very huge. The
warehouse must be able to hold,and maintain such a
large volume as well as different
data structures for the same.
- The functionality includes : /
o Removing unwanted data from operational databases.
i
f © Converting to common data names and definitions.
é
;
o Establishin ig defaults for missing data.
© Accommodating source data definition changes.
3. Data Storage Component
- The central data warehouse database is the cornerstone of the data warehousing
environment.
- For implementation, mostly a Relational database management system (RDBMS)
ee
technology is used.
ae
ew
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-14 Introduction to Da
ta Ware
- However, this kind of implementati feu,
on is often constrained by the fact
per the; that tradit;, SF Data Wareh
RDBMS products are optimized for tran
' Uni sactional database Processing,
"4
——
6 Management and Co
ntrol Component
— Man
Scanned by CamScanner
*
wy F i ‘
_Data Warehousing & Mining (MU-Sem, 6-Comp.) 1-15 Introduction to Data Warehouse
1.6.2 Data Warehouse Architecture
>(MU - May 2010, Dec. 2010, May
2011, Dec. 2011, May 2012, May
2013,
- The data in a data wareho May 2014, Dec. 2014, May 2016)
use comes fro m
Operational s ystems of the
as from other external
sources. These are c oll Organization as well
ectively referred to as source
— The data extracted ‘from systems.
source systems is stored
where the data is cleaned, in an area call ed data staging area,
fransformed,
combined, and duplicated
’ the data warehouse, to prepare the data in
: a
The data staging area is gen
erally a collection of machin
sorting and sequential Proces es where simple activities
sing takes place. The data stag like
query or presentation services, ing area does not prov ide any
:
OU | = As soon as a system Provides query or presentation services, it is cate
presentation server. gorized as a
.
‘A presentation server is the target-
machine on which the data is load
ed from the data
applications.
The three different kinds of systems that are
required for a data warehouse are:
(i) Source Systems
(ii) Data Staging Area
(iii) Presentation servers
9 — ' The data travels from.source systems to prese
ntation servers via the data staging area. The
entire proce ss is popularly known as ETL (extract, transform,
and load) or ETT (extract,
transform, .and transfer). Oracle’s ETL tool is
called Oracle Warehouse Builder (OWB)
and MS SQL Server’s ETL tool is called Data Tran
sformation Services (DTS).
A typical architecture of a data warehouse is shown in Fig.1.
6.2 :
9
c 0
ao
ares * e x
| Warehouse manager *
Archive / Backup
Scanned by CamScanner
a]
Ww Data 1 Warehousing & Mining Introduction to Data Wareh,
ttn,
Each component and the tasks performed by them arc explained below ;
1. Operational Source
The sources of data for the data warchouse are supplied from :
- The data from the mainframe systems in the traditional network and hierarchic,
format.
- Datacan also come from the relational DBMS like Oracle, Informix.
* = Inaddition to these internal data, operational data also includes external data Obtain,
from commercial databases and databases associated with supplier and customers.
2. Load Manager
- The load manager performs all the operations associated with extraction and Joading
data into the data warehouse.
- These operations include simple transformations of the data-to prepare the data fy,
entry into the warehouse.
- The size and complexity of this component will vary between data warehouses ang
may be constructed using a combination of vendor data loading tools and custom bujj
programs.
3. Warehouse Manager |
— The warehouse manager performs all the operations associated with the management
of data in the warehouse.
- This component is built using vendor data management tools and custom built
programs.
— The operations performed by warehouse manager include.
— Analysis of data to ensure consistency. .
- Transformation and merging the source data from temporary storage into data
warehouse tables.
- Create indexes and views on the base table.
- Denormalization.
— Generation of aggregation.
- Backing up and archiving of data.
- In certain situations, the warehouse manager also generates query profiles 10
determine which indexes and aggregations are appropriate.
Scanned by CamScanner
*
Scanned by CamScanner
' : 1-18 Introduction to Data jw. ai hong ;
)
&: Data Warehousing & Mining (MU-Sem. 6-Comp.
direct 4 query ty the
2
data isis us used to SF Data Warehousing & |
Dake - oO As part of Query Management process - Mcta
Univ hecauee the «.
—eee
|
different. underlying DE
End-User Access Tools Middle Tier (OL)
— The main purpose of a data warehouse is to provide information to the busine, OLAP Engine i
managers for strategic decision-making. Processing) or M
3. Top Tier (Fron
— These users interact with the warehouse using end user access tools.
— Some of the examples of end user access tools can be : = This tier is:
data mining
eo Reporting and Query Tools
- From the A
o Application Development Tools
J. Enterprise W:
© Executive Information Systems Tools-
The informati
o Online Analytical Processing Tools
enterprise war
o Data Mining Tools Il. Data Mart
1.6.3 Three Tier/ Multi-tier Data Warehouse Architecture — aAsubset
Analysis
fe ¥ query ——SSSS
Serve reports
data mining 17 Meta
4.7.4 Detir
Scanned by CamScanner
BF Data Warehousing & Mining (MU-Sem.
6-Comp.) 1-19 Introduction to Data Warehouse
nae like, ODBC(Open Database: conn
ection), OLE-DB (Open linking
embedding for database), JDBC and
(Java Database Connection) is
underlying DBMS, supp orted by
, 2. Middle Tier (OLAP Engine)
OLAP Engine is either implemen
ted using ROLAP (Relational
Processing) or MOLAP(Multidimension online Analytical
al OLAP).
3. Top Tier (Front End Tools)
This tier is a client which contains query
and reporting tools, Analysis tools, and /or
data mining tools.
From the Architecture ‘Point of view there are three data warehouse
Models :
I. Enterprise Warehouse
The information of the entire organization is collecte
d related to various pubiocts in
enterprise warehouse.
II. Data Mart
1.7 Metadata
> (MU - Dec. 2010, Dec. 2012, May 2013, May 2014, Dec. 2014)
1.7.1 Definition
= (MU - May 2010, Dec. 2011, May 2012, Dec. 2013)
Data warehouse metadata are piec&s of information stored in one or more special-
purpose metadata repositories that include. :
(a) Information on the contents of the data warehouse, their location and their structure,
b) Information on the processes that take place in the data warehouse back-stage, concerning
the refreshment of the warehouse with clean, up-to-date, semaganeehy. and a structurally
reconciled data,
(c) Information on the implicit semantics of data (with respect to a common enterprise
model), along with any other kind of data that aids the end-user exploit the information of
yn
the warehouse,
Scanned by CamScanner
~
~_¥& ata Warehousing & Mining (MU-Sem. 6-Comp.) _1-20 antroduction to Data Wath.
(d) Information on the infrastructure and physical characteristics of components and ‘
sources of the data warehouse and, -
(e) information including security, authentication, and usage statistics that aigs te
_ administrator, tunes the operation of the data warehouse as appropriate,
1.7.2 Describe Metadata of a Book Store
Scanned by CamScanner
ww Data Warehousing & Mining (MU-Sem, 6Comp.) 1-21 Introduction to Data Warehouse
<== 2a —
Notes
Scanned by CamScanner
=F pata Warehousing & Mining (MU-Sem. se
omp.) 1-22 Introciucton to Data Ware, 4 |
‘Per the p m
SF Data Warehousing & Minit
uf Univ ormation Meta data
Extraction and Transf ————
Scanned by CamScanner
9 Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-28 _Introduction to Data Warehouse
es
ee
EEE —— ————————————
ks Used for Online Transactional Processing (OLTP) but can be used for other purposes such
as Data Warehousing. This records the data from the user for history.
The tables and joins are complex since they are normalized (for RDMS). This is done to
reduce redundant data and to save storage space.
b Entity - Relational modeling techniques are used for RDMS database design.
L.. Optimized for write operation,
Performance is low for analysis queries.
Jata Warehouse
Vs Used for Online Analytical Processing (OLAP). This reads the historical data for the
Users for business decisions.
The Tables and joins are simple since they are de-normalized. This is done to reduce the
response time for analytical queries.
Data - Modeling techniques are used for the Data Warehouse design.
Optimized for read operations.
High performance for analytical queries.
}. Is usually
a Database.
Scanned by CamScanner
o
Ww Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-24 Introduction to Data Warety
An, %
tly as per th Sr. Dimensional Model ERModdl wz Data Warehousing & Mining (Mi
mbai Un No,
SEeeaaeameaeaeaaeaaooooooooooooeeuem
Facts
academicy
7. | It is robust. The dimensional model design | It is variable in structure ang Yen (a) Occupied Rooms
er Choice £
can be done independent of expected query | vulnerable to changes in the use, (d) No of occupants
patterns. .| querying habits.
8. | The model is easy and understandable. The mode! for enterprise is very hard fy
people to visualize and keep in thes
heads. 1.10 The Star Schema
|
9. | The model really models a business. It is a | The model does not really model ,
body of standard approaches for handling | business. It models the micyy Date
common modeling situations in the | relationships among data elements. dimen
business world.
Scanned by CamScanner
¥ pata Warehousing & Mining (MU-Sem, 6-Comp.)
De yita\\ Fate
1-25 Introduction to Data Warehouse
cf
&
iy)
é
(a) Occupied Rooms (b) Vacant Rooms
7 yi
(c) Unavailable Rooms
aN (d) No of occupants (ce) Revenue
‘Dp in a Syllabus Tople : Star Schema
= 1.10 The Star Schema
Ie “dey
ni! > (M
:
-UDec. 2010, May 2012, Dec. 2014)
ents f % ie "
dimension
ae
~
> “ Salestac; t f |__[Pro auct.
dimension} .
TE ee
5, . dimension| - :
— Star Schemaiis the most popular schema design for a data wareho
use.
’ Dimensions are stored in a Dimension table and every entry has its own
unique identifier.
ct — Every Dimension table is related to one or more fact tables.
- All the unique identifiers (primary Bal from the dimension tables make up for a
the composite key in the fact table.
- The fact table also contains facts. For example-a combination of store_id, date_key and
product_id giving the amount of a certain product sold on a given day ata given Store.
- Foreign keys for the dimension tables are contained in-a fact: table. For eg.(date key,
product id and store id) are all three foreign keys.
- Inadimensional modeling fact tables are normalised, whereas dimension tables are not.
- The size of the fact tables is large as compared to the dimension tables.
- The Facts in the star schema can be classified into three types :
(i) Fully-additive : Additive facts are facts that can be summed up through all of the
dimensions in the fact table.
| (ii) Semi-additive : Semi-additive facts are facts that can be summed up for some of the
dimensions in the fact table, but not the others. -
| Example : Bank Balances : You can take a bank account as Semi- Additive since a
. current balance.for the account can't be summed as time period; but if you want see
current balance of a bank you can sum all accounts current balance.
Scanned by CamScanner
Sem, 6-Comp.) 1-26
¥ Data Warehousing & Mining (MU- Introduction to
Data Wa,
eh ty
»N
( iii) Non-additive
on- -: Non-additive
, facts are facts that cannot be sum med UD for an . 2
- Easy to understand,
Easy to Navigate between the tables
due to less number of joins.
Most suitable for Query processing,
Example ;
limo_koy
day
day_of_tho_weak
mo
quanth ms an
rter me, Sales fact table
year
mh
item_key:
Item_name.
time_key w| brand
ua (ii) Surrogate ]
| type
Itam_key [* Supplier_key
- These k
branch_key - Theyd
Iccation_kay - All dat
branch_key oo
branch name , _ | location - Onem
branch_type units_sold me
| location_key - A four
dollers_sold Street .
city
(iii) Foreign k
|
State_or j
| Se cou ntry ee Every Dit
| Measures | ee
key of eac
Scanned by CamScanner
a
yy Data Warehousing & Mining
(MU-Sem. 6-Comp.) 1-27
Introduction to Data Warehouse
Syllabus Topic : STAR Sche
ma Keys
Me, —s—s«11 = STAR schema Keys
Sig,
Re dig, q y (i)CsPrimary Keys : The ‘4
Primary key of the dim
uly She | which identifies each row ension table is one of
in a dimension table unique the attribute value
Me, ak Example : ly,
In Professor_Dimdimension,
Prof dis primary key of dimens
Course_Section_Dim ion Professor_Dim.
i
Course_ids-
i
Section_no. Period_Dim
i Course_name _, ~ Semester_id -
Course_grades Year
I .
“Course_id =
i Room_capacity™
{ _Semester_i
| | Student_id:
| Prof_id
| "grade
Student_Dim
! Professor_Dim Student_id
| )Student_name ©
‘ | Major
Department_id ee
Department_name:
Fig. 1.11.1
(ii) Surrogate Keys
;
}
Scanned by CamScanner
p.)i 1-2 8 Introduction to Da
w
5
j g (MU-Sem. 6-Com
Data Warehousing & Minin ta Warep,
SSN
*
Scanned by CamScanner
Item [suppiier|
Sales fact table itom_key meee
> item_namea Supplier_key
time_key. _a| brand _ suppliar_type
Q wet
dy Ie | type wee
item_key [° supplier_key --T”
Ct di
>
branch_key
On ta rae location_key
bly ; ‘ branch_key a"
- location
: branch_name
a ij units_sold ‘~}-————
branch_type se
f ) location_key ©
bis : Street
mal ise, —— eres Es
‘olty_key. | [elty
ia
‘ avg_sales” nh,
=
ead
4 _State_or_province
country £
E
more tables.
Queries results fast. There will be some delay in processing the
Query.
5.. | Star Schemas are usually not in In Snowflake schema, dimension tables are
BCNF form. All the primary keys in 3NF, so there are more dimension tables
of the dimension tables are in the which are linked by primary — foreign key
fact table. relation.
Scanned by CamScanner
=
(date, time, product code, product_name, price/unit, number_of_units, amount) For each star schema ©
hema.
= (18-SEP-2002, 11.02, pl, dettol: soap, 15, 2, 30) “That solution is very fl
But in the above dimensional model we provide sales data rolled up by product(all records The main disadvantag
corresponding to the same product are cmsatiries) in a store on a day. A typical fact table because many variants
record would look like this:
In a fact constellati
Scanned by CamScanner
Ww Data Warehousing & Mining (MU-Sem. 6-Comp,) _ 1-31 Introduction to Data Warehouse
nes Syllabus Topic : Fact Constellation Schema
Se,
Fact Constellation
| > (MU - May 2012, May 2014)
_ Asits name implies, it is shaped like a constellation of stars (i.e., star schem as).
— This schema is more complex than star or snowflake varicties, which is due to the fact that
jt contains multiple fact tables.
- Thisallows dimension tables to be shared amongst the fact tables.
- A schema of this type should only be used for applications that need a high level of
~ sophistication. ;
- For each star schema or snowflake schema it is possible to construct a fact constellation
schema.
- That solution is very flexible, however it may be hard to manage and support.
- The main disadvantage of the fact constellation schema is a more complicated design
because many variants of aggregation must be considered.
- In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which-are for given facts relevant.
- This may be useful in cases when some facts are associated with a given dimension level
and other facts with a deeper dimension level.
— Use of that model should be reasonable when for example, there is a sales fact table (with
details down to the exact date and invoice header id) and a fact table with sales forecast
which is calculated based on month, client id and product id.
- In that case using two different fact tables on a different level of grouping is realized
through a fact constellation model.
Family of stars | Dimension: “Dimension:
table eo table.
Ne =
:=
ra
table. |
“Dimension pension,
‘table.
Sa | : eae
we peed
Dimension: NS “Dimension.
. table — stable
Scanned by CamScanner
p.) 1-32
eouig & Mining (MU-Sem. 6-Com
w Data wara—— —
——————————
Shipper_key’ ;
shipper_name
~ location_key ”
Fig. 1.13.2 : Sales Fact Consolidation
‘shipper: type
Scanned by CamScanner
ike sal es_unit and cost.
pieemrcc:y in a sales fact table we may include other facts like
Likewise
- Dimension tables are companion tables to a fact table i nastar schema. j at Fags
for fr /eferen cain:
‘ Sart
- Each dimension table is defined by its primary key th at serves as the basis
tables con sista 3
integrity with any given fact table to which it is join ed. Most dimension Ave
textual information. sider the yet?
— To understand the concepts of facts, dimension, and star schema, let us con feea iond
following scenario : we
and writing
sold am
- Imagine standing in the marketplace and watching the products being
down the quantity sold and the sales amount each day for each product in each Stl
- Note that a measurement needs to be taken at every intersection of all dimensions (day,
product, and‘store), , :
Table 1.14.1 -
— The information gathered can be stored in the fact table as shown in
Table 1.14.1 : Fact Table
Date Key
Product Key
Store Key
Sales_Unit
Sales_Amount
. Cost
ee ,
_— The facts are Sales_Unit, Sales_Amount, and Cost (note that all are numeric and
additive), which depend on dimensions Date, Product, and Store. The details of the-
dimensions are stored in dimension tables.
leant to conte
1.14.2 Factless Fact Table
‘ > (MU - Dec. 2012, May 2013)
- Factless table means only the key available in the Fact there is no measures available.
Used only to put relation between the elements of various dimensions.
Are useful to describe events and coverage, i.e. the tables contain information that
something has/has not happened.
Often used to represent many-to-many relationships.
The only thing they contain is a concatenated key, they do still however represent a focal
sion
event which is identified by the combination of conditions referenced in the dimen
tables.
Scanned by CamScanner
of ae
Attendance fact
\
Scanned by CamScanner
Se PIC : Update to the Dimension tapies
1.15 Update to the Di
mension Tables —
Scanned by CamScanner
evised Syllab
ty Us (Rev - 2016) of
3-2019
‘edit and Grad
ing System)
TT B A = =
| SF ata warshousing & Mining (MU-Sem, -COmP) TSB ¥ Data Warehousing & Mining (ML
i table
di ension
dim s inin row.
Noneed to do other change
————— oe ]
— From Fig. 1.15.1 star Schema, If Customer Name of Customer Dimension or Prodiict
Name of Product dimension changes, then it comes under Type 1 change as there is
no need to preserve the old values. It can be overwritten with new names without
changing any Key Value.
(ili) Type 3 Changes
(ii) Type 2 Changes
- Check the attribute Marital Status of Customer Dimension. — Type 3 changes
— This attribute value can be changed from Single to Married over time. _— Complex queri
~ If the Marital status of Anand Prasad Changes from Single to Married, then old value o Hard to im
can not be overwritten as there may be a need to track orders by marital Status. o ‘Time-con
- If Anand Prasad Changes his state from Maharashtra to Gujarat, then also old value of o Hardton
state has to be preserved . : - InType2 ch
This type of changes are called Type 2 Changes where historical Values are preserved that cut-off
instead of overwriting .
“Married” al
General principles for type 2 change _ groups.
- Type 2 changes are related to the true changes in source systems. — Incase, if tl
Scanned by CamScanner
, ww Data Warehousing & Mini
g ning
(MU-Som.6-Comp.)
m 1-37 Introduction to Data Warehouse
r- Thist - of change partiiti tions the history in the data warch
W-5 — =
ouse,
'y change for same attribute, then
it must be preserved.
How to apply Type 2 Changes to
the Data Warehouse ?
- Add anew tuple in dimension table with
new value of the attribute.
- Include date field on which the change occurred in dimens
ion table strictly 25 >
- Origina
= l row shouldns not be changed and the key: value of that row is; not affected Mumba
. wef, acad
- The new row is inserted with a new surrogate key.
Cc
Example: / (spe
Customer Code:
C123
Martial Statu ‘ i
(21 78/2008): ‘ ‘ Key Restructuring
Married eid C123->39154112
Addiress (13/10/ee2014) i
Baie: ce 2
5278934
Scanned by CamScanner
fF
1
a
Fast Changing _—
1.15.2 Large Dimension Tables ‘attributes
Scanned by CamScanner
is oo is fee asible if the
: dimension
: ; changes infrequent! li ‘
q y like once or twice a
omrek Product dimension which has rows in thousands changes melt a =a
manageable.
But in case of customer dimensions, where number of rows are millions and
= infrequently. changes
then type 2 changes are feasible and not much difficult. If customer )
dimension changes rapidly then Type 2 changes are problematic and difficult.
If dimension table is rapidly changing and large then
break that dimension table into one
or more smaller dimension tables. x"
, cw
Customer_DIM Ratail_ Sales
CustomerID ;
Customer_Name
luct_I
. Customer_Address Paine
Customer_City sat ‘ | Employee_ID
ne Date_ID
eames Rating. Sales_Rev
as: j sis, “f- A
attributes Credit History Score. : oe
:
unt Status 5 :
Fig. 1.15.3
Ego
Move the rapidly changing attribute in. another dimensi
“ 4
on table and leave the original
at
ges
dimension table with slowly changing attributes.
Example, Consider Customer dimension, then following attributes changes rapidly Age,
Income, Rating, Credit history score, Customer account status, Weight.
From Customer_Dim, take out the fast changing attribute and create new mini dimension
as shown in Fig. 1.15.4.
Customer_D!
-Customer_ID -
Customer_Name ~
Customer_Address*:
Customer_City
DateID,
‘More... Sales_Rev
Avg_Sales
er eyes cs 5 . E % oe be is as : Z ¥ oe |
Scanned by CamScanner
4
Ww Data Warehousing & Mining (MU-Sem. 6-Comp.) _
1-40 Introduction to Data Werth,
housing§
1.15.4 Junk Dimensions F vata oe
Wate
= Fi Tom
source i.
legacy systems,
tems, some
some extual
te? data Or flags
= cannot- be the SI gn Itc;
fi ant fi I | soln. =
major
a] dimensions . But at the same
fact table. .
rr
ritam_key
ee
Scanned by CamScanner
—_
soln. :
ma
(a) Star Sche
Time Branch
Time_key Branch_key
Day Brarich_name
week
Day_of_the_ Branch_typs
Month Sales
Quarter Time_key
year % Item_key
Blanch _key Location
Item : palin Key |_| Location _key
ltem_key : Unitsr_ss_oslodld| Street |
Stem names] -Dolla City
street
Brand Province_or_
Type
Year ‘Branch_key
‘Location_key |
‘Units_sold |
‘Dollars_sold * Location
Item :Location_key
ltem_key Street ws
City_key ae
ltem_name
| Province_or_street®
Brand :
Be _ City
Supplier_type Supply Clty _key
\ Supplier_key City ;
Supplier_type Province_or_street-
Scanned by CamScanner
s i ining (MU-Sem. 6-Comp.) 1-42 Introduction t -
eee eee ie
R
eo) , 10 Data ware, vy Data Warehousing & Mining (MU
(c) Fact Constellation
Branch
Time (c) Can you con
key answer and c
Bra _name
Day
Ni
S Soln. +
Month (a) Star Schema
Course_Section_Dim
year.
Bra
key =
nch_key= |
Dollars_sold =~
les ==
Fact table
|
Shipping
|
|
key =
From_location
|
key ss
To_location © name:
. Shipper_|
Scanned by CamScanner
Data Warehousing & Mining (MU-Sem. 6-Comp.) Da | Warehouse
Introduction to Data
1-43 __
(ce) Can you convert this star schema to a snowflake schema ? Justify your
answer and design a snowllake schema if it is possible.
Soln. :
(a) Star Schema
, Course_Section_Dim
Course_id Period_Dim
Section_no Semestor_id
Course_name Year
Units Course_grades
Room_id Course_id
Room_capacity , Semester_id
Student_id
Prot_id
grade Student_Dim
Student_id
Professor_Dim Student_name
Prof_id
‘Major
Prof_name |
Title= °
Department_id.
Deparment_namg
Yes, the above star schema can be converted to a snowflake schema, considering the
following assumptions.
- Courses are conducted in different rooms, so course dimension: can be further
normalized to rooms dimension
as shown in the Fig. P. 1.17.2(a).
— Professor belongs to a department, and department dimension is not added in the star
schema, so professor dimension can be further normalized to department dimension.
— Similarly students can have different major mnpeotts so it can also be normalized -
shown in the Fig. P. 1.17.2(a).-
Scanned by CamScanner
EW Revis
] aa d Syllabus rp...
-
YF Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-44 Introduction to
Data Wa,
Tobe, a
Reom_Dim Course_Dim ty
Te : Period _Dim
‘ WF pata Warehousing & Mining (MU-Sem,
6
Time period < 5
ear years
Stora + 300 stores
Teporting
Course, Product : 40,000 Produc
ts |
rs. Promotion: a sold ite
m me
Semester.
ti
udent_id Soin. :
G (a) Star schema:
For a Supermarket Chain consider the following dimensions, namely Product, Maximum-number 0
Ex, 1.17.4:
store, time , promotion. The schema contains a central fact tables sales facts Ex.1.17.5: Drawa Si
with three measures unit_sales, dollars_sales and dollar_cost.
Design star schema and calculate the maximum number of base fact table
records for the values given below:
Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem, 6-Comp.) 1-45 Introduction to Data Warehouse
Ex. 1.17.5: Drawa Star Schema for Student academic fact database.
et
Scanned by CamScanner
W Revised Sviliar.
Student_Key School
Patient_Dim
Student Address
Address City
Patient_key
Name
City : State Gender
State Zip
Deoctor_Dir
saad Academlc_Fact_Tabl siype. Doctor_Ker
Age ay Student_key:. Status_Dim Narre
Sex _ 2 Project_key. eigtds-hey Aifiation
Status_key. Fa ;
School_key
Project_Dim
-Project_! Keys =)
- Descriptio
Type
: Category.
Ex. 1.17.6: List the dimensions and facts for the Clinical Information System and Desz
Star and Snow Flake Schema. -
Soln. :
Dimensions
1. Patient
Doctor
Specialization_Dir
3. Procedure
Diagnose Specialization_key
5S. Date of Service 6. Location
7. Provider
Facts
1. Adjustment
2. Charge
3. Age
Scanned by CamScanner
-
wy Data Warehousin ig & Mining (MU-Sem. 6-Comp.) 1-47 Introduction to Data Warehouse
Patient_Dim
Date_of_Service_Dim
Patient_key
Dateser_key
se
Name
Date
Gender ‘act Table for Procedure Month
& Billing History Quarter
Doctor_Dim
Pationt_key
Doctor_Key Location_Dim
Dateser_key
Name P| Location_key Location_key
Affliation Doctorkey Name
Procedure_Dim
Provider_key — Owner
Procedure_key
ae a dagnos key) Provider_Dim
Subgroup / Sete enue
Name .
hey
Group ; —— =
- ; ——____] er
Diagnos_Dim
Diagnos_key *
: Descript
Subgroup
Group
City_Dim. °
Name - ——" Location _key }____——— beeiation Key.
Affiation Doctor_key pone a ae City_key
Specialization_key Provider_key.- Owner Ee Fl City_name
State
Procedure_key: sly key Ss
Specialization_Dim / Procedure_Dim ee Besgnoa ave emia Dh
alization_key Procedure_key Adjustment Provider key.
|Major Descript Charge : Name aul
_ {Department Subgroup Age; ype"
: Group
Diagnos_Dim
Diagnos_key
Descript
Subgroup
‘Group
Snow Flake Schema
Fig. P. 1.17.6(a) : Clinical Information
Scanned by CamScanner
ng & Mining M U-Sem. 6-Con p.) 1-48 Introduction to Dats
Se Jata Warehous
> Ne 7
1 . 17 Wee Draw a Star Schema fo L br
i ar y 9
* ,
Ex.
ive Soln. : Time ww Data Warehous
ear Time_id
Date
asi Ex.1.17.9: Ab:
Day thei
Book_Purchase_Fact Year
like
Time_ld [Month| wh
User_id Holiday
Ea
S_id
cu
Librarian_id a
Book id
po
i P| Book Sa,
Currant_purchasa — We
[Book_id J
W
Book_name '
‘Author.
(i
-Publication
‘Edition ~ e
Total_copies .
—t oT
‘Availability
|No_of_issued= copies
fives: cae
City Supplier_name = |
Working_time
OS
Ex. 1. 17.8 :
Soin. :
Sales_Person_Dim (ii) Star Sc
Time_Dim
Time_key: :
Quarter
Departmantid Fact_Table Year
‘Sales Pe rson_key
Time_key
Location_Dim
Scanned by CamScanner
j like House Bui
Soln. :
ca ees MU - May 2014, 10 Marks |
(D °
-
Scanned by CamScanner
fi
i 8S per
a the N a
Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 1-50 Introduction to Data Wareh,
Ex. 1.17.10: Draw star schema for video Rental ‘ para Warehousing & Minit
Soin. : Rental Item
————_s_
est STOC
Ex, 1.17.11: Star schema for retail. chain.. oej 1
Soln. : (a)
(b)
Time key ‘Customer kay
Product key Name me. | ok Ms : :
Customer key © “Age reat Soln.:
5 Store key ‘Income (a) Information
Mode key Gender
Actual sales status
Forecast sales
Price :
Discount
Product key
Name . i
Brand Facts :To
Category
Mode key
Colour
Payment mode
Price
Interest rate .
Scanned by CamScanner
Ww Data Warehousing &
Ex. 1.17.1 _ Des * ng MU-Som. SComp,) 1-54
Ex
& 3 1.
1.17.1 aa
a . 25)
eT a easels 3
Soln. : UEP ONT
Time
Branch
|__| week:
Month y
|
!
I .
| Item Location
item_key "7 Location_
item_name— Street
Brand
nce_or_street
| “BOOKS _| BOOKSTORE
Booknum Storenum —
Primary_author | City
Topic State
3 Zip
, Qty
Facts :Total_Stock , Price, Inventory_value
Scanned by CamScanner
ver the New Re
| Univers: ViSed Syllabus (po.
Mic yes,
ice B ‘
BF pata Warehousing. & Mining
i (MU-Se
H m. 6-Comp
- .).) 1-52 Introduction to Data ¢
Reh
(b) Star Schema
wy Data Wiarehousing &
Dim_Books Mining (M
fe
(b) Star Schema ang sno
Mmary_author wflal
PRODUCT otra
ble
num
Storenum
y 7,
(a) Dimensions
Scanned by CamScanner
house
Introduction to Data Ware TS
e
WF Data Warehousing & Mining (MU-Sem. 6-Comp,) 1-53
“a
Scanned by CamScanner
S$ Per the Ne
w P .
a! Unive
P
=
lemic ye 1-54, Introduction to —
¥ Data Warehousing & Mining (MU-Sem, 6-Comp.) Data
10ice B; Wary
PRODUCT_DI
M
STORE_DIM
Supplier “ ww Data Warehousing & Min
Supplier_type Supplier_key Stora Key
| :
Product Descriptio| [_Supplier_type Store Name ; Linked to the righ’
n
Brand Name ~ Store ID
Product Sub-Catego drive faster, bette
-———Addr
areses
s s
customer behavic
Department
State’ business.
Data warehousin
Sales facts table business advanta
| Product Key.
Time Key
Store Key
: e
Location. Key
Shelf depth _ - Unit Sales © Pe LOCATION_DIM
Doller Sales Oper
L_Doller Cost Location_key sys
2
TIME_DIM Fig. P.
_Time Key + Advantages of BI Sol
Date
Day of week The Retail Busine:
Week Number
Province_or_stre
at
more timely decisix
_ Month
Month Number Promoting unmat:
Quarter channel selling.
Year
_ Holiday Flag The Solution help
product and custo
Fig. P, 1.17.14(a) selling, but to wh
(c) ; Snowflake Sche
Need for BI ma Given as an inp
Solution fo r Re
tail Busine ss
Whether shoppi provide real-time
ng for daily to respond to fast
i
a1.
t 18 Univ
Un iver
ersi
sity
ty
May 2010 _
'Q.1- Define Data
block diagra
;
and at all points Q.2
maximum impact
.
.
on your bottom
Up and d Define Met
line, warehouse
To cope with :
these challenges
known as data . , many retailers ar (Ans. : Ref
Warehouse. e buuildi
djng unj
Q.3 How are t
Discuss th
(Ans. : Re
Scanned by CamScanner
5 ee
oh .
by May 2010
as
°Q1- Define Data Warehouse, Explain the architecture of data warehouse with suitable
block diagram. (Ans. : Refer sections 1.3 and 1.6.2) (10 Marks)
@.2 Define Metadata. What
are the different types of metadata stored in a data
warehouse ? Illustrate with a simple customer sales data warehouse.
(Ans. : Refer sections 1.7.1, 1.7.3 and 1.7.4) (10 Marks)
Q@.3 How are top-down and bottom-up approaches for building data wareho
use differ ?
Discuss the merits and limitat ion of each approa ch.
(Ans. : Refer sections 1.5.1 and 1.5.2) (10 Marks)
Scanned by CamScanner
Dec. 2010
————
Q@.4 Explain the characteristics of the data present in the data ware house, —— D
ata Warehousing & Mining|
SSE
Q.7 White short notes on Metadata. (Ans. ; Refer section 1.7) « Mar,
Q@.16 Differentiate betwee
CBEEEEE:’CS'C-_
Q.8 Adimension table is wide; the fact table is deep. Explain, What is St AR sen Ma, warehouse. Explain
its advantages. (Ans. : Refer section 1.10)
tt
Scanned by CamScanner
¥ Data W
arehousin
& Minin
MU-Sem,
Q.16 Design Sta ' SComp) 4.57
(Ans. : Referr ScEx.hema forauioea IntBro
duction to Data Warshouse
4 17.19 SS analysis of the
May 2012 Co mpany,
Scanned by CamScanner
May 2013
(10 Mark: s)
Scanned by CamScanner
wv Data Warehousing & Mining (MU-Sem, 6-Comp.) 1-59 Introduction to Data Warehouse
Q. 36 A bank wants to develop a data warehouse for effective decision-making about their
. loan schemes. The bank provides loans to customers for various purposes like house
building loan, car loan, educational Idan, personal loan, etc. The whole country is
categorized into a number of regions, namely, north, south, east and west. Each
region consists of a set of states. Loan is disbursed to customers at interest rates that
change from time to time. Also, at any given point of time, the different types of loans
have different rates. The data warehouse should record an entry for each
disbursement of loan to customer.
(a) Design an information package diagram for the application.
(b) Design a star schema for the data warehouse clearly identifying the fact tables
(s), dimensional table (s), their attributes and measures.
(Ans. : Refer Ex. 1.17.9) a (10 Marks)
Q.37 Define the following terms by giving examples :
(a) Fact constellation (Ans. : Refer section 1.13)
(b) Snowflake schema (Ans. : Refer section 1.12)
(c) Aggregate fact tables (Ans. : Refer section 1.16) _ (15 Marks)
Dec. 2014
Scanned by CamScanner
:¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 1-60 Introduction to Data Wareh
SSSSSSSSSSsS939”$@aoe ao
Q.45 For a super market chain, consider the following dimensions namel
SS
Y¥ Product, Sten
time and promotion. The schema contains a central fact table for sales,
i, Design star schema for the above application.
i. Calculate the maximum number of base fact table records for wareho
Use With "
* the following values given below :
Time period-5 years 3 :
Store - 300 stores reporting daily sales
Product- 40,000 products in each store (about 4000 sell in each store
daily)
(Ans.-: Refer Ex. 1.17.4) ; (10 Marks)
Q. 46 Write short note on : Updates to dimension tables.
(Ans. : Refer section 1.15)
(5 Marks) | -Syabus :
QQOo | Major steps in ETL proce
Chapter Ends tasks, Major transforma
OLAP definition, Dimens
up, Slice, Dice and Rotal
2.1. An Introduction
Automated data me
Warehouse, data ma
Extensible,
robust,
Standardization of
~
Reneat: is.
Scanned by CamScanner
| Moauiez >
Syllabus :
Major steps in ETL process, Data extraction : Techniques, Data transformation, : Basic
tasks, Major transformation types, Data Loading : Applying Data, OLTP Vs OLAP,
OLAP definition, Dimensional Analysis, Hypercubes, OLAP operations : Drill down, Roll
up, Slice, Dice and Rotation, OLAP models : MOLAP, ROLAP.
- AnETL tool is a tool that reads data from one or more sources, transforms the data so that
it is compatible with a destination and loads the data to the destination. ~
- ETL functions reshape the relevant data from the source systems into useful information
to be stored in the data warehouse.
2.1.2 Desired Features
~ Automated data movement across data stores and the analytical area in support of a data ©
warehouse, data mart or ODS effort.
Scanned by CamScanner
2-2
Better utilization of existing §
hardware resources. vata Warehousing
— Faster change-control and ese two factor
management. ve use. This ¢
Integrated meta-data managemen
t. jpouse program =
Complete development environment, “work as you think” design metaphor, |
pata extract
List of
source jdentific:
‘ieee - §yllabus Topic : Major Steps in ETL Process. ——ee_|
} -
each of the source
Scanned by CamScanner
Wi Data Warehousing & Mining (MU-:-Som. 6-Comp,) 2-9 0 ETL Process
~ These two factors increase the complexity of the data extraction process in a data
warehouse. This data extraction can be carried out using a third party tool along with in
house program or scripts.
— Source identification : The Source Application and the source structures are identified.
— Method of extraction : The extraction process can be cither manual or tool based for
each of the source systems.
- Extraction frequency : Data extraction frequency needs to be defined whether it should
be carried out on a daily basis, weekly or quarterly. %
— ‘Time window: Time window needs to be decided for each data source.
- Job Sequencing : find out whether the beginning of one job has to wait until the previous
job has finished completely.
Exception Handling : The input records that cannot be eral have to be handled upon
> (MU - May 2010, Dec. 2010, May 2011, Dec. 2011, May 2012, Dec. 2012,
May 2013, Dec. 2013, May 2014, Dec. 2014, May 2016)
Source systems usually store data in two ways, current value and Periodic status. The type
of data extraction techniques to be used has to be decided based on the type of category the
data falls.
Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem.6-Comp:) cat ETL ProcessSy
Current Value - . \
i i
time WF pata Warehousing&
The value of an attribute represents the value at this moment of . t ———————————sasasaoo=809”
Attribute : Status
— Values are transient or transitory.
61112000
ent on the business transaction,
— The change of value in this category is depend
9/15/2000
— Change in value or how long the value will stay cannot be predicted.
e and outstanding 1/22/2001
— Some of the e.g. are Customer name, address, account balanc
on an individual order. . AMO, 31112001
; 25.1 \mmediate
Periodic Status’ _-
the attribute needs uid — Immediate Da
In Periodic status, every time a change occurs the value of
preserved. _ jae Se
3/1/2001 Changed to NJ
1. Cap
Scanned by CamScanner
OPTION 3:
Capture in source
applications - -
In this option, the transactional log maintained by DBMS is used for data extraction.
log.
All the committed transactions are selected from the transactional
the log file is refreshed.
All the transactions have to be extracted before
is not feas ible if the data sour ce is other than the database applications
This option Aim
|
like for e.g. indexed or flat files.
a met hod for crea ting copi es ina distributed environment.
Data replication is
Scanned by CamScanner
Source Database eg
Scanned by CamScanner
capture the changes.
2.5.2 Deferred Data Extraction
= This option does not cap
ture the changes in rea
l time.
= Th [ere are two options = a ) capture based on date (ii) tuume
stamp and capture
by comparing
If all of the above techniques are not feasible for the source file then this option is
d considered. .
eo ~ — This technique is also called snapshot differential technique as it compares two
snapshots of the source data.
This technique captures the changes by comparing the two copies of data.
Although this techniqne is simple, comparing full rows in a large file can be very
inefficient.
legacy source systems do not have
However this technique is feasible when the
transaction logs, time stamp on source records.
awe | eek
Scanned by CamScanner
(Rev - 2016) of
Revised Syllabus
-ictly as per the New
umbai University vay-som. 0-Corg), 28 ETL Process erst,
».f, academic a 2016: wW Data Warehousing & mining. . i Data Ware! & Mini
; per Choice Base =
Example + Employee!
= other.
> “Thus one set of Dalat
Once all the data elen
NG
‘The conversion may
© Characters mus
o Mixed Text m
o Numerical da
o Data Format
} o Measureme
|
| o Coded datz
| __ AM these tran
available to pt
oe : ——— _ oothing
Syllabus Topic : Data Transformation - Basic Tasks, Major Transformation Types ~ Sm
~ Aggregati
2.6 Data Transformation : Tasks Involved in Data Transformation - Gevexal
- mali:
> (MU - May 2010, Dec. 2010, May 2011, Dec. 2011, May 2012, Dec. 2012, | Ne
May 2013, Dec. 2013, May 2014, Dec. 2014, May 201 6) — Scaling:
- The operational databases developed can be based on any set'of prioriti "aig .
miority
changing with the requirements. cee aaah Keeps kame
, 4 - Therefore those who develop data warehouse based on thi ese datab 2
nai i aS€s are typically faced ~ Scali
with inconsistency among their data sources.
— Transformation Process deals with
i rectifying
ifs any inconsistency
. Gt aii . when
- One of the most common transformation ee .
is common for the given data element to ‘ is “Attribute Naming Inconsi :
databases Teferred to by different hits istency’. It ca
names in different alt
Scanned by CamScanner
Example : Employee Name may be EMP_NAME in one database and ENAME in the
other.
Thus one set of Data Names are picked and used consistently in the data warehouse.
Once all the data elements have right names, they must be converted to common formats.
The conversion may encompass the following :
o Characters must be converted ASCII to EBCDIC or vice versa.
o Mixed Text may be converted to all uppercase for consistency.
o Coded data (Male/ Female, M/F) must be converted into a common format.
All these transformation .activities are automated and many commercial products are
available to perform the tasks. DataMAPPER from Applied Database Technologies is one
such comprehensive tool.
Generalization : It uses concept hierarchy to replace the data by higher level concepts.
Normalization : Attributes are scaled to fall within a small, specified range.
Scaling attribute values to fall within a specified range.
Example : To transform V in [min, max] to V’ in [0,1], apply
= (V-Min) / (Max-Min) .
Scaling by using mean and standard deviation (useful when min and max are unknown or
when there are outliers) :
Scanned by CamScanner
WF Data Warehousing & Min MU-Sem-
~ wn, 4
S¥ilabus (Re 2.6.1 The Set of Basic Tasks @ vata |
— data records.
on selected
Itis the data manipulation Bee
Scanned by CamScanner
Ste, ww Data Warehousing a, Mining (muU-
a, , a ong (MU Som.6-Comp.) 2.44
n — It can be considereg mee brocaes onc OLAP
\ _ i
applied. 45 0 pre- /
Process type before major transformation techniques
are —.
3 we |
Syllabus Topic
: Data Loading - App
lying Data
|
¢ 2.8 D ata Loading : Techniques of Data
Loadin
ge
trae ; > (MU - May 20 ;
y 2010, Mo rth Dec. 2011, May 2012, Dec. 2012,
to,
UP
, Dec. 2013, May 2014, Dec. 2014, May 2016)
'
atl oa
Loading process is the physical
movement of the data from the computer systems stort
4 the source database(s) to that which will
store the data warehouse deésheane "
The whole process of moving data into the data warehouse repository o%
is referred to in the ybe
eee
ata following ways :
|
ion a (1) Initial Load : For the very first time loading all the data warehouse tables. a y
Meh (2) Incremental Load : Periodically applying ongoing changes as per the requirement.
(3) Full Refresh : Deleting the contents of a table and reloading it with fresh data.
|
=7 Data Refresh versus Update
‘ itty After the initial load, the data warehouse needs to be maintained and updated and this can
:
indiy; Vv be done by the following two methods :
es.
= — Update-application of incremental changes in the data sourc
— Refresh-complete reload at specified intervals.
; ;
8
Data Loading
:
warehouse.
ae — Data are physically moved to the data
:
“load window”.
- The loading takes place within a is
time upda tes of the data warehouse} ‘as the warehouse
Th e trend d 1is to near real
<“
— -
nal applications.
increasingly used for operatio
it hee
Dim ens ion Tables
¢q 2.8.1 Loadin g the
tables0022 functions, int
includesingtwo basis
the dime nsio n ongo .
} _ Procedure for main maintain ing ving the changes
taining
app
the tables and thereafter
Scanned by CamScanner
. 2-12 ETLP,
W Data Warehousing & Mining (MU-Sem._ SCOP) moose Se,
pA = Ph
- System gen crated keys are used in a data warehouse.
their own keys.
The records in the source system have 2.9.1 Reasons for “Di
Therefore before an initial load or an ongoing load the production keys must be “OMe, — Dummy Values,
to system generated keys in the data warehouse. : x dey
| -
_ Another issue is related to the application of Type icati 1, Type 2 and Type
YP 3 dimen 3 ¢:
Fields,
changes to the daia warehouse. Fig. 2.8.1 shows how to handle it. w= we a =
‘Transform and prepare for” — Contradicting Data,
- loading in Staging area Inappropriate Use ¢
— Violation of Busin
Determine if rows exist — Reused Primary K
__..this surrogate key « Non-Unique Iden
‘t Matches “ eee
Part of Match changed values with: eee NO 2 De
Bs ~ existing dimension values,” ACTION 29.2 Data Clean
Matches
exactly? NO : —- Source systems
| : = Specialized da
TYPE 1 TYPE2 J£ TYPE 3 | ; \ address correct
Overwrite ; 3 a e F Create anew - : a By ie h don Wri changed value b ss I eadin data
old value J [dimension record. |] |= to“old" attribute field | First tonic
Fig. 2.8.1 : Loading changes to dimension tables | Steps in Data CI
2.8.2 Loading the Fact tables : History and Incremental Loads
| 5 +,
~ The key in the fact table is the concatenation of keys from
the dimension tables | .: ie
~ . Therefore for this reason dimension records are : esem€
loaded first.
-— FE Fore
- Aconcatenated key is created from the keys of
the corresponding dimension tables.
examly
2.9 9 Data Quality
.
: Issues in Data Cleansing
. / s
, | & Correetir
Data cleaning, also called data cleansing
3 : or scrubb in, di : and removiss | |
errors and inconsistencies from data in order ti 8, deals with detecting , indis
in
problems are present in single data collections oak
.
misspellings
s during
. data entry, tilting tnfeaniane such as Cate
files and databases, e. g. due 1 | =
Dor other invalid data. ]
Scanned by CamScanner
W pata Warehous& ing
Mining (MU-Som. 6-Comp) _ 2-19 ETL Procngs and OLAP
1, Parsing
loc ate d and ide nti fie d in the source files and.then
ents are
— The individual data elem
the target files.
these elements are isolated in r
as the fir st, mid dle , ‘and last name, anothe
sed
— For eg. Name can be par nu mb er and street name, city and
state.
par sed as str eet
example, Address can be
or
Correcting
2.
ary dat a sou rce s are used for correction of
and second
— Sophisticated data algorithm
.
individual data components zip code.
of a
rep lac e a van ity address and ation
— Fore.g.
Scanned by CamScanner
-Sem. 6-Comp.,
3. Standardizing
Standardizing applies conversion routines
to tran Sform data into its
consistent) format using both standard and custom
business Tules,
- Examples include addi tere ly
ng a pre name, replac
street name,
ing a nickname, and usin
4, Matching
— Searching and matchi
ng records within and
Standardized data ba across the Parsed, co
sed on predefin r,
ed business Tules to el
— Examples include identi iminate duplication 2
fying similar names and %
addresses
5. Consolidating
~ :
Often used as an interi
m Step between data extr if
~
action and later steps
Accumulates data from asynchrono
Sessions, or other pr us sources using native
ocesses, interfaces, flat files, FTP
~ Ata predefined cut off time
, data in the staging file is
warehouse. transformed and loaded to the
:
Scanned by CamScanner
Data Warehousing & Mining (MU-Som. 6-Comp.) _ 2-15 ETL Process and
OLAP
0 Sample ETL Tools
. ‘Scanned by CamScanner
W eh
¥ Data War
ng _
On li ne'e An al yt ical processi
2.11 +r
Need for
ional view of data.
‘diimension:
supports the multid
OLAP or the On Line Analytical
ous views of informatio,
dy, an d pr of ic ie nt access to the vari
OLAP provides fast, stea '
d
. .
queries can be processe
The complex
1 1¢s On multidimeng;
by processing complex
ti ‘oO analyze information
I t's easy
views of data.
is generally used to analyse the information where huge amount,
Data warehouse
historical data is stored.
sales, matig
ted to more than one dimension like
= Information in data warehouse is rela
trends, buying patterns, supplier, etc.
Definition
ne Analytical Processing |
Definition given by OLAP council (www. slapestivell org) On-Li
y.
(OLAP) is a category of software technology that enables analysts, managers and executives t
of possibk
gain insight into data through fast, consistent, interactive access in a wide variety
views of information that has been transformed from raw data to reflect the real dimensionality | Model Differen
of the enterprise as understood by the user. Eee
i
Syllabus Topic : OLTP V/s OLAP. Single purpost
System.
Full set of En
2.12 OLTP V/s OLAP
> (MU - Dec. 2010, May 2011, May 2012, Dec. 2013) |
Application Differences
res
oa,
*)
oneal
>
Transaction oriented
Subject oriented
High Create/Read/Update/
activ Delete (CRUD) | Hi se
ity ) gh Read activity
Many users
Few users
Continuous updates — Many sources |
Batch updates — Single s
| ime informati on
Real-t | = :
. Historical information
Tactical decision-making — ca
p=
Strategicicplanning
a. ™
Scanned by CamScanner
-
(MU
ata warehousing & Mining -Sem., 6-Comp.)
ae
OLAP
— ed, customized delivery “Uncontrolled”, generalized delivery
Con
a
RDBMS RDBMS and/or MDBMS
.
ee eh
Informational database See
operational database ——
yoni ae Differences
Se OLIP OLAP |
Hi - "ransaction wok using few records Low transaction volumes using many
records at a time.
at a ne, _
——
B glancing needs of online v/s scheduled Design for on-demand online processing.
Model Differences
ENS] oer
Scanned by CamScanner
r the Ne
2.14 Hypercu be
—
Let us assume that there are two business dimensions Product and Time. Business Use|
wish to analyse metrics as fixed cost, variable cost, indirect sales, direct sales, and
margin. Then for analysis purpose, data can be displayed on spreadsheet
which
metrics as columns and time as row for one product per page
as shown in Table 2.14).
Table 2.14.1 : Display of Spreadsheet for Product Trousers where
Time as row and Metrics as columns
|__| fixed Cost | Variable Cost [Indirect
Sales | DinGcts es | Profit. |
[Jan
F ob. | 340 110 230
270 90 200
320 100 . these dis
260 100
[Mar] 310 100 one more
210 270 70
|Apri] 340 110, ees
210 320 80
[May | 330 10 =|) 339—— 300
June 260 90
90
150 _ 300
July | 310 100 100
180 300 70
Aug. | 380 130 210 360
Sep. | 300 100 60
[oct | 180 290 70
310 100 170 -
[Nov.] 330 310, 70
110 210 310
Dec 350 120 so |
200 360 90
aid one for mene by in Re 2.14
This
Scanned by CamScanner
W Data Warehousing &s
Minin MU-So
om. 6.6, ) 2 245 :
.
se
van
° I,
ETL Process and OLA
tn
Mar 1 ™ + Fhed cost
Apr Coats Veriatia
M T i
. Jackets Indirect
=
Jul 4 Pe It Sales
Nov T
Dec jo idan
Profit
bree dimensions is ca
lleg a8 hypercube
which “presents mult
one more dimensio idimensional data, So
n Store with Time an
d Product. MDS for 4 add yi
dimensions is given in
Mar
Senile’ Coats Variable
Apr | |
May Ineinact | |
Dallas ah, Jack le |
Jun | | 3 |
D Jul —
T Dresses
" Sales
4 | Aug - |
Sep {
Cleveland T i I Shirts |
Scanned by CamScanner
Rewleed Suliqhe= — ETL Procosg and ¢, wr
as per the New . . 6-Comp. 2-20
ww Data Warehousing & Minin (MU-Som: | ee Mb
bai Unive
=
|
ww Data Warehousin:
is also
sademic y jon of these 4 dimensions on the epee 7 h a
possible |
Choice | - Representation © yns within the same display group. In the Big. 2.14. A i
2.14.3, produc |
multiple logiical dim
i ens sic
page TCpresents <
x
Cost) are com hincd to display as columns. One
metrics (Sales and
: New York.
sales for store
| B-
Multidimensional
[Jem fm fm ae
STORE TIME
Cost — Miultidimens
| Dallas Mar Jackets Cubes or H.
q
two-dimens:
for a partic
New YorkStore ue 2 dimensio
Hats
: Sales| Hats: Cost | Coats : Sales | Costs : Cost | Jackets
: Sales| Jackets
: Cost 3 dimensio!
P2
Fig. 2.14.3 : MDS and page display for4 dimensions
P3
So hypercube is used for representation of more than 3 dimen
sions. Similarly one can add |
more dimensions and represent it using MDS.
ree Silat rn e
Syllabus Topic : OLAP Operation e
s - Drill down, Roll up, Slice, Dice A data cu
and Rotation
" Fespect to
:
Scanned by CamScanner
Corporato
databases
(Data
warohouses.
ate.)
i he
Mult! dimensional
dalabaea
Client ma chines
P1
Products
nN
2D database
Fig. 2.15.2 : Pictorial view of data cube and
es :
A multidimensional model has two types of tabl
es of dimensions
1. Dimension tables : contains attribut
2. Fact tables : contains facts or measures
follo wing techn iqu es are used for OLA P implementation :
The
any
ns id er a co mp any of Electronic Products. Data cube of comp
Example : Let us co
(a gg re s ated with respect to city), Time (is aggregated with
cation
consists of 3 dimensions Lo
gg re ga te d wi th re sp ec t to item types).
em (a
pect to quarters) and it
Scanned by CamScanner
‘N ew Revised Syllabus (Rev - 2016) of
versity
ar 2018-2019
ised Credit and Grading System)
f= m=z = —
Sem. 6-Comp.)
es
us in g & Mining (MU-
FF Data Wareho
or Roll Up
1. Consolidation to dimensions,
ne . ra ll y ha ve hi erarchies with respect
ge
ional databases ect to one or mo
r,
— Multi- dimens sh ip wi th re sp
da! ta relation
ing up OF adding ty data.
les to get total Ci
at io n is ro ll
Consolid up all pr od uc t sa
ple, adding
dimensions. For exam rformed on the
the re su lt of rol l up operation pe
2.15. 3 shows cation. This hierarchy
was
For example, Fig- ep t hi er ar ch y for lo
bing up the conc
central cube by clim te <country.
tot al or de r st re et <c ity <provinc e_or_sta
defined as the y by location
the dat a by ci ty to the countr
n aggregates
- e roll up o peration show
Th
hierarchy. Location
(countries)
Canada
a
Time (Quarters)
of fy
TITS
Item type
Scanned by CamScanner
q ww Data Warehousing & Mining (Mu-s
. ; eM. 6-Com
Slons: 2. Drill-down :
P) 2.
23 ETL Process and OLAP
Or Mo - Drill Down is
defined as changing the view ofd
— For example, the Fig. 2.15.4 Wises i ata to a greater level of detail
byON wae
the ryrer eaneee PY Stepping down. gtConcept.
<quarter<year_ OF drillhierarchy
down operations performed on
for time defined as
Cation
Scanned by CamScanner
¥Y aS pe
E r the New Revis
ed Syllabus (Rev
tbai Un iversity - 2016) of
icademic year 2018-2019 °
’ Choice Based Credit and Gradin @®
g System)
pc
= == —
id
. ; -Sem.6-Comp.) 2-24 ETL Process
Ww Data Warehousing & Mini (MU-Sem. and9 {
6. Other OLAP |
— Drill acre
=> Vancouve more thar
d
- — Dril. l thr
Time (Quarters)
bottom 1¢
—_—_———__
———__—
2.16 i
OLAP I ee
Entertainment
Enteterta
rtaii nment j 1
Item type Approaches tc
= ieic, 2156.6 = Di *
>-6 Dice operation pe ste 2pa Fag
Multid imension
Tefets to techno
Scanned by CamScanner
-
Item type
we Fig. 2.15.7 :Pivot operat
ion
6 Other OLAP operatio
ns
~ Drill across : This techni que
is used when there is need to
more than one fact table. execute a query involving
———— a
Scanned by CamScanner
2.16.1 MOLAP # pata Warehousing & Minit
2. Can perform complex calculations : All calculations have been pre-gen 1. Can handle large =
cube is created. : “rated When 4, the limitation on dat:
i itself places no jimitz
Disadvantages of :MOLAP -_
‘ | 2 Can leverage func
1. Limited in the amount of data it can handle : Because all calcula database already cor.
2 tions are |
when the cube is built, it is not possible to include a large
amount of data inte nt| a_i
itself. This is not to say that the data in the cube cannot be derive
d from a large amount | Disadvantages of RO
data.Indeed, this is possible. But in this case,
only summary-level information will |
included in the cube itself.
voc(or aa
multiplete SOL
2. Requires addititiional investment
: Cube technology are often propr
dala gse,
alread der ietary and do no
in the organization. Therefore, to adopt MOLAP technology, chances of} ~ underlyin
additional 2. Limited by SQ
investments in human and capital resources
are needed.
roobrialion generating SQL s
all needs (for exa
teyer | ~~ client technologies are |
Scanned by CamScanner
Thi 18 ional
methodology relics on manipulating the data stored in the relationa database to B1Vive
aby ee eee traditional OLAP's slicing and dicing functionality. In cach action
Sm
hy ON siieing and dicing is equivalent to adding a “WHERE” clause in the SQL statem
Advantages of ROLAP
1 Wh
: "4 MY database already comes with a host of functionalities. ROLAP technologies, since they sit
on top of the relational database, can therefore leverage
these functionalities.
“ly Disadvantages of ROLAP
Ul}
pt. Performance can be slow : Because
each ROLAP report is essentially a
(or multiple SQL queries) in ‘the relational database, SQL query
hy: ~ underlying data size is large. the query time can be long if the
a .
| 2 Limited by SQL functionalities
:
: Because ROLAP
technology mainly relies on
generating SQL statements to query the relational database, and SQL statemen
ts do not fit
all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP
technologies are therefore traditionally limited by what SQL can do.
Scanned by CamScanner
Tae: i vata Warehousi
housing & Mining
gated this risk by building into the tool OUl-Of-the'h,
aks to allow users to define their own functions, eactT able : Fact_table (DI
* ieeaaeet
2.16.3 HOLAP
ang ROLp, a
the advantages of MOLAP
_ HOLAP technologies attempt to combine
san
, summary-type information, HOLAP leverages cube ieemnoloey for faster Performan,
_ When detail information is needed, HOLAP can “drill through” from the cube its \ (oe
underlying relational data.
stored, [oe
_ For example, a HOLAP server may allow large volumes of detail data to be
relational database, while aggregations are kept in a separate MOLAP store. »
server.
Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP
Operations
2.16.4 DOLAP
fac
on si
tability, saicfieent= Slaxiicse th
sing and variation of ROLAP. It offerst por 4. Sl
It is Desktop Online Analytical on mel
software to be presen
P, it needs only DOLAP
users of OLAP. For DOLA d
sional datasets are formed and transferre.to destin|
Through this software, multidimen
machine.
}
\
2.17 Examples of OLAP F
Ex.2.17.1: Consider a data warehouse for a hospital where there are three dimension Q
Doctor (b) Patient (c) Time
And two measures i) count ii) charge where charge is the fee that the doc '
charges a patient fora visit. | 2. Dice: Itisas
Using the above example describe the following OLAP operations
| Bike Hine one
1) Slice 2) Dice 3) Rollup 4) Drill down 5) Pivot
. *
a
:
-
DURDEN ees
Soln. : :
There are four tables, out of 3 dimension tables and
1 fact table.
Dimension tables
Scanned by CamScanner
by, *
‘S$
aie
.
Lp Story. 200 o es
Stop, &
) | 100 [Pca alt ow
= ah
|
| Operations
Ortar.
"ab i || L ;
Slice i on fact table
: Slice with DID =
0 de,
| : 300 50 280
, | wig 250 180 200
a . 300 180
| 100° | 125 oe
- 200 | . 300
3, Roll up : It gives summary based on concept hierarchies. Assuming there exists concept
hierarchy in patient table as state->city->location. Then roll up will summarise the charges
for a particular state etc.
or count in terms of city or further roll up will give charges
Scanned by CamScanner
v - 2016) of
v7 pata Warehousing & Mining (N
——==
= 80 = iP ep Package Size
<
Package Type
200 200 2 a = < fs Weight
350 .| 200 p |: Unit of Measure
180 rs - = oe Units Per Case
[Toe] = = a |
35 300 200 /” 50 :
too | 200 | 10 | 4~ 1H
130: 0 630
Scanned by CamScanner
ae . is wa
s
a
soln. = Product
Product Key
SKU Number
Do11, Doip Product Description Store
ore 98 the ang Brand Name Store Key
nder dot econ Product Sub-Category Store Namo
DO Product Category Store ID
Department Address
Package Size City
Packages Type State
Weight Zip
Unit of Measure \ See : Dette
Units Per Case Time Key —
Snot width
Shelf evel StoreSales
"Unit Key ~ Services Ty ype
y . pelt det Dollar Cost 3
> 1S SUMMariseg
ect to location, Time
Time Key
Date
Day of Week
Week Number.
__. Month
__Month Number
- Quarter
Year
- | Holiday Flag
Fig. P. 2.17.2 : Star Schema for Electronics Company sales department
There are four tables, out of these 3 dimension tables and 1 fact table.
For OLAP operations refer Example 2.17.1.
pote The College wants to record the grades for the courses completed.by, stu
There are four dimensions: 3
{i) Course (ii) Professor ' % yee
(iii) Student (iv) Period j
The only fact that is to be recorded in the table is aeesigaia:
() . Design star schema.
: (ii) Write DMQL for the above star schema
©
OLAP cosas is
| iii) U the above example describe thefohowina
oaks naan dl ies
Scanned by CamScanner
id Syllabus (Rev - 2016) of
19
tan’
Soln. : Course_Dim
Course_id a.3 Distinguish betw
Section_no a.4 Explain differen
Course_name
peices | — may 2011
Room_id
Course_id a.5 Explain the ma
Room_capacity
Semester_id . (Ans. : Refer s
Student_id
Q.6 Compare bety
Proof_id
paca | Course_grade’ Student_Dim
Q.7 The college ¥
—-Student_id are four dime
Prof_id
~ Student_name -
Prof_name
Major () Cours
Tite
(ij) ~—— Profe:
Department_id .
|
Department_name (a) Thec
May 2012
Department_name )
- Define dimension Student_Dim as (Student_id, Student_name Major)
For OLAP operations refer Example 2.17.1, Q.9 Whatis
2.18 University Questions and Answers (Ans. :
Q.10 Comp:
May 2010
Q@.11 Write
Q.1 Explain ETL of data warehousing in detail
(Ans. : Refer sections 2.5, 2.6, 2.7 ang 2 Dec. 2012
8)
Dec. 2010
Ee (10 Marks!
Q.12 Exp
Q.2 Explain ETL ofdata
Warehousin (Ans
(Ans Refer sections 250 ae
Pee0, 2. 2.8)
A anne)
Scanned by CamScanner
(10 Marks)
(10 Marks)
a@.5 Explain the major steps jn th
ET
(Ans. : Refer sections
2,5. 26 se 28) manent
ea eet
@.6 Compare between OLTP
and OLAP. (Ans, : Refer
section 2. 12)
(5 Marks)
Q.7 The college wants
aeioun ae. t
record the Srades for the courses completed
by students. There
(i) | Course (iii) Student
(i) | Professor (iv) Period,
(a) The only fact that is to be recordedin the table
(i) Design star schema.
is course-grade :
(5 Marks)
(i) | Write DMQL for the above star
schema (5 Marks)
(b) Using the above example describe
the following OLAP operations
:
Slice, Dice, Roll-up, Drill-down, Pivot.
(Ans. : Refer Ex.2.17. 3)
(10 Marks)
Dec. 2011
Scanned by CamScanner
he New Revised Syll labus (Re Fo
liversity ed
fear 2018-2019
Jased Credit and Grading Sys
tem)
:
}
I
Scanned by CamScanner
W vata Warehousing & Min
ing (MU-Sem. 6-Comp,) _
2-35
Dec . 20 14 e and OLAP
a
—
Q.21 Explain. the ETL (Extract
, Transfo tm Load y
(Ans. : Refer sections
2.5, 2.6, 2.7 and ie
(10 Marks)
Q.22 Describe the following OLAP operations using an example
:
-(1)_— Slice (2) Dice 3
(4) — Drill down
(5) Pivot (Ans. : Refer
anes section 2.15) (10 Marks)
May 2016
hh on00
Chapter Ends
ere eres
i
i 1
Scanned by CamScanner
os Warehousing & Mi
pefinition
Data mining is proce
o Valid,
o Novel,
of
Syllabus : o Potentially
Introduction to Data Mining, Data Exploration and Preprocessing : Data Mining 7, \j —_———
Primitives, Architecture, Techniques, KDD process, Issues in Data Mining, Applications at =>
Data Mining, Data Exploration : Types of Attributes, Statistical Description of Data, Den||
Visualization, Data Preprocessing : Cleaning, Integration, Reduction : Attribute Subse |
selection, Histograms, Clustering and Sampling, Data Transformation and Daal 32
32_Data Mir
Discretization : Normalization, Binning, Concept hierarchy generation, Concer |
Description: Attribute oriented Induction for Data Characterization. Data mining|
data mining quer
'
|
3.1 Whatis Data Mining ?
Scanned by CamScanner
W pan Warehous
ing & Mining (MU-Se
Definition m. 6-Comp
=
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _3-3 Introduction tg Dein
Sy! Data Warehousing
FfSS S—— & Minin
f 2. The kind of knowledge to be mined
\ é Operation-derived hie
— Specify the knowledge to be mined. decoding of informatic
objects, data clustering.
- Kinds of knowledge include concept description, association, clasgie
prediction and clustering. Cation Example : URL or em:
xyz@cs.mu.in
- User can also provide pattern templates. Also called metapatterns or ry,
metaqueries. taruley ‘ a) Rule-based hierarchi
defined as a set of rul
3. Background knowledge , rule definition.
4. Interestingness |
3567 distinct values
. — Itis used to cc
674,339 distinct values ~ Based on the
Fig. 3.2.1 :. Concept hierarchy for the dimension location -
. “—
— Patterns not:
Four major types of concept hierarchies
— jecti
Objectivem
a) Schema hierarchies : It is the total or partial order among
attributes in the database o Simpli
schema,
human
Example : Location hierarchy as street < city < province/state < country
o Exam
b) Set-grouping hierarchies : It organizes values into
j sets or groups of constants.
i & ‘Cee
Example : For attribute salary, a set-grouping hierarchy can Confi
be specified in terms
ranges as in the following :
Ce
(low, avg, high}
gh} € all (salary)
(sal o. Utili
{1000....5000} € low
{5001....10000} € avg
oO
{10001....15000} € high
° a
c
Scanned by CamScanner
Ww Data Warehousing
Mining (Mu.
ON {MU-Sem, 6-Comp
¢) Operation-deriyeg hie; ,) 4
Introducti ion to D Data Minin
A Tarchies - A 2
decoding of information-cneoded . Z based ON operation specified whic
objects, data clustering, h may include
omnes. information extraction from
Example : URL
complex data
or €mail address
dio, Example :
Vey 4 i
Followi ing —s rules are
used ty
medium_profitandhigh profit Categorize items
_margin, as low_profit,
low_profit_margin (Z)
<= price (Z, Al)*cost
(Z, A2)*((AI-A2)<80)
medium_profit_marg
in (Z) <= price (z, Al
)*cost (Z, A2)4(A1-A2)
high_profit_margin (Z) <= price (Z, 280)*((A1-A2)<350)
A1)Acost (Z, A2)*(A1-A2)>350)
4. Interestingness measures
Novelty : Patterns contributing new information to the given pattern set are
exception).
called novel patterns (Example : Data
Scanned by CamScanner
syllabus (Rev - 2016) of
nd Grad?
ty
of di scovered patterns Tf Data Warehousing & Mir
visualization
5. Presentation and
discovered Patterns in
sho uld be able to display the 2. Databases or data w
Data mining sys tem s barshart hy .*
, cro sst abs i (cr oss -tabulations), pie or
tables tes It fetches the data as
forms, such as niles, ons,
vis ual rep res ent ati
trees, cubes, or other 3. Knowledge base
for paying
forms = presentation to be used
t
> (MU - May 2010, Dec. 2010, Dec. 2011, Dec. 2013, May 2, Tt is integrated w
patterns.
Architecture of a typical data mining system may have the following major componentsx
6. Graphical user i
shown in Fig. 3.3.1 :
This module is u
users to browse |
=
—————oO''
———————
Many techniq
Data cleansing 4
Some of such
Data Integration
ff
3.4.1 Statistic
Statistics is
representatic
Scanned by CamScanner
ar “ban, ; Ww Data Warehousing & Minin 5
ee LE Minin g (Mu SOM, S-Comp.) aA Introduction to Data Mining
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _3-7 Introduction to Data iy, |
———————— — —$—$$—S 4
- For example, Facebook's News Feed changes according to the user's personalj ie,
with other users.
1. Supervised learning : One standard formulation of the supervised learning taskig
classification problem. In classification, the training dataset is having labe|
examples. Based on training dataset, classification model is constructed which wl
to give label to unseen data. Neural networks and decision trees techniques are hj
dependent on the previous information of classification where the classes are Pe.
determined by classifications techniques.
2. Unsupervised learning : One standard formulation of the unsupervised learning task
is the Clustering problem where the examples of training dataset are not labelleg
Based on similarity measures, clusters are formed.
3. Semi-supervised learning : This learning technique combines both labelled ay
unlabeled examples to generate an appropriate function or classifier. This method
generally avoids the large number of labelled examples.
4. Active learning : The user play an important role in active learning. The algorithm
itself decides which thing you should label. Active learning is a powerful approach in
analysing data effectively.
Database
1. Databases are used to record the data and also be used for data warehousing. Online
Transactional Processing (OLTP) uses databases for day to day transaction purpose.
2. To remove the redundant data and save the storage space, data is normalized and stored in
the form of tables.
3. Entity-Relational modeling techniques are used for Relational Database Management
System design.
Scanned by CamScanner
lof a 2. Data is de-normalized 66
analytical queries, t ables are not i complex and reduces the response ‘
time for
T]
TOach ; - Decision Support system is si ..
in : : a Category ofof informati
information systems, ;
which helps =
ina dees ion
making for business and osbatiengice
~ a = jon Bens software based systems which helps decision makers to extract useful
: . a m raw data, documents and take appropriate decisions, modelling business
= by identifying problems and solving them.
On
- Information that is gathered and presented by a DSS are:
o A list of Information assets like legacy, relational data sources, cubes, data
warehouse, data marts.
o Comparative statements of sales.
o Projected revenue based on assumptions of new product sales.
Scanned by CamScanner
- The main goal includes extracting knowledge from large databases, the Boal is wa
by using various data mining algorithms to identify useful patterns acco, Sang hia,
predefined measures and thresholds. hn, |
Data mining ~
metenenenneenseenenncnnde
(a) Pattern st
cancer
(b) A decisiot
(c) Consider
mining m
Fig. 3.5.1: KDD Process
Scanned by CamScanner
: a: WF Data Warehousin & Mini (MU-Som
j . 6-C ei
3- 2
4, Data reduction and projection :
(a) Based on the goa
l of the lask, useful
feat ‘Ures
are found to
(b) The number of variables may be eff, Fepresent th
e data
dimensionality reduction or trans
“formation. Cclively
Invariant ted reprec rgs ne,i Methorls ik
also be found out.
variant representations for the i
| 5, Choosing the data mining task
:
{ree - U5)
Selecting the appropriate Data
minin ig tasks liki
ificati
on the goal of the KDD process. Seen Stustering, regression based ——
. 6. Choosing the data mining alg
orithm(s) :
¥ a \
(a) Pattern search is done using the appropri
ate Data Mining method(s)
9 P
(b) A decision is taken on which mod
els and parameters may be appropriate
\ Sec
(c) Considering the overall criteria of the KDD process a match for the particular data
mining method is done.
7, Data mining:
The terms knowledge discovery and data mining are distinct. ———
aa a a ca“ naa aE: PA
iy " en acl oe peer, Saco earoes
Scanned by CamScanner
F data warohoust
\
ng SMe
aie Warehous
——F
Y pala
of a medicin
3.6 Major Issues in Data Mining jvencss
ysing dala _—
ini ng can be
Mining methodology and user interaction issues used it
Attributes Types
j ~ Data Mining
i - An attribute is a
Public sectors,
characteristic of a.
The attributes may
@ Nominal attri
Gi) Binary attrib
Gii) Ordinal attri
Gv) Numeric att
WY) Discrete ver
Scanned by CamScanner
usi
& ng
Mining (MU..
g e
af
ective ness of e
‘ RN So oon
a medicine or Certain
> ng data mining.
g Types of Attributes
30
‘
pata Objects
- A data object is a logical cluster of all
the same entity. It also represents tables in the data set Which contains data relat
an object view of the same, ed to
- Example : In a product manufacturing
company, product, customer are objects.
store, employee, customer, items and sales In a retail
>. are objects.
- Every data object is described by its Proper
ties called as attributes and it is stored
es database'in the in the
form of a row or tuple. The columns of this data tuple
=——— attributes, are known to be vers
Ow
a Attributes Types
ac. 2071) - An attribute is a property or characteristic of a data object
. For e.g. Gender is a
well as characteristic of a data object person.
~The attributes may have values like :
surance -
() Nominal attributes
fraud (ii) Binary attributes
(ili) Ordinal attributes
1 the (iv) Numeric attributes
) Discrete versus continuous attributes
Scanned by CamScanner
a
¥ Data Warehousing & Mining (MU-Sem. & Comp.) _3-13 Introduction tgData '
() Nominal attributes
© Trans
treatir
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-14 Introduction to Data Mining
————
Numeric Attributes are quantifiable. It can be measured in terms of a quantity, which can
either have an integer or real value. They can be of two types :
(a) Interval scaled attributes
(b) Ratio scaled attributes
Example : weight, height and weather temperature. These attributes allow for
ordering, comparing and quantifying the difference between the values. An interval-
scaled attributes has values whose differences are interpretable.
(b) Ratio-scaled attributes
Ratio scaled attributes are continuous positive measurements on a non linear scale.
They are also interval scaled data but are not measured on a linear scale.
Operations like addition, subtraction can be performed but multiplication and division
are not possible.
For example : For instance, if a liquid is at 40 degrees and we add 10 degrees, it will
l
be 50 degrees. However, a liquid at 40 degrees does not have twice the temperature of
a liquid at 20 degrees because 0 degrees does not represent “no temperature”
Scanned by CamScanner
ontg the .
© ats Werchousing& Ming (MU-Sem. 6 Comp.) _ $15 __Introducti & Mining
Scanned by CamScanner
W Data Warehousing & Min
ing (MU-Som, &Comp.)
3-16 ___!ntroduction to Data Mining
~ Meanis given byx (Pronounc
ed as x bar)
x = Gitmt..tn)
Or 3x
Where, Ht represents the Mean,
— Forexample the me
an for the following data is,
[ss [55 [39 [oe | 14/96 |78 | 87 | 45 92
=. 64,7
— Mean is the only measu
each value from mean isrealway
of central tendency in which the sum of the deviations
of
s zero.
- if the weights are associated
with the value x; then weighted
weighted average is, arithmetic mean or the
Mo
i
i
|
I
Me
Wi
1
ie
i
- This measure is suitable when the data is skewed (frequency distribution of data is
skewed). As the data becomes skewed, it loses its ability to provide a central position
under such a situation median can be used as it is not influenced by the skewed
values.
- Median is the middle score for a data set, which has been arranged in the order of
magnitude (smallest first).
Scanned by CamScanner
a
¥ Data Warehousing & Mining (MU-Sern Comp.) 17 : resent b
he
on
_
Rearransing the above dzta in the order ofmpine:
ec : TN
Scanned by CamScanner
The midrang, :
a Average of the largest and smallest valuc in a data set.
3.9.2 Dispersion of Data
“ee
min, Qi, M, Qs, max
Where, M = Median
— Example :7,8,3,1,5,6,10,13,5 |
1/3 /5})5]6 {74 8 | 10} 13
in
Scanned by CamScanner
_— inked
wv pata Warehousio & (MU-
Calculate the medi of that data which divides the data into two halves.
” standard deviation s Is the;
1 3 5 5 6 7 8 10 13
7 | 8 | 10 13 | s
Gi)zi Box-plot-plots
- 1 { 3.{ 5 |'5
+ Histograms
e (iii) « otots
_ Calculate the median of both the halves as there are 4 values, the median is averag Gv) Quantile plots
two middle values:
wv) Quantile-Quantile plots (
Qi = B+5/2=4
Similarly for second half, . (vi) Scatter plots
(vii) Loess curves
Q;= (8+10)2=9
- Now there are three points, Q, (the first middle point), Q,, Q, (the middle point of ty, 0. Box-plots
fp”
| : 1
23 .
4
|
|
i
6
1
7
4
8 9
|
| *
l 1
Se
l l
We |
| =
5 . “10 15
~ i=] i=l
Scanned by CamScanner
I Bar charts, Pie‘
charts, line
Scanned by CamScanner
WF pata w: ng & Mining (MU-Sem. 6-Comp.) 3-21
A vertical bar graph and a histogram differ in these ways :
- The normal quantile plot is usedto compare the data values with the Values tha
be predicted for a standard normal distribution.
- To draw normal quantile plot, compute two additional numbers for
Cach Val : I
variable. le of
1. Sort the data value from lowest to highest and in the colum
n labeled Xi) lag
: the data values is ascending order.
2. Next column shows the quartile for each data
value representation,
3. Now calculate the value of the
standard normal distribution that
quantile just computed in previous lies at the
column.
4. A normal quantile plot is formed by plotting
each ‘data value against the valueof (iv) Quantile - Quantit
the standard normal distribut
ion,- ;
,|
Example : til i
— As compared t
‘ but it needs mo
a One of the qua
2 | -o4 0.15 ~0.727891156
:3 | =
-03 0.25 ee
* —0.659863946
; ee e values.
y-axis
~ 0591836735
0.45 — If the marks :
~ 0.455782313
6 0.1. L955 | _o3g77ss ‘Standard Nor
7
i02 7 ASwehaver
0.6
ee
a
———
8 —0.047619048
27 975 ~ Finally you
| 1380959381
Pe 9 |Pe
6
S| 131292517
~ SeQde
| =
|
095 | 1448979599
5
Scanned by CamScanner
OZ = (x()-uvo
your 4
y-axis values.
‘
— If the marks are normally distributed, then x-axis values would be the quantiles of a
standard Normal distribution.
— As we have marks of 30 students, we will need 30 Normal quantiles.
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-23
ef pata Warehousi
Correspon
the Fig. 3.
es
(v) Scatter plot
Scanned by CamScanner
the Fig. 3.9.5 :
‘Ti
Water consumption (ounces)
°
—
o
;
70 75 88s gg 95 100.
Temperature (F)
Fig. 3.9.5 : Scatter plo
ik
t for Temperature and
Water Consumption
(vi) Loess curve
Scanned by CamScanner
tion
Syllabus Topic : Data Visualiza
— :
.
3.10 Data Visualization
| >
or pictorial f ‘
Data visualisation is pres
enting the data in a graphical
ise not possible Whe
analyse things which are otherw
help people to
techniques
ga
the dat visuaj iy
easily usin
s the data can be marked very
large. Pattemin
s : ;
Some of the data visualisation techniques are as follow
1. Pixel-oriented visualization techniques
' r
It maximises the amount of information represcaed at one time without any Oy, da rape
A tuple with m variables has different m colored pixel to represent each Vatiable |
has sub window.
each variable -
Gi ) Paral
— Based on data characteristics and visualization task, the color mapping of the pin in the
decided.
Scanned by CamScanner
4)
Introduction to Data Mining 4
by
.
Sati the
}
LS hte
aes S| Temp |
“XSi, nt Rf
ager
f
ve
tay Be] aes) 5
aa
y
Ws for
the
T Wie Z =
ry Sj ;
oespe
Tee
ee tog | RHJuly
ee
3
:
t Y Ove
Fig. 3.10.2 : An example of Scatter ploti
Variah le (ii) an
Hyperslice : : [t 1S an
extension to scatter plotl i
miatrices. They represent a
dimensional function as a matrix of
orthogonal two dimensional slices
the Di . (iii)
iii) a
Parallel co-ordi
aruinetes - : The parallel vertical
ical lines
li Separated define the axes. A point
Cartesian coordinates corresponds to a polyline in parallel coordinates.
é 45
z
3.5
| 3
2.5
= found j
2
length
Scanned by CamScanner
_3-27 -
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.)
is mapped to an icon,
Fach multidimensional data item
ounts of data,
This technique allows visualisation of large am
- An example of Cheroff faces is shown below; they use facial features to repres
trends in the values of the data, not the specific values themselves.
- They display multidimensional data of upto 18 variables or dimen
sions.
~ InFig. 3.10.4, each face represents an n-dimensional data
points (n<=18).
Scanned by CamScanner
| WF —pat_a Warehousing g
Mining (MU.g
cas
§
‘ = LIMU-Sem, 6-
q
C omp.) 3-25
Introduction ta
| - Pickett an
Data Minniing
d Grinste
I i inaim
nt,
rod
_- T
ahie eFig. e
uc i“
3.s
10.5
PR
a Su alisation
v
© e
originae techniqu
l rSs
Sa5 esiia
lick_. fig
ure wiwjith fi
ve stick- an
d family of
to display
te data with UP
to five varj ial
n helps to diffecrreenntiate € tdeixstur by using a 7two stick ico
»
t; ple ay bivariate MRI n which
ee
KIL S¥ 45
@
Fd
-
dimen OYA
#
;
A
r
4
Scanned by CamScanner
3-29
housing «
iy
Attribute 1 -
Fig. 3.10.6 : Data in dimension sta
cking
j (ii) Mosaic plot
Scanned by CamScanner
=
“a SS
Oe
=~Sa
On, scalj & (inner)
Mic fa glove
- Using queries Static intera and Stereo disp
lays,
ction is also p (inner/outer)
Os
2"
- Each set of rectangles on the same level in the hierarchy represents a column or an
expression in a data set.
Scanned by CamScanner
Fig. 3.10.9 : Web traffic by location Tree-map
Scanned by CamScanner
—
W_Data Warehousing & Minin
9 (MU-Som, &-Comp,) 2-30
Introduction to Data Mining
Sas design ta
—
Cerone
ee Fodiet =gpter
Ior eyeuSabil
es tert ity =v
= =apbicaion = Used Yet eagert
vewasee as= also
- When data processing does not involve any data manipulation andanly converts the data
type it may be called as data conversion.
Scanned by CamScanner
ww Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-33 Introduction Q
S00 NSS lo Daty ‘4
$f pata Warehousing AM
— Noisy : When the data contains errors or some outliers it is Considereg lO be |
data. ~ 4 Reasons for “I
— Inconsistent : When the data contains differences in codes or names it jgincon,
3.12
pummy values
4 data. =
1
_ Absence of data
i 2 Tasks in data pre-processing
Multipurpose fields
— Data cleaning : This process consists of filling of missing values, SMoothening E ' Cryptic data
data, identifying and removing any outliers present and resolving inconsistencie, ~ Contradicting date
, ntradi
~ Data integration : This refers to integrating data from multiple sources ei ieee
databases, data cubes, or files. te - Inape .
Big ¢ Violation of busi
— Data transformation : Normalization and aggregation. a
- Data reduction : In data reduction the amount of data is reduced but same analyticg _ Reused primary
= "ce di
results are produced. _ -Non-unique ider
~ Data discretization : Part of data reduction, replacing numerical attributes yg|_. — ‘Dat intesration
nominal ones. Why data cleaning
Different Forms of Data Pre-processing
— Source System
1. Data cleaning — Specialised tor
2. Data integration and transformation
— Some of the
3. Data reduction
(isin) ane
4. Data discretization and Concept hierarchy generation
3.12.2 Steps in
Scanned by CamScanner
- Dummy values
- Absence of data
- Multipurpose fields
-— Cryptic data
- Contradicting data
Inappropriat
e use of ad
dress lines
- Violation of busi
ness Tules
Maly, - Reused primary keys
- Non-unique identifiers
S Wig - Data integration Problems
This is the next phase after parsing, in which individual data elements are corrected
using data algorithm and secondary data sources.
~ For example, in the address attribute replacing a vanity address and adding a zip code.
Scanned by CamScanner
| 2
__¥¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _
_3-35 Int
troductionA tg Oat
|
3. Standardizing
- !In standardizing
iz process
— conversion
i routine
iness aresSiem
usedecome
to transf, orm i
consistent format using both standard and custom busine: S.
a
For example, addition of a prename, replacing a nickname and
using a Preferred ty
name,
4. Matching
This involves Se
arching for empt
y fields where V
alues should oc
cur,
Scanned by CamScanner
—- Data Preprocessing
is one
incomplete, Noisy or ine of the
o . “ Mport
filling out the Missing "Sistent, thi ties :
ns Stages in data Mining.
van ICs, Real world data is
SMoothenin Orre
“ted in data Preprocessing proc
niques for cal ess by
be dependent on Pro nd correcting inconsistencies
blems domain nd NE With misc; ,
the Boal ort un data, choosing ;
- Following are the differen; Way one of them would
s for h Mining process
1. Ignore the data
row
- Incase of cl
assification
could be ig SUppose a cl
nored, or as S label is Mi
Many attrj ssing for a Tow,
row could be
ign ored. If butes WwW within a TOW such a data row
performance. the Perc are miss
ing even in this ca
entage of such
rows js high it se data
will result in poor
- For example,
Suppose we
college. For th have to build
is Purpose a a Mode] for Pred
Student’ s datab icting student succ
address, etc. an ase ess in
d colu ha vi ng information ab
and “HIGH”. In this MN Cl as si fying the; ou t age , score,
na “LOW”, “MEDIUM”
When missing
values are difficult
to be Predicted, a global
constant value like
“unknown
”, “N/A”
or “min us infinity” can
be used to fill all the missin
g values.
For example, consider
the students data , if the
Some students it does not address attribute is missin
g for
makes Sense in filling up the
Constant can be used. se’ val ues: rather a global
Scanned by CamScanner
™~
the same class
5. Use attribute mean for all samples belonging to
Instead of replacing the missing values by mean or median of all the OWS jn
database, rather we could consider class wise data for missing values to be t,
Iep|
Atyy
by its mean or median to make it more relevant.
For example, consider a car pricing database with classes like “luxury” and «
budget” and missing values need to filled in, replacing missing cost of a luxuy %y
with average cost of all luxury car makes the data more accurate.
6. Use a data-mining algorithm to predict the most probable value
Missing values may also be filled up by using techniques like regression, inferer,
based tools using Bayesian formalism, decision trees, clustering algorithms,
For example, clustering method may be used to form clusters and then the Mean cy
median of that cluster may be used for missing value. Decision tree may be used j
predict the most probable value based on the other attributes.
~ The entire range is divided into N intervals, each containing approximately the same
number of samples.
Example : |
in INR
~ Let us consider sorted data for e.g. Price
26, 28, 29, 34
4,8, 9, 15, 21, 21, 24, 25,
Scanned by CamScanner
A ng Se RE) BS
cnntenenn
¥ pata Warehousin
Bin 2: 21, 21, 24, 25
_ Perform clu:
Bin 3 : 26, 28, 29, 34
cluster repre
— Smoothing by bin means:
ression
Replace each value of bin with its mean value.
3. isi
- Regression
Bin 1:9,9,9,9 between on
Bin
2 : 23, 23, 23, 23 variables.
Partition data set into clusters and one can store cluster .
:
XxX x
hae
x
x xk :: xeeex — In mut
Y-Axis
i
xXx%y Xx" Record B
— Re
22%
aX} y
‘ af
Telatio
(Lines
Record ¢
iam Regre
Price
X- Axis
Fig. 3.12.1 * Grap
hical Example
of Clustering
Scanned by CamScanner
# pataa Warehousing &M ining (MU- -6-Comp,
sen 3-40 Introduction to Data Mining
Perform clustering on attributes values and
replace all values in the cluster by 4
cluster representative,
Regression
a = The intercept
; b = Theslope .
Scanned by CamScanner
341,)_
) mS4
6O-Co p IntrodoUt tay
W at a Wa
War re
e ho us & Min
in . 6-C
(MU-Semm.
g inging (MU-So
SF Dat
Log linear model
and a log linear model jg ia
na
anption
gE assum
sion a best fit between the data
: ast pata Inte
: A linear relationship exists between the log of the‘cia ee
Major
independent variables.
ee,
linear models are models that postulate a linear relationship betw
Log
exams
independent variables and the logarithm of the dependent variable. For
Jog(y) = ap + ay X; + a2 X2~-++ An Xn
where y is the dependent variable; x; i=1,....N are independent variables nat
of the model. i .
{a,, i=0,...,.N} are parameters (coefficients)
log linear models are widely used to analyze: categorical da
For example,
|
represented as a contingency table. In this case, the main reason to transfor,
frequencies (counts) or probabilities to their log-values is that, provided e.g. A.cust-id
ty
independent variables are not correlated with each other, the relationship between
new transformed dependent variable and the independent variables is a lines
(additive) one.
ij
yt"
~ With the]
- Theredu
from mu
Fig. 3.12.2 : Regression example
3.13.1 Entity Ie
3.12.5 Inconsistent Data
~ Schema inte
— The state
in which the data quality of the existing
data is understood
and the desi! task,
quality of the data is known refers to consistent data quality. 3 Identify re;
- It is a state in which the existent data quality is been modified to meet the current a identificati
Such conf
Scanned by CamScanner
[ Sy
ep, ag
= Introduction ta Data Minin
f — - ¥ labus Topic ' Integration = 2
| 3.19_Data Integration
tee,
Scanned by CamScanner
Ww Data Warehousing & M (MU-Sem. 6-Comp.)
Detecting and resolving data value conflicts for the same real world entity, Altibut.
from different sources are different.
“ty
3.13.2 Redundancy and Correlation Analysis
— Data redundancy occurs when data from multiple sources is considered fori
in Le Rration,
- Attribute naming may be a problem as same attributes may have diffi
erent Names ‘
multiple databases. =-
L) DF
- An attribute may be derived attribute in another table e.g. “yearly income”
where. ris the n
- Redundancy can be detected using correlation analysis.
for the other cat
— To reduce or avoid redundancies and inconsiste
ncies data integration must be freq
carefully. This will also improve mining algor CAITIed ty
ithm speed and quality. attribute. The fc
= x ( Chi quare
Fe
o nis the to
Ex.3.13.1: Byt:
data
(You
Tab
Le
Soin, : State th
: Ho: Null hy
H,: Alterna
Scanned by CamScanner
———
———
Eee
¥ Data Warehousing & Min sl
Lee ing (MU
(my-So
5m
Where —
~&Comp) 3-Se44
X? — Chi- § are Introduction to Data
Mining
~
E dL: f
= Fre,
quency Cx
Pect to fing iCc ted whi
nN en ich is the am
“Mount of sub ‘
O = Frequency Obs erved
“awi legory based on ki
own iformeEne ? a
found
ee
or mary to be ; Whic
be in ch Categohry is inthe
Degrees 8 presen teen
© degrees Of free pmecmey
SS
dom (DF) j
DF = ‘
-1)*@¢~1)
PAAaBOS
where, , Tis : the S ny nae
ohne of leve
for the other cate ls
gorical variable for One Cat .
. €g0rical Variable an
d c¢ j 8 the number of lev
Expected freque els
ncies > It is th
attribute. The form which is com
ula for cmmenae puted for each level
uency is of categorical
E. = (o;*n.)/n
Ex. 3.13.1:
,
Scanned by CamScanner
345 _!ntroduction to a wa rehousing & Mini
! Ww Data Warehousing & Mining (MU-Sem. 6-Comp.)
a ‘N
oe is terpretation ‘
Analyze sample data
objects, x and y a
| Mate(o-E 400 900 100 a
[Macocre|. oe |g RE Se Sena
Female(O) 250 300 50 600
Female(E) | (600° 450)/ 1000 = 270 | (600* 450) /1000=270 | (600*
100) /1000=60| 600
Female(0-E) ~20 30 -10 — Covariance is g
| Female(O-E)? 400 900 100 Cov YY)
[ Female | 148 853 Sater 2 3.13.3 Tuple Duy
- Using formula,
| - In data integr
occur due to ii
x= 2 |O-2| ~ Detection of t
© Specify t
X? = 2.22 +5.004+2.50 + 1.48 + 3.33 + 1.67 = 162
- The P-value is the probability that a chi-square statistic having 2 degrees of freedom8 9 Compan
more extreme than 16.2. © Objects
- We use the Chi-Square distribution table to find P(X? > 16.2) = 0.0003. 9 The clo
4 Scanned by CamScanner
Fr
¥ pata Warehousing & Mining (MU-Som, Comp
2 (p-p)
fn = EE AD(q- 2 ou) a59= nh ‘0,6,
Scanned by CamScanner
aT =“ Pay
Scanned by CamScanner
eS
- Removing irre]
—
wrapper smistiede, a Attributes ; In thi Introduction to Data Mini
AY be used i, 'S altribuie .
- Principle component ed, it alse involy, 5 ene
SS
Methods like filtering and
;
representing anal Ysis (numeric . Sivas the attribut
the data inva Sie ibute space,
- > SS
Reducing
2. 9 th the number of attribute
Pact form by
si
usi Ss only) This involves
Ng a lower dimensional space,
Ss —
ues
- Binning (histogram
. this will result inS)
bins, : ThisIS InVolv
j
“=
es representing
aS
hes
lesser number of attrib
a
a Clustering attributes into £roups called as
: Group 4
clusters 7
rd
7
r
th
Example
Total annual sales of TV in USA is aggregated quarecly as shoinwn
Fig. 3.14.1.
«
Scanned by CamScanner
———————
Data Warehousing & Mining (MU-Sem. 6-Comp.)
oo
3-49
—————
Date
Scanned by CamScanner
7
~~ he
+
hd
. Forward selection
PSS
Start with empty Set of att
ributes
- Determine the best Of the
6
o—3
Ti
- Ateach Step, fi Binal attributes an d add it to the set.
=.
nd the best 0
F the remaining Original attribut
2. Stepwise backward elimination es and add it to
the set.
- Starts with
the full set Of at
tribut es
— Ateach step, j
Scanned by CamScanner
¥ 9 & Mining (MU-Sem.
Lossless compression
2. 2. Principle c
Principle component analysis
— Princip
orthoge
1. The wavelet transform can al:
“toregns
i f
ae iden ti
may be identified r
by applying the wavelet transform
Stogr
‘ over the featu® “
—ai
Scanned by CamScanner
|
- Clusters are iden
| . tif;
in their bound;
, King, .
Major features _
- Italso results in
Effectiy,
© hi - The technique is Cost Oval of our Pi
liers,
Rhy,
fificient.
Complexity ON).
le.
Tey ~ Abeiitoront scales arbitrary shap
eq Clusters are detected
= “Theietho d ts not SCRS
. itive to Noise or Input
- Itis applicable only to ord ,
Order
low dimensionaj itn
2. Principle comp
onents analysis
Scanned by CamScanner
’ 6-Comp .) 3-53 Introductig Ki bey
¥ Data Warehousing & Mining (MU-Sem.
- de data into buckets and store average (sum) for each bucket.
Divideda
A bucket represents an attribute-value/frequency pair.
— Can be constructed optimally in one dimension using dynamic programming
i. Equal-width histograms
®'It divides the range into N intervals of equal size
ii, Equal-depth (frequency) partitioning : It divides the range into N in
tervals, CF
containing approximately same number of samples. .
iii. V-optimal : Different Histogram types for a gi
ven number of buckets are CON
and the one with least variance
is chosen.
iv. MaxDiff : After the sorting proces
s applied to the data, borders of the
"defined where the adjacent values hav buckets »
e maximum difference.
Example
1,1,5,5,5,5,5,8,8,10,10,10,10,12,
14,14,14,15,15,15,
15,1 15,18, 18,18,18, 18,18,
18,18,20,20,20,20,20,
0,20,21,21,21,21,25,25
,25,25,25,28,28,30,30,
30
Histogram of above data sample
is,
Count
Scanned by CamScanner
® Data Warehousing » Mi
MU-g6,
m.
- The goal of UNdirecteg a
Bn
;variable © j;
18 to Predi Ata a Mining
mini ist ahrction ata Mining
independent Sted, th M
and dependent 9 expt
.
Vari oad i 8 Ore structure in the data. . Mo AO t *g
ya
"
ss Categorization ofclusters 4
ables. no difference been mace tiikoran
baseg
o An n cluster
e cxample belonging toa _ "echniques is given below -
o Any exam ple m
ay belong ti“ingle chusy
overlapping. °F would is
3. Sampling r
y
- F
Sampling is used in reliminary
ling is: P j
IVestigation as well as final anal
- Samp 18 Important in ysis of data.
aiid ets bine z. data min; ng
mini as Processi:ng the entire data set
is expensive
Types of sampling
The objects selected for the sample is not removed from the population. In this
technique the same object may be selected multiple times.
The data is split into partitions and samples are drawn from each partition randomly.
Scanned by CamScanner
Ww Data Warehousing & Mini MU-Sem. 6-Comp.) _ 3-55 Introduction to Data
eee
¢ ormalization + In norn
Syllabus Topic : Data Transformation and Data Discretization, Normalization, Binge
e S + To transform
~
Scanned by CamScanner
E Normalization ; In normal; _
Example : To Tansforn y in
: _ l), apply
: Scaling by using mean and
standard ae. ( Min) 1 (Max-Miny
Wy when there are outliers):
= P
[
deviation ¢ Useful wh en min and max are oak:
"ear 8
| Vs cy “Mean)/
nown or
Se. ;
3. Decimal scaling
original data. The values are within a given range. ing a v value, of an attribute A from
erform mappin
Following formula may be anes a minA,new_maxA],
range [minA, maxA] to a new range [new_
Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 3-57
v= (¥v ~minA)(maxA-minA)
*
new_maxA-—new mj
v =
,
73600 in [12000, 98000] nA
v’ = 0.716 in [0,1] (new range)
2. Z-score : In Z-score normalization, data is normalized based
On the mean and
deviation. Z-score is also known as Zero mean normalization.
v’ = (v—meanA) /std_devA
Where, MeanA = sum of the all attribute value of A .
std_devA = Standard deviation of all values
of A
Example :
Sov’ = (-1, 0, ))
3. Deci
cal mal scaling : Based on the
m aximum absolute valu
pout is moved, This pr e of the attributes the dec
ocess is call ed as Deci ini
mal Scale Normalizat
ion
V@ = v(i/10* for the smallest
k such that max(Iv’(i)l)<1
Example : For the ra .
nge between — 991
and 99,
10‘ is 1000 (k=3 as we
have maximum 3 digi
t number in the rang
e)
v’(- 991) = ~0.9
91 ana v‘(99) = 0.09
3.15.4 Di scretization by 9
Binning
This is the data sm |
oothing technique.
~ Discretization by
binning has two
approaches :
(a) Equal-width
(distance) Partit
ioning
(b) Equal-depth
(frequency) Part
B g e itioning or Equ
oth th v e
is binning ap al
proaches are
Ziven in Sect
ion 3.12.4
Scanned by CamScanner
In Discretization by Histo
. era 'N divide
each bucket in smaller daty repres Dlation, th © data jint
buckets ard
pifferent types of histogram Store average
(sum) for
}. Equal-width histog
rams
2 Equal-depth (fre
quency) Partitioni
ng
s Veepeuaal
4. MaxDiff
Scanned by CamScanner
874,339 distinct va
lues
i
Fig. 3.16.1: Concept hierarchy
example '
‘
_ Syllabus Topic : Concept
Desorption -ANAE
Characterization
Scanned by CamScanner
418-1 Data Generalization
database from alow conceptual leve] higher ee 4 large set Of task-relevant data in a
Scanned by CamScanner
Data generalization is
©
p e r m e now
performed in two ways:
Gene
iy) get'a threshold f
Oo Attribute generalization.
— Data aggregation is performed by merging identical, Seneralized ty \ -
The threshol
re
numberd, ofIf
*
accumulating their. respecti.ve counts. iS
Typical threshc
— Presentation of the generalized relation such as charts or rules.
~ These two tec
3.18.2 How Attribute-Oriented Induction is Performed?
. ‘
control mae
control to furt
3.18.2(A)
18. Data Generalization 418 g(c) Example
It can be performed in two ways : attribute removal and attribute generalization,
This example sh
Scanned by CamScanner
ev Data Warehohousing & Mini
—= using & N ing
S(M
SU-Son, S
I. 8-Co,
Generalized relation threshot g co==.)3
a)
_ Seta thresh Ntroy Introduction to Data
Mining
old for 8ene
‘is ralizeg
Tlation,
Ube, ~ eth satis toe
the thres. * I Not then fu‘U ples in the Beneralj
rther &e
_ Typical neralj al
threshol
g value ZAtion izSed relation sho uld
hould be be less or equal to
range is Performe
from 10 d
to© 39 30,
~ Name : This attribute is removed as there are large number of distinct values for name
~ Gender : This attribute is retained as there are only two distinct values (M and F) for
gender.
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 3-63 Introduction ts- :
at
- Residence: As number of values for city in residence attribute is Jess than
value, it can be generalizes to city. Sy
- Phone#: This attribute has large distinct values so it should be removedj in genera
ine,
- GPA: Grade Point Average (GPA) has distinct numerical values but it can be ge “
into numerical intervals like (3.75-4.0, 3.5-3.75, ...... ), which in turn can be ony
descriptive values such as {Excellent (Excl), Very Good(VG),... ++}. The ati,
therefore be generalized. &
A generalized relation obtained by attribute-oriented induction on the
data ,
Table 3.18.1 is as given in Table 3.18.2. —
Table 3.18.2 : Generalized Relation
May 2010
Q.1 Explain data mining as a step in KDD. Give the architecture of typical DM system.
(Ans. : Refer sections 3.5 and 3.3) _ (10 Marks)
Dec. 2010 ’
Q.2 Explain data mining as a step in KDD. Give the architecture of typical DM system.
(Ans. : Refer sections 3.5 and 3.3) | (10 Marks)
May 2011
Q.3 Describe the steps in the KDD process with a suitable bock diagram.
(Ans. : Refer section 3.3) (5 Marks)
Dec. 2011
Q.4 What is data mining ? What are techniques and applications of data mining ? Explain
the architecture of typical data mining system.
(Ans. : Refer sections 3.1, 3.4, 3.7 and 3.3) (10 Marks
al
Scanned by CamScanner
_.,. Introduction to Data Mining
(5 Marks)
izati ;
'0n techniques for Data en a
P mining. (Ans. : Refer section 3,10)
13
oes ee
g Explain Data sections
(Ans. : Refer mining as3.5a step
and
in KDD. Explain the architecture of a typical DM system.
3.3) (10 Marks)
wy 2010 ,
Chapter Ends
Scanned by CamScanner
Warehousing
| Module 40 | a—___
! 9}. CHAPTER
various classifi
- * 5 Regression
|:
:
4 Classification, Prediction
8 . . .- |: \
§— i
Decision jsion tre
and Clustering . {
o
medi
u
—_ AAA Classifica
i Suppose a da
CH{Cy---Cn}
‘ Syllabus : oo ™s which tuple 0
Classification, Prediction and Clustering : Basic Concepts,
Decision Tree Usky
“7 Information Gain, Induction : Attribute Selection _ Actually divi
; Measures, Tree pruning, hi diction is
Classification : Naive Bayes, Classifier Rule - Based Classification : Using
: for classification, Prediction : Simple linear regress IFTHEN Ply 7 Je corti
ion, Multi ple linear regression Mods)
Evaluation n && Selection : Accura cy and Error models
measures, Holdout, Random Samping|| | 412.
Cross Validation, Bootstrap. Classifi
Clustering : Distance . Measures, Partitioning
Hierarchical Methods(Agglomerative, Divisive).
Methods (K-Means, _k-Medoits, How teache:
_ Ifx>=90t
: — Tf 80<=x<9
Syllabus Topic | Basle Conse = Wf 70<exe!
CO
P - reex
4.1__ Basic Concept : Classification
— Ifx<60 th
~ Classification constructs the classification model base
‘j
d on training data set and usinthg
model classifies the new data.
- It predicts the value of classifying attri
bute or class label.
- Typical applications :
41.3 Clas
o Classify credit approval based on customer
data.
1, Model |
o Target marketing of product.
- Bye
o Medical diagnosis based on symptoms
of patient.
fl
~ Th
o Treatment effectiveness analysis of patient based on the
treatment given. ~ Th
de
Scanned by CamScanner
. : Classiti i n & Clusteri‘ng
various classification techniques Hication, Predictio
Regression
Decision trees
Rules
Neural networks
ad Classification Problem
suppose a database
D is given as D=
Ca[Cy--sCm)> th
e Classification th, lays ast} and a set of desi
red classes are
which tuple of databa Pr ob ie ™ is to define the mapping
se D belongs to whi m in such a way that
ch class of ¢.
Actually divides D into equi
valence Classes,
Prediction is simi
lar, but may be
viewed
;
models continuous-valued functions, ie. as having infinite number of f classes. Pri ediction
» Predicts unknown or missing val
ues,
41.2 Classification Example
E D
Fig. 4.1.1 : Classification of grading
4
Scanned by CamScanner
2. Model usage
- Fo r classifying unknown
objects or new tuple usc the constructed mode},
with the resultant class label.
Compare the class label of test sample
g the percentage of test set samp,
_ Estimate accuracy of the model by calculatin tal
constructed.
are correctly classified by the model
_ Test sample data and training data samples are always different, oy
over-fitting will occur.
Example
Classification.
gu
__ Algorithms.-
a
1 -
“Training — —
_ Classifier
~ (Model) —
IF rank="professor ~ |
ORyear>6
THEN tenured ='yes’
Fig. 4.1.2 : Learning : Training data are analyzed by a classification algorithm
41.4 Differer
at
no
Tenured? {I
no
Yes
yes
Fig. 4.1.3: Classification : Test data are used to estimate the accuracy of the classification™*
Scanned by CamScanner
8-Com
<3 Class ification, Prediction & Clusterine
‘ f example + How
to perform Classifj Cation N task
re d ase for
ClassiT
fir
cation of medical pati.ents by
"Seen patients
Test Set
‘ Shenae! =
La unseen 5
-Unseen patient ® fe Patient S
Scanned by CamScanner
ss the val ues_
[i is used to asse Or
:
a given aly,
of an attribute that
ta A2
is Sample
assification Jabels. Ig ji
ict class have.
predic
- tion to Pre !
te Model ; pel!
Eg. if a classification
their
r o u p pa ti en ts based on predict the treatment outcom, ey |
E.g. G data and
would be a ilen,
know n me di ca l
is a
oa patient, then it ICtio
then
treatment outcome
classification.
ods
42 Classification Meth
en below :
Classifications methods are giv
ion measures, tree pruning
|. Decision Tree Induction : Attribute select
- Bayesian Classification : Naive Bayes classifier
Scanned by CamScanner
g pata Warehousing & Minin; (MU-Som, 65,
= Mp.)
7 ., Appropriate Prob ; 4.
yo) PP lems for Decision y,Dea tt
=cification, Prediction & Clustering
Fc
pecision tree learning is Appropriate tei ea Learning
F the Problems
pelo . t
having the characteristics given
Instances are represented by a fixed
“ ale, female) describ“das attribute-vatue
(e.g. male, Set of attripy
tates, (Cg. gender) and their values
if the attribute has small number of disjoint
there are on. cl ses (e.g true, f possi
ly two possible Clas Ossityble y alues (e.g. high,
medium, low) or
ven ; False
Extension to decision tree al gorithm also biitlee veal
) then decision tree learni
ng is easy.
¥ alue attributes (c.g.
Decision tree gives a class label to each instance of P= salary).
jsion tree methods itaset.
can be used even
tones (e.g. humidi whe training
ty is known for
. * .
- A leaf node is the last node of each branch and indicates class label or value of target
attribute
ision tree . : .
_- A decision node is the node of tree which has leaf node.or sub-tree. Some test to be
carried on the each value of decision node to get the decision of class label or to get next
tion and sub-tree. s :
Scanned by CamScanner
ion for play tennis
bel iow,
y ¢ en ni s = Yes is given
Pla
— Logical expression for ¥ (outlook = Overcast) v
= normal)
= sunny “ humidity
(Outlook
= rain A wind = weak)
—
4.2.1(C) Attribute Selection Measure
After Splittin
g T into two
s
ined as,
Scanned by CamScanner
Oy, For each attrib
“Oop . saiveil attribute,ute, €ach o
theattributef the °SSible binary ‘
N Bini (7,
—s
For continuous-valued attrjp, 78 SMallog: rts 18 consi dered
ple using gini ii index; Sach POssibje iia
*Plit-point mis cho en
— ee discrete-
node,
Consider the following dataset D, i
Outlook | Ty Ty
else | No -
High True i
High te ee
|
-———__| False Yes
BO
et
Rormal =|
Fase
p
“Pxes
a
p——_| False | Yes
hee
[Normal True | No
ee eh
| Normal Truc | Yes
|High [False | No
. 1
et
Normal . || Fajs
na
ee © | Yes
se
Normal | False
AS
Yes
Sunny
AIS
Mild Normal | True | Yes
overcast | Mild
wa
ner High True | Yes
| overcast | Hot Normal | False | Yes Q
rainy Mild High True | No
- The total for node D is : P
gini(D) = 1-sum (pl’, p2’, wa. pm’) (4.2.1)
Where, p1....n are the frequency ratios of class 1....n in D.
Inthe above example, there are only two classes : 1) Yes 2) No
Therefore
n =2
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem, 6-Comp.)
oo”
_ 4-9 Classification, Pei
Scanned by CamScanner
tion 4 Cluaterin
2 Waesification, Predic
patron (4.2.3)
sai
pro
5/ 14 gi ni (s un ny) + 4/14 gini (a
te as t)
vere
+ 5/ 14 gi ni (rainy)
*
split
4) +0 +* (5/* 1
gpPlit = (5/14 * 0.62 0.6244) *
.
|
1 =
Split = 0.446
;
at ge ne ra te s th e smallest gini spl ch os en to split the node on
.
~aante th gini sp li t va lu e is
tegorical.
believed to be ca
0) [n
ap te d fo r co nt in uo us -v al ued attributes
spoan be ad is se le ct ed fo r split.
gh es t in fo rm at io n ga in
_ The attribute which has thsee s,hi P and N.
as
assume there are two cl to cl as s P and n samples
mp le s be lo ng s
mples, out of these p sa
_ Suppose we have S sa
ass N.
pelongs t0 cl e in $ be longs to P or
io n, ne ed ed to de ci de if random exampl
rmat
_ The amount of info
Nis defined as
_p 2+) ,
I(p,n) == —-p—4=n pen p n log p+n
(P.0)
wi ll be pa rt it io ne d in to se ts (5), Sp, -- Sy)
ibute A, a set S
_ assume that using attr the expected
. ples of N, the entropy, OF
es of P and 1; exam
«ag S, contains P; exampl cts in all subtrees 5; 1s
informatio n eded to classify obje
ne
vo.+n:
BAS i=Ll ento
ly
| Entropy (E) : ed to as si gn a class to a random
need
am ou nt of in fo rmation (in bits) :
- Expected ma l, shortest-length code
r th e op ti ed
drawn object in S unde es re du ct io n in entropy achiev
Measur
fo rm at io n ga in i.e. gain (A) : du ction (maximizes
GAIN)
- Calculate in achi ev es mo st re
us e of the spl it. Ch oo se the split that
beca
Gain (A) = T(p,
0)-E(A) ,
¥s Data
: Warehousing & Mining Eh (MU-Sem. 6-Comp.)
cthachaiias Reotahbaanenlt _4-11 Classification, Prediction &o
RSS
{
Gain ratio should be big when data is evenly spread and small when al} data
one branch. So it considers number of branches and size of branches when Mg,
attribute to split. It seh, |
- (4,5 is a designed by Quinlan to address the following issues not given by ID3 : |
© It avoids over fitting the data.
o Itdetermines the depth of decision tree and reduces the error pruning. 7.
o It also handles continuous value attributes e.g. Salary or temperature. ;
o It works for missing value attribute and handles suitable attribute selection measure. 8.
o It gives better efficiency of computation. The algorithm to generate decision teeis 9.
given by Jiawei Han et al. as below : ‘
Input: ©
Data partition, D, which is a set of training tuples and their associated class labels;
Scanned by CamScanner
ta Warehousing & Minj
by, ¥ Da 2109 (MU-Sem, 6.
Comp)
“Ni ne galt = Classification, Prediction& Clustering
NChes A decision tree.
Method :
|, create a node N;
2. if tuples in D are all of the same class, C,
thes
| | =e N asa leaf node labeled with the class C:
3 if attribute_list is empty then. ‘
| -_ return N as a leaf node labeled with the majority
class in D; // majority
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.
© Decision trees provide a clear indication of which ficlds are most ins
prediction or classification. Porta,
— Because of noise or outliers, the generated tree may over fit due to many branches,
- To avoid over fitting, prune the tree so that it is not too specific.
Prepruning Class P: buys_¢
© Start pruning in the beginning while building the tree itself. ClassN: buys
o Stop the tree construction in early stage. Total number of
© Avoid splitting a node by checking the threshold with the goodness measure fli Count the numbe
below a threshold. : So number of ret
Ex. 4.2.1: Apply ID3 on the following training dataset from all electronics custo For age < 3(
database and extract the classification rule from the tree = with “ ‘yt
R= with
Table P. 4.2.1 : Training data of customer
Scanned by CamScanner
yg
: Ge Student Credit_rating | Class: buys. con uter |
P| a ee lt Yes ee
|
| 7 5 Low Yes Fair Yas =
a Low Yes Excellent No i
all || Low _ Yes Excellent Yes
a | Medium No Fair No
| 730 | Low _ Yes Fair Yes |
| “10 | Medium Yes Fair Yes
| ea [Medium Yes Excellent Yes |
Sal | | Medium No Excellent Yes
i.20| [High _ Yes | Fair Yes
40] | Medium No Excellent No
| gol :
”
class P : puys_computer = “yes
classN: buys computer= “no”
gotal number of records 14.
e n u m b e r of re co rd s w i t h “y es ” cl as s and “no” class.
t
Coun th
5 = 9 and “no” class= 5
1(p, 0) 1(9, 5)
(5/14)
il
log,
= — (9/14) log, (9/14) - (5/14)
= 0.940
y for age
Step 1: Compute the entrop
For age $30,
2 an d p; = with “no” class = 3
p,= with “yes” cl as s =
= 0-971-
Therefore , I(p, .0,) = 12,3)
Scanned by CamScanner
Similarly for different age ranges I(p,.n,) is
calculated as given below :
Table le P. 4.2.1(a): Se
2:
pe He j0n
ne
| age | [a] tom yore 09 Credit satin
| <30 | 2/3] ase" <
o971 consider ABS
[31...40 4,0 0
Vv
E@)= ¥ 23160) .
i= ptn
=]
AGE
Fig.P. 4.2.1
a.
Scanned by CamScanner
ww Dat
—<=—a$<Wa reho
—= using g Mining (Mu.
Sem,
step 2:
Scanned by CamScanner
_4-17 Classification, a
MU-Sem. 6-Comp.)
¥ Data Warehousing & 4 . o
.1 (c) and iin Hig
the va l ves from the Table P. 4.2
Calculate entropy using a
Bi
4 Ney 4 y 7 pata Warehou:
SS
= -
: rg n,)
E(A)
i=l ; ain Compute the
B (Income)
= 2/5 * 1 (0, 2)
+ 2/5* Il, 1)+ 1/5 *I(1, 0) = 4 For credit_rat
[Note : S..s is the total training set. a SEEN now's
Hence —: |
Gain(Sza. Income) = I (p, n)— E (income)
= 0.971-0.4=0.571 Similarly for
(ii) Compute the entropy for Student : (No , yes)
For Student = No,
P;= with “yes” class= 0 and n, = with “no” class=
e
Therefore, I(p, , n; ) = (0,3) =— (0/3) log,(0/3)— (3/3) 1og,(3/3) =
Similacy for different outlook ranges I(p; , n; ) is calculated as given below - F .
.
Table P. Llp
Scanned by CamScanner
For credit
. rating
= E ai .
: ith “yes : cl
ass s =]
and nh,
= With
“hy io”" Cl
ass =2
Therefor
e
K(pi
Pp
=
=
‘0;
stings Wa
) (2/3) log2 (2/3)
Scanned by CamScanner
* . i6-
=F Data Warehousing Mining (MU-Sem_ =
. 1 t
| 31...40| High Yes | — Fair Yes No Yag For credit_t@
it
Since for the attributes income and credit_rat rati ,
ing —_ p=aa with “y e:
buys_computer= yes, so assign class ‘yes’ to 31...40. Fig. P. 4.2.1(h) Therefore, IC,
Step 4: similarly for
Consider income and credit_rating for age: >40 and'count the number of tuples from ih
original given training set
Syyg = (age: >40) .
Table P. 4.2.1(g)
Scanned by CamScanner
~~ the num
ber of r
ecords Wi
yat th “y c
@ aber of FeCOTdS With “yogs
be Cla SS
$0 nu ; =o, nds,
“of jpformation gain = T(p, n) =_ . 1 Class
-
P +n [0g
) = 13 2) =
1(p® B48) tog, ays
Peak,
PH Mog, Der+n *
iapute the e ntropy PY for “red
it_rating
ot)
5) ]
) OR, (
all
Jang 10
@ credit_rating = Fair
\
= ‘
49 - with “yes class =3 and n= With “no” clnas
aw > 0
class — 2
mrerefore, I(p; ’n;) = I(0,2) =0
ee, we
Table P. 42.1¢) «i given below 2
hte tL kh A ht Lt
Calculate entropy using the values from
the Table P. 4.2.1(h)
klow: and the formula gi
ven
a: |
Vv
=
PI
ta”
+n
"
cu
be |
Hence
ee
0.970 -0 =0.970
Scanned by CamScanner
: 1 Classifica
_¥F ata warohousing & Mining (MU-Som, 6-G2M2) FET ENON, Prov,
w)
(High, medium, lo
(ii) Compute the entropy for inco me: FA pata warel
For Income= High, 6 xrracting clz
P;= with “yes” class = 0 and n,= with “no” class= ia
Therefore, I(p; +1) = 0,0) ==0 : pxamP
Similarly for different outlook ranges I(p, . m,) is calculated as given bejoy, page= "= 70
Table P. 4.2.1(i) page=" = 30
IF age = “3 1 *
IF age= > 40
2 | a5 B iuge='>
Ex. 4.2.2
Calculate Entropy using the values from the Table
P. 4.2. 1(@i) and the formula gj
E (Income)= 0/5 * I (0, 9 +35*a. 1) +265 5 “I, = 0.95) Ven by,
[Note: S.nisthe totaltrainingset. = =8 | eh eR
Gain(:
nei mcome) = I(p,n)—E (income)= 0.970— 0,95] = 0.019 AGE
srefore, ‘ S30
Ts
31.49
Gain(S.49, eis = 0.019 | *
Gai Student Yes
AGE
Student ns on
os
No
Yes aN
| | Savant Fair
No Yes | |
Fig. P * 4.2.1(a) : Decj .
No Yes \
Scanned by CamScanner
The weather attr;
They can have the race SUutlooK
Outlook = {sunny, Cnn 'NG valy.
a .
Temperature — hot, mil
Humidity = {high
> ye
wind = {weak, strong}
Sample data set 5 are -
. aan
|
Yes
No
| Yes
Yes
| i | Strong] Yes
| 12 | Overcast | Mild High | Strong] Yes
| 13 | Overcast | ‘Hot — Normal | Weak | Yes
| 14 | Rain | Mild High | Strong No
We need to find which attribute will be the root node in our decision tree. The gain
S calculated for all four attributes using formula of gain(A).
Scanned by CamScanner
Class P : Playball = “yes”
Class N : Playball = “n0™
Total number of records 14. att .
with “yes” class and “no” class.
Count the number of records
class = 9 and “no” class = 5
t
Scanned by CamScanner
Prediction & Chus
Gain (T, Temperature terin
:
Gain (T, Humidi )= 09
ty) = 0 -
. Gain (TT, Wind)
gutlook atoms = 0.04%
the highest gain
, sq itis useg
gquribAsute Ou
in tlthoo
e kro ha c. ~ 8S the decision
s only va
lues “sunny aaa
va Ove
soot node has three branches : rast,
, rain”, the SY Ovemact Fain
Fig. P. 4.22
& As pe
attr
l ibute outlook at
root, we have to de
cide On the
Femaining three
+ ara
Consider outl attribute for sunn
ook y
= Sunny and co
0 unt the number
of tupl es
from the Original give
hhh .
n tr
aining set
Ssumy ; = {1,2,8
,9,11) }= = 5 (From
Table P. 4.2.2, outl
% a
: k = sunny)
oo
Table P, 4.2.2(b)
a ee
Day | Outlook | Temperat | Hu
ure midity Wind | Play ball
a
] Sunny
,
Hot
High
Weak No
2 Sunny | Hot
.:
High Strong | No
8 | Sunny | Mild
I a
9 Sunny | Cool Normal | Weak | Yes
11 | Sunny | Mild Sorina _| SHDN] Yes
Note: ReferTable P. 4.2.
2 | of No tuple =3
Ia given Total number of Yes tuple = 2 and total number
— (3/5)log,(9/5) = 0.971
___1(p, n) =1 (2, 3) = — (2/5)log,(2/5)
ef
Scanned by CamScanner
ng & Mining (MU
mu-Ser econo) ‘ By patawt
¥ - warehous! a Min! ste B )is calculated as given beloy, 5 } w Entropy using, the
ranges
outlook ran MPI" 2c) caicolel®
Similarly for differen Table P. 4-2- ao :
aa
P. 4.2.2(c) and -
-» the values from the Table (C) and the fom, . Hence
Calculate Entropy usiné Gain(T,,
below : 1
. : ute the entropy
E(A) “5 Beto gan COmP ire
For wind =
B Temper) = mes 2) + 2/5* ICL, p= with “yes” class
Table P, 4.2.2(d)
Scanned by CamScanner
=I
3 n)
Weak | 1/2] 0.918
Strong | 1 | 1 1
Calculate Entropy using the values from the Table P. 4.2.2(e) and the formula given as:
y :
Scanned by CamScanner
4-27 —_ Classification, Pretoria,
Gain(Tam TOM
Gain(T nay?
wind) = 0«+eon below Outlook = “sunny”, Hf
as
0 " jder temPS
in: therefore, 11S ginal given
Humidity has the highest £2!
| ,
Sunny gyercast er ,
4 __
[amc | Day
oes
High Normal
Fig. P. 4.2.2(a)
_
5
prs
ie
Step 3: had
Consider now only temperature and wind for outlook = Overcast and count theMn, 14
tuples from the original given training set Consider thi
T = {3,7,12,13} = 4 (From Table P. 4.2.2, outlook = overcast) temperature and
Table P. 4.2.2(f) ClassP: |
ve ClassN
a, For Winc
unny Rai e
Ove ain P - = Wi
Yes . its
| - Win
x \ .
. e Pi = will
Be
Scanned by CamScanner
ClassP: Playball = “yes”
Class N : Playball = “no”
. al
Scanned by CamScanner
on, p "eh7,
e Classiefincaeti
e e e n me
9.) _— AO eene
g-comp-)
2
Mini= nging (MUN-Sem.
sing & a
*
! ousing &
YY Data Ware
- n,) = 100, 2) =0 1p, is calculated as given below, .
Therefore, I(P;+ i?
outloo k ra nges
Similarly for differem Table P. 4.2.2(
h)
on 7 |
4 (hh)
the T. ‘able P. and th
2.2
values from .
I e entr
Calculat opyy Uusing the
trop qherefore,
below : y
7 2 wind bas th
E(Wind) = 51, 0)+5 1(,2)=0
Hence Gain(T i Wind) = I (p, n)—E (Wind) = 0.970 ~0 = 0.979
(ii) Compute the entropy for Temperature : (Hot, mild , cool)
a 2/1] 0918
Cool 1
1 1 ;
Calculate Entropy usin th —
below & the values from the Table P. 4 2.2(i) and the formula git
V on
E(A) = > ota , E © decisio,
<, P+n! @n) a
= Bou
Scanned by CamScanner
Gain(T,,:, T
emperature)
= 0.019
Gain(T,,,,
Wind) =
0.979
hi in;
Wind bas the highest gain; therefore, it is below outlook
00k = “Tain”,
Sunny
reast Rain
High Ni al
@ [ay
A N oo Weak
Fig. P. 4.2.2(c)
Therefore the final decision tree is -
iy Net
Fig. P. 4.2.2(d) : Decision tree for play tennis
| Ite decision tree can also be expressed in rule format
| IFoutlook = sunny AND humidity = high THEN playball =n0
Scanned by CamScanner
ny AND humidit
IF outlook = Sun —
TF outlook = overcast THEN playball y fie i: 4.2.4:
Count thi
So numb
So Infor
IQ,
Step 1: Co
For Inc
Pi = wit
Scanned by CamScanner
& Minin MU-5,
ot
Using the for,
ag =
4S]
4%.
te bt
A
Soln.:
eee
‘
al
Scanned by CamScanner
WF Data Ware! n,) is calculated as given below .
es I(Pi +
Similarly for different 1nc0m° el p. 4.2.4(a) |
Income Pi a
Low - 0} 3 y
Calculate entropy using the values from the Table P. 4.2.4(a) and the formul, Bing I Income 4
below : " gsed 25 the de
? * i Now
oung |3/ 1] 0.811 of tuples |
Medi ”
|edu | i
3/2 | 0971
ee
rented
Pl
below : PY using the values from the Table P. 4 “ta
~ ‘+ 4.2.4(6) and the formula 8”give aSo t
Vv
Ex. 4.2
Scanned by CamScanner
nov (MU-Som.1. 6-Comp,
g Mining (MUS
_. Classification, Prediction & Clustarine
E (Age) = 4/12 * 1(3,1) + 5/12°1(3,2) + 3/12*1C4,2)
= 0.904
ae
: trainin sot. hai
tne 10'2 8
Aue gis
= 0.979-0,904
= 0.075
si? .
33
. we have used income at the root, now we have to decide on the age - e.
attribut .
since
inal given
ider income = “Very high” and count the number of tuples from the orig
Cons
j anil sel
Sveryhigh = 2
, so directly give “yes” as a class label
} Since both the tuples have class label ="yes”
tow “very high”.
Similarly check the tuples for income= “high” and
itome = “low”, are having the class label “yes” and
“ented” respectively.
Scanned by CamScanner
|=
a
a
Dashed | No | Triangle
jal
Gmen
IN| ola lalaln m
Soln. :
Class N: Shape = “Triangle”
Cais tags aie
Total number of records 14
Count the number of records with “triangle” class and “square” class
Scanned by CamScanner
rds with “tr:
ber of recor’ With “triangles Class
P(square) = 9/14
P(triangie) = 5/14
= ~~B_ toga =P . h ,
formation gain=I(p,n) P+n
>
so!
toga —D_ i?
I(p, n) = 19 5) +h Ptn
DPtn r
’ =
— (914
14)loga (iy
4)
the entropy PY f for Color ; (Rea
1 comp’ ute
’ Breen, Yellow)
Fig. P. 4.2.5(a)
Scanned by CamScanner
4-37 Clalassification, Prag 1
| . -
ris
. AzS(a) and ction
MU Sem. sa
Data Warehousing & Min
:: nous
:F ww
and the form g pa wi
the Table P. 4.4. ‘
ing the values from
Calculate Entropy using
i My
below :
E
Vv
E(A) “o ne
££ ree I (pis ni)
i=l
E (Color) = 5/14 * 1(3, 2) + 5/14*1(2, 3) 4 4/14*1(4, 0)
g
[Note: Sis the total traininset. ~
Hence _ Gain(S, color) = 1 (p,n) — E (Color) = 0.940 — 0,694 ~ 0.2%
Step 2: Compute the entropy for outline : (Dashed, solid)
Similarly for different outline values, I(p; , 0;) is calculated as given below :
Table P. 4.2.5(b)
Outline.
Dashed | 3 | 4 | 0.985
belowCalculate
« ues from from the the T; Table
Entropy usingg the values P. 4.2.5(b) and the formula fie
Therefore
,
Similar] y for diffe
rent dot values, I(p; , n;) is calculated |as given below
t
.| Stepa: a,
a
j Te
a Consia
Scanned by CamScanner
Gain(S, color) =
0.246
Gain(S, outline) = 0,137
Gain(S, dot) = 0.048
As color has highest gain,
it should be the root node.
Send: As attribute color is at the root, we have to decide on the remaining two attribute for
red branch node.
Consider color = red and count the number of tuples from the original given training set
Scanned by CamScanner
| 2 | Red | sotid
|3. i Red | Solid Yes | Triangle
| 4. | Red | solid No | Square
Calculate entropy v
4
|! : pe
.
(Outline) = 2/5 *1(1,1)43/541(2,1)
_
= 0951
[Fria
= 0.02
eee
Scanned by CamScanner
of
the entropy for Dot -
ee {0,ye5)
. italy for different Dot Values
Re
(f) and the formmy
B(A) = > Btn la Biven be}
ix] P+n!(,n,)
E Wor ~ 35 *1¢
=0 q
Heace
*
a-")
4
ae"
oC
shthee c
Color = “Reg |
al
tuk
ples with Dot == “y“yeses” from Sample S..4,
ra”)
=|
eo
it has class trian, io
¢ tuples with Dot = =
“no” from Sample Si |
eg , it has cl} nai A
So the partial tree for re to
d col ~«
yak ep
ar =
a")
J
7p a?
TY
Ee la
Pe
Scanned by CamScanner
FF data Warehousing & Mining (MU-Sem, 6-Comp.) _4-41 Sat eation. Precicten
eo
Step 5: Consider Color = Yellow and count the number of tuples from the op;
Bina
m)
training set.
Table P. 4.2.5(g)
Lf Attritut
Ye!
g
?
[2 | Yellow | Solid | Yes | Square
/3. | Yellow | Dashed | Yes Square
Check the tuples 1
[4 I Yellow | Solid | No | Square
tec ec
yn oe
Check the tuples 7
Scanned by CamScanner
Rae
chock the tuples with Outline —= “dasheg»
ere
teh
Eup
Pr Te
Ry
Pits et te ae
] 426: Apply statistical b
atay
BS
[ 2 | aim Tall
| 3 | Maggie | Female | 1.9m | Medium
Y
M
a
7 | Soln. :
P(Short) = 4/9
p(Medium) = 3/9 . .
PCTall) = 279 Lapae
lly Actual
e es as given below :
Divide the height attribute info six rang! aneaies a
1 1 2
[ $4? 4
Ez af t e
|
E 2 {| ‘o-1 too |
0 o
L 0 2 0 0
[ | [1.9,2.0] 0
— 0 1 0 0 12 =a
[ | (2.0,infinity | 0 0 1| mal
0 1n =
Use above values to classify new : Male
tuple as a tall : > | Fi emale
Consider new tuple as t = {Adam,
M, 1.95m}
secs,
P(tiShort) = 1/4*Q=9
P(tMedium) = 1/3 + 0=0
Scanned by CamScanner
— P(Short In.
Similarly p . "= if
(Medium) Be ote * 15
‘i The question is what transportation mode would Alex, Buddy and Cheery use?
Ns
Scanned by CamScanner
ousing & Mining (MU-Sem. SE Odictn,
re ’ ¥ Data ware
Scanned by CamScanner
Gain (S, Car Owners ™
lead
le Valu » he Toot node has
Since im all the attributes of Travel c three branches
ooika
, 50 assign class ‘Car’ to expe nsive. Ost (S\V/Km 5
NT
tandard. =
Ro
ne
Ft
st
wat
3
Fig. P. 4.2.7
] Cons idm
travel cost ($)/Km = Cheap and count the number
Bien of tuples from the original
Set
Scheap = 5
Scanned by CamScanner
Table P. 4.2.7(a) g Calcula
==
— te ent
[ Attributes | |
| Gender | Car ownership | Travel cost ($)/kmn | Income level Transorage
I Mate 0 Cheap __Low oy
[we
| Femate |
[1 1 [ow | ete | Oo
Cheap aa
[Fenate| 0 crew | tow | ag
te —
[rae| ia
Cieip | Medium [agoe :
~|_ y Comte
For Car ow
p; = with “t
Therefore,
1(P; » Gi, Mi)
= =
eee 0.722 —
Compute the entropy for gender : (Male, female
)
For gender = Male,
P; = with “Bus” class = 3, q, = with “Train” class = 0 and n, with “car” class =0
Therefore ,
1,40) = 1G,0,0) = — (3/3) log, (3/3) — (0/3) log,(0/3) — (0/3) log
= 0. ,(0) mam
Similarly for different genders
I(p; , q;) is calculated as given
below :
Table P, 4.2.7(b)
Scanned by CamScanner
caleulate entro, PY usiUSing the y, alueg
from
E i = p+ I
— =35.,. 0 "™
Hence ,
Gai
noth (Scop Bender) .
(i comp een tropy for Car owne
rs, .
for Car ownership = 0, hip ; (0,
ne with “bus” class = 2
Therefore,
rT
haha
Li
PPE
ey
Pt
Ua SUES
EAGER ee
= 0.171
Scanned by CamScanner
WF pata w & MuU-Som. cameo —eretlction4 | ;
fhousi
(il) Compute the entropy for Income level :
(Low smedium, high) ee
For income level = Low,
P, = with “Bus” class = 2 ,g; = with “Train” class = 0 and n, with “car” class 0
Therefore, E
1,,.4;+0,) = 1(2,0,0) =-(2/2-)(0/l2)loog,g(0/,2) — (O/2)log,(0/ C oF
(2/2)2)
= 0.
Similarly for different outlook ranges I(p, , q;, 0;) is calculated as given below
_ ‘Table P. 4.2.7(d) :
0 — bn
0.918 |
0
bet Calculate Entropy using the vafrom
luthees
Table P. 4.2.7-7(d) and the formula
gives
low : ;
E (A) z ey
ll
ptn
||
;
Scanned by CamScanner
Fig. P. 4.2.7(b)
Scanned by CamScanner
Classification, Procictg
——— el
: Bayesian Classification
- Naive Bayes
Syllabbius —
3
4.2.2 Bayesian Classification : Naive Bay es Classifier ID)
P(hl
ctical difficu
4.2.2(A) Bayes Theorem
pequire initi
— _ Itis also known as Bayes Rule. .
— Bayes theorem is used to find conditional probabilities.
significant ¢
ae
- The conditional probability of an event is a likelihood obtained with the ae 42-20)
information that some other event has previously occurred.
~ Zane: 7
~ P(XTY) is the conditional probability of event X occurring for the
event y hic
already occurred.
a
Scanned by CamScanner
a”
pD) : Independen, Probabiy;
difficulties Biven D
ca
ao
quire initial Knowledge Of many py,
Ff ificant computational cost
sign
abilitics,
ae) Naive Bayes Classifier . Example
Nisexample,
S| false | Yes
High false Yes
ee |
[Normal | tase Yes
Normal i true No
5
is Normala
unn true Yes
Yj mi
mi
d | Hiigh
Sunny
false | No
“tally coo] Normal false | Yes
Rain mild Normal false | Yes
vility | Sunny mild Normal | true Yes
| Overcast mild High true | Yes
| Overcast hot Normal | false | Yes
|__Rain mild High_| true | No
Given a training set, we can compute the probabilities as follows : a
~ TheClassification problem may be
formalized using a posteriod ee :
HCY) is the probability that the sample tuple Y = <y nate is of class C.
~ Assign to sample Y the class label C such that P(CIY) is maximal.
Scanned by CamScanner
y-som.6-comp)p)_4559
:53_— Classicaton, Pray
eet ,
— lata, calculate
the P probabilities for play tenn:
~ From the above given sample d Is aay
~ P(tue! Yes) . P(
Yes)
Scanned by CamScanner
24. 3A, 34, us
P(YINo) - P(No) =
P(sunny INo
i+
*P(No)
= WSUS. ays,
hoose the class so {hy
jaw © at it
ili be classified as no, st mi Zes
pan wi (don play) this: Probar:
P(RedINo) = 2/5
P(YellowINo) = 3/5
Scanned by CamScanner
P(SUVIYes) = 1/5 P(SUVINo) = 3/5
P(SPORTSIYes) = 4/5 P(SPORTSINo) =2/5
neofsovemia|
Ex. 4.2.10: Consider the
following data set S, which con
of sunbum : tains observatio
Sarah | Blonde
Ligh No Sunbumed
ee | Blonde Tall | Average| Yes None
Alex | Brown Short | Average | Yes
Annie | Blonde | Short
None - Anunseen sai
Average | No Sunbumed
Average | Heavy PCRISu
No | Sunbumed
Tall | Heavy
:
No None
Average | Heavy | P
No None
—Stot_| tight | Yes [None
Sa
Since 0,032
Scanned by CamScanner
7 lS
?
U-Sen,
se
‘
y
Sees Stabirned) Sas a
a 27, Pong
= P
®
P(Brownl Sunbumeg : F a
) KO (Blonde Noney — Vs a
PBrowntny
P(Averagel Sup ur
b
_ .
ed) = 9/3
-
PCFall p ¥
t
! Sunburn _ A
P(Shorti Sunburne
“D=0 PCTali
—aEel None) = 9
d) =1/3
Ses me
— P (Short! Non
PLight! Sunburned)
— 1/3
P(Nol Sunburned)
= 373 P(Nol None) = 2/5
P(Yes| Sunburne
d) =0
P(Yes| None) =
3/5
P(Sunburned)
= 3/8
P(None) = 5/8
. Anunseen sample X
= < brown , tall, av
erage ,no >
P(XiSunburned) *P(Sunbune
d) = P(Brownl Sunburned)
- P(talll Sunburned)
- P(averagel Sunburned) - P(
Nol Sunburned)
- P(sunburned) = 0
= 0.032
ice 0.032 > 0, our example gets classified as ‘NONE’. |
Scanned by CamScanner
F bata Warehousing
Predict a class label of an unkno
wn sample using Naive Bayogj
Ex, 4.2.11:
aset from all electronics Customer day
on the following training dat
s com, |
Age | Income | Student Credit rating | Class: buy
<30 | High No Fair No
Excellent ks
<30 | High | No
31.40} High | No Fair Yes
> 40 | Medium| No Fair ioe
No
.
Yes
ae
Fair Yes |
>40 | Medium Yes
a
P(yeslyes) = 6/9 an ©) = 4/5
Credit_Rating
P(fairlyes) = 6/9
=ose) ]
.- jo unseen sample X = < age= “< 3qp
egy
? Income= * * ”
- Since 0.028 > 0.007, Therefore the naive Bayesian classifier’ predicts Buys
computer = “Yes” for sample X.
&4212: Using Naive Bayesian classification on the following given training set, classify
the unseen tuple (Refu
Scanned by CamScanner
No Married
No
P(No) = 7/10
P(Yes) = 3/10
Soin. :
= 38x0x0=0
Since P(KINo) P(No) > P(XIYes)
P(Yes)
0.0466 x 7/10 >
0 x 3/10
Therefore P(NoIX)
> P(YesIX) > Class
=No
>No
Someti
Yes _| NoNon-n-mamammmmala’s
Yes Non-mammals_
No
——Ne__
ee |
Yes | Mamma
Scanned by CamScanner
P(Al M) PM) == 0.05 x5Zh5 = 0.0175
Scanned by CamScanner
Ww Data Warehousing & Mining
Classification, . p = edictio,
aa
|CREDIT HISTORY DEBT | INCOME}. RISK _
High | 15 to35 High
| Unknown
| High 0to15 High
| Unknown
| Good High | otots | High
| Bad Low | Over 35 | Moderate
| Unknown Low | 15to35 Moderate |
Scanned by CamScanner
PiVery High] REQUicgy ID)
= 9
P(High| REQUES1p)
T ~ 175
PiMedim| REQUEST ip) _ te
Plow REQUEST iD) « 9
0 P|
Scanned by CamScanner
Bobby ‘se 1.85m Medium
Medium
= 8 Rule-based
Rule-based
Tall
= 3
than superv:
Prepare the probability table as given below :
Learned me
phere es
P(FiMedium)= 6/8 P(FITall) = 0
FeMediam) = 2/8 P(MITall) = 33
Scanned by CamScanner
rehousing & Mining (Mu
pats WE
Scanned by CamScanner
PF pata WarehousiD 5
Neovers~
1D L.
= number
n § (i
: cas’“ goed . “
tionon.
.
neaed
then in it dat
tuples conflict resoluluti
IDI 1 == number of ved ; geri. ng rule Which .
eu
4 ack appr
_ ig
trigee $0 give hi ghest prior ity to that tr _
- Ifmore thano ne rule is der. hay go
.
ofr
_ Based on size, it has io pulZ¥ se
i ute tesst. ing of the rules. Rules are organj , =
maxii mum attrib zed batey —
sed on the ord eri ae op inion
. ‘ ape
Make the decision list ba tak ing B exP ot
- some measure of rul e qua lit y ot by
Most comm
43.1 Struct
excellent ; =a
7 - Regressio1
a of these re
_ age
2 =“ < 30” 30” AND student = Hapaed?
yes THEN buys_computer = “yes” 1, Lin
Scanned by CamScanner
warehousing & Mining
= 9 (MU-gpe m,
pa a EC
=Lomp,
ner Classifi
AA other Cation Methods l.
i .
nearest neighbor classifje,
? case-based reasoning
etic algorithm
Gen
’ govgh set approach
Po gs Topic
Topic : : Predicti
Prediction -
Linear R
————— esression,
Multipie Linear Regression
prediction
43
continuous-valued functions.
_ Most common
.
ly used methods for prediction is Tegressio in.
Scanned by CamScanner
4-67 Classification, Prediction 2 ous
waren
See)
comp.)
ata Warehousing
¥F Data Wa’ & MIDIS. us
.gom. &
228
——— : oh
Y= m+y%1+ nghet Taha + ont 1X + y
gression ch we are trying (o predict
Multiple re
t varia
Ye The dependen
Where:
xX = The inde
penden t va riable that we are using t0 Predict varigy).
m = The intercept
n = The slope
ai re ression
residual.
u = Thereg «differentiated wit4 h subscript
i
ed numbers.
regressions each j a
variableb l e i
— In multiple
finds ga Mathe
nt ables for prediction and
f random vari
- Regression uses a grou lp hip is depicted in
form of a s the
relationship between them. This relati
i es all the po: in ts in the best way. ne.
(linear regression) that approximat
odity, interest rat,
Regression may be used to determine for e.g. price of a comm
ri or sec tors.
industtries
price movement of an asset influenced by indus
gi 43.3 Multip!
4.3.2 Linear Regression Multiple lit
Regression tries to find the mathematical relationship between variables, if it is a g1«: It uses twe
line then it is a linear model and if it gives a curved line then it is a non linear model. dependent
Scanned by CamScanner
2
terest Tat Fig,
> the
ABs Linear regression
a Multiple Linear Reg
ression
~ Inlog linear regression a best fit between the data and a log linear model is found.
~ Major assumption : A linear relationship exists between the log of the dependent and
o error independent variables.
~ Log linear models are models that postulate a linear relationship between the independent —
* fF + 7 4 t
mee Variables and the logarithm of the dependent variable, for example :
log(y) = a + 41 Xy + A Xp «+ + Ay Xn
variables and
Where y is the dependent variable; x» i=l, -N are independent
™
(8, 0,...N } are parameters (coefficients) ofthe
Scanned by CamScanner
4-69 Classification, Prodict,
On 2,
muU-Sem. 6-CompP.
BF Data Warehousing & M ised to analyze categorical data te
widely | Pre;
- For example, Je, 18log i.linearIn models
this case,aF° 1”
the man reason
Teto transform frequencies ;. “4 7
a contingency (able. log-values is that, pre
dependent var. oy ode!”
‘ie sew fhe 4 Tables opab’
probabilities to. their TB the relationship between ic ansformed rs it, P
correlated with cach other, riables isa linear (additive) one. Per valida
pendent va
-ariable and the inde { ta cz
ee,
d s fecti
L Ss ycces ;
4 n and Selection
; nif
to estimate the accuracy of model.
- Validation test data is very useful
‘ 4.
All of them recognize
— Various methods for estimating a classifier’s accuracy are given below.
&
based on randomly sampled partitions of data :
| poss
© Holdout method i
o Random subsampling eee
o Cross-validation
o Bootstrap Actu:
~ If we want to compare classifiers to select the best one then the following methods
ap
used : : c
Oo Confidence intervals,
Scanned by CamScanner
Tep set isa sect of instances that
ies (coy, ta a
1 I's performance Is Svaluatey
able, Mts) as mie pility OF SUCCESS ON UNknow, dh . ol
cd gq are © Sr
© tain)
ao ita. “ee
jigation data is used for Parame Sting j Protess, The
* |My eMtimates
va can be the training dat ler tun thy
Yin
Oras é a
, i"
fa
tion EError : Mod ubset
:
of Wining om!
daty
1 the
© leat d
aty, Va lidat
generalize cl error On the lest d ion
ata,
2 Instance (record) class js Predicteg ie
;: seq instance cllass is pred rectly,
predic tedd Mcor
icte ; rectly,
nfusion matrix : It is 9 Useful
:
rent classes,
tool for an alyzi‘I
er tu ples of diffe nng ho W
well YOUr
—
Classifier
wwe have only two way Classifieg can
3 possible which are given tion then Only
below in th four Classifj
¢ form of a Confusion eae oucomes are
Class
s e n Predicted class r
eeLab
e el Cc . a
Se ee C, Sa
a
=' € Positi—ves (Tp
(TP) | False Negatives
; (FN) | p
18 are a False Positives (FP) | True Negati
ves (TN) | N
Total Pp’ N’
All
- TP:Class members which are classified as Class members,
Scanned by CamScanner
enousia:
ye
meas eee
iJ J
i ‘ santified-
are correctly iden Sensitivity = TPP . ‘ ed
te which is the proportion of negagi, weigh
recognition *4 © tupy Fo * -
6. Specificity + Tue Negative “yp recall 2° a
are correctly identified. cacity = TN/N
specificity f test set tupl
‘ Ayaan accuracy or recognition r ate : Percentage of fest set tuples that are te where B is aM
Jassificd.
classi Accuracy = (IP +TN)/AL 3 classifiers c2
OR speed
. TP+IN busthy
Accuracy = “p+N na ReScalabil1
itz SLEEN
Error‘rate =PN ———=
9. Precision
pasive. : Percenta ge ofofeen
tisameamne tuples which are correct] y classified
i as positive
iti are actl 4.4.2 Hold
; Precision = —2PI__
= TTP + IFPI Pesce
NS forte
.» Recall : Percentage of Positive tuples which . ont
measure of completeness ch the classifier labelled as positive. It!
Recal] = —/ZPI
ITP + [ENI
To trait
Use test
Scanned by CamScanner
F
Precision a
X Prog: a
'eRarjty, e Unie : _ weighted measure of Precisio, * Ttitlo
Sionns
+ RececCal]
ee] x Re .
8 ‘ wee as to precision.
1
c
e
rds) li - Re-substitution error
wcities rate is a Performance
: m,
fasure and is “quivalent to trai
ning data
VAT - pre
Itisferdiff
able,icult to get 0% error rate
: but it can be min
minimized, so low error rat
e is always
Syllabus Top
: Hol
idou
ct ==
actual 442 Holdout
|
~ Inholdout method, data is divide
d into training data set and testing
13 for testing, 2/3 for tr
aining). data set {asually
|
le Total number of examples ;
Training Set
Fig. 4.4.1
To train the classifier, training data set is used and once the classifier is constructed then
Use test data set to estimate the error rate of the classifier.
a a
Scanned by CamScanner
Classification, Prediotc ;
; 6-Comp. 4-73
na
WF data Warehousing & MINE we dif th the test dar, iy
is constructed and if
|
bett
Oe er mod el my
is more than t ste
_ If the training
the er ror estimalcs. é
some ¢ a
more accuralc ative. For example, Asse, “,
j be repre sent
with no instances at all,
_— Problem : The aon ee v eeeen
ith very few mista .
nin
represented Wi : thod which ensures that both trai & and testi,
_ 38 the metho ;
Solution : stratification Bt
class.
have equal number of samples of same
ng ee
— Syllabus Topic : Random Sampli
4 |. |
- For each data. split we retrain th m scra tch wit h the traini _- Theav
sifi er fro
esti: mate B; with the test examples, e clas training examples ui |
- over:
The het all accuracy 'Y j is calcula ‘
cac ai ted by taking the average of the accuracies obtained fron
) - Strati
,
.
. Perfo:
‘
E= Kl .
Da E;
: - Strat
i
i=]
eke ee:
ee Cross-Validation
| — ———— —
;
Oo '
. “
———
.
Validation (CV)
ij‘| 4,4,4 Cross-
i
Avoids overlapping test
sets
Scanned by CamScanner
& Mini z “i ais .
ng (MU-Sem, 8-Comp ) es
Pe) warehouse
i 4-74 Llassitication
q ¢ -vyalidation
,-fold a s Pradiction a Clustering
tep : Data is split into k SUD
sub:Sel
s of equ
: T ize
reat ctexamen ny by random Sampling)
=
Experiment 2 Training exampioa
a
Tast examples
Fig. 4.43
ese d a : Each subset in tur
n js used for testing
and the remainder for
a advantage is that all the SaRM traini
ples are used for both traini
ng and testin ie
error estimates are averag
_ The ed to yield an overall error est
1
imate.
K
°
E=xX x E,
nple 1=.
9 Forevery experiment, training uses N-1 examples and remaining example for testing.
|_ Theaverage error rate on test examples gives the true error.
mples and ; N
1
E = N ba E;
i=l ._~
ined from
- Stratified cross-validation: Subsets are stratified b efore the cross-validation is
performed,
Scanned by CamScanner
aq (MU-S0M 6: CODE —Sttion2
F pata warehousing
ng a = =<. Gh :
a ———
Experiment
Experiment 3 :
Experimen
Fig. 4.4.4
stesteriring
opice : : CluClu
r antifi
be ide
~ Syllyllabusabus Topi
_ - Biology :
4 What i . NS aa asseeibs
5 hat is Clustering ?
- Librari
Farle
4.5.1 What is' Clustering ? er) ordering
' a ae
> (MU - Dec. 2010, May 2012, Dec. 2012, De®
identitic
- Clustering is an unsupervised learning problem.
City-p)
in *,
que used to place data elemess
Clusterini g isi a data minvosing (machine learning) techni
.
_—
can be
related groups without advance knowledge of the group definitions.
are called asclustes
- It isa process of partitioning data objects into sub classes, which
Scanned by CamScanner
Ne fs contains data objects Which have ig
Bh inter ¢ MilaSION, Predict
6
i gAv ee
now this with a simple raphical ex
ample : W intra
Similarit
y,
~ Biology : Clustering can also be used in classifying plants and animals into different
classes based on their features.
* Libraries : Based on different details about books clustering can be used for book
ordering,
Insurance : With the help of clustering different prongs of aelea ea.
Identified. For e.g. policy holders with high average claim cost or identifying so ,
. City-planning : Using details like house type: geographical locations, groups of houses
Scanned by CamScanner
Data Warehousing & MI
can also be used
to identify dangerony, ,
Cl us te ri ng ‘3 ‘ONey
udies 1
Earthquake st
es.
earthquake epicentr groups of similar access Patter,
find
: Cl us te ri ng can be used [0
WWW cuments, : Usin,
be us ed for cl assification of do
data, It can als o ah
Requirements
y are :
eme nts tha t a clu ste ring algorithm should satisf
The main requir
Scalability.
s of attributes.
Dealing with different type
itrary shape.
Discovering clusters with arb
s,
wledge to determine input parameter
Minimal requirements for domain kno
Ability to deal with noise and outliers.
Insensitivity to order of input records.
High dimensionality.
- Because of time complexity, it creates problem to deal with large amount of data ‘a
- For distance based clustering, the effectiveness-of method depends on distance definitns
Scanned by CamScanner
worohQZe i
Scanned by CamScanner
sation,
2. ‘ Dissimilari arity
matrixi or by object structure eee a
Where a
0
d(2,1) 0
d3,1) 4.2) 0 9
ee
d(n,1) dm2) ~~ ase O
Using
Scanned by CamScanner
fication,
Clustering jlarity
i /Simi
i ilarity Metrj i: Ieee
.
is into
pis which is typically Metric 1a Ar;
su
nc
parate “quality” ‘
ere i 5° ee maaltty fnetion that Meagy
definitions ista, nce function,
of dista r C8 the« © distan.fee
ii
ean ¢, orical,
cates! i 1, ordin
i au land tio " n ec Ustuia Mn] Neay a“
*
te nolits should be associateg
. Of a hu
With ditt,*tiables ¥ Very 4iftereng j “ ‘ter,
v.
Cte, €
: welantics: nt Variables iT Met Val ac aler,
b
‘ hard to define “similar enough” epi
is
pe a
f data in clustering analysig ;
terval-Scaled Variables
{ Ini scaled va
riab les are Continuous
ement Measur
- : weight, height a
nd diff
ing and quantifying the weateren
her ce‘ebet
mPewee
ratunre.the Tvalu
hesees,atAntributes al
alues whose erence: diff ces i
are inte rpretable iNterval-scaleq
1] fave
consider the measur =—1 ifics
2p =_ Xc—™;
Scanned by CamScanner
__ Classification, Predict;
¥F ——
pata Warehousing se :
used to me as
ure t he similarity or dissimjj,,;
Milarity ling =
normally Mea
o Distances are “ance:
data objects. :
i dis
include : Minkowsk
o Some popular ones
q = q
dG, j) = Vix I + 2 — Xp + «+ Ip — Xp
o Both the Euclidean distance and Manhattan distance satisfy the following
requirements of a distance function : mathe
dij) = 0
d(i,i) = 0
dij) = d(.i)
daj) < d&k +dkj)
© Also one can use weighted distance, parametric.
4.6.2 Binary Variable
Tiable ;
| example of such a variable is ae cei of the states are not equally important. ~ 4
€ : ei ence
ifI} mae ee is “handicapped or not lee absence of a relatively rare attribute.
Y coded as 1 (present) and the other js coca om a Most important outcom** .
aDsent).
Scanned by CamScanner
“ingen table for binary data -
patio’
AK
object m
tonal] da
mnas
jwould be the number found in object ™ but not in obj ect = mee
jos the opposite to b and dis the number y
that are not f
matic eal? matching coefficient (invariant, if the binary y ia IN either object,
arable js symmetri
dG, j) = bic "
SED tonal (4.6.1
cient (non-iInvari ‘ 5
d Gi, j) = eran
+b+e --(4.6.2)
fmple
Table 4.6.1 : A Relational table containing mostly binary values
: ug ae a . 2 ee ia ce pourees pe
rariable, Se | eee S i
N P N N N
>. Here
P N
should o 7 N
e both Jim Me fF at esp N | N|NJN
etric binary
* Gender is a symmetric attribute the remaining attributes are asymm
to 0 as shown in the
Tet the Values Y and P be set to 1, and the value N be set
|
Tile 4.6.2,
“ing formula (Equation 4.6.2) of asymmetric variable.
Scanned by CamScanner
& Mining (MU-
g
Ww Data Warehousin
- B
‘ th
qT
Gender | Fever Cough | Tes
M 9 ° ° 9 _
| 0 _ I
F 1 0 I 0
M 1 ! " : ; 0
Distance between jack and mary (i.e. - dfjack mary)) is calculated Using
4
(Equation 4.6.2) and use contingency table given above-
7
Consider attributes : Fever, cough, Test-1, Test-2, Test-3, Test-4
gti No
pur “
Consider jack as object i and mary as object J
arith
a = attribute values 1 in jack and in mary also = 2
t of st
b = attribute values 1 in jack but 0 in mary = 0
c = attribute values 0 in jack but 1 in mary = 1
8 b+c
dij) = aeb+e
O+1 |
d(jack, mary) = 24041 70:33
Gim, mary) == 12
<4 5 ma= 0.75
~
d (jim ma
Scanned by CamScanner
.
venousing & Mining (MU-Sem Com
ae —_— — ———
p)
ja
dividual item
has 4 Certain
dic
, . vpnoaemir naofl the categories is nop Possib ©
variable Cate
go
ries can be
ie.
+ netic and igi
; ni examples of su *Perations on t n
ch Attributes ar
e :
’ ‘Car owner :
H Ordinal Variables
Scanned by CamScanner
& Mining (Mu-Sem, 0-COTE: BS een POC Ao, q
WF pata Wa
.
(3) Ratio-variables ontinnous positive measurements on a nop ti
. ci ‘
_ Ratio scaled a cial data but are not measured on a linear sca] . ar ty -
vail ty ane
e
Calories th
Way of co
Euclidean
: we hac
CtWeen ,
Scanned by CamScanner
. yet j , Oecas;
Scala gel
Orit ts. The choice
ri fF Similar
of distance/simitenn a
Measures diver
Ca
one Metric white
OP and gi wfie” entation of objects, Y Measures gent Often canes IM for
Visig oro T, Pends on the ed Similarity
and %
7 SF
able 4.7.11 Minkowst fami
surement type
2. CityPa
block L, aater cites
Scaleg i pee ee i=]
5 3. Minkowski Li. Pt |
ta pfa
i
: atk x P-QpP
i=]
: ty infinite, the equation (4) can be derived and it is called the chessboard distance in 2D, the
oagle minimax approximation, or the Chebyshev distance named after Pafnuty Lvovich
Chebyshev.
dimensions,
| - These distances (similarities) can be based on a single dimension or multiple
with each dimension representing a rule or condition for grouping objects.
of
~ Forexample, if we were to cluster fast foods, wej cou 1d take into account the number “i
— most straightforw
orice. subjecti
i their+ price, i
subjective ratings Bate EIetc. TheFenenadie
0 F taste, ete
i they contain,
calories
objects in a multi-dimensional sp
way of computing distances between
—_— Euclidean distances. — | al geometric distance
- " ‘ ctu
~ Ifwe had a two or three-dimensional space this nee Ppa
) Loedt waited with a ruler).
measured
ve
: between objects in the space (1.€., aS if
Scanned by CamScanner
Classification, Preg; W
w-
a Mining (MU-SOD. gS Ne |
Data wWarehousin tnods (k-means, k-medoids)
gina ;
“partitioning Me—— , vet
— Syllabus TOPIC:
hods
4.8 Partitioning Met
> (Mu. w
tract a partition of a database D of n object, inte , ; 4
Partitioning methods const” ty
K clusters.
ing methods
Diterent eo tively enumerate all partitions.
1. Glo bal optimal method : Exhaus
K-means and K-medoids algorithms.
2. Heuristic me thods : K-me
d by the centre of the cluster.
K-means : Each cluster is represented by K_mean
Scanned by CamScanner
ro ace fn
is algorithm aims at mip
imiz; an
« ©),
"1S given
wf
we
Ser. 9 fori=1tok
Replace m; with the mean of all of the samples for cluster i
o end_for
sa © end_until
on ~ Following three steps are repeated until convergence :
Scanned by CamScanner
CANT
PT
Arbitrarily choose K _
object as initial Update
cluster center the
cluster
2 ad means
ade Stop
“e423 4 $6 7 8 10) two groups are i
~ Distance objects to
centroids
Grouping based on |
~ minimum distance “
Scanned by CamScanner
pers which are close to me
are close t0 Mean m,= 4 are M1 = 3 ar
&louped into én froy
Ster
new mean for NEW chast
‘ ag? 3h Ko= the
7 calculate {4,10,12,20,30, 11,25), fig
1 =
K, = {2,4,10,12,3,11},
Mean =7
4 Re-assign
K, = {20,30,25},° _ Mean = 25
K, .= {2,4,10,12,3,11}, _ Mean =7
Scanned by CamScanner
W bate We ing
& Mining (MU-Sem. 6-Comp. 4 “91 Classification,
o p sation 5 a warghous® ‘
S GZ 5
K,= 3, 9. 18— mean = 19
= 2, 8, 15 — mean=
K, 8.3;
8.3;
- mean = 13.3
K,= 6, 12, 22
2. Re-assign = ; 50 the final answ’
ar one
K, = 2, 3, 6, 8, 9 - mean 25.6; ; Kyo mean = ‘ 4.84 * gro upe
groupe
K,= 12, 15, 18, 22 — mean = 16.75
3. Re-assign
Scanned by CamScanner
Consider four objects yw;
Ith two a
grouped together into two Cluster
values.
Attribute Y
8.4 : Graphical
Fig. P. 4.8.4: Gra representation of Data Points
Scanned by CamScanner
7 Attribute X ‘N
_
—s
‘z
t—
——
‘ ' ' it
4.5 ; eas
“a seateaeee § sri
1 :
’
secede 1.
7
er 7 aeoeneeee
eenans 1 eesuac
a5 cosnseeed ana
4 ‘ oandee basces
wsc ase feenesena
'
dewceeane aseeecere . eac
esne
ee: ee
D
ens
OS dee Be saeiits baeeseensbeneerens fosseen feveeo
0 +—_}+____ +++ i
0 1 2 3 4 5 8
A BCD
The above distance matrix shows the dist
ance of each o bject wi respect to the cen
ject with
of cluster Cy and C>,
For example, distance from object D
= (5,4) to the fi i ss $f
cond cnttid =, )ied2Aet NO CI= Ch Dis Saint
G° =| lean group - |
grou-p2
; : Based . |
centroid. As group-1 has only one object, a belong to group, calculate the ™
roid of group-1 is C, =(
1, 1).
Scanned by CamScanner
Fig. P.4 -8.4(b) : Clus
ter formation
after First Iterat
ion
: Calculate objects-centro
ids distances : Comput
e the distance of
wnew centroids. So new distance matrix wi Ib all objects with Tespec
t
Centrojd e D'is given below
ni [ 0 1 3.61 C,; =(1 1) group - 1
an 11 8
d to the “13.14 2.36 0.47 1.89] Cc, = G ; -
" oo
A B Cc D
imum { Make the new clusters of objects : Foll
ow the step 3 given above. Based on minimum
Toup- distance assign the group to each object. Now an objec
t A and B belongs to group-1 and
object
s C and D belongs to group-2 as given below:
g=[2
“LOO
0 0) group- 1
1 1) group-2
ABCD oa
ew
Again determine centroids : Repeat the step 4 to calculate new cen
Le? Jt -(15 7
% -(, T° 2
Scanned by CamScanner
4+5 34) (45 1
c, = (EE AE) = (49.39$)
Attribute X
Iteration 2
A Boc D
9. Make the Clusters
of objects
calculated using : Assi eac h object based on the minimu
Euclidean distan
ce, = m dius
o*=[) 1 0 "| Stroup - |
0014 group - 2
Scanned by CamScanner
ee
a
Enel bt he
rey au Sa
3
i
15 0.5
Rs
> Matrix for for simplicity we can find the adjacency matrix y Which gives distances
hich gives di of all
) object from
2.12
0 2 0.71 |
Scanned by CamScanner
WF pata warehousing & Mini . ; Classification, Prag
i warehousing é
2. Objects-centroids distance : Using Euclidean distance formula
agi » the qj Zz
object with respect to centroid C, and C, is given below : at ‘t Z For example, di
D°= [eats
os — 2) + (0.7
tteration-1, objec
6. minimum distance
Gi=
eb. ge
[a |B
*
The object is. represented by
5
column in the distance matrix. The first ro ; d
—~
distance of each object to the first centroid and the second
Tow to thesecond TeDTeseny os
Cen * .
3. Objects clustering * Each object is assigned based on the minimum distance “= By comparing mi
si.glance D is assigned
and matrix to group 1 and C and E to group 2. A value of 1;
if an object belongs to that group,
Thu ogg to-a ci
move any more
S aSSignedj te So final clusters a
G’= .
Ci(2,2)~Group 1
C,(1,1) — Group 2
gning the objects to their appropriate gro,
has three member thus the
centroid C, iste
‘ian Similarly Group 2 now has i
are ale among the two members Ex. 4.8.6: Suppo:
CG = £7o2t3 2424 1
re
3° 3) = (2.67, 1.67) pres‘
c, = (41S 140 | Met
5 2° 2 3) = (1.25, 0.75) The di
. eee objects-centroids distances :
objects to the new centroids, Similar to ’ : The next Step is to compute the distance ofl ie , ¢
show
D! = tep a we have distance matrix at iteration Lis (a) q
(b) -
Scanned by CamScanner
a
(Se
ania
By comparing the above Tesults we obs Ly
ea more to a different gtoup. Thus K erve that Ge G 1
SC
Means ¢] r » this shows that .
$0 final clusters are Group 1= ustering has reacheded itsite oc.4 ot b OO Hot
{ A, 21D } and Group 2= (6,5 stabi;
Stability,
PRTC
x
A
Es
q
Scanned by CamScanner
at dad C2(4,9)
| ° 9 168) ae
|
sterati
' | © A2(2,5) o— 7)
; B3(8,4)
ject
PIGAy ay
enna ail
0 + ¥ < i
=
0 2 4 6 8 10 F
7. Iterati
8. Iterati
D? =
9
Tteratj
The G,
Scanned by CamScanner
, Mining (MU-Sem. 6-Comp.) 4-100 Classification, Prediction & Clustering
6.52
the
clu ste rin :
g Sim ila r to step 3, we assign each object based on
0pjects
wn below.
'
yest" . tance. The Group matrix is sho
Jy m §
pinot
] 0 0 | X2(6,6)
0
0 o | 1 | 0 | X3(1.5,3.5)
centroids
iteration-2, determine
95)
x1 = (2+4)/2, (10+9)/2)= 3,
)/4)=(65, 5.25)
= ((8+5+7 +614, (44+84+544
3.5)
x3 = ((2+1)/2,(5+2)/2)=(15,
s
\ Iteration-2, objects-centroids distance
De
1.12 | X1G.9.5)
235e17.43 | 2.5 | 6.02 | 6.26 7:76 |
112/a
s
,5.25)
X2(6.25 PO
6541 451 | 195 | 3.13 | 0.56 | 1.35 | 6.38 | 7.68 | X20
5,3.5 )
| 6.52 | 1.58 | 6.52 | 5.70 | 5-70 J
4.52 | 1.58 | 6.04 | X3(.5
| et
1 in distance.
un
lleration-2, objects clustering : We assign each object based on the minim
Group matrix is shown below.
a
Scanned by CamScanner
0
fofi1{[o o] oj] 0 1
XI = (2+5+4)/3, (10+9
+ 8)/3) = (3.67, 9)
X2 = (8+7+6)/3, (44+5+4)/3)= (7, 4.33) strenofgth
K-ms
X3. = (2+ 1)/2, (5 +2)/2)=(1.5, 3.5) Relatively
ef
11, Iteration-3, objects-centroids distances where n is nu
D?= ae
Normally, k,
: K-means ofte
0.33 | X1(3.67,9) oe
| 6.01 | 5.04 | 1.05 | 4.17 | 0.67 | 1.05 |
6.44 | 5.55 X2(7 ,4.33) global optim
| 6.52 | 1.58 | 6.52 | 5.70 | 5.70 | 4.52 | 1.58 | 6.04 | X3(1.5,3.5)
_ Weakness of K.
12. _Iteration-3,
, 0 objects clusterining g :: We assign
‘on each :
object based on the minimum distance
Group matrix is shown below. Applicable c
Need to spec
G=
Unable to h:
Not suitable
0} 0 |] 1 |x1@.67,9)
1} 0 | O | x2(7,4.33)
O11] 0 | x3(1.5,3.5)
Scanned by CamScanner
of K-means clustering
adl atively efficient: O(ikn),
ere is number of objects,
"his number of clusters,i is number of iterations,
Tx vormally, ki<<n.
Scanned by CamScanner
K2 = {15,17,20}Cl
, =Mean= 17 3
3. Re-assign oe
K2 = {15,17,20}
Scanned by CamScanner
¢ ach SIP in algorithm,: medojgg are Chan 7
_ cost¢ hange for an item 4, ASSOciateg ed if
wit h
. n by Margaret H, §
Pe With Non-med ‘
i? ceca . :
.
ite \
;
ye" s 2 _ in} if set of elements.
qe oe 2
a Nele, Tents,
om
;
ired clusters,
mber of desired
i :
erence . ae ‘
Bo in| ;
° "doy, ft peateei™ !
ate
pagent Meee oe
pirat select k medoi
n
Tey.= 2 “Cin
j= 1
ids)
Advantages of PAM (Partitioning Around Medo
t handle large data sets
3 Wer.
but does no
i ‘small data sets, k is number of
iia is number of data,
where 1
Complevwxity isomO(aa k)*) forte each iteratio
n(n-sy |
.
; .
Clusters.
d outliers.
presence of nol se an oe |
re rob ust tha n k- me an s in the
~ PAMis mo
Scanned by CamScanner
w. Apply K-
ects are given belo
Coordinates of Obj
clusters = 2.
Table P. 4.8.8
1.0
5.0
5.0
5.0
10,0
25,0
25.0
25.0
25.0
29.0
Soln. :
y
10
8
6
4p e4 e4
2 03
e2
|
1 5 10 20
|
15 25
Fig.P. 4.8.8 : Two-dimensional example 30 " *| |
with 10 objects
4 (
2t
Objects 1 and 5 are the selected
oo
representative objects initially.
(Random selection)
men Stance of every object with :
Tespect to selected object
oo e clos st representative 1 and 5
object with Tespect to the selected
culate the average object.
value of minimal dissimilarity
Cost = Average value of
minimal dissimilarity =9
= 4
ase
Objects
Scanned by CamScanner
10 15 20 25
Fig, P. 4.8.8(a)
Scanned by CamScanner
gem. 6-COmp. 4-107 Classification, Pr
4: Repeat steP 7
Pei from object 4 |_ from object8_|__ dissimilarity
, 100 24.19 4.00 sampling Be
2 3.00 20.88 3.00
3 2.00 20.62 2.00 CLARA
0.00 20.22 0.00 Clustering le
4
5 5.00 15.30 _ Built in stat
6 20.00 3.00 3.00 Draw multi
7 20.10 1.00 1.00 Performsb
8 . 20.22 0.00
0.00 _ Efficiency
9 20.40 1.00 _ 1.00 LARANS
10 24.19 4.00 4.00 e °
Average 2.30 - CLARAN
search.
A
y — ‘CLARA
— Search is
ol - Itdraws
4 - It has tv
and nun
6 a q
L —_—_—_—_—_—————
4 = Sy!
49 Hierar
2r
12345 10 a 20
Fig. P. 4.8.8(b) : Clustering corresponding to the selections described in Table P. 48.70) wt
ari Sus hierar:
Step p 3; Single-lin
~ Calculation of swapping cost.
Complete
i193 ~ Swapping cost = New cost— Old cost ost == 2.30 9,32= —7. 02 ~ Average-
Scanned by CamScanner
. ping cost < 0
llf
oe medoid with new Selected Object,
pee bj
ne medoids are object 4 ang object g
perepeat step 2 and 3 until Swapping COSt en
0
3 samp ling Based Methoa
cARA
_ Clustering large Applications.
_ Built in statistical analysis Packages, such ass,
{9Hierarchical Clustering
> (MU - May 2014, Dec. 201 4)
7
Veous hierarchical clustering algorithms are :
Scanned by CamScanner
YF Data Warehousing & Mining (M
- Ward's method.
Hierarchical clustering technique (Basic algorithm)
Repeat.
mw
oe eee
In ney
the complete linka. 8¢
method, D(A,B) isfe com
an uted = na
d(i,j) |
object iis invtuster A and objectj :
is in cluster B} ee
Scanned by CamScanner
°
« Cluster4
Fig. 4.9.2 : Complete. é
Scanned by CamScanner
warenousi
wecgift as a dendrogram as shown a Fj go pa! ta Kowa
An HAC clustering is typically visualized
each merge is represented by a horizontal line. * 494 ex 49-1* techn
What is Dendrogram ? ‘hy
- Dendrogram: A tree data structure which illustrates hierarchical cluster}
n
— Each level shows clusters for that level. . ni
o Leaf — individual clusters
o Root—one cluster
r p6 in 2-d
| 5 [: Step 2: Calculat
all other
mee Fig. 4.9.4 : Dendrogram ‘ and plac
- ow: chart of agglomerative hieerarchici
} aera al clustering
i ‘
algorithm is. shown in Fig. 495 | sae
where x; is
on, aS Many attrit
;
In our case,
and p2, which h
d(pl, r
Scanned by CamScanner
ae
oe? —
o<
attributes x and y, so we plot the inn Wwe have 9 ss
p6 in 2-dimensional space : iad ae
op?! Calculate the distance from cach gs 04 "3 98
i . . ject (poj to
all other points, using Euclidean Pili,
easure,
and p lace the num bers ini a distance
° matrix. 0 01 0203 ; 04 os on™
The formula for Euclidean
distance between two points
j andj jis
vn in Fig. 4.9.5 d(i,
Gi ))j) = af Init —Xy + xg— xi + Fxg oe
where x;1 is the value of attribute 1 for i and x); is the value of
attribute1 for j and so
} «,asmany attributes we have ... shown upto p Le. Xp in the
formula.
In our case, we only have 2 attributes. So, the Euclidean distance between our points pl
ad p2, which have attributes x and y would be calculate
as follows:
d
Scanned by CamScanner
Distance matrix , —
p2 0.24 | 0
p3 | 0.22 0.15 | 0
p6 and p3 and make single cluster (p3,p6). Now re-compute the distance 7»,
Dendrogram
Disa
0.30
0.25
0.20
Oo 01 0203 04 05 06
Distance matrix
” The «
pl 0 . dist(
p2 0.24 | 0
(P3, p6) | 0.22 | 0.15 | o
p4 0.37 | 0.20}015 . | 9
ps 0.34 | 0.14 | 0.28 0.29 | 0
; pl p2
To calculate the distance (p3, p6) p+ pd
of pl from (p3,p6) :
Scanned by CamScanner
Classification, 7
& mini
s
sone single cluster is formed i. merge al the chuter ,
*
.
nll
4 9 oh
F015
+014
|
7Olt
| |
6 2 5
3°
0
0
0.22 | 0.15
{015 [0
037/020
pl | @2P5) (p3, p6) | p4
ca lc ul at ed as gi ven below:
and (p2, p5) is
between (p3 p6) (p 6, p2). dist(p3, P:
st(pe3, p2 ) » <i st di; 5),
p5) ) = MIN ( di
5)
i
a t i
ae ); (p 2, ’
,
6
dist( (p: 3, p ;
//from origigininal al
matrix
25, 0.28, 0.39)
MIN (0.15 0.
= 0.15
and again
distance i.¢.0.15
(p 3, p6 ) as h a v1 ing miinniimum
YMerge (p2, p5) and
e e - Pp.
’
matrix.
erge
compute distance
Scanned by CamScanner
4-115 Classification, p .
* Sem, 6-Comp.)
=F Data Warehousing & Mini MU-Set <Siction warehousing &M
so. looking
r and pl hav
* two ina si
oF
— there are nl
rit}
O14 space
3 62. §
o 010203 04 05 0.6
Distance matrix
pl 0
So, looking at the last distance matrix above, we see that (p2, p5, p3, p6) aad To stop fuer
have the smallest distance from all i.e. 0.15. So, we merge those two in a sing bas t0 make ests
cluster, and re-compute the distance matrix. merging of amar |
stop clustering at th:
Space dendrogram _ 7% Complete link :
Dendrogram Distanca —
Step land step 2:
0.20 .
Refer the sing]
an Distance matrix |
ofl
0 01 02 03 04 05 06 $ 62 5 4
Scanned by CamScanner
d. since we Dave more ClustersSto
so, looking at the last dixtn
and p! have the smallest die Matrix
0.29 two in a single cluster, There nee _
Ove,
Oy §
there are NO More clusters 4, tncrge,'S no
lo floes One lefty
14
O44
yA
9 0102030405 06 x
5 4,
he he
ndition :
gorring
indicated that “each object is placed ;
* WA“ N
‘ ma Separate cluste;
ir of clusters, 9 until certai,
aA
ee
gecosest Par 0.
ain termination conditions an * and at eachSED We merge
€ Satisfied”,
Any
.
Jo stopclustering either user has to specify the number of cise
é =:
ct
at which level. Through dendrges nn Mem
3, P6) and pd
,sn ukeof clusters stop clustering
decisionat tovarious distances. If merging of clusters is at high a WE can notice the
Tk
Wo in a Single
rid
igh distance then we can
WS A
epcustering at that level.
ip tomplete link :
«<
ALiUnoas
P
Sep land step 23
;i Dstance matrix :
4 pl |0
p2 | 0.24 | 0
p3 | 0.22} 0.1 1 ——
|5; 0
| 0.15
p4 | 0.37 | 0.20
0.28 |
p5 | 0.34 | 0.|14
25 | 0.11 | 92
p6 | 0.23 | 0.25 |00
Scanned by CamScanner
Step 3: Identify the two clusters with the shortest distance in the Matrix,
together. Re-compute the distance matrix, as those two clusters Mey
cluster. Me Roy ny ‘i new 2
Dendrogram 0
Distancg pl 0
p2
0
(p3s p6) 5
pt 0
ps :
o 0102030405 06 . pl
Scanned by CamScanner
se
.
' at the above distance
So, looking matrix ‘meta Pe 4 p5
— at p2 and .
fom all - 0.14. So, we merge those two ina Single cluster, and = seit Smallest distance
wing the following calculations. “ , Compute the distance matrix
Space : Davida
0.20
nal matrix
| | 0.10
x 3 6 2 5
0 01 02 03 0.4 05 06
] matrix:
|
dis (p2, p5), pl) = MAX (dist(p2, pl) dist(o5,P!))
original a
= MAX (0.24, 0-34) //from
= 0.34
Scanned by CamScanner
& Data Warehousi &M
pl 0 |
0.22
1 0.14
0.11
0 01 0.2 03 04 05 086 3 6 4
Scanned by CamScanner
Therefore new distance matrix
(p2, pS)’
(P3, p6, p4)
I =!
sep 6: Now consider the follow PY ©2,P5)
ing distance matrix
(p3,p6, py
pl 0 ry
-—-—.,
0.34 | 0
Ya
0.6 5
0.5
0.44
0 0102 03 04 05 06
Scanned by CamScanner
dist ((p2, p5, pl), (p3, p6, p4) == 0.39
03
Therefore new distance matrix
ason fo|
carsrp fos fo |
(p2, p5,p!) — (p3, p6, P4)
Finally, merge the cluster (p2, p5,p1) and (p3,p6,p4)
Space Dendrogram
rT
o 0.1
0 01 02 i 040508
3 ¢& = me
Average link: 5 dist( (p3 p6), pl.
Step 1 and Step 2 : “dist( (p3 p6), p2
p2 [0.24 0
mec
P3 | 0.22 | 0.15 0
P4037 | 0.20 | 0.15 | 9
PS | 0.34 | 0.14 | 0.28
| 9.29 0
0.231 0.25/ 0.11 | 0,
Pe [025] 025] 011 [aaa 039] | 0
Scanned by CamScanner
0 0102 03 04 05 og x
4 6
pl 0
_
Scanned by CamScanner
4-123 Classifica ing & Mini
Ww Data Warehousing & Mini mot, PreK, ~ warenoes
02
w the closest
Step 4: 3:
o ‘ tthe ma
So, looking at the above distance matrix above, we see that p2 and 5 hay ” jookin® ie
distance
§ from all - 0.14, So, we merge those two in a single cluster, ang ,, on"®t i ”
distance matrix. Dendrogram ;
Space
T O25 Y
06
TO25
y
0.6 wi
o4 [Os
TO414 o4
0.4
o4 TO05
Scanned by CamScanner
oO
w the closest clusters
CreRed,
Gas
0.29
O15
0.14
9.19
os 0 01 02 03 0405 o¢ x
pl e
).27 ||
(p2, p5) 0.24 | 0
|0 |
(p3, p6, p4) [0.27 [0.26
P4)
pl _(p2, p5) (p3, p6,
and p 1,
Sep6: So merge the cluster (p2,p5) (p3,p!)
= 1/9x (dist(p3
p2)+ dist(p3,p5) + dist
dist( (p3 p6p4 ), (p2,p5.P1)) 6 p5) + dist(p6,pl)
+ dist(p6 2) + dist(p
we p5) + dist(p4+P))
039 +023
sntry £0"
Scanned by CamScanner
o 01 0203 04 05 06
Distance matrix : 5
109 7 ¢
5] 9 8 5 .
dio;
da:
Scanned by CamScanner
ayo =i {di3,do3} = min (6, 3) =3
dua = in {dj,4424} = min { 10, 9} =9
e
dys = min {d,,5,d, 5} = min
{9, 8} =8 ©
ve Matrix (use 2 . ry 8
123 45 (1,2) 3-4 ‘
0 (1 2) 0
| (i, ’ (1, 2,3) 4 5
25 3 | 3.0 4 <6
g6 3 0 > 4) 9 70 > 51 5 49
° 09 7 0 51 8 540
i985 4 0 |
Scanned by CamScanner
¥‘ Data Warehousing & Mining (MU-Sem. 6-Com .) 4-127 Classifica, ‘on, Predictign ‘
Step3:
123465 (i,2)
3 4 5 (1,2,3) r
a,2) | 0 (1,2,3)] 9 an | c
0 : 3. 0 4 _
3} 6 3 0 => 4 /9 7 0 ~ * Le | se?
109 7 0 5 8 3 6
4 0
5} 9 8540 1
| 12 °
{
3} 6 3 |
(1,23) 46 4) 10 9
(1,2,3) | 9 5) 9 8
(45) | 5 4 da
dazaas) = min {di12.3),4,dq,2.35}= min{ apes
‘
Step 3: ;
s | 1
.
1] O
2 0
(ll) Complete link
3} 6 3
Step 1:
4110 9
P2449 (1,2) 3.4 5 5 9 g
1/0 ‘
1,2, 3),(4,5
4/10 9 7 9 s1 9 540
519 854 0
rr
Scanned by CamScanner
5 4 0
danas = MAX {daa 4dao5}= max (10, 9} hg
dyc4s) = Max {d54,d,5} = max {7,5} =7
step 3 *
1, 2 (1,2) 3 4 5 (1,2) 3 (4,5)
if 0 (1,2) | 0 (2) | 0
sta 0. | 3| 6 0 3 | 6 0
> 4 | 10 70 => (45){ 10 7 0
116 3.0
i|10 9 7 0 5 | 9 a4%
s}9 8 5 4.0
dha, 3) ¢4,5) = max {daa aspdsa.s} =. mt {
Scanned by CamScanner
(iii) Average link
Step 1: /a4as (2) 3 4 5 2
tl 0 (2); 0
2/2 0 a) ee 0 4| 9
a 95 7 0 ; 2 oO
3} 6 3 0 =
4/10 9 7 0 5 85 5 4 0 Je 28
5/9 8 5 4 0 Z 10 9 (7 0
1 6+3 5 4
dang = 7 dist hs) ="9 =F 5L? °= 61
1 1049 dga.sntt-® Gia +
1 9+8 @
days = 9 dist zs)=—9 =8.5 pa)
Step 2: ;
12345 (1,2)
3 4 5
1; O (1, 2) 0 (1,2) —————
2)2 2-0 "sige —x.4.9.3: Uset
3 45 0 3 link a
316 30 > 4/95 70 => (45)
4,10 9 7 0 5 85 5 4 0
5} 9 85 40 |
a2
das) = | (det dys + dag + dys}
1/4 { 10 +94 948}
I
ond 9 .
Soin. :
1
dys) = 7 {daat dss) =6
Scanned by CamScanner
(1,2,3) 0 2
4,
(4,5) 8 ‘
Scanned by CamScanner
¥+ pata Warehousing & Mining (MU-Sem. 6-Comp.).) 4-131 Classi
assification, Prog "
; °
For simplicity we can find the adjacency matrix which gives distances of al
each other. Using Euclidean distance we have Sbise, .
Di(i,j) = ix, — x1 +ly,-y,P 4
Step2: Now
twoc
Step 3: Clus
|
valu
Step4s in
(Cc)
Scanned by CamScanner
complete link
a
1! Closest clusters are Merged y,
. . Te the
the maximum dis tance between és e dia:
Stance
’ clusters,
Sep3: Cluster (A,B) and D can be merged together as they are having minimum distance
value. .
1um distance
(C.E).
Scanned by CamScanner
5 nousing & Mining (Mt
Classification rel
¥ pata Warehousing & Mining (MU-Sem. eaten Stodiction 4 i¢ oi pe Dene
Final dendrogram
2.24
1.44
1
0.71
c
1Cc
A BOD
matrix
a, OF _
following data and pita doc
Ex. 4.9.4: Discuss the agglomerative ¢ algorithm using = gn &
contains
using single link approach. ‘The following figure Sa (EA) B) = MD
indicating the distance between ane Seren Sai
ae xe ee Me ~
if eid fj,
one Paty A] he].
Distance matrix
EACBOD
E/0
A/1/0
C}2/2)0
B/2/5/1
D/3;3 16/3 /0
Step 1: From above given distance matrix, E and A clusters are having minimum distance ° Distance matrix
merge them together to form cluster (E, A).
Scanned by CamScanner
te at 134 Classification, Prediction & Clustarini¢
ing & mining
Dendrogram
Distance
2
MIN (dist(E.C), dist(A.0) = MIN (22) =
- MIN (dist(E.B), dist(A,B)) = MIN (2,5)
ma 2
‘ _
a
:
) =3
jE? »P MIN (dist(E.D), dist(A,D)) = MIN (3,3
| BA cB D
gc E> P
e A ,
ae .
7
) » d i s t ( C B), dist(C A))
ix : gist(B A
Distance matr
9, 2) =*
* MIN (2.5
acca
Scanned by CamScanner
& Mining (MU-Sem. 6-Comp. Classification, Pre
¥F pata Warehousing
tion &
dist( (BC), D) = MIN (dist(B,D), dist(C D))
= MIN(G,6) =3
EA B,C D
[r,A} 0
[ B,C 2
jp | 3 | 3 Jo
Step 3: Consider the distance matrix obtained in step 2 (given above)
[1
E A
TI
Bo oc! | =
dist((E A), (BC)) = MIN (dist(E,B), dist(E C), dist(A,B), dist(A
C))
= MIN (2, 2,2, 5,2) = 2 | Hom dhe: ak
E,A,B,C D distance-is betwe
E, A, B,C 0 and is called the
:
Step 2: Calcul:
D
2/0
St ep 4: : FinFi ally combini e D i Therefore,
with (EABC)
: Final dendrogram Déndrogram dis(C37,4) :
:
Similarly, ¢
Distance
fe
Scanned by CamScanner
Following table .,
clustering ang oo0V8S
on
gol
(14.25
= min4) 14.25
) , 18.1= 9)
tos
dis(C37,4 ) == minmi ( dis2 (3,4), dis(7,
iara [ete
ar.
ys ae the.other distances to get the distance matrix.
Q | 14.25 | 20.28 | 10 |
Scanned by CamScanner
In the above matrix distance
between paints 2
cluster as C25,
az. L_ |
In the above matrix dist.
ance between CI and
formed new cluster is C6 in minimum Which js 5.08,
C16.
sea 9 ney,
Step4: Calculate the new distan
ce matrix with cluster
C16,
Cluster number | C16 | C25_
cies | o [3554 —qa6: Sup
C25 r hays
Scanned by CamScanner
Clasaification, Pradintlan 4 Clyatarire
16.11 16
4.257 44
12
+10
671
t 6
28ST
4.01t 4
6 4 2 6&6 3 7
= 7
, x1 x2 5
a(t. 45 1) 4
‘
- 15
c J§ s
3 4 2
D /
eE | 4 4
F 3 35
0
0 2 4 6
and draw Dendrogram.
Apply Single linkage clustering
rix of size 6 by 6.
win.: We have given an input distance mat
A B c D E F
Dist
“| 0.00| 66:| 3.61 | 4.24 | 3.20
0.71 | 0.00 | 4.95 | 2.92 | 3.54 | 2.50.
BP
| ‘
| ae
Scanned by CamScanner
We summarized the results of computation as follow :
1. «In the beginning we have 6 clusters: A, B, C, D, E and F, 50 in
5
2. We merge cluster D and F into cluster (D, F) at distance 0,50.
4
3. We merge clusterA and cluster B into (A, B) at distance 0.71,
3
4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00,
i
5 We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41,
4
6. We merge cluster (((D, F), E), C) and (A, B)
6
into (((D, F), E), C), (A, B)) at 2.5
distance 2.50. 20 acy Mat rix
7. The last cluster contain all the objects, thus ts pdjace’
. conclude the computation. ,
Using this information, we can now draw the Te
final results of a dendrogram.
05
Using Euclidean D
x1 x2 .
Aled 94 Dii,j)
B 1.5 15
Sue 3
D
84s D(A,B)
Ee |4 ay Distance matrix:
F 3 35
f ~ © technique
{ala set given. Create
to cluster the adjacency
the given matrix.
data, Draw the Use singe bt
dendrogram.
distance 1
dititn- dis
So ele a
Scanned by CamScanner
Gh as Dend
D(i,j) Ix2—xlI +ly2-yil rogram neve
dist( (B.C), A) =
Scanned by CamScanner
¥F bata Warehousing & Mining (MU-Sem. 6-Comp.) 4-141 __Classification, Prediction :
. 1.44
BC E oD
New Distance matrix :
| AT BC | DE:
A_|0
B,C | 1.41 | 0
D,E | 3.60 | 2.24 | 0
Step 3: Consider the distance matrix obtained in step
2
Since (B,C) and (A) distance is minimum, we combine them
Dendrogram Distance.
rif]
B CAE
D |:
or
_u
Scanned by CamScanner
MU-Sem. 6-Comp.) _ 4-142 ___ Classification, Prediction & Clustering
A,B,C | DE
A,B,C | 0
D,E 2.24 0
ne A,B,C and DE
| combi
2.24
1.41
B
| CA E OD
|
(based on distance formula)
ari son of the three method: s
comp
\ single-link :
“Loose” clusters.
al shapes.
strength : Can handle non-elliptic NH
a * aes of
- : Pa OT
ye eM Tee
ws, « sf. et
wn!
Pe:
a eats : we, * be
oud A
ee.LS "4"
tt ‘es
ae ke
ae Ee e =28 * * .* g s wre *.
we He e sae ot ef es
os "AE Eo ost 5 Te
tye
+ -_— o* -
aeeet? ff
seat
a ee> ° *, Me *5 ea
° * . = ae
a i
Scanned by CamScanner
I OO >a
a
Classification, Pradiction A
2. Complete-link :
“Tight” clusters.
te 4
'
< art
Le "te .
2 “a efe we Leas?
sha, ates, fe i oats Bas
oe a. see Meee3; ‘
ha 70k toe ‘ha "y as fe mt f
, a * “, ". "y eee "
af
re en a , ooh fas
oe abe oo * & 6 te soi ‘
‘ reheys
oe fot geet
ee ee ae . ‘ oe 1 a a. 4
* ee o? at's “erewe
meet, * 6 tte rie . ee, 6 .
get
#eFt me . oP it
* tat “eo a,
. . #. ‘ fant “Pent
. ate : * ots .
Limitations:
- “In between”.
- Compromise between Single and complete link.
~ Strengths : Less susceptible to noise and outliers.
Limitations ; Biased towards globular clusters.
Note: Which one is the best? Depends on what you need! ine
ele we
Agglomerative algorithm given by Margaret H. Dunham:
Input :
D= Ustaewitd
i set af elements
Aff Adjacency matrix showing distance between elements
Scanned by CamScanner
_aa
pl gj) Dendrowrar® TePtosented ag g gai 2
om ative algorithm;
p24
‘
en,
52 <d, k, K
pb > i Initially dendrp
pepe
Oldk = k;
d=dt1;
B= K~ {KD
1K y {KiuK}
K = oldk-~]
DE=DEU<q
*K, K> 1 New set
Uilk=2o f
MSs edo dnd
492 Divisive Hie
rarchical Clusterin
-
- Th
Scanned by CamScanner
te,,
n _ Prediction & Clus
catio
cl assifi
Divisive
(DIANA)
i c a o q ] c clustering
hiera r c h
ve
ive v/s pivisi
9.6 a
6 * A g glomerat
g. 4-9-
iFi”
rmative.
c t u r e th at is more info
q stru
Advantages hi erarchy;
n t s th e o utput as 4
Is simp] e and represe
It does not
split, it
o u p of © bj ects is merged or
Disadvantages @ gr
t po in ts is cr itical as once nd o wh at wa s done previously.
rge or sp li t u
Selection of me ne ra te d cl us ters and will no
will operate on
the new ly ge
le ad to lo w -quality clusters.
not sen may
well ¢ ho
me rg e or sp Jit decisions if
_ Thus
ve
n ag gl om er at ive and divisi
Differ ence betw ee
—<—$——]
Sr. Agglomerative
No.
l-inclusive cluster.
l clusters. Start with one, al
Start with the points as individua ch
cluster until ea
l.
At eac h ste p, spl it a
of
2, At each step, merge the closest pair nt (or there a
clusters until only one cluster (or k cluster contains a poi
k clusters).
clusters) left.
: This tec! i i i
Scanned by CamScanner
if pased on the notation of CR
; u
¢ is a height-balanceg = &
© that
rer of data points (vector) ig 1, .
cus sen
Numb
I
. er of items jp ' Yay im iy al
LS Linear Sum ofthe p . Chiste, her, Ns jn
i
Oints, , :
SS = Sumof thes
Divisive Uar
(DIANA) . . an gf Point
A cr Tree is a height-balanceg tree that ss 5
; we al nodes store the sums of
their
© the clu... 7
lusterin ig "
desce,
Ndants 8 feature, i icony
4 ,
yor ee structure is given as bej,,,elow :
ality clusters,
a
isive. (ei
Jusive cluster.
g Hierarchies)
data sets.
Scanned by CamScanner
oe
Data
Phaset; Load into memory by
buildingaCFtree — -
Initial CF treo qo
[esis rcs Condange into !
desirable rango by building a
smaller CF irae
Smaller CF tree
Scanned by CamScanner
x Cluster refining
{4ff passes OVEr the datase,
ie ast centroid from step 3,
sample :
’ of
CF = 3
ny: Number of data points N.Ls, SS) :7
N -
Z
Lee Se
N
i= l x B
SS: =x ,
i so a >
(3)
(2.8)
(4,5)
(4.7)
(3,8)
Scanned by CamScanner
WF ba ta Warehousing m p.) 4-149 Ula
Mining (MU-Sem. Se i
eee ——
Ne#5
BE I6ID) ta FRAO OS Bp
i.e 3°3? ++2°274 + 44 4744? Bs x
SS = (54,190)
ic, do 4 3? = 54 and424.+6745 An
<2 2 ta
- .
Advantage : Finds i with
a good clustering ith aa single
sing scan and im Proves
r
few additional scans, the iy a
Scanned by CamScanner
pined clusters Cannot be Split
+ 92 co s
SIg9 | goes BNE PIPES CTStering it a
Wali, ft
‘
ith i ang of big. cluster into
. sma Ma is no}
j axing ; Olay
ller © ison Sse,“eMsitiy
.
Teo,
, university Question
s ang ;
luste
{
2
2
or ESsion
0
1
6
int of (Ans, : Refer section 4.5.2, 4.9.1 and Ex. 4.9.4)
(10 Marks)
tee, 2010
=e
Scanned by CamScanner
rarget
Dec. 2011
(Ans. : Refer Ex, 4.9. 4)
packer©
1 Missio
Q.6 Write short notes on OM work P
Similarity and distan
(Ans. : Refer section ce measures in data
4.7) Mining An €xa
May 2012 of a st
(10
Q.7 Explain what iS meant ‘y produc
by clustering. State experic
example for each and explain the various
(Ans. : Refer Sections types Surrou!
4.5.1 and 4, 5.2)
= basket
= @.10 Apply
Clustering algo use Clusteri dend:
rithm with the he ng, Descri
lp of an exampl
(Ans. : Re
ter sections 4. e
5.1 and 4.8.1) aay 7
May 2013
(10 Maris}
brQ.e 9 lllustrate how :
the su Permar
ket can use cl
ustering Meth
ods to improv
e sales, (5 Marks)
Scanned by CamScanner
v > «
i, Jaser focus cuts down on w “ z
' . : ] om Selling
. lines ast < si ~
i elf space
7
5 Sheg; 18 to
(10 Ma Mission profiles reveal detaiy, ab °8 andj 00 Wifi tlie, ‘dentig,
Tks) ' ‘k Ip in, lunch time sh : Out en Mes Us Silty live
work Po ©Pping o¢ Wee Shoppers , Tal an tthe
‘sel am «
Aa example of the way this inform... €nd x OPpin P= top 4
¥ TMation § ey, ae ly Sl "
a ' éf a store 5 shoppers are Popping . Coy tag Peditions hopping after
0 In fi Ww
Markey products ne located nea; to Ws and te Peitas -
a
ience and disc Tont > Oe nu
ia
aT aii g the b Md ee Shoppers ‘7 Of the Store to i "RUE that ten
woh
th si; Surroundin reac’ and milk with Telatedime M8 8 conven; Ove stopping
Ultable pasket and the margin on these sho _, lMPUlse lines i COCE store insteag
° Mark ) PPSE Missions,
wT
Sul also help increase the
210 Apply Agglomerative Hierarchical Clustering
; . an
PL ke
dendrogram for the following distance matriy d draw Single link ang pie at
any One fo A/B/C;/DJ/E
ee
A/O |}2/6]10]9
) Marks)
vy
B/2 /0/3;9 /8
tte}
C}6 /3/0/]7 |5
Marks
Pout
; ) D}/10;9/7/0 |4
Scanned by CamScanner
May 2014
Q.14 a
Write detailed notes
on Hierarchical cluste
ring methods.
(10 Meri
(Ans. :Refer section 4.9) :
)
Dec, 2014
10
( Marky)
(Ans. : Refar Q.
10 of May 20 73)
Q.16 Explain hierarchical
clustering methods.
(Ans. : Refer Section (5 Mar
Q.17 5
Write . eat 4.9) (to
detailed notes on
K-Means clustering, Mats
May ay 201
2016 (Ans. Reter section 4 8.1)
(49
i)
Scanned by CamScanner
the help of th 7
(10 ms
Tks)
(lo Marks
OVE sales,
( 5 Marks)
sabus *
(10 Syll :
Marks) yilabus Topic : Market Basket Analysis
- Market basket analysis is a modelling technique which is also called as affinity analysis, it
helps identifying which items are likely to be purchased together.
- The market-basket problem assumes we have some large number of items, e-g., “bread”,
‘nilk.”, etc. Customers buy the subset of items as per their need and marketer gets a
information that which things customers have taken together. So the marketers use this
information to put the items on different position.
. o tends to buy a bread at the same
* For Example : If someone buys a packet of milk als
lime
Scanned by CamScanner
————
Mining Fred. Pattems & Asso, pa,
— 8M
Scanned by CamScanner
a =
A alysis of value-addeg
SCtViceg =
2 ether at a Point ;, ;: . en
ee ath mn time, i could 5
six mon * Rath
nsidy “CrViegg
ea Store, by yariows ways can be used to Siderj &
; apply Marke bap seryj
et of butte, ; Te 8. ip jal combo offers May @ Periog Of, let's on
5 Spec be off “set Analysis .
stomers Bens, ; Ted to the tte,
Keep, placement of items Nearby
inside we ers g
& Product,
tes tempted to buy one Product
ome With , with the Silas
The layout Ms ¥ tesyy
—i Min custOMOmMerstoget
of catalogue Of an én
. he] o pets; z He
we ting Tesult, at inventory may be manageg based May be
id
on : fined,
Ifa set of items say S is frequent then all its subsets are also frequent.
Scanned by CamScanner
SP,
bu t th er e is at fees
in Y
em of X :is ; oy 5
— Consider two itemsets X ane Me ie,
roper super-itemset of X. In this case ; tems... fl gP
3 ¥ is notaP
Y, which is not in X, then } ® MAX, 5 or objec
closed. Josed frequent itemset gh ee are
where. I, are se
The rule shoulc
jikely to also P
Large ttemsets
An itemmset 1S
]f some items
Support
-— The support
or in other v
Fig. 5.2.1 : Lattice diagram for maximal, closed and frequent itemsets
Ia _ The itemsets that are circled with thick lines are the frequent itemsets as they satis the
1 minimum support. Fig. 5.2.1, Frequent itemsets are { P,4,1,S,Pq,PS,9s.18,pqs} ~ The suppe
database t]
- The itemsets that are circled with the double lines are closed frequent itemsets. Fig- s2,
closed frequent itemsets are { p,r,rs,pqs}. For example {rs} is closed frequent itemset ® ~ An itemse
| all of its superset {prs ,qrs} have support less than 2. Confidence
- The itemsets that are circled with the double lines and shaded are maximal freq ~ The cons
itemsets.
Fig. 5.2.1, maximal frequent itemsets are (rs,pqs}. For example (+s) ' a transact;
frequent itemset as none of its immediate supersets like {prs, qrs} is frequent. | ~ 9 ,
Onside
Nd B tx
Scanned by CamScanner
ve
s or objects j
—
ositor are cone Rel
elatjfor
idereg tOna]
rf da labs “4 >
st Qe ‘ ‘
support
. The support of an itemset is the count of thati
iemset in the total number of
orin other words it is the percentage of the tran sactions in whi tran; sactions,
ch the items appear.
If A>B
# _ tuples_conta
bothini
A and
ngB
Support (A > B) =
atisfy the - total_#_of_tuples
- The support(s) for an association rule X = Y is the percentage of transactions in the
database that contains X U Y i.e. (X and Y together).
ig. 5.2.1, et if its support is above some threshold.
=mset as - Anitemset is considered to be a Jarge items
Scanned by CamScanner
#_ tuples_containing_bot
Confidence (A => B) =
h_A_ang p
#_tuples_containing A
Finding the large itemsets E
4
1. The Brute Force approach be
pe
- Find all the possible association rules. \- *
FP "
— Calculate the support and confidenc
e for cach rule generated in the
~ The Rules that fail above Step,
the minsup and minconf are prunned from
the above list,
The above steps would be a time consuming
process, we can have a better
asgiven below.
“PPT
2. A better approach
7 IE
4. Types of the values handled
in the rule : Here we use
association rules. Boolean and quantitative
5. Kinds of the rules
to be mined Here we use association rules and corr
based on the kinds
of the ry les to be elation nt
mined.
6. Kinds of pattern to
be mined : Here we
ig! use frequent itemset mining, seque
mining and structured patter
n mining,
ntial patie
Scanned by CamScanner
1
ave a better app TOach
j
p orl algorithm finding Frequent ‘. ;
oa Emset fi
S Using Cang > (wu = Dec. 201
_ The Apriori Algorithm solyes "
the frequ ent; Item Hate Conerat ton
setg Problem,
analyzes q ue
_ The algorithm
uti
—
to determing
together frequently. ch Combinations of items
4 Apriori al occur
gorithm is at
the core of
Various algorithms for data min;
best known problem is finding the asg iat mi
OClatio
ea on.mm
pasic Idea rules that hold in at
Z criteria :
An itemset can only be a large itemset if all its subsets are lar
the complete set of are large itemsets '
.
itemsets: The sets of items that have minimum
Se Frequent num support.
All the subsets of a frequent itemset must be fre quentforor €.g,e.g. {PQ} is: a frequent itemset
.
‘el association rules
be frequent.
{P} and {Q} must also
transactions;
D: a database of
1 correlation mules mu ! m support count th
reshold. |
; th e mi ni
min_sup
ts in D.
:L
ut : fr equent itemse
sequential pattem Ou tp
t_l-item sets(P)s
Sey find_frequen
Scanned by CamScanner
W Data Wa
’ rehousing & Mining (MU-
‘ Sem. 6-Comp 5-B
(2) for{k = 9, .) Mining Freq
. Pa tal
The Or k++){ M3
& a,
8) Gs §prion_g F pata ware’
en (Ly pi
(for each transa
ctio1n€ D {4 sean A avantaees
qhe algorit
6) Cc D for counts
= subset (Cy, t)s 4
&et the subsets of¢
(6) for each eg nd
idate ce C.
that are candidates
, The metho
@ 1. eal gorit
©.count +4
;
3. = es
pisadvanine
Although
‘
the overall
C1) retury L = Uk
Ips _ Due to Dat
Procedure pr
iori_gen (L_,
: frequent (k
(1) for each — 1)- itemsets) 54.3 Solve
it emsets el ]
(2) for each Hems
eteL, 1 Cr
(3) #U.A) = £ Ex. 5 4.1:
4) 9 42) [
AA Ch Ik 9) =4()
= 40k-2) 0 g
@
© = 4044: If join
y <4 [k~1) then
slep:generate ca {
(5) if has_infr ndidates
eque n L_subset(o
(6) , L._,) then
(7)
Soln. :
Step1: Scan
supp
: candidate
k-intemsep;
~itemsets); Cc, =
“use Prior
knowledge
Scanned by CamScanner
Y. Paterna & Asse. Patlerr
200 |235
300 |1235
400 |25
whe . “| ae
find the
walt Scan D for count of each candidate. The candidate list is {1, 2, 3,4, 5} and
support.
C=
Scanned by CamScanner
a
{12}
pA
{13}
{15}
{23}
{2 5}
{3 5}
Step 4: Scan D for coun! t of each candidate in C, and find the support
Step 8:
23} | 2
Step 9.
25} | 3
sri
Scanned by CamScanner
)_ Mining Freq. Pattems a ”
So, Patte,,
n support count (i.e, 50%)
oo a candidate C; from
j
i
sip? Scan
D for count of each ike
} G =
PhS
tats
pov;
és
'
¢
j| ds Sodata contain .
th ef enti
temset(2,3,5)
/ teTherefore
gu the ass
Ociation rule that can be generated from L, are as shown below
with
Pport and confidence.
Scanned by CamScanner
——_Data Warehousing & Mining (MU-Som, 6-Comp,) +12 __ Mining Freq. Pattems AS Pay,
n
Association Rule | Support | Confidence | Confidence %
243=55 2 2/2=1 100% i
3*$a>2 2 2/2=1 100% |
2'5=>3 2 2/3=0,66 66%
2=>345 2 2/3=0.66 66% _|
3=>245 2 2/3=0.66 66%
S=>243 2 2/3=0.66 66%
If the minimum confidence threshold is 70% (Given), then
only the first and Second ry
above are output, since these are the only ones generated Tules
that are strong.
Final rules are :
C=
{A} | 3 |
{B} | 2
{C} | 2
{D} 1
{E} |} 1
{F} |} 1
Scanned by CamScanner
Freq. Patterns gAsso ss
atte
Th
idence %
7
iaifiesci.
ctions , » wi
With a :
yt scan D for count of each ;
: candidate in C, and and f;
a Cz Ind the Support
|F} and find the 43: Compare candidate (C, ) Support count with the minimum support count
{A,C} 2
96: So data contain the frequent item 1(A,C)
Therefore the association rule that can be generated from L are as shown below with
the support and confidence
| Support Confidence Confidence % »
_Association Rule
A->C 2 2/3 = 0.66 66%
C->A 2 aaah im
Scanned by CamScanner
) 5-14 MiningSFreq.
—— Patte
S & Asso, Patter
ms n
0! & Mining (MU
-Som. 6-Comp.
then both the rules are output as the
va Warehousing
WY Data
is 50% (Given),
Minimum confidenc
¢ threshold
50 %.
confidence is above
Step 1: Scan the transaction Database D and find the count for item-1 set which is the
candidate. The candidate list is {I1, I2, 13, 14, I5} and find each candidates support.
its
t Step 4: Com
L-ltemsets | Sup-coun item
ll 6
L,=
EB 7
5 6
14 5
H 2 _
————-_
Scanned by CamScanner
Ng Freq. Pattorng 4 Aden. Pattern
‘Frequent | Sup-count
2-Itemsets | 5
1,2
vilmlalroleala
1,3
15
2,3
2,4
2,5
ae : |
Scanned by CamScanner
=
WF Data Warehousing & Mining (MU-Sem. 6-Comp.) 5-16 Mining Freq. Patterns RA -
Step 5: Generate candidate C, atm
from L,
GQ=
Frequent 3-Itemset
1,2,3
1,2,5
1,2,4
Step 6: Scan D for count of each candidate in C, and find their support count,
C=
Scanned by CamScanner
se
jf the minimum Confidence 7
as output, as they are Sttong 1, oth
5
count,
‘| Itemset | Sup-count
€ Association
I 3
ea
120 3
3 3
4 *, J@
5 3
6 1
yr. | ob
| 8 | ——
po
|
fo
|1
Scanned by CamScanner
° 3 Mining Freq. Patterns g
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _5-18 2880, Patter
Step 2: Find out whether each candidate item is present in at least 297 of transaction,
support count given is 30%).
L,=
Itemset | Sup-count
1 3
g 3
3 3
4 2
5. 3
Step 3: Generate candidate C, from L, and find the support of 2-itemsets.
C,=
Ttemset |Sup-count
12 1
13 2
14 2
1,5 r
2,3 2
2,4 0
23 3
3,4 1
3 2
Step 4: Compare candidate (C,) generated in step 3 with the
support count, and prune those
itemsets which do not satisfy the minimum support count
.
L, =
Ttemset | Sup-count
1,3 2
1,4
a 2
2,5 3
Er —
Scanned by CamScanner
¢ pata Warehousing & Mining (MU-Sem. 6-Comp.) 5-19 Mining Freq. Patterns & Asso, Pattern
Itemset | Sup-count
| 2,3,5 Z
Therefore the database contains the frequent itemset{2,3,5}.
Following are the association rules that can be generated from L3 are as shown below
with the support and confidence.
Association Rule ‘Support | Confidence ‘Confidence %
243=>5 2 2/2=1 100%
345=>2 2 2/2=1 100%
245=>3 2 2/3=0.66 66%
2=>345 2 2/3=0.66 66%
3=>245 2 2/3=0.66 66%
§=>243 2 2/3=0.66 66%
Given minimum confidence threshold is 75%, so only the first and second rules above are
output, since these are the only ones generated that are strong.
Ex.5.4.5: Adatabase has four locimltbas Let min sup=60% and min conf= 80%.
“Tip.|’ ‘Date. |.“ Items-bought
T100 10/1 5/99 | {K, A, D, B}
T200 | 10/15/99 | {D, A, C, E, B}
T300 | 10/19/99 | {C,A,B,E}
T400 | 10/22/99 | {B, A, D}
Find all frequent itemsets using apriori algorithm
List strong association rules( with supports S and confidence C).
aa
Scanned by CamScanner
Soln. :
| a Data Warehous
i
Step 1: Scan D for cou nt of each candidate. The candidate list
list isis { {A, BODE, 7 asses]
support.
fing step 5: Compare
L,=
SteP 6: Generate o,
C=
SEP 7: Scan
D for g
C=
Step 8: Compare oa
Gz
L,=
Scanned by CamScanner
¢ pata Warehousing & Mining (MU-Sem,
6-Comp.) 5-24 Mining Fre . Patter
& ns
Asso. Partarn
op Compare candidate (C,) support count with the min
5 imum support count.
L,=
Itemset Sup-count
A,B 4
A,D 3
B,D 3
step 6: Generate candidate C, from L,.
C=
‘Itemset
A,B,D
step 7: Scan D for count of each candidate in C,.
C=
itemise t | Sup
A,B,D 3
Step 8: Compare candidate (C,) support count with the mini
mum support count.
L,=
‘temset | Sup
A,B,D “13
Step 9: So data contain the frequent itemset(A,B,D).
aa
Scanned by CamScanner
“Eocene sen et) th A ta
If the minimum confidence threshold is 80% (Given), then only the SECOND
>
AND LAST rules above are output, since these are the only ones generat
ed that arestent
ng
Ex. 5.4.6: Apply the Apriori alg orithm on the following data with Minimum suppoq——_
TID | List of tem_IDs |. =2
T100 11,12,14
T200 11,12,15
T300 11,13,15
T400 12,14
T500 12,13
T600 11,12,13,15
T700 14,13
Ta00 11,12,13
T900 12,13
T1000 13,15
Soln. :
C=
Blolalala
I2
3
14
I5
Step 2: Compare candidate support coun
t with minimum support count (i.e. 2).
L,= Step 5 Genera
re
Scanned by CamScanner
4
4
1,5 3
2,3 4
2,4 2 |
2,5 2
3,5 3
Step 5: Generate candidate C, from L,.
C,
1,2,3
123
1,2,4
135
235
—
XN
Scanned by CamScanner
| rrequent 3-Itemset | Sup-count
[ 1.2.3 2
*
a
| 0
1.3.5 2
23:5 0
Step 7: Compare candidate (C,) support count with the minimum support
count,
L,=
13,5 2
Step 8: So data contain the frequent itemsets are
{11,12,13 } and {11,12 »15} and {11
13.15},
Let us assume that the data contains the frequent
itemset = { 11,1215) the then
association rules that can be generated from
frequent itemset are as shown below
withthe support and confidence,
Scanned by CamScanner
Step 2: Compare candidate support count with minimum support count (i.e. 50%).
L,;=
1 6
2 5
3 1
5 6
Scanned by CamScanner
WF Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 5-26 Mining Freq, Patt
ems
&A S89
Step 3; Generate candidate C, from L,and find the support.
soln
Itemset | Sup-count step hee
1: -
1,2 3 one
1,3 a
1,5 3
2,3 4
2,5 4
3,5 4
Step 4: Compare candidate (C,) support count with the minimum support count. ied |
” 7 - =
temset | Sup-count.
1,3 5)
So data contain the frequent itemset= {1, J}.
Therefore the association rule that can Oe cence from L, are as shown below With te Step3:
support and confi denice
~
ni é, determina tl
ge rules eee the; a ioe.‘asain
“Transaction | items
T1 Bread, Jelly, Butter
T2 Bread, Butter
T3 Bread, Milk, Butter
T4 Coke, Bread —
T5 Coke, Milk
Scanned by CamScanner
¢ pata warehousing & Mining (MU-Sem. 6-Comp.) 5-27 _ Mining Freq. Patterns & Asso. Pattem
7
In 1:* Sean D for Count of each candidate.
st e ” candidate list
a ‘
is { Bread, Jelly, Butter, Milk, Coke }
C=
I-Itemlist Sup-Count
Bread 4
Jelly I
Butter 3
Milk 2
Coke 2
step 2: Compare candidate support count with minimum support count (i.e. 2)
-. ]-Itemlist Sup-Count
Bread 4
Butter 3
Milk 2
Coke 2
Step 3: Generate C2 from LI and find the support
G ~ - : ™ oe a
Step 4: Compare candidate (C2) support count with the minimum support count
L=
‘Frequent 2 -Itemset | Sup — Count’
{Bread, Butter} 3
| aa
Scanned by CamScanner
Mini
‘himum confidence threshold
is 80% (Given)
Final rule is
Butter ~ Bread
=
Ex, 5.4.9;
Scanned by CamScanner
gteP 2: Generate 2 item sey :
Item | §
AB i = Hem [Stipa
AC |4 eet
AE
AD [4
3
fc{7-—
/DE tq
AF |] /DF- >
AH |2 DH een
AL | 2 DL [2
BC {5 EF [2
BD |6 [BH [2
BE 4 EK a
BF {1 H (3-. |
BH !|0 FH |92
BK |3 FK |0
BL |2 FL |2
cp |5 HK |1
[CE |2 HL [2
[cr [1 KL |1
Step 3: Generate3 item set:
Item sets of 3 items
“Item
set | Suppor
ABC 3
ABD 4
ABE 2
. ABK 1
ACD 3
ACE 1
ADE 2
AEK 1
AEL 1
BCD | 5.
BCE 2
BCK |
BDE 3
BDK
|PVA 2 4
Scanned by CamScanner
BEL
Fro’
CDE
DEK
final rul
DEL
Item set above 30 % support
—e——
23
Associat
section 5
ABDE.
BCDE ——.
Therefore ABCD 1s:
the large item set with ———
minimum support
Following 30%.
Rules generated —
—
The
efficien:
- Ha
For
buc
|
she
j - Tr
i ~~
na
Pa
|
ite
!
din
a
ear
en
of
Scanned by CamScanner
Boa tereresg
§ on ny 5,
4,
Data Warehousin
Of the partitions.
Scanned by CamScanner
=
_¥ Data Warehousing
& Mining (MU-Sem. sing 8
6-Comp.) 5-32 Mining Freq. Patterns
~All the frequen, & 4,
itemsets with resp 89. Patten,
itemsets, In ect to cach Partit
the second ph ion forms the global
Support of €a ase 0 f the algo
rith cand ns
ch item is fou m, a second scan Of database for a idate ’ pat?
nd, the Se are global frequent siZ@ onthe
Sampling : Rather itemsets. c tual e aa
than finding the frequent :
transactions
are Picked up and itemsets in . i proach is v
minimum Support searched for frequent the entire database D,a Subset This :
is considered as itemsets, A lower thy - compres sion
frequent itemse this reduces the poss faa, oF
ibility of MISs ay
t du eto a higher
support count sing the Old of It isa frequent
Algorithm : FP ,
An FP- tree is a tre
e Structure which co gowth.
nsists of :
One root labeled
as " null", Input :
A set of item Prefix - D,a transaction d:
oi: sub-trees wit; h each
node-link. node formed by thr
ee fie‘ lds - it
.
em-name, count, - mi
. ee , the mini
A frequent- item’
header table with two
node-link, fields for each Output :The complete
entry : item-name, head
| of Method -
It contains the co
if mplete informat
. ion fo r frequent
H Pattern mining
Th e size of the FP-tre
e is bounded by the
| AFP tree is co
size of the database ns
| sharing, the si , but due to frequent ;
ze of the tree items @ s
is usually mu ch smaller th cn the tran
~ High compaction is an its original
achieved by Placing database
more frequently item
thus more likely to
be shared), s closer to the root (b . eo‘ms, oc
eing
(b) Create th
e ro
the followin
g
Scanned by CamScanner
i re
weme size of the FP-tre, is < 3
e . Can i
sibility Gp NE tego
a qhis approach 1S Very effic; ent g Gidate
oe oO > Sub, .
8¢
ye . t Ue to 7
Compression of large aig :
ing © ao, of Ase; /
u It is a frequent Pattern _ ints Me
It adopts a divide-ang.,
nquer Str
he database of frequen; ite
“formation of items is Preserved is
hen mine each such data Se ge
(2 FP-Tree Algorithm
s
it due to frequent item
support counts. Sort F by support count in a
| database. items. ssation TesD €0
”
bel i1t ag “aul” For each
oser to the root (peike (0) Create the root of an FP tree, and la
the following :
ea6
» for mining freq
uent.
Scanned by CamScanner
Select and so
rt the fre quent items in
re
=P “F
wnt
fiFrequent Tr an s ac co rd in
item list in Tran g to th e order of L,
Temaining list. Cal} s be [p | P], wh 2
insert_tree ([p | P]-T ere p is the first element and “M-eg FP:
child N such ), which is perfor o
that N_item-nam med as follows, tp a,
create a new no
de
e = Prilem-n
am e, then
the FP
N, and let its co increment Vs
unt be I, its pare
liink to the
nodes . nt link be linked Sount by a /
with the same . ; to T. i i She ue
nonempty, call in item-name
via the node-lin i
And its ,
sert_tree (P, N) re k Structur e If ore: Tre
| 2. The FP-tree is mine cursively, dl
d by calling FP growth. FP ri
follows : tree, null/, which is imPl Fi ail
ementey 7A Ex:
Procedure FP —fro
wth (Tree, q) : 7 -
(1) if Tree contains a single
path P then Vee a
(2) for each combination (de
noted as B) of the nodes in the
path P
(3) generate pattern BUo. with support_count = minimum
(4) else for each a; in the
header of Tree {
d
(5) generate pattern B= :
a,U awith support_coun
t = a;,Support_count In. =
;
(6) construct B’s condit =
ional Pattern base and then Step 1: F
B’s conditional FP_tree
(7) if TreeB4 TreeB;
then .
(8) call FP_growth (Tree, B):)
Analysis
Scanned by CamScanner
ge
¢ case Scenario : Every transaction has a distinct set of items, i.c. no common items.
wors
: pp-tree size is as large as the original data,
0
FP-Tree storage is also higher, it needs to store the pointers between the nodes and
the counter.
ing
p-Tree size is dependent on the order of the items. Ordering of items by decreas
suF port will
not always result in a smaller FP -Tree size (it’s heuristic).
i example of FP Tree
61.
a, b,c, f,1,m,0
b, f, h, j, Oo fF
b,c, k, S, P i
a,f,c,e,1,p,m,n A
soln.
1; Find the minimum support of each item. | |
step ‘ a Ps
3 id
yet
|
|
GEERT EEE
th ceed
Scanned by CamScanner
Scanned by CamScanner
wpe insertion of Second transagi f
18+” ggction TH POMUNE 10
“gon T is pointi the root yyy ap ,
f a” . Header table
os
Scanned by CamScanner
(iv) Similarly cons
ider the next item a.
Header table
tenes
Scanned by CamScanner
jns
itt of third transaction (f, )
f° jp Header table
_ aster
f”
Tiss
the final FP-Tree.
Scanned by CamScanner
a
¥ Data Warehousing & Mining (MU-Som. 6-Comp. 5-40 Mining Freq. Patterns& Assn, Patt
fr
o For each item, conditional pattern-base is constructed, and then jt’s Condit,
FP-tree.
© Oneach newly created conditional FP-tree, repeat the process.
© The process is repeated until the resulting FP-tree is empty, or it has Only a sin
path(All the combinations of sub paths will be generated through that single path
each of which is a frequent pattern).
Example : Finding all the patterns with ‘p’ in the FP tree given below
Header table
(f:4,¢:3,a:9
m:2,p: 2) and
(c:1,b:1, p:1)
process these
further
Scanned by CamScanner
2
4, 6:38 at3, m2, Pp pi2) DCT
and (4 4. pty
bel,
wot actions containing 'p’ have p.count
oo awe nave (£:2, c: 12, a:2, m:2, p:2) and (e:1,
0
be
vee pl
qe spar of these we can remove ‘py’
js can
ioe ain filter away all items < minimum support threshold (ie, 3)
again
0 we 9, a:2, m: 2), (c:1, b:1) => (¢:3)
og,
ovwe erate (cp:3) paterns containingitem po we
: we are finding frequentthreshold
(Noteitem
5 r toc as ¢ is only that has min support .) —
P
apP endevil is taken from the sub-tree
. «
requent ee . band 7
Freq
prample -
- 4,6! 3a: 3
Bi
besae ee : 2) and
(f:4,0:3.8%
ae b:4,m:1)
», ap
Scanned by CamScanner
AA Data Warehousing & Mining (MU-Sem. G-Comp
-
.) _5-42 Mining Froq. Patterns 2, Aany p
2 try,
© In the above transaction we need to consider
m:2, based on
this we get f:2 and so on. Exclude p as we
don’t want p GD Ty
i.e. given in example.
O Path 2: (f:4, ¢:3, a:3, b:1, m:1) — (fl, c:1, ail, bi)
~ Build FP wee using (f:2, ¢:2, a:2) and (f:1, c:1, a:1, b:1)
Ca:3)
— Now we got ( £:3, ¢:3,
a:3 ,b:1)
- Initial Filtering removes b:1 (We again filter away all items < minimum a
threshold).
f
ee abi2 ee *asa
i ae ..o8hies / ota
; i b+ ne ‘s
! ou ei “ort bi4 e:4
‘, | “ Z
A cl 1
est
Scanned by CamScanner
vy a
sneer |
7200 | 12, 14
| T300| 12, 1g
—___| I, 12, I4
T500 | 11,13
| 7600| 12, 13
| 7700 | 11, 13
——__| l1, l2, 13, 5
T900 | 1 , 12, 13
Min support = 2 a a
Soln. :
| Support Count
7
12
6
i
I3 6
‘Ths, ;
15
2
for D.
h eT = malian
ou ee
———-—£°
Scanned by CamScanner
Scanned by CamScanner
rehousing & Mining (MU-Sem, g C »O* omp.)
for D
jaittonal Pattern base
(0
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
— @B:1,C:1))
We have the following paths with ‘D’
(AA.:11,,BD.:11)),
p = {((A:1,B:1,C’ :1),AW (A:1,C:1 :
so append BC wale
e BCD: 1
We generat oo
FP-growth ACD, BCD. which are
apply , BD
6 Re cursively w i t h s up > Ip AD
u n d (
Ttemsets fo
-, Frequent
.
n c o n d i t i onal node D
o
oO
from CPB
°g°en erated
Scanned by CamScanner
| name and TID- transactional
set is the set data is repres
of transactions ented as item
- TID-set, Item
Oo Eg. of Vert co nt aining that it is the item
ical Data Fo em,
rm at
Scanned by CamScanner
I TD_sets of the fj :
Sets, Th Tequent k-itemsets to
a
2
: found . ieee & Stopped when
i
: Introd
uc tion oO n i
eT Ctlon to Mining g Mult eve Asso
cia
5 9 Introduction to
on ules
Mining Multilev
e| Association
Rules
> (MU- May 2012,
May 2013, Dec. 2013, Dec
Items are always in the . 201 4, May 2016)
form of hierarchy,
Items which are at leaf nod
es are having lower support.
An item can be either gene
ralized or specialized as per the described hierarchy of
and its levels can be powerfully preset in that item
transactions,
Rules which combine associations with. hierarchy of concepts are called
Multilevel
Association Rules.
em Depa rtment Foodstuff.
A : ! ‘| |. Dairy— - Ete...
fe Family _ Vegetable] | _ Fru
hy of concept
Fig.5 9.1: Hierarc
Scanned by CamScanner
a ata Warehousing& Mining (MU
-Sem.
Support and confidence of mul rely FLO. Patios 0 bay # pata Waret
tilevel association rules
20 ~ The Support and There are
Specialization value ofconattr
fidence
ibutes.
of an item is affected due to its Benetalizay:,
The support of &ner evel-by-
alized item is more than the support
of Specializeg item,
= iJ
Similarly the Suppor — Itsa
t of rules incre. ases from specialized to generaliz
eg itemsets,
If the Support is below the threshold w ‘The
value then that rule becomes invatig,
Confidence is not affected for general or specialized. node
Level 2
min support = 5%
Scanned by CamScanner
qhere are 4 se
arch Strategi
es i
evel-by-leve
l independen
t
_ Itsa full-bread
iy Search
The pae rent No
= Methog,
de
node is examin is 4 cc
e, Ked Whether it’
a Level-cross
filtering by
i sin
The children 2le item,
of Only freque
nt node.
ii) Level-cross fi
ltering by k. “i
‘S
temset
"XaMing
the
- Find the freq
; uent k ite
© item,S of - Only the
k itemse
t at nex:
May &et Miss (iv) Controlled
eg
level-crogg fi
Ciation
Tulesre
are
- So the item
s : Which do
threshold this pn Ot Satii sfy
is also Ca Minim
lled “Leyey Pag
“ SUpport are Ch
ecked for Minimu
m Support
Scanned by CamScanner
las
. : f possible values an
- Categorical attributes ; ;
This have finite number of po s there 8 pp
ordering among values, Example : brand, color.
- Quantitative attributes : These are numeric values and there is implicit Ordering
wig
values, Example : age, income.
ceicttalinsiiieais
Techniques for Mining MD Associations
0-D
Cuboid
(Gender)
pal
Cubold
Cuboid
Scanned by CamScanner
) 5-51 Mining Freq, Pattornis & Asso. Pattern
rg A.quantitative! A A quantitat
ive? > Categorical
a
tative measures are
. Jf we are interested in association rule where two quanti
pond le salary
« and the type of the phone that customer buy.
ag «30-34”) a Salary (CUST,“24K - 48K”) => buys(CUST, “Sony Xperia Z”)
A ge(CUST
rn
us ed to find such rules is the Association Rule Clustering Syste
; approach -
° (ARCS):
60-70K
50-60K
Salary
40-50K
30-40K
20-30K |.
<20K
36 37 38
33 34 35
32
Age
2-D grid for tupl es rep res ent ing cus tomers who purchase Sony Xperia Z
Fig. 5.10.2 : A
d to the rules :
- Following four CUSTs correspon
40K") — buys (CUST,” Sony Xperia Z”)
© age(CUST,34) ASalary(CUST,"30—
Sony Xperia Z")
© age(CUST,35) ASalary (CUST,"30 — 40K”) —» buys(CUST,”
Sony Xperia Z”)
© age(CUST,34) ASalary (CUST,"40 — 50K”) > buys(CUST,”
” Sony Xperia Z”)
© age(CUST,35) ASalary (CUST."40 — 50K”) — buys(CUST,
Scanned by CamScanner
_W sing
Data Warehousing & Mining arehor
(MU-Som. 6-Comp.) _
As above rules are close to
5-52 Mining Frog. Patterns
& Aaso, Patty ye 2010
each other, they can be clustered fn
following rule: together 0 form the
asider the
© ase(CUST, “34 — 35") ASalary q -
Xperia Z”) (CUST,"30
”
- 50K")* — buys(Cust, Sony ort cour
Limitations of ARCS
: A P
: Refe
- It works only for quantitative att
ributes on LHS of rules.
— The limitation is 2D i.e. ‘wo
attributes on LHS only so doe
sn’t work for more dimension
3. Distance-based associ ,
ation rules
- This is a dynamic dis
cretization Process that
considers the distance be
points. tween dies
— This mining process has onl
y two steps :
o Perform clustering to find
the interval of attributes in
©
volved.
Obtain association rul
es by searching for gro
ups of clusters that occ
~ The resultant rules of thi ur together.
s method must Satisfy :
© Clusters in the rule antece
dent are strongly associ
consequent. ated with clusters of Tules
in the
' May 2011
© Clusters in the antecedent
occur together.
© Clusters in the consequent occur 0.3 Whatis.
together.
5.11 University Questions and all frequ:
Answers
May 2010
Scanned by CamScanner
ini 29 Freq. Patterns
atte, == & Asso,
4, i og
y.
clustereq B,
) 010
together t
© form
i =,
5
OK”)8
the ; consider the transaction d
_, buys(Cugy o. Atabass
Ony I
3 gupport count 2, Generatg
!
the distance between
da li, 12,14
11,13
ved.
"00 |
12,19
ers that occur togethe,’ 11,13
T,
|
i T9009
h clusters of
Tules in th
ie yoy 2011 Lts
Nh, 2
300 2,3
400 1,2,4
500 PTT
600
PL
a
700
800
g00
onfidence
2 70%. (Ans
Minimum — C
Scanned by CamScanner
¥ pataata W Ware housin
& Minin
g g (MU-Sem. 6-Comp.) 5:54 _ Mining Freq . Patterns & Asso,
Pattem |
Dec. 2011
EE
Scanned by CamScanner
Va
7 garth following transection
| TID Items i
01 A,B,C, D,
o2 A,B,C,D,E,G
03 «Oo A,C,G,HK
04 B,C,D,E,k
(oe
06 «=A,B,C,D,L
oy) =—sa#BELLL EE, K,L
08 =A, B,D,E,K
09 OA, E,F,H,L
10 B, C, D, F
Appl
the Aprioni
y algorithm with minimum support of 30% and minimum confidence et
rules in theion
70%, and find all the associat data set.
(Ans, : Refer Ex. 5.4.9) (10 Marks)
49 Define multidimensional and multilevel association mining.
(Ans, : Refer sections 5.9 and 5.10) (10 Marks)
213
Seteal
a
_—,
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Som. 6-Comp.) 5-56 Mining Fr _Pattoms & Asso, Patton
02 | 2,3,5,7
03 | 1,2,3,5,8
04 | 2,5, 9, 10
05 | 1,4
Apply the Apriori Algorithm with minimum support of 30 % and minimum
of 75 and find the large item set L. (Ans. : Refer Ex. 5.4.4) (10 Marks)
May 2014
——
Q.14 Discuss association rule mining and apriori algorithm. Apply AR mining
to find all
frequent item sets and association rules for the following
dataset :
Minimum support count = 2
Minimum confidence = 70 % (Ans. : Refer Ex. 5.4.3) (10 Marks)
-_
Scanned by CamScanner
Multidimensional essocieton rule
(Ans. : Refer sections 5.9 and 5.10) —
goo
Chapter Ends
I
'
i
t
4
i
, - .%
Scanned by CamScanner
Module
——__)
Syllabus :
Spatial Data, Spatial Vs, Cla it
Association and Co-loc
ssi cal Dat a Min ing , Spatial Data Structures,
ation Pattems, Spatia
l Clusterin: g Technique
: Minj
Extension, Web Min s -
N9 Spatiay
ing : Web Content Cc
mining, Applications Mining, Web Structure Mining, Web Usage
of Web Mining.
Scanned by CamScanner
Useful in certain application %,
‘i ‘ Main,
Bxample : Shutting off identifieg
W:
ynexpected ae Pump = saved
human life.
edge,
- . —
o f pla net Earth?
: How iis the hea lth y
o Exam vironment and ecolog
ffe cts of hu man activity on en
ONY
ize €
water delivery © Example : Character
MALSAS
Scanned by CamScanner
0 Example : Predict eff
ect of weather, and economy.
Traditional approach
: Man Ui: ally generate and test
hypothesis but, spatial data js 8rOwin,
‘00 fast to analyze manually,
© Satellite imagery, GP
s tracks, sensors on hig
hways.
Number of Possible &cogra
phic hypothesis too large
to explore manually
‘ © Large number of Seographi
c features and locations.
© Number of interacting
subsets of features grow exp
onentially.
Oo. Ex. Fiind teleconne
ctions between wea
ther ¢ vents across
SDM may reduce ocean and land afeas,
the set of possible hypoth
esis
9 Identify hypothesis su
pported by the data.
© For further exploratio
n using traditional sta
tistical methods.
6.1.4 Spatial Data Mining
: Actors
Domain Expert
Identifies SDM
80als, spatial data
set.
Describe domain
knowledge, €.g.
well -known pa
Validation of tterns, e.g. correlat
new Patterns. es.
Data Mining Anal
yst
Helps identify pa
ttern families, SD
M techniques to
Explain be used,
the SDM outputs
to D ‘omain Expert,
Joint effort
| 6.1.5 Characte
Ms
i ristic s of Spatial D
4 ! ata Mining
- Auto Correl
ation.
i - Patterns usua
lly have to
;| complete attrib be defined in
ute space. the Spatial at
tribute Subspa
ce and not
in the
Scanned by CamScanner
Objects £0 poin
tS, but can alsg
Large, Continuous . : Tefer to 1)
. ‘Pace
oa Regional know] . “fined by spatial ay .
geography (F i hetero,
Spatial of Particular imp “ties
[= Beneity),
| Spat das ig
Implicit
| Horizontally : Spatia
l Indexing
isStatistical
Independence ofcont
ext eSpa
atial autoc
utput Set based osi
pattia
Spa i l bas ed —
Algorithm | Generic
i i de
: . divi and coinquer, MBR, i
Spatial i
index in ig, Plane
or dering, hierarchical
Sttucture sweeping, Computational gcometry
Spatial data consists of spatial objects made up of points, lines, regions, rectangles,
surfaces, volumes, and even data of higher dimension which includes time. Examples of
Spatial data include cities, rivers, roads, counties, states, crop coverages, mountain ranges,
Scanned by CamScanner
Rules for R- tee are similar to B-tree. @ gare for the 0
fis ‘aoe
Example given by Hanan 98
Samet is given below :
tree.
problem is tha
whole database
432 A+tree
- R+ttree and C
collection of ot
- R+-trees is ext
If object is inn
Scanned by CamScanner
a ng
R1| R2
NN ab
R6|R6 } g h A He
R3| R4
Ete
di gjh ciilyyel ft R4 f
a b [s R5
g-tree for the collection of line segments (b) the spatial extents of the bounding rectangles
er. Fig, 6.3.2
R-tree is not unique, rectangles depend on how objects are inserted and deleted from the
tree.
_ Problem is that to find.some object you might have to go through several rectangles or
whole database.
6.3.2 R+tree
Rttree and Cell Trees used approach of discomposing space into cells and deals with
collection of objects bounded by rectangles.
Cell tree deals with collection of objects bounded by convex polyhedra.
R+-trees is extension of k-d-B-tree. | .
Try not to overlap the rectangles.’
If object is in multiple rectangles, it will appear multiple times.
Al) R2
R6
Scanned by CamScanner
‘
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _6-7 Spatial te
patal and Web Minin
6.3.3 More Spatial Indexing
Uniform Grid
Scanned by CamScanner
| where at least One of the predicates p
ene
“ 735
tm confidence threshotq aE Nit confidence sno
ess than its
A epeete is strong if predicate «p >:
‘P> QS" is high.
Vis*TA¥Ge in set Sand the condense ot
| 64.2 Mining Co-location Patterns
ty occur in together.
- ForExample
Se ° Ecology : Symbiotic relationship
in animals orplants
on embedded jn o Public health : Environmental factors
and cancers
0 Public safety : Crime generators and crime
events
- Colocation pattern is a subset of spatial event types or inst
of an
these ce
event s
types
Jrequently occur together.
~ Consider following example :
Li id: -
Ti veprasents instance | with feature ype T ; ?-
una insta nces represents neighbor relafianships
among a set
C2
rrences in Al
/ B.1
i
Scanned by CamScanner
Spatia
land Web Minin,
~ From the above example,
© Spatial event types are A, B, C
© Spatial event instances are A.1, A.2, A.3, B.1, B.2 ... .
© Candidate Colocation are the combinations of event types (A, B), (B, C), ( AC)...
© Neighbor relationship represented by solid line between
event instances (A.1, B.] ),
(A.1,.C.2), (A.3, C.1)..5.3..
© Table instance of (A, B) i.e. direct relationship between events A and B from says
graph :
(Al, B.1)
(A.2, B.4)
(A.3, B.4)
.
Interestingness Measure are defined
by participation ratio (pr) and partic
ipation index (pj)
1. Participation ratio is defined
for the colocation pattern
C= (f,, £ ..., f)
| tf, Table.Instance (c)I
pr (c, f) =— eo! |
[Table.Instance (fl
2. Participation index for
given c can be calculat
ed as
| Pi(c) = min, {pr(c, £)}
~ Exampl
: Find
e pi
i
Table instance for each pattern -
Scanned by CamScanner
ysing & Mining (MU-Sem. 6-Comp.)
_ 6-10 Spatial and Web Mining
‘A, B, C) is as given below because only A.3, B.4, C.1 have direct
A3 B.4 Cl
1
pr (A,B,C),C) = 3
pi(c) = min(1/4, 1/5, 1/3) = 1/5
crocation Mining Algorithm
' starting with k = 1
alent pattern
_ Iterative until no prev
ns {c,}
Generate size k colocation patter
h c,
_ Generate table instance of eac
prevalent
= Compute each pi(c,), add to result if
k= k+l
k=1 A B Cc
[XD
AB AC BC
\LZ
k=2
k=3 ABC
Scanned by CamScanner
PI(T4) = og,
pPi(TS) = 0.5
Scanned by CamScanner
s i n g & M i n i n g (MU-Se
ou
of c a c h r e m aining c, € Cy
ar se t able instance
co
Ge gerale
ach pile
F Co mpute e o w threshold,
u t i o n b e l
08 coarse resol
(cy) pased
if p prune out C,
°
© CG
each remaining
ov e th re sh ol d, ad d c, to set Py
° each pili): if ab
- CLARANS
C l u s t e r i n g Techniques
6.5 Spatial
ba se d on R a n domized Search)
Cl u stering Algorithm
CLARANS (A ho mo ge ne ou s groups of objects
identi fy
de sc ri pt iv e task that seeks to , A. , Kriegel, H.-P. and
Clus te ri ng is 4 , Fr om me lt
es (Ester, M.
tribut spati
pased on the va
lues of their at pe rm it s a ge ne ra li zation of the
ial data sets, C Ju
stering icit
Sander, J, 19 98 ). In sp at
of sp at ia l ob je ct s which define impl
ion
plicit location and extens
component like ex
ighbourhood.
relations of spatial ne s.
A R A by us in g mu lt iple different sample
on CL
o CLARANS improves
C L A R A N S dr aw s a sample neighbour.
o Forevery step of search
thi s is no t co nf in in g a search to localized area.
© So
pa ra me te rs nu ml oc al and maxneighbor.
0 Ituses two additional
di ca te s the nu mb er of samples to be taken.
o Numlocal in h an y specific node can be
of a no de to wh ic
er of neighbors
o Maxneighbor is the numb
compared.
Jiawel Han
Algorithm CLARANS given by
mincost to a large
and maxneighbor. Initialize i to 1, and
1. Input parameters numlocal
number.
™~.
_—-
Scanned by CamScanner
¥¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _6-13 Spatial and Web yj
4, Consider a random neighbor $ of current, and based on 5, calculate the cost differentia, of
the two nodes.
5. If Shas a lower cost, set current to S, and go lo Step 3.
6. Otherwise, incrementj by 1. If j maxneighbor, go to Step 4.
7. Otherwise, whenj > maxneighbor, compare the cost of current with mincost. If the former
is less than mincost, set mincost to the cost of current and’set bestnode to current.
8. Incrementi by 1. If i> numlocal, output bestnode and halt. Otherwise, go to Step 2.
Steps 3 to 6 above search for nodes with progressively lower costs. But, if the curren,
node has already been compared with the maximum number of the neighbors of the Node —
(specified by maxneighbor) and is still of the lowest cost, the current node is declared to
be a “local” minimum. Then, in Step 7, the cost of this local minimum is compared With
the lowest cost obtained so far. The lower of the two costs above is stored in mincost,
Algorithm CLARANS then repeats to search for other local minima, until numlocal of
them have been found.
Syllabus Topic : Web Mining
Wob mining —
etl
Scanned by CamScanner
|aad
ah
is...
See Balad Wey ue
=e.
-
on 5, calculate the i¥ Data Warehousing
cosy diff.erentin = & Mining (MU-Som,
Me - ey ithelps ini solving
i the Problem
robs no of how,
_ The Process involves mining
es_
Lips ing HSIN the we sig ==
p 4. : .
nen
urent With mincost If th _ It is the process
web data.
of qj i
re oF gs Meanin
seton
bestnode to curren covering the "seful and Previous
t. = forme eee
| ISe, &0 go 1, to Step 2
- Web data is - “
o ” = anal
Wer costs. Bur Web content — text, nom
if the im:
of the Neighbors
> current node of the
ec
minimum is compares
wok the ogret Sti
o Web usage
~ —
~ pt ei
ove is'wtorea 2
ais a
to nd 5s, 1 —
|
tp logs, 4pp server logs,
etc, |
6.6. Ww
Web Mining is Different
nini ima, untili AUMloca]
:a from Classical py
- Web Mining
ining isi Similar >
to data mining,
- Itdiffers in data
ae collection,
- Data Mining
: The collection
of data is already
~ Web Mining : Data done‘ant
Stored in a4 data
collection is done data warehouse.
; as by crawling through
a number of target
66.2 Benetits of web Pages.
Web Data Mining
:
Match your available
ta, resources to visitor int
erests,
Increase the value
of each visitor.
Improve the visitor’s experienc
e at the website.
Perform targeted resource man
agement.
Collect information in new
ways,
-
Test the relevance of content and web site
architecture.
~Q
| 6.7.1 Introduction to Web Content Mining a
|
ee
rm
from atof
the content a webn
io page. 4°)
- p s to discover useful info
The proces Ly
A
*
4
oO .)
a Mi
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) 6-15 _ Spatial and Web Minin
oO Relevance
°o Novelty
o Interestingness
Scanned by CamScanner
Typical text mining tasks
include :
o Text categorization
o Text clustering
o Conceptentity extraction
> LP
Scanned by CamScanner
—*_Data Warehousing & Mining (MU-Som. 6-Comp.) _ 6417 PUA a We Mining
A very little or no analysis is provided by these tools of the data relationships among the
accessed files and directories within web space.
The tools for discovery and analysis of patterns may be classified into following two
categories:
Web servers, Web proxies, and client applications can quite easily capture Web Usage
data.
Web Server Log’: It is a file that is created by the server to record all the activities jt
performs. .
For example when a user enters URL into the browsers address bar or requests by clicking
ona link.
The page request sent to the web server maintains the following information in its log :
o Information about the URL.
o Whether the request was successful.
o The users IP address.
o Time and date about the page request completed.
o Thereferrer header. |
Web Mining systems use the web usage data and based on the analysis the system can
discover useful knowledge about a system’s usage characteristics and user’s interest
which finds its application in :
o Personalization and Collaboration in Web-based systems,
o Marketing,
———ill
Scanned by CamScanner
ry
WF Data Warehousing & Mining (MU-Sem, 6-Comp.) 6-18 Spatial and Web Mining
o In order to improve the performance
of the website, web usage logs can
to extract useful web be used
traffic patterns. |
Search engine transaction logs also provide valuable knowledge
about user behaviour on
Web searching.
Suc
ie h sinformation is ye,
TY useful for a better understanding of user
ormation seeking behaviour and can s’ Web searching and
improve the design of Web search sys
tems.
Web usage mining is useful in finding interest
important knowledge about
ing trends and patterns which can provide
the us ets of a system.
Sevev eral Machini e learni: ng and
data minaeing techniques for e.g. Associ
classification and clu ation rule mining,
stering can be applied.
Web usage data Provides a ext
remely useful way to learn user’s
interest.
Weveeb logs may y bebe used in web
usage minin & to help identify use
. Similar web pages. These patter rs who have accessed
ns may be us eful in web searching
and filtering.
For e.g. Amazon.com uses collab
orative filterin g for recommen
customers based on the data ding books to their
col lected from other customers sim
ilar interests or purchasing
history. .
|
6.8.3 Web Usage Mining Activities
Pre-processing Web log :
o Cleanse.
o _Sessionize.
Session : All the requests that a single client makes to a web server.
Pattern Discovery :
Scanned by CamScanner
YF Data Warehousing & Mining
SSS eee
(MU-Som. 6-Comp.) 6-19 _
SSS SSS
Spatial and Web= Mining
- Pattern Analysis :
o Once the patterns have been identified, they must be analyzed to determine how
that information can be used.
o A process to gain Knowledge about how visitors use Website in order to Prevent.
(i) Disorientation and help designers to place important information/functions
exactly where the visitors look for and in the way users need it.
(ii) Build up adaptive Website server.
=
© 152.152.98.11
o The name of the remote user (usually omitted and replaced by a dash ‘-”)
Scanned by CamScanner
Web Minin, ¢— & Mining (MU-Sem.6-Comp)a _. 20
Spatial and WelyMining
gin
fur golem”
Peta) oe Googlebot/2.11; hut:
/HTTPI. J"
“CETae b ie. ee
a gs200pi10801
an - <n (08 0eu /20 07: £11:17:55 -0400] q=l &rls= mozillazen-
iL nini
a.
oe= u
SS
u
no
ie=
Wi
lyz erf
o:
ana
ae
*h
ee
h?q=1
ri or ” "M al la . (W ad vie 0 at
O07: 35 cetseertA
Gecko/20070914 Firefox/2. 55 0400] ae nt
h1n1,111,111 - - [o8/Ocv2007:11:17: i auaeen
=
atest
ena eigeesc oo
0/200 0014 Firefoxi2.0.0.7". for
retrieved and indexed them
ee 07. The pages are
show
Te above fine” "
1. A google spider, which ' 6.249.650qu ai l! 11 who gave 8 search for Nihuo
Web
ine.
their search eng
m Ip address °"
2. Another visitor is fro
Log analyzer homepage-
ttn sO,
Scanned by CamScanner
W Data Warehousing & Mining (MU-Sem. 6-Comp.) _ 6-21 Spatial and Web Mini
A few things to note :
1. A single hit is represented by each line in the file, for a file on web server.
2. A web page hit is a page view. For c.g. if a web page contains 4 images that a hit on that
page will generate 5 hits on the web server, one hit for the web page and 4 for the images
3. IP address or cookie is used to determine a unique visitor. A visit session is terminated if
the user falls inactive for more than 30 minutes. So if a unique visitor visits the same
website twice then it gets reported as two visits.
Description of fields
i. 11. lll. 11
[08/Ocl/2007:11:17:55-0400]).
“GET/HTTP/LI"
10801
“http://www. google.comsearch?q= Jog-+analyzer&ie=utf-B&ioe =utf-8&aq=txls=org.mozillazen-
UStofficialéeclient=firefox-a” ices ;
1. IP address : “111.111.111.111”
This is the IP address who visited the site Nihuo web log Analyzer:
This field will return a dash until Identity Check is set on the web server.
This information is available only when accessing content which is password protected by
the authenticate system of web server.
—
Scanned by CamScanner
Spatial and Web Mining
Access request : “GET / HTTP/1.1”
“http/www .google.com/se
om. arch?q=log+analyzer&ie=u
8&aq=téerls=org.mozilla:en-U tf. 8&oe=utf-
S:official &client=firefox. a”
9, User Agent
The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as
edges connecting between two related pages.
.
LPn~ a
Scanned by CamScanner
¥ bata Warehousing & Mining (MU-Sem. 6-Comp.) _6-23 Spatial and Web Minin
a Hyparink
Web
— document
~The process of using the graph theory to analyze the node and connection structure of 4
web site. Web structure mining can be divided into two kinds :
o Extract pattems from hyperlinks in the web. A hyperlink is a structural
component that connects the web page to a different location.
© Mining the document structure. It is using the tree-like structure to analyze ang
describe the HTML or XML tags within the web page.
- Web structure mining has been largely influenced by research in :
o Social network analysis. .
© Citation analysis (bibliometrics).
Scanned by CamScanner
structure Terminology
Scanned by CamScanner
ees nenrttee eel
Po
a *.
OulDeg(P2)
1
OutDeg(P3)
. = a,
Oe ewe nent . =-
Stee eee
6.9.2(B) CLEVER Te
chnique
—
Identify authorit
ative and hub Page
s by creating Weig
(@ Authoritative Pages : hts.
Oo Highly important
Pages,
© Best source for Te
quested informat
ion,
Scanned by CamScanner
-
f 008 warehousing & Mining (MU-Som. 6-Comp.) _ 6-26
Spatial and Wob 1
) Hub Pages:
(ii
Oo
Contain links to highly important pages.
pubs and authorities are ‘fans’ and ‘centers’ in a bipartite core of a web graph.
-
A good hub page is one that points to many good authority pages.
A good authority page is one that is pointed to by many good hub pages.
Hubs
ic Search)
HITS Algorithm (Hyper! ink-induced top
kind of rank algorithm that
HITS (Hypertext Induced Topic Search) algorithm is a
analyzes Web resource based on local link.
query, and Page
The difference between Page Rank and HITS is that HITS is related to
each page a rank
Rank is a kind of query unrelated algorithm. Page Rank algorithm gives
each
value which is unique and unrelated to query keyword, but HITS algorithm gives
page two values, which are Authority value and Hub value.
Authority page and Hub page are two important concepts in HITS algorithm, which are
the concepts all related to query keyword.
Authority page refers to some page that is most related to query keyword and
combination. ,
Hub page is the page that includes multiple Authority pages. Hub page itself may not have
direct relation to query content, but through it, the Authority page with direct relation can
be linked.
Analysis of HITS Algorithm
LPB»
Scanned by CamScanner
_ Data Warehousing
& Mining (MU-Sem. 6-Comp
,) _ 6-27
Spatial anc Wab
Then, HITS Performs link
Hub pages related {0 ana lys is on this subset, and find out the Aut
query in the subset. hority Pages tig
The selection of subnet in HITS
This subnet is defined as
algorith m is acquired by means oif key
root set R, then use link worg Matching
S is the Page that includes analysis to acquire set § fro
Authority page and ulti m TOO set Re
Process from R to S is called mately meet the query requ
“Neighborhood Expand”, iren The
The algorithm procedure for computing S is
shown below:
Step 1: Use text keyword matching
to acquire root set R, whi
Or more; ch includes thousands of
URL
Step 2; Define $ to
R, that is § and R are
equal:
Step3: To each page p in R, pu
t the hyperlinks includ
referring to P into set ed by p into set S; PUt
S$; the pages
Step4: Sis the acquired expanded
neighbourhood set.
above, the
keyword
Analysis of HITS Link
Scanned by CamScanner
wae Mining (MU-Sem.6-Comp.) 6-28 Spatial and Web Mining
sl
wo
the Hub value of each page is the sum of the authority values of pages
teP -’ . That is:
jn * ing to it
0 fe ef
I: a= Z h
jeB(i)
oO h, = Y
jeB(i)
jo stePSs [and O, are based othe fact that one Authority page is always referred by
ro pub pages, and one Hub page includes many Authority pages.
‘ iteratively computes the two steps. I ; they converge. At last,
algorithm iteralive Steps, I and Q, till
aha the Authority and Hub value of page i. ia
. Fig. 6.9.3 shows the application of HITS algorithm in a subgraph including 6 nodes.
h=0.408
a-0
~~
Scanned by CamScanner
~ As shown in Fig. 6.9.3, the authorit
y value of node 5 is equal
1, 3 which refers to it, after normalization, the value is 0.816. to the Hub values of Nodes
~ Assume A,,,.,, is the matrix of subgraph, then the
value of position i, j in matrix a is
to 1 (if page i refers to J), or 0. Set a to be the authority valu
e vector [a,, a, .__, a tbe
the hub value vector [h,, h,, ...,h,], then the iteration I, O can be CXpresseq ay
a= Ah, h = A™ o. After completing the iteration, the values of authority
respectively satisfy a = c,AA"a, h = c, A’Ah, which c, and and hub
c, are constant in order to
satisfy the normalization condition. Thus, vectora and vector b respectively become the
eigenvector of matrix AA™ and
matrix A'A. This feature is simi
algorithm, their convergence speeds lar to Page Rany
are decided by eigenvector.
6.10 Web Crawlers
Scanned by CamScanner
°o
Cr awler : Vi
si
Visits Pages to
based on crawle
| Context focuse r and distiller sc
ores| ,
d cr awler
| - Context Graph:
Oo Context graph crea
ted for each seed do
cument.
© Root is the seed doc
ument.
Oo Nodes at each level show doc
uments with links to documents at
next higher level.
oO Updated during craw itself.
Approach
Construct context graph and classifiers using seed documents as training data.
yer
Level 1 documents i i ;
were” oe
at ? a?
ts yer gre’
Level 2 documen Be
;
Level 3 documents
Scanned by CamScanner
: ata Warehousin, g &
Mining
QoQ
Chapter Ends
ee a
Scanned by CamScanner
(Section 5 ? Outline
Ans. : their im
Spatial data mene
are data tha (5 Maarks)
a8 data ab t have spati
out Object al Or locati
implement s th € on Componen
ed with a Sp mselves ar
e located I ts Spatial da
ya Pattiti ecifi
c location at na p Ysical can be viewe
onin
g of the data tribute(s) s space. This g
base based o u ch Tess OF may be
Geograph n location, latitude/ lon
ic Informat gitude or
locations On ion Systems
the surface (GIS) are Us
Of the earth, ed to store i
nformationTelated to
Q.4 (b) What
is Metadata?
Wh Y do we
80 effective? need meta da
( S e c t ion 1.7) ta when Sea
Ans, : rch engines
like google s
eem
(5 Marks)
_
Word-based
MLOh
Scanned by CamScanner
Ww Data Warehousing & Mining (MU-
Sem. 6-Comp.) _A-2 App
Sa endin
iy
Q. 4(c) In real-world data, tuples with : missing
values for some attri ibutes arear a common
occurrence. Describe various methods for handling this problem.
{Section 3.1 2.3)
(5 Marks)
Q. 1(d) With respect to web mining, is it possible to det
ect visual objects using Meta-objects>
Ans. : (5 Marks)
It is possible to detect visual objec
ts using meta-objects as visual
objec
ts are re
by small rectangle that completely contains enensetle
that object which is minimum boundi “Pre sented
(MBR). ng Tectangle
Q. 2(a) Suppose that a data wareh .
ouse j for DB - UniveUnivers
rsi ity consists of the foll
dimensions: student, course, semester, and instru Owing four
ctor, and two measur
avg grade. When at the lowest conceptual €S count and
level (e.g., for a given Student, coy
semester, and instructor combination), the avg-grade
course grade of the student. At stores the actual measure
higher conceptual levels, avg-gr
grade for the given combination, ade stores the average
(i) Drawa snowflake schema diagram for the
(10 Marks)
data warehouse.
(ii) Starting with the base cuboid [stud
ent,
- OLAP operations (e.g., roll-up from semecourse, semester, instructor ], what Specific
ster t 9 year) should one
per
rses for each Big University studen form in order
to list the average grade of CS cou
Ans. : t.
() A snowflake schema is as sho
wn
course
dimension table univ
student
fact table dimension table area
dimension table
student. id _student_id | |
course_name. area id
course_id _student_namé’. ee | mee,
semester _id _areaid
| city
‘ _Instructor_id
—- province
semester major * . country
dimension table count
status
~ avg-grade 2 ;
university a
ester
instructor
dimension table
_instructor_id i
— dept
Scanned by CamScanner
r Mining (Wile oetil. erie
i
we? _houses are carefully designed databases that hold integrated data for secondary
- pee . you only need data from one system, but can’t impact the performance of that
nee” then it is suggested to take a copy ie. a replicated data store unless the
sro itis and width of data is very large and the various ways to require access to it
eit high, a data warehouse would be overkill.
A replicated data store is a database that holds schemas from other systems, but doesn’t
truly integrate the data. This means it is typically in a format similar to what the source
systems had. The value in a replicated data store is that it provides a single source for
resources to go to inorder to access data from any system without negatively impacting the
performance of thatsystem.
_ Data replication is simply a method for creating copies of data in a distributed
environment.
_ Replication technology can be used to capture changes to source data.
o Synchronous replication : Synchronous replication.is used for creating replicas in
real time. In synchronous replication data is written to primary storage and the
replica is done simultaneously. Primary copy and the replica should always remain
synchronized.
o Asynchronous replication
: It is used for creating time delayed replicas. In
asynchronous replication data is written to the primary storage first and then copy
data to the replica.
Synchronous replication is best suited for data warehouse as it creates replicas in real
time. i
Q. 3(a) The following table consists of training data from an employee database. The data
‘have been generalized. For example, “31::: 35" for age represents the age range of 31
to 35. For a given row entry, count represents the number of data tuples having the
Values for department, status, age, and salary given in that row. (10 Marks)
Tech
Publications
Scanned by CamScanner
E_Data Worerousing & Mini
Department
\.
ng (MU-Sem.6-Comp.)
| Status | Ago
_A-4
tenet,
pata!
gain |
| Salary | Coun t
Sales
Senioior | 31..ZA.35/ 46K...50K| . bran c
30
Junior | 26... tos”
Ifa di
Junior | 31... -
Intrin
Splith
— Gain
Gain ]
Scanned by CamScanner
Gain ratio should be bis wa
dhs. da
3
branch. So, it considers mii Ata js -
to split. eroOF bran,
isa]
Where, p; is the Telative nen
oP Intrinsic information Split iy cy of Classj in A
SplitInfo (S, A)= - =e Bt tog Sl
8
- Gain ratio normalizes info
Gain Ratio (S,A)=
ke into
*ntry)?
Tation
1 into .
> the
inal
For
5]
165 (1-(ita) - G0) )-ies'-@)-@))
— 0.4001= 0.0315
~é(I- (a) -()) )-res(! i) }-())- 0.4316
of
31 14 iB)
SplitINFO = ~(7¢65!o8 110, 1653
10;,, 765+ 8 165° eG"
165 765 98165) = 09636 penise® s
the Ne™ ity
yainers ois" spp
Scanned by CamScanner
*
os
Gi
Theb
Gain RATIO = a;
GAIN = (i- ia) -~ Gy)
2-8 -@)
rat 6
~Tes(1--(3y-
|= aaa "
SplitINFO - (2: 22,2
7
165 * 165 log 17)_
15
ca" Testes 163) =9.9513
CNRATIO = aoe = 0.2053 ° °
For Ay
Scanned by CamScanner
i SpliINFO = ~ (P41 4
mer P 165
3 ==
* Tee tog 63.
Gai
= nRA TIO = 9.2082
The bigge 0.8349 = 0.2494 los"
st GainR
ATIO is
obtained
ror Departme
nt
™
5) (3)
363
Ving So
= 0.0815 ~0.051
|
= 0.0305
| ‘ -_(80,
SplitINFO 8 4 4 1
= -(2 108 94-94 log 94 +54 = 0.5099
2 |
| ener,
sriic net
yer?
s cr
jee ase
cei
Scanned by CamScanner
“am= (GF) (65g
HC-65-@)86-7 a
= 0.0815 — 0.07
73
y
= 0.0042
SnuentiG =. ‘80
-% log9480-94 4 log 52+
:
3a) = 0.8863
-
Gain RATIO =_ 0.0.88
0042
63 = 0.0047
Q. 3(b)
Ans.:
Tree Pi
- Be
= Te
Prepru
- St
— St
— Ay
at
— Se
Postp
- B
~ U
Fig. 6-Q. 3(a) : Upda
ted decision tree
When we connect the node Age = ©
in Salary = [40 — 50] we already
all the age classes. Therefore, arrive in leaf nodes for SC
this is our final tree (Fig, 7 - Q. 3(a)) pI
WF Tech reminds
Pebl
ications —_
Scanned by CamScanner
} Shown ; of (i) converti
in ng the
(ii) pruning the deci decision
ne:
s;sion t
advantage does (i)
h VE Over er
ri(ii
ne
(i )?
Tree Pruning
mie
west
a s
than trainidatng et to get the best pruned tree.
Use different set of data | a
tne’ morsitl
a eeeco —
I
a
ca re
|
de ci si on tre e bui lti may overfit the training , data. The too many branches uni
ott
ite
ed cred
The
i may reflect ' ano malies le branches noice
:| / some of which oving the least reliab
of overfitting" the data by rem:
runing addresses this issue
a
ea
,
Scanned by CamScanner
described by the
example, ho
w tg compu (10 Marks)
following : te the dissim
ilarity betwee
n Objects
(i) Nominal Attr
ibutes (10 Marks)
(i) Asymmetric bi
nary attributes
(i) Nominal Attrib
utes
Nominal attributes
are also called
qualitative Classi as Categorical at
fication, tributes and allo
Every individu
w for only
al item has
the order of th a certain distinct ca
e catego ries tegories, but quanti
, is not Possible, fication or Tanking
Oper
ations on the no
Typical exampl minal cannot beperforme
es of sy ch at d.
tributes are:
Ent
e
Scanned by CamScanner
_ The dissimilari
cts 'Y computed between ob;VEC descrip,
the dissimilarity between tivo obi
‘
priate sides.
d (i, j= (p- m)/ ects
tion j “md . Nom} Mal a
- Ifthe outcomes of
the stg
“handicapped
OF not handi
Capped”.
(Present) and the other ig PP
Coded 28.0 (absent),
(10 Marks)
Scanned by CamScanner
¥ Data Warehousing & Mining (MU-Sem. 6-Comp.) _A-12 Appendix
Transaction ID Items
3 C, A, B, E
T4 B, A, D
TS D
T6 D, B
‘TT A, D,E
T8 B,C
Ans. :
tem | _— Support
A 5
B 6
C 3
D 6
E 4
Step 2 : Consider items with min. support = 30% = 3.
As all items have support > =3, consider all items and arrange the transaction items in
descending order i.e. B, D, A, E, C.
T2 D, A, C, E,B B,.D,
A, E, C
T3 C,A,B,E B, A, E, C
T4 B, A,D B,D, A
T5 D D
T6 | oa | Bp a
Scanned by CamScanner
Fig. 1-Q. 5(a) : FP—Tree
Q. 5(b) Differentiate between simple linkage, average linkage and complete linkage
algorithms. Use complete linkage algorithm to find the clusters from the following
24
12
Scanned by CamScanner
ered
Single linkage Minimum distance is consid
P, 4 0
P , | 11.704} 8.06 | 0
P, 20 16 | 9.848
Ps | 21.54 | 17.988 | 9.848
Distance matrix after merging P, and P,
(P1,P2) P3 P4 P5
(P1,P2)| 0
P3 8.06 0
P4 16 9.848
Ps | 17.888 9.848
Distance matrix after merging P, and Ps
(PyP,)| Ps (P,P,)
(P,,P,) 0-
FP 3 8.06 0
P,P.) 16 | 9.848
Distance matrix after merging (P,, P,) and P,
(P, P., P 3)
P,P.)
(P» P,P) 0.
Py Ps) 9.848
_ eee
Scanned by CamScanner
2 -warehousing & Mining (MU-Sem, 6-Comp.) —
i
ja
P1 P2 P3 pe ts
Scanned by CamScanner
Data Warehousing & Mining (MU-Sem. 6-Comp.) _A-16 Appendix
— In theory this knowledge can be acquired with data query processing or simple Statisticgs
analysis, but it would require a considerable amount of manual
work by expert Market
analysts, both in order to decide which queries to use or how to interpret the Statistics and
due to the huge amount of data.
A department store, for example, can use data mining
to assist with its target Marketing
mail campaign. Using data mining functions such as association, the store
can USE the
mined strong association rules to determine which produ
cts bought by one £Toup of
customers are likely to lead to the buying of certain other produ
cts. With this information
the store can then mail marketing materials only to those kinds of custo
high likelihood of purchasing
mers who exhibit
additional products.
Data query processing is used for data or information retri
eval and does not have the
means for finding association rules, Similarly,
simple statistical analysis cannot
large amounts of data such as those of customer han
records in a department store. *
Scanned by CamScanner