You are on page 1of 9

Ani

lNeer
ukondaI
nst
it
uteofTechnol
ogy&Sci
ences(
Aut
onomous)
(PermanentAf
fi
li
ati
onbyAndhr
aUniver
sit
y&ApprovedbyAI
CTE
Accredi
tedbyNBA(ECE,EEE,
CSE,I
T,Mech.Civ
il&Chemi
cal
)&NAAC)
Sangi
val
asa-
531162,
Bheemunipatnam Mandal
,Visakhapat
nam Di
str
ict
Phone:
08933-
225083/84/87 Fax:226395

Websi
te:
www.
ani
ts.
edu.
in emai
l
:pr
inci
pal
@ani
ts.
edu.
in

3-
1ITMi
d–I
I(SET-
II
)Scheme&Key

Subj
ect
:IT314DATAWAREHOUSE&DATAMI
NING

Scheme

1.Wit
hsupportthr
eshold=50%andconf
idence=60%.Constr
uctacondit
ionalFPtree
t
ofindtheassoci
ati
onr ul
es.Fpt
reeconstr
ucti
on-6M,Ident
if
icat
ionFrequent
i
tems-4M

2.Mi
ningf
requenti
temsetusi
ngt
hev
ert
ical
dat
afor
matoft
het
ransact
iondat
aset
D.
All
owforminmum suppor
tcountei
ther2or3andf
orv
ert
ical
process-
6M and
fr
equenti
tem l
i
st–4M

3.Appl
ydecisi
ontr
eefort
hedatasetandconsi
dercl
assi
fi
cat
ionat
tr
ibut
east
arget
var
iabl
e.(
Ident
if
yonlyt
herootnode)
.
Cal
cul
ati
onofentr
opyandinf
ormat
iongai
nforeachat
tri
but
e(a1,
a2,
a3)–3*
4=12M
andi
denti
fi
cat
ionofr
ootnode–3M

4.Naïv
ebay esclassif
ieral
gori
thm -
5M andApplyNaïveBayesi
anclassif
icat
ionon
thedatasetin(Questi
on3)andpr edi
ctthecl
assif
icat
ionl
abelyes/
nof orthetest
sample<True,cool,Normal>-10M

5a.k-
meanscl
ust
eri
ngalgori
thm.–3M Applyk-meanst
oclust
ert
hefol
l
owingdata
{2,
3,
4,
10,
11,
12,
20,
25,30}i
ntotwoclust
ersbyassumingmean1=4andmean2=12.–
7M
5b.Wr
it
eanyt
wot
echni
quest
oimpr
ovet
hecl
assi
fi
cat
ionaccur
acy
.5M

6a.Def
ineAggl
omerati
vecl
uster
ing.-2M Andappl
yaggl
omerati
vesi
ngl
eli
nk
cl
uster
ingf
orthedat
a{18,
22,25,
42,27,
43)
-6M andbui
lddendogr
am –2M
6b.Wr
it
eanyt
wot
ypesofcl
ust
eri
ng.–5M
AnswerKey

1ans)Suppor
tthr
eshol
d=50%=>0.
5*6=3=>mi
n_sup=3

1.Countofeachi
tem

I
tem Count

I
1 4

I
2 5

I
3 4

I
4 4

I
5 2
2.Sor
tthei
temseti
ndescendi
ngor
der
.

I
tem Count

I
2 5

I
1 4

I
3 4

I
4 4

3.Bui
l
dFPTr
ee
2.ans)Consi
dermi
nimum suppor
tcount=2or3.

I
temset TI
D_set

I
1 {
T100,
T400,
T500,
T700,
T800,
T900}

I
2 {
T100,
T200,
T300,
T400,
T600,
T800,
T900}

I
3 {
T300,
T500,
T600,
T700,
T800,
T900}

I
4 {
T200,
T400}

I
5 {
T100,
T800}

Forsuppor
tcount=2
2-
Itemset
sinv
erti
caldat
afor
mat
I
temset TID_set

I
1,I
2 {
T100,
T400,
T800,
T900}

I
1,I
3 {
T500,
T700,
T800,
T900}

I
1,I
4 {
T400}

I
2,I
3 {
T300,
T600,
T800,
T900}

I
2,I
4 {
T200,
T800}

I
2,I
5 {
T100,
T800}

I
3,I
5 {
T800}

3-
it
emseti
nver
ti
cal
dat
afor
mat
I
temset TI
D_set
I
1,I
2,I
3 {
T800,
900}
I
1,I
2,I
5 {
T100,
T800}
Dot
hesamepr
ocedur
eforsuppor
tcount
=3.

3ans)Deci
siont
ree
At
tri
but
e:a1
Val
ues(
a1)=Tr
ue,
Fal
se
S=[
6+,
4-] ent
ropy
(s)=-
6/10l
og(
6/10)
-4/
10l
og(
4/10)
Strue=[
1+,
4-] ent
ropy
(Strue)=-
1/5l
og(
1/5)
-4/
5log(
4/5)
Sfalse=[
5+,
0-] ent
ropy
(Sfalse)=0
I
nfor
mat
ionGai
n(S,
a1)=Ent
ropy(s)–5/10ent
ropy(Strue)
-5/
10ent
ropy
(Sfalse)
=0.
9709-5/
10*0.7219-5/
10*1=0. 6099
Val
ues(
a2)=Hot
,cool
S=[
6+,
4-] ent
ropy
(S)=-
6/10l
og(
6/10)–4/
10l
og(
4/10)=0.
9709
Shot=[
2+,
3-] ent
ropy
(Shot)=-
2/5l
og(
2/5)–3/
5log(
3/5)=0.
9709
Scool=[
4+,
1-] ent
ropy
(Scool)=-
4/5l
og(
4/5)–1/
5log(
1/5)=0.
7219

I
nfor
mat
ionGai
n(S,
a2)=Ent
ropy
(s)–5/
10ent
ropy
(Shot)
-5/
10ent
ropy
(Scool)
=0.
9709-5/
10*
0.9709-5/
10*
0.7219=0.
1245
Val
ues(
a3)=Hi
gh,
Nor
mal
S=[
6+,
4-] ent
ropy
(S)=-
6/10l
og(
6/10)–4/
10l
og(
4/10)=0.
9709
Shigh=[
2+,
4-] ent
ropy
(Shigh)=-
2/6l
og(
2/6)–4/
6log(
4/6)=0.
9183
Snormal=[
4+,
0-] ent
ropy
(Snormal)=0.
0
I
nfor
mat
ionGai
n(S,
a3)=Ent
ropy
(s)–6/
10ent
ropy
(Shigh)
-4/
10ent
ropy
(Snormal)
=0.
9709–6/
10*0.
9183-4/
10*0.
0=0.
4199
Maxi
mum i
nfor
mat
iongai
n=0.
6099
Hencet
her
ootnodei
sa1

4ans)Naï
veBay
esCl
assi
fi
eri
soneoft
hesi
mpl
eandmostef
fect
iveCl
assi
fi
cat
ion
al
gor
it
hmswhi
chhel
psi
nbui
l
dingt
hef
astmachi
nel
ear
ningmodel
sthatcanmake
qui
ckpr
edi
cti
ons.I
tisapr
obabi
l
ist
iccl
assi
fi
er,
whi
chmeansi
tpr
edi
ctsont
hebasi
s
oft
hepr
obabi
l
ityofanobj
ect
.

o Bay
es't
heor
em i
sal
soknownas 
Bay
es'Rul
e or
 Bay
es'l
aw,whi
chi
susedt
o
det
ermi
net
hepr
obabi
l
ityofahy
pot
hesi
swi
thpr
iorknowl
edge.I
tdependson
t
hecondi
ti
onal
probabi
l
ity
.
o Thef
ormul
aforBay
es'
theor
em i
sgi
venas:

Where,
P(
A|B)i
sPost
eri
orpr
obabi
l
ity
:Probabi
l
ityofhy
pot
hesi
sAont
heobser
vedev
entB.

P(
B|A)i
sLi
kel
i
hoodpr
obabi
l
ity
:Pr
obabi
l
ityoft
heev
idencegi
vent
hatt
hepr
obabi
l
ity
ofahy
pot
hesi
sist
rue.

a1 Yes No
Tr
ue 1/
6 4/
4
Fal
se 5/
6 0/
4

a2 Yes No
Hot 2/
6 3/
4
cool 4/
6 1/
4

a3 Yes No
Hi
gh 2/
6 4/
4
nor
mal 4/
6 0/
4
Gi
vensampl
e<t
rue,
cool
,nor
mal
,?>
Pyes=6/
10*
1/6*
4/6*
4/6=8/
180=2/
45=0.
04
Pno=4/
10*
4/4*
1/4*
0/4=0
Py
es>pno;
Hencewecanpr
edi
ctt
hatt
hesampl
e<t
rue,
cool
,nor
mal
,Yes>

5aans)K-meansAl gori
thm:
St
ep-1:
 Sel
ectt henumberKt odeci
dethenumberofcl uster
s.
St
ep-2:
 Sel
ectr andom Kpoi ntsorcentr
oids.(
Itcanbeot herfr
om theinputdataset
).
St
ep-3:
 Assign each dat a poi
nttot heirclosestcent r
oid,which willform the
pr
edefi
nedKcl ust
ers.
St
ep-4:
 Calculat ethevarianceandplaceanewcent r
oidofeachcl
ust er
.
St
ep-5:
 Repeatt het hi
rdst eps,whichmeansr eassi
gneachdat apointtot henew
cl
osestcentroidofeachcl uster
.
St
ep-6:
 I
fanyr eassignmentoccur s,t
hengot ostep-4elsegotoFINISH.
St
ep-7:Themodel isready.

I
ter
ati
on1:
I
tems C1=4 C2=12 Clust
er
number
2 |
4-2|=2 |
12-
2|=10 1
3 1 9 1
4 0 8 1
10 6 2 2
11 7 1 2
12 8 0 2
20 16 8 2
25 21 13 2
30 26 18 2
Mean1=(
2+3+4)
/3=3;
mean2=(
10+11+12+20+25+30)
/6=18

I
ter
ati
on2:
I
tems C1=3 C2=18 Clust
er
number
2 |
3-2|=1 |
18-
2|=16 1
3 0 15 1
4 1 14 1
10 7 8 1
11 8 7 2
12 9 6 2
20 17 2 2
25 22 7 2
30 27 12 2
Mean1=(
2+3+4+10)/
4=19/
4=4.
75;mean2=(
11+12+20+25+30)
/5=98/
5=
19.
6

I
ter
ati
on3:
I
tems C1=4.
75 C2=19.
6 Clust
er
number
2 |
4.75-
2|=2.
75 |
19. 6-
2|= 1
17.6
3 1.
75 16.6 1
4 0.
75 15.6 1
10 5.
25 9.6 1
11 6.
25 8.6 1
12 7.
25 7.6 1
20 15.25 0.4 2
25 20.25 5.4 2
30 25.25 10.4 2
Mean1=(
2+3+4+10+11+12)
/6=7;
Mean2=(
20+25+30)
/3=25

I
ter
ati
on4:
I
tems C1=7 C2=25 Clust
er
number
2 |
7-2|=5 |
25-
2|=23 1
3 4 22 1
4 3 21 1
10 3 15 1
11 4 14 1
12 5 13 1
20 13 5 2
25 18 0 2
30 23 5 2
Mean1=(
2+3+4+10+11+12)
/6=7;
Mean2=(
20+25+30)
/3=25
Fi
nal
clust
ersc1={
2,3,
4,
10,
11,
12}c2={
20,
25,
30}

5bans)Techni questoi mpr ovecl assif


icati
onaccur acy :
1.St acking is an ensembl el ear
ning t
echnique t hatuses pr edictions from
mul t
iplemodel s( f
orexampl edeci si
ontree,knnorsv m)t obui ldanewmodel .
Thismodel isusedf ormaki ngpr edi
cti
onsont het estset.
 
2.Baggi ng:Itiscombi ningt her esult
sofmul tiplemodel s( f orinst ance,all
decisiont r
ees)togetagener al
izedresult.Her e’saquest ion:I fyoucr eateall
themodel sont hesamesetofdat aandcombi nei t
,willi
tbeusef ul?Thereis
ahi ghchancet hatt hesemodel swillgivet hesamer esultsi ncet heyare
getti
ngt hesamei nput .So how canwesol vet hispr oblem?Oneoft he
techniquesi sboot strapping.Boot str
appingi sasampl ingt echniquei nwhich
wecr eatesubset sofobser vationsf rom t heor i
ginal dataset ,
 withrepl acement.
Thesi zeoft hesubset sist hesameast hesi zeoft heor i
gi nalset.Baggi ng(or
BootstrapAggr egat i
ng)techni queusest hesesubset s(bags)t ogetaf airi
dea
ofthedi st r
ibut i
on( compl eteset ).Thesi zeofsubset scr eatedf orbaggi ng
maybel esst hant heor i
ginal set.
3.Boosting:Iti sasequent i
alpr ocess,wher eeachsubsequentmodelat tempts
to correctt heer rorsoft hepr eviousmodel .Thesucceedi ngmodel sare
dependentont hepr evi
ousmodel .boost ingwor ksint hebel owst eps.
 Asubseti screatedf rom t heor i
ginal dataset .
 I nit
iall
y, alldatapoint sar egiv enequal weight s.
 Abasemodel i
screat edont hissubset .
 Thi smodel i
susedt omakepr edi
ct ionsont hewhol edat aset
 Er rorsar ecal cul
atedusi ngt heact ual val
uesandpr edict
edv al ues.
 Theobser vati
onswhi char ei ncorrect lypredict ed,aregi venhi gher
weights.
(
Her e,thet hreemiscl assifiedbl ue- pluspoi ntswi llbegi venhi gher
weights)

 Anothermodeli
screat
edandpr
edict
ionsaremadeonthedat
aset
.
(Thi
smodel t
ri
estocorr
ectt
heer
rorsfrom t
hepr
evi
ousmodel)

 Simil
arl
y,multi
plemodel
sarecreat
ed,
eachcorr
ect
ingt
heerror
soft
he
previ
ousmodel .
 Thefinalmodel(st
rongl
ear
ner)ist
heweight
edmeanofal
lthemodel
s
(weaklearner
s)

6aans)St
ep1:

18 22 25 27 42 43

18 0

22 4 0

25 7 3 0

27 9 5 2 0

42 24 20 17 15 0

43 25 21 18 16 1 0

St
ep2:

18 22 25 27 42,
43

18 0
22 4 0

25 7 3 0

27 9 5 2 0

42,
43 24 20 17 15 0

St
ep3:

18 22 25,
27 42,
43

18 0

22 4 0

25,
27 7 3 0

42,
43 24 20 15 0

St
ep4:

18 22,
25,
27 42,
43

18 0

22,
25,
27 4 0

42,
43 24 15 0

St
ep5:

18, 42,
43
22,
25,
27

18,
22,
25,
27 0

42,
43 15 0

St
ep6:
18,
22,
25,
27,
42,
43

18, 0
22,
25,
27,
42,
43

6bans)Ty
pesofcl
ust
eri
ng:

Partit
ioni
ngCl ust eri
ng:Itisat y
peofcl ust
eri
ngt hatdividesthedatai nt
onon-
hierarchi
calgr oups.I tis al
so known as t he 
centroi
d-based method.The most
commonexampl eofpar t
it
ioningcluster
ingisthe K-
MeansCl ust
eri
ngalgori
thm.In
thistype,thedat asetisdi
videdintoasetofkgr oups,whereKi susedtodefinethe
numberofpr e-definedgroups.Thecl ust
ercentreiscreatedinsuchawayt hatthe
distancebetweent hedatapointsofonecl ust
erisminimum ascompar edtoanother
clustercentr
oid.

HierarchicalClusteri
ng:Itcanbeusedasanal ternativ
efortheparti
ti
onedclustering
asther eisnor equi r
ementofpr e-speci
fyi
ngthenumberofcl ust
erstobecr eated.In
thi
st echni que,thedat asetisdiv i
dedintoclusterst ocr
eateat ree-
li
kest ructure,
whichi salsocal leda dendrogram.Theobser vati
onsoranynumberofcl usterscan
besel ectedbycut ti
ngt hetreeatt hecorr
ectlevel.Themostcommonexampl eof
thi
smet hodi sthe Agglomerati
veHi er
archi
calalgor
ithm.

Thedensity
-basedcl usteri
ng:Itconnectst hehighl
y-densear
easintoclust
ers,and
thearbi
tr
ari
lyshapeddi st
ribut
ionsaref or
medasl ongast hedenseregioncanbe
connect
ed.Thisalgorithm doesitbyi dent
ifyi
ngdiff
erentcl
ust
ersinthedatasetand
connect
stheareasofhi ghdensiti
esintocluster
s.Thedenseareasindat
aspacear e
div
idedfr
om eachot herbyspar serar
eas.

You might also like