DS535 Note1

DS 535: ADVANCED DATA MINING FOR BUSINESS
Lecture Notes #l: lntroduction and Oven'iewl
(Textbook reading: Chapter 1 and Chapter 2)
Core ldeas in Data Mining
. Classification xL
) c- k t, X",
' Prediction
' Association Rules & Recommenders 1t --a T l.ru,1
' Dala & Dimension Reduction p C k
' Data Exploration
' Visualization
Paradigms for Data Mining (variations)
SEMMA(from SAS)
' Sample
' Explore
' Modify
' Model
' Assess
CRISP.DM (SPSS/IBM)
' BusinessUnderstanding
' Data Understanding
' Data Preparation
' Modeling
' Evaluation
' Deployment
I The lecture notes is based on the textbook "Data Mining for Business Analyics in R," by Shmueli. Bruce,
Yahav, Patel & Lichtendahl, and the textbook's accompanying slides.
1
S ervised Learnin
Goal: Predict a single "target" or "outcome" variable Y
Training data, where target value is known
Score to data where value is not known
Methods: Classification and Prediction
Unsuoervised Leamine
' Goal: Segment data into meaningful segmentst detect pattems
' There is no target (outcome) variable to predict or classify
' Methods: Association rules, data reduction & exploration, visualization
Supervised: Classillcation
Goal: Predict categorical target (outcome) variable 'l= l'o
Examples: Purchase/no purchase, fraudino fraud, creditworthy/not creditworthy...
Each row is a case (customer, tax retum, applicant)
Each column is a variable | / o
Target variable is often binary (yes/no)
Supervised: Prediction
' Goal: Predict numerical target (outcome) variable
. Examples: sales. revenue. performance
' As in classification:
' Each row is a case (customer. ta\ retum, applicant)
' Each column is a variable
' Taken together, classification and prediction constitute "predictive anal).tics''
Unsuoervised: Association Rules

Goal: Produce rules that define "what goes withwhat" C"f ;funce
"lf X_rt1g1chased. Y was also purchased"
'tf t-
Example: L.. R"fiq
Rows are transactions
Used in recommender systems - "Our records show you bought X, you may also like Y"
Also called "affinity analysis" rnarlq,t laftot tl J)-J
U rvised: Data Reduction

Distillation of complex/large data into simpler/smaller data
Reducing the number of variables/columns (e.g., principal components) fcA
Reducing the number of records/rows (e.g., clustering)
7lusY1, tt Jtl
2
R t,Lra".1
Unsupervised: Data Msualization

' Graphs and plots ofdata 6tfld'
' Histograms, boxplots, bar charts, scatterplots i"( l/
P
' Especially useful to examine relationships between pairs of variables
s[,in
7
Data Exploration
' Data sets are typically large, complex & messy
' Need to review the data to help refine the task
' Use techniques of Reduction and Visualization
The Process of Data Mining
Steps in Data Mining

' Define/understandpurpose
' Obtain data (may involve random sampling)
' Explore, clean, pre-process data
' Reduce the data; if supervised DM, partition it
. Specifu task (classification, clustering, e1c.)
' Choose the techniques (regression. CART, neural networks. etc.)
' Iterative implementation and "tuning"

' Assess results - compare models
' Deploy best model
3
Preliminary Exploration in R
loading data, viewing it, summary statistics
i+ tltL
R
tt
code for loading and creating subsets from the data
(- 6rr', h)
I /.to-tt fivsl 7n"t Lontii,'^t
----.""i-\ t nouElDt.df <- rsad. cg('UestRorbEr . cB?", hsader = qE) t loed drta [Alulvt L1 A Abas
drr(trouslDg. dl ) I tlrd ttre dlEerElon of drta frrre
nt: -I
Ded(loustDg.dr) r
vrer (DouilDg. df )
----
sD.(," .o"Eit"ir-rJ--
I shor e1l tffiti-rl a rer tab
+-, i I I )
'- .ot. ttirril,tt
I Prectlce BboylBE dltt€lelt strbssts ot tbs data
16jgrr!g.df[1:10, 1] I ghoc th€ tl.rEt 10 ross of the flrst colllla oDly
housr4S.df[1:10, ] , shog tLs f1r8t 10 rors ot sech ot thE colrlna
/"1 ft
-,+--_-+----t----i-
566rng.df[6, f:10] t ahos tb€ tlfth ror ol the flr€t 10 coluDs
iou8lDg.df[6, c(t:2, 4, 8:10)] ] shor the fltth roc cJ ses colu!trE
hourrrg. <tf [, 1] t sbor tLs rbols flrat co1@
tr6us r ng. df tTmlI._VlLllE S s dlttereat rar to ebor tbs uhol.€ tlrat co]-um
trsus r rt8. (UtTOf lI._VlI.t E[1:10] t shoc the flrst 10 rosa ol ths llrst colEn
leryti(hous1a8.df lmTlI_VlI.UE) * tlnd tLe ].sEgth ot ths flrst colu@
.€s[(boB1rg. d'!T0TAL_VII.UE) * flEd th€ EsEr of ths llrst colur.
BtrDrrI (houslDg. dt) I l1d su@rrf statlstlEs tor eaci colulD
j'ol^, u^l',dod;^ /et\

t'f'' tv^ir''i1r
Data mining typically deals with huge databases

For piloting/prototyping. algorithms and models are typically applied ro a sample from a
database, to produce statistically-valid results
Once you develop and select a final model, you use it to ''score" (predict values or classes
for) the observations in the larger database
Rare Event Oversampling
' Often the event ofinterest is rare

' Examples: response to mailing. fraud in taxes, ...
' Sampling may yield too lew "interesting" cases to effectively train a model
. A popular solution: oversample the rare cases to obtain a more balanced training set
' Later, need to adjust results lor the oversampling
4
I
l'D W nanrs
---l
Sampling & Oversampling 1
TAtsLE 2.4 SA}IFLIIIG I!{ R {b>
,rT:$ oode for sarnplirE and cver/under-sampling
\
I leJrde srrpls of E ob€ervrtlo[s , rnd*n nunl,",t lvw''
l: Idrz
<- sa pls(roy.[ues(hous1E8.df), 5)
r\o.rr 1'1,*
5 rr ",( n'n I
B..
1616r ag. df [s,1
- snrnplt 1 tl l-J,r , J- ) a'll^ft'rttl
ltt$ tvl"y
lrtr cttr\ r *.rr*pll hflrs€E rlth o"sr 10 rooua tIl:iln)
B <- sBlPle(ror.ae€s ihouallg,dl) , 6, prab = 1t€1ss (lsuslrg.dftREgll$t0, 0.9, 0.01))
5sEsr,.g.df [s,1 c t l: j-/,;-,_;
cl t' knf^thaw;X,fir)
Tvpes of Variables
2 | *"plt
Determine the types of pre-processing needed, and algorithms used ? tf^t
t 9 l^ rr(J
Main distinction: Categorical vs. numeric
f r^lrr
Numeric
hc--fiir-,, wiHr ( r^c atqv e * )
Continuous ^tt
Integer ( rorntrllt ) '
CaGorical
Ordered (low, medium, high.t
or/.int ! Asi"
Uioftlered (male, female) norrnin^-Q ,lrk^
Variable handl lng

Numeric
Most algorithms can handle numeric data
May occasionally need to "bin" into categories
Categorical
Naive Bayes can use as-is
In most other algorithms, must create binarl,dummies (nurnber of dummies: number of
categories - 1) [see Table 2.6 fol R codel
5
Reviewing Variables in R
TABLE 2.5 RTVIIWIT{G VARIABTES IN R
R rode for reviewing variables
rnas (iousug. df ) I pr1-Et e 11st of .vf,.ilables to tie acr€€n.

L . {renl D6re t{t(DEss(houslng,df))) * prllt the llst l! & EEefEl cg}E!. JorEet
colnass (housl[g. dt ) [1] <- c ('TOIII_VILIrE" ) ! chsqs the llrat coluEn's nare
t llkt J ^ntr'*) i*llifr]1ff.ff?Tn ilI*o1a a rrctor,arleble TciAL,uALLtt-
-F'^ 1svels(hourllg.dt, f4l) , ft crr la},s oBe of tbrs€ lsv€ls
+
clasa (houslDg. dtiEDftIXllS) t BE'rJf,m S ls i! lntggsr yirlebls
c188s (houslrg. df [, 1]) , Totd--Telue ls a Bursrlc ],'erlable
Output ofcode on previous slide

Partiat 0utput
> t(t-{Der€s (houslrxg. df} ) ) , h lp

[, 1]
[1, I "TOTlr_VlLtEi' )o,^tftt< i
12, I 'ux" r+ tY
[3, I "ror.sqFT' I
t1,] "YB.BUILT"
[5, I 'GR(ES. tREl',
[6, ],LIWt'IC.IREA'
f7, I nFaooRq"
I
[8, 'moils'.
I
[e, ,BElnno'lsx
[10, ] "FI'II,BAIII"
[11 , ] UTIAIJ"BITII'
ro i
112,] "KITCFFT'
[13, ] *FIREPLICE' ih l$s
U4, ] ,f,EI.fi)DE.'
I AIrt e
> clasa (houstag. dt8RElffi)tr:I ) 0( d.
[1] "fector" YYawt
i iihr(
> levslg(houalng.df I, f.4l ) 5u m ! ,l ( )
[1] ,ll@s, ugidu --rIilent. 'firl
6
Creating binary dummies
TA,ALE e.6 tREATItlG 0ul'lltY yARIABLIS lH R
e code fur cnating binary dummies (indicaton)
* Esc Lods1" re.tr1.r ( ) to coryort all, categorlcal va.rlabLes ta the data lrars lnto
* B sst of duE, varleblss, UE rust thetr turn th€ resuLt1lg data aatrlr back lDto
J 8 dete frsrs for further rork.
rtotal <- rodel.Detrlr (- 0 + BEDRoOIIS + REHooE., dara = housllg,df)
xrotellBElB8{$gt1:51 * 8111 nor rork bscause rtotel lB a E}atr1r
rtotal tres(rtotel)
<- ea.deta.
t{t{tre€sctotal) ) ) } check the nares of the dr@y varleblss
hesd(rtotal)
rtotel <- rtotel [, -4) I drop on€ of the drn[y yrrlablss .
t Itr ttls cas€, dxop REEDELIT€cetrt.
#now put it all together again. drop original REMODEI- from the data
housing.df<- cbind(housing.df[, -c(9. I -l)]. xtotal)
t(t(names(housing.df)))
Output on next slide
Creating Binary Dummies - Output

Fafial 0utput
> t (t (tr.!€s (rtotal) ) ) ft€ck the Daoss of th€ dUEEJ varlabl€8.

I, 1l '
tl,l ogEDROolts'
nRE mElIloas
t2,l "
lg'l "iEl{0oElll1d,'
14,l "eEXmE["Rsc srt "
) hBad{*ota})
BES*trOlS fiEHlDELfioas IgmDEL0ld nEli0DELRec€nt
13100
24001
34100
45100
83100
63010
7
Note: R's lm function automatically creates dummies, so you can skip dummy creation
when using lm
Detectin Outliers
' An outlier is an observation that is "extreme", being distant from the rest ofthe dala
(definition of "distant" is deliberately vague)
' Outliers can have disproportionate influence on models (a problem if it is spurious)
' An important step in data pre-processing is detecting outliers
' Once detected, domain knowledge is required to determine if it is an error, or truly
extreme.
' In some contexts, finding outliers is the purpose of the DM exercise (airport security
screening). This is called "anomaly detection".
Handling Missing Data

Most algorithms will not process records with missing values. Default is to drop those
records.
Solution 1: Omission
If a small number of records have missing values, can omit them
If many records are missing values on a small set ofvariables, can drop those variables
(or use proxies)
If many records have missing values, omission is not practical
Solution 2: Imputation [see Table 2.7 lor R code]

Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its (non-missing) information
8
Replacing Missing Data with Median
TABLE 2.7 IIISSIXG DAIA
R co& for irnputing missing data with median
t To ltluatrrte L18s1!g dita lEocedur€s, v* t1r5t conY€rt a teIJ sEtrl€a for

I bedroaa to llf,'8. TD.r Ee t.lpEt€ tb€Be ElcalDg vilues ualtrg tts Edlrl o, th€
I rs.rlulat yslues. l. t-i D- itr6 1Ti,,tiniv ,1 1 )
roir. ro.rtsslrt <- srryI€(roE.Dllos(ho[sl!g.df1, 10)
rl 8siDg, ] TBEITRIXIiS <- X
Lou81rg.(U [rofl B. to.
sErrI(hou81!g. dltsEDnoffiS) t lloc re blvs lo lll's and tbe E d18! ot tbs
t rsrrblBs ralne8 ls 3.
I replace the rlsslDt valusa ualDg ths redla! oI the raallllg Yalu€8.
3 uas ledlrso rlth D&.rI[ = TRIJE to 1glore ElsElug valu€s ebeu coPutlng th€ Esdle!
Lors1"og.dt [ro{rs. to.r1Bsl!g,] IBEDIu)&Gi <- E€d1s.r(horxhg. (l'lBEDaIxHs, a:.n = TRIIE)
sulry (nouslng. df tEmB0 s)

Partiat Outplt
> trouslrg. df Equs. to. !1srr.trg,I lgEDnfinE <- fl

> EEErrt (hous ln8. (uIBEDRB(tr6)
xlD. lst qE. fedls.D l*or,t 3!d Qu. llrr trA' s
tdr lr
-
1.OO 3.OO 3.00 3.23 4.OO 9.OO 10 Ar t-
> hoE3lDg.dI [rors. to. d8r1trg, ] IBEDRIIOE <- !€d1sr (ious1!8. d'IEEDRIXX,A, ra.ra = TRUE)
> srrlrl Goustlg. df 0EGDnOms )
Ittu, lst qu. lledlen ,ser $d qu. [er.
1.00 3.00 9.m 3.23 4.00 9.OO
Normalizins (Standardizing) Data

' Used in some techniques when variables with the largest scales would dominate and
skew results
' Puts all variables on same scale X _ 1.
' Nolmalizing function: Subtract mean and divide by standard deviation J
' Alternative function: scale to 0-l by subtracting minimum and dividing by the range t-hth
' Useful when the data contain dummies and numeric max - niq
9
The Problem of Overfitting
statistical models can produce highty complex explanations
ofrelationships between
variables
' The "fit" may be excellent i,1, adl {zv t*init
. When used with new data, models of great
complexity do not do so well. ' 'lA,
' 1007o fit - not useful for new data
L"t ,rt lw
1600
v^l',t t',"', lo'h1
1400
1200
1000
€,
800
o
o 600
tr
400
200
0
0 200 400 600 800 1000
Expenditure
Overfitting (cont.)
. Causes:
' Too many predictors

mr-tlL h" ongli atd I
' A model with too many parameters
' Trying many different models
conseguence: Deployed model will not work

as well as expected with completely
new
10
Partitioning the Data

Problem: How well \,vill our model
perform with new data?
;
e'r.e. rE-t.r : v.rco.tt l
bi
solution: separate data into two parts 1.. . .,.., -,.,
o TraininE partition to develop the
model
. Validation partition to implement the
model and evaluate its perforrnance
on 'new" data
Addresses the issue of overfitting
Test Partition
Test Partition
. When a rnodel is developed on training
&,i:'1'Iiy'
data. it can overfit the training data

(hence need to assess on validaLion) ,..$,!.''.-,r,
. Assessing multiple models on same
validauon data can overfit validation data "- ._::-..*.
. Some methods use the validation data to
choose a parameter. This too can lead to
overfitting the validation data S;i"':'-
. Solution: final selected model is applied
to a test partition to give unbiased
estimate of its performance on new data
11
TABLE 2.9 DATA PARTITIOI'II{G IN R
R code hr gartitkning tfie lld Roxbury datn irto tnirir8. vatid:tion (and test) sets
t use set.se€d0 to ggt tbs srla Frtltlo[s Ehea r€-mB.E1Dg the e code,
s€t. seed(1)
,t psrtltlo3rrt ltrto tnlnlag (60I) rld Y3Lldatroa (4OI)

t r8rdcry earple SOL oI tle roc IDs for trs1llog; the ra.rr tr1"9 ,1O7. serve al
, vllldttlou =- i, ,-{ e > 6 * t1 v-,v' )
tralLrors <- 8ep1€(rollnes(iouslng.dt ), dlp(honslDg. df ) hl.oj-,!!] I

t col.lect rll ths coLr@r rr.th tralnlD8 roc ID llto trElnlng B€t: I
tr81[.drte <- hou8ltrg. (u ftrela. rore, ]

* s881gB roc II)8 tbrt ere trot elr€edl ltr th€ trelaj.ng Bst, lnto velldatloa
y8lld:-r_otE <- s€tal1.lf (rorD es(houhg.df) , trar[. roBa) ,,,t ir, +& Oa',,. =v.N t
veltdTata <- h5i6 g. dI [ral.rd, rors, ]
:lett ( Avr 4lats ( huw\h/
J
rf
> " clu"o"itv"
Partitioning the Data into 3 sets
l* pertttlonrng lnto triij'nltg (60,(), valldatroa (3O1,), test (20io
f raldoaf, selp1e 5OI of the ror IDs for trElnlng
tr5.1n, r(rs <- saple (roaE res(houslDg,dt), dra(houslDg.df) [i].0.5)
* sslp1e 3O:[ ot ths roE ]Ds lBto Ehe riLldrtloa set, &ealE8 oBlJ froE r€corda
, not rlrssdI lD tbe traltrUt Bst
t use cetdltf() to flD.d !€cords trot elre.dl 1a the tn1nlDg sst
Talld. rofls <-'r'Dl€ (astdttf (rosraBes (ioualDg.df), tri.1D. roes),
dl! (hous lsg-.dtl [1] ro.3)
* as8lglr the r6e1r1Dg 2O[ ror ]Ds ssrye Ea tsat

tesE. roBs <- sotdlft (roirssrsB (Louslng, dl ) , u!1o1(tr81a. roEs, nrlld.roca) )
, create tb€ 3 dete tr as by collectltrg rll co1rttDs fx6 th€ approprlat€ rocs
tral.n.deta <- hou8lng,df [tre1n.ro$, ]
vrud.d.te <- [ous1trg. df lvel1d. rore, ]
tsct.drte .- trsusrllg.df [tegt.rors, ]
12
Example - Linear Regression
West Roxbury Housing Data
DESIAIPNOil OT VANIABtis IIi *tsr R0xEuRY {B0sT0t{) H0titE vAtttE DAIAsEI
TOTAL VALUE Total assessed valu€ for property, in tholsands of USD
TAX Tax bill amosd based on total a*essed vabe multiptied by the tax rat*, in USD
LOt $ FI Total lot sire of parcel in square feet

YB BUILT Ytar the property wa: blitt
GR{)SS ARtl Gross floor area
LMiaG AREA Total living area fur residentiat proffiies {ftt)
FLOORS l{umber of floor:
R00i45 Total number of roomr
BE0R00I{5 Total number of bedrooms
FULL EATH Total sumber o tull biths
HAtf BATH Total number of hatf batk
KfIaHEl{ Total number of htchens
FI*EPIA(I Total number of firep{ares
i,'j!l::,,-,.,---.nT:.yl-".:::i:":::":.:,!'l*:Ti"j:::l--
13
Fit model to the training data
.l\wtr Fit model to the training data

ua t'+ Ll( r r h [rn..,v
'" "- J
. tf
n*ry? i'' '1"'"
''.rdd Urc ln tu."tbn fincr
rrg <- .tE ( TOTAI_VAIUE
r
' ., d6ta * housl:rg-df, ruhr<t = i.drn.rorr) fo1 Al, lhLu(-
'sccins' (fr b ecti*tr*"fi .JTHLtr.,#;fl.8"f*n,
f
*' Nmnc h
UA[q?
rr{ <- lriloll!._l|llt! - ,-tlr. d.ta - r:iu.1,{.da, ruD!.r
!..r4 I r. . r€r I
-:!.rn.:o8)
a
t!.rar .- oara.!!6a(t!.:n.t:!^!Tq:ll-rA!o!? :.t!!tr!d.r.:u.r.
sE r€ ra.x
r.t5:sr:.d.r.l!i l,otAt -
> Ld(ar.ra.) tlr,l l " ?J t YLl\

lral!.dala.l0tll-Ylur3 rat-trt!.d.r.ls.r rri..$&h.la
lm tri.2 ta,,.6!7 -t1.0616,
c'7*r ,' -
{t{, ter.3 llo.zin t7.6rl6 Crr{,?
tcle s&l.o {s7.6r5 -s6.06r62
il{l !2t.t r6.sat2 -?5-2ctr6
6t0t t6t.J Ja6.r6? ts.096!6
,lt6l a6l.G 3x).1666 ,r,2rt5{ ii*'o\ ol,*r ,
wrl ut{^{ r
Scoring the validation data
Scoring the validation data

D c\' ^y pr*
i", lfu .rrll/. ix,la ^
.- grroict tr'r9,'iiriitl vrli..d.tr) Y
A
e
,) ' vl.!€, <- {ata. tl.rl lvnllal. dalrsrorll--!.ltcl. Preo, rrjrlalurl6 -
Y.l !6. d.tr8xxtAi_lalrE - Pr.d)
h..d lvl . lrt I
> lrd(rl.. r.3)

rdld.d.!.-tlr L-Y IJ: pr.d r.rldlafa
a{ts !qr.t 366.S90r 3r.200ot
:]6t 350.3 3s0.595 !9.769488
$r2 &31.t 616.16S9 -1t.25&rt3
tr6a 76:t.6 691.361)8 70- tlolql
{st {70.4 aoe.6396 60.760{Oi
2l2l art.6 to9.2l.t3 a.3ls7rs
14
Assess accuracy
Assess accuracy
For the valdrtDn data
llb!aay (lorcc.st,
.ccu!.cy(pr.d, valid.datatao.lAl VALUZ,
OrrlBI: 1 f
Error metrics
Enor = actual predicted
ME : Mean error i,h.{ ih 0'l
1
all tvvrw
Z et vi.1
I
RMSE - Root-mean-squared error: Square root ofaverage squared error

fioletmt
MAE : error
Mean absolute .w\t Ah
cUl I trmrl
MPE = Pean percentage error
MAPE = Mean absolute percentage error
rntt"' it .,Yl,wvc
P<tituEle ty
1 | Y"*1' t'*'l I
ttfiY
I I
Summarv
' Data Mining consists olsupervised methods (Classification & Prediction) and
unsupervised methods (Association Rules, Data Reduction, Data Exploration &
Visualization)
' Before algorithms can be applied, data must be explored and pre-processed
. To evaluate performance and to avoid overfitting, data partitioning is used
' Models are fit to the training pa(ition and assessed on the validation and test partitions
' Data mining methods are usually applied to a sample from a large database. and then the
best model is used to score the entire database
15
R Computinq:
The main software used is R, which is free from R-Project for Statistical Compuling
R Installation:
Download R: https ://uuv.r-proj ect. org/
Students may also use RStudio.
Example:
chooseCRANmirrorQ
# select a mirror site; a web site close to you
install.packages( "forecast")
library(forecast)
# use the "forecast" library
setwd("C:/DS535")
# set the working directory where the data file is located
R codes
#### Table 2.3
housing.df <- read.csv("WestRoxbury.csv". header - TRUE) # Ioad data

dim(housing.df) # find the dimension of data frame
head(housing.df) # show the first six rows
View(housing.dfl # show all the data in a new tab
# Practice showing different subsets ofthe data

housing.df[ 1:10, 1] # show the first l0 rows ofthe firs1 column only
housing.df[1 : 1 0, ] # show the fi rst 1 0 rows of each of the columns
housing.df[5, 1: l0] # show the fifth row of the first l0 columns
housing.df[5. c(l:2,4.8: l0)] # show the fifth row of some columns
housing.df[, 1] # sho* the whole first column
housing.df$TOTAL.VALUE # a different way to show the whole first column
housing.df$TOTAL.VALUE[ : l0] # show the first l0 rou's of the first column
length(housing.df$ToTAl.VAluE) # find the length of the first column
mean(housing.df$TOTAL.VALUE) # find the mean of the first column
summary(housing.dfl # find summary statistics for each column
#### Table 2.4
# random sample of 5 obsen'ations
16
s<- sample(row.names(housing.d0, 5)
housing.df[s,l
l0 rooms
# oversample houses with over
s<- sample(rownames(housing.dl), 5, prob - ifelse(housing.df$ROOMS> I 0. 0.9, 0.01))
housing.df[s,]
#### l'able 2.5
names(housing.df) # print a list ofvariables to the screen.

t(t(names(housing.df))) # print the list in a useful column fbrmat
colnames(housing.df)[ ] <- c("1OTAL_VALUE') # change the first column's name
class(housing.df$REMODEL) # REMODEL is a factor variable
classlhousing.df[ .14]1 # Same.
levels(housing.dfl, l4l) # It can take one ofthree levels
class(housing.df$BEDROOMS) # BEDROOMS is an integer variable
class(housing.df[, 1]) # Total Value is a numeric variable
#### Table 2.6

# Option 1: use dummies package
install.packages("dummies")
library(dummies)
housing.df <- dummy.data.frame(housing.df, sep : ".")
names(housing.df)
# Option 2: use model.matrix0 to convert all categorical variables in the data frame into a set of
dummy variables. We must then tum the resulting data matnx back into
# a data frame lor further work.
xtotal <- model.matrix(- 0 + REMODEL, data = housing.df)
xtotal <- as.data.frame(xtotal)
t(t(names(xtotal))) # check the names of the dummy variables
head(xtotal)
xtotal <- xtotal[, -4] # drop one ofthe dummy variables.
# In this case, drop REMODELRecent.
#### Table 2.7
first convert a l'ew entries for

# To illustrate missing data procedures, we
# bedrooms to NA's. Then w-e impute these missing values using the median of the
# remaining values.
17
rows.to.missing <- sample(row.names(housing.dt). I 0)
housing.dfl rows.to.missing,l$BEDROOMS <- NA
summary(housing.df$BEDROOMS) # Now we have 10 NA's and the median of the
# remaining values is 3.
# replace the missing values using the median ofthe remaining values.
# use median0 with na.rm: TRUE to ignore missing values when computing the median.
housing.df[rows.to.missing,]$BEDROOMS <- median(housing.df$BEDROOMS, na.rm -
TRUE)
summary(housing.df$BEDROOMS)
####Table2.9
# use set.seed$ to get the same partitions when re-running the R code.
set.seed( I )
## partitioning into training (60%) and validation (40%)

# randomly sample 60% of the row IDs for training; the remaining 40oZ serve as
# validation
train.rows <- sample(rownames(housing.df), dim(housing.df)[1 ]+0.6)
# collect all the columns with training row ID into training set:
train.data <- housing.df[train.rows, ]
# assign row IDs that are not already in the training set, into validation
valid.rows <- setdiff(rownames(housing.dl), train.rows)
valid.data <- housing.df[valid.row-s, ]
# altemative code for validation (works only when row names are numeric):
# collect all the columns without training row ID into validation set
valid.data <- housing.df[-train.rows. ] # does not work in this case
## partitioning into training (50%), validation (30%), test (20%)

# randomly sample 50% of the row IDs for training
train.rows <- sample(rownames(housing.df), dim(housing.dt)[ I ]*0.5)
# sample 30% ofthe row IDs into the validation set, drawing only lrom records
# not already in the training set
# use setdiff0 to find records not already in the training set
valid.rows <- sample(setdiff(rownames(housing.df , train.row-s),
dim(housing.df)[ I ] *0.3)
# assign the remaining 20oh row IDs serve as test

test.rows <- setdiff(rownames(housing.df). union(train.rows, valid.rows))
# create the 3 data frames by collecting all columns from the appropriate rows
train.data <- housing.df[train.rows. ]
18
valid.data <- housing.df[valid.rows, ]
test.data <- housing.df[test.rows, ]
#### Tzhle 2.11
reg <- Im(TOTAL_VALUE - .-TAX, data: housing.df, subset: train.rows) # remove variable
''TAX'
tr.res <- data.frame(train.data$TOTAL_VALUE, reg$fitted.values, reg$residuals)
head(tr.res)
#### Table 2.12
pred <- predict(reg, newdata = valid.data)

vl.res <- data.frame(valid.data$TOTAL_VALUE. pred, residuals =
valid.data$TOTAL_VALUE - pred)
head(vl.res)
#### Table 2.13
library(forecast)
# compute accuracy on training set
accuracy(reg$fitted.values, train.data$TOTAL VALUE)
# compute accuracy on prediction set

:
pred <- predict(reg, newdata valid.data)
accuracy(pred, valid.data$TOTAI--_VALUE)
19

DS535 Note1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS535 Note1

Uploaded by

Copyright:

Available Formats

DS 535: ADVANCED DATA MINING FOR BUSINESS

Lecture Notes #l: lntroduction and Oven'iewl

(Textbook reading: Chapter 1 and Chapter 2)

Core ldeas in Data Mining

Paradigms for Data Mining (variations)

' Methods: Association rules, data reduction & exploration, visualization

' Taken together, classification and prediction constitute "predictive anal).tics''

Unsuoervised: Association Rules

U rvised: Data Reduction

Unsupervised: Data Msualization

The Process of Data Mining

Steps in Data Mining

' Iterative implementation and "tuning"

' Deploy best model

j'ol^, u^l',dod;^ /et\

Data mining typically deals with huge databases

Rare Event Oversampling

' Often the event ofinterest is rare

Sampling & Oversampling 1

TAtsLE 2.4 SA}IFLIIIG I!{ R {b>

,rT:$ oode for sarnplirE and cver/under-sampling

Variable handl lng

R rode for reviewing variables

rnas (iousug. df ) I pr1-Et e 11st of .vf,.ilables to tie acr€€n.

Output ofcode on previous slide

> t(t-{Der€s (houslrxg. df} ) ) , h lp

TA,ALE e.6 tREATItlG 0ul'lltY yARIABLIS lH R

e code fur cnating binary dummies (indicaton)

Output on next slide

Creating Binary Dummies - Output

> t (t (tr.!€s (rtotal) ) ) ft€ck the Daoss of th€ dUEEJ varlabl€8.

Handling Missing Data

Solution 2: Imputation [see Table 2.7 lor R code]

TABLE 2.7 IIISSIXG DAIA

R co& for irnputing missing data with median

t To ltluatrrte L18s1!g dita lEocedur€s, v* t1r5t conY€rt a teIJ sEtrl€a for

sulry (nouslng. df tEmB0 s)

> trouslrg. df Equs. to. !1srr.trg,I lgEDnfinE <- fl

Normalizins (Standardizing) Data

' Too many predictors

conseguence: Deployed model will not work

Partitioning the Data

Addresses the issue of overfitting

data. it can overfit the training data

TABLE 2.9 DATA PARTITIOI'II{G IN R

,t psrtltlo3rrt ltrto tnlnlag (60I) rld Y3Lldatroa (4OI)

tralLrors <- 8ep1€(rollnes(iouslng.dt ), dlp(honslDg. df ) hl.oj-,!!] I

tr81[.drte <- hou8ltrg. (u ftrela. rore, ]

* as8lglr the r6e1r1Dg 2O[ ror ]Ds ssrye Ea tsat

DESIAIPNOil OT VANIABtis IIi *tsr R0xEuRY {B0sT0t{) H0titE vAtttE DAIAsEI

TOTAL VALUE Total assessed valu€ for property, in tholsands of USD

LOt $ FI Total lot sire of parcel in square feet

.l\wtr Fit model to the training data

> Ld(ar.ra.) tlr,l l " ?J t YLl\

Scoring the validation data

Scoring the validation data

> lrd(rl.. r.3)

For the valdrtDn data

RMSE - Root-mean-squared error: Square root ofaverage squared error

Students may also use RStudio.

#### Table 2.3

housing.df <- read.csv("WestRoxbury.csv". header - TRUE) # Ioad data