Professional Documents
Culture Documents
. Classification xL
) c- k t, X",
' Prediction
' Association Rules & Recommenders 1t --a T l.ru,1
' Dala & Dimension Reduction p C k
' Data Exploration
' Visualization
SEMMA(from SAS)
' Sample
' Explore
' Modify
' Model
' Assess
CRISP.DM (SPSS/IBM)
' BusinessUnderstanding
' Data Understanding
' Data Preparation
' Modeling
' Evaluation
' Deployment
I The lecture notes is based on the textbook "Data Mining for Business Analyics in R," by Shmueli. Bruce,
Yahav, Patel & Lichtendahl, and the textbook's accompanying slides.
1
S ervised Learnin
Goal: Predict a single "target" or "outcome" variable Y
Training data, where target value is known
Score to data where value is not known
Methods: Classification and Prediction
Unsuoervised Leamine
' Goal: Segment data into meaningful segmentst detect pattems
' There is no target (outcome) variable to predict or classify
Supervised: Classillcation
Goal: Predict categorical target (outcome) variable 'l= l'o
Examples: Purchase/no purchase, fraudino fraud, creditworthy/not creditworthy...
Each row is a case (customer, tax retum, applicant)
Each column is a variable | / o
Target variable is often binary (yes/no)
Supervised: Prediction
' Goal: Predict numerical target (outcome) variable
. Examples: sales. revenue. performance
' As in classification:
' Each row is a case (customer. ta\ retum, applicant)
' Each column is a variable
3
Preliminary Exploration in R
loading data, viewing it, summary statistics
i+ tltL
R
tt
code for loading and creating subsets from the data
(- 6rr', h)
I /.to-tt fivsl 7n"t Lontii,'^t
----.""i-\ t nouElDt.df <- rsad. cg('UestRorbEr . cB?", hsader = qE) t loed drta [Alulvt L1 A Abas
drr(trouslDg. dl ) I tlrd ttre dlEerElon of drta frrre
nt: -I
Ded(loustDg.dr) r
vrer (DouilDg. df )
----
sD.(," .o"Eit"ir-rJ--
I shor e1l tffiti-rl a rer tab
+-, i I I )
'- .ot. ttirril,tt
I Prectlce BboylBE dltt€lelt strbssts ot tbs data
16jgrr!g.df[1:10, 1] I ghoc th€ tl.rEt 10 ross of the flrst colllla oDly
housr4S.df[1:10, ] , shog tLs f1r8t 10 rors ot sech ot thE colrlna
/"1 ft
-,+--_-+----t----i-
566rng.df[6, f:10] t ahos tb€ tlfth ror ol the flr€t 10 coluDs
iou8lDg.df[6, c(t:2, 4, 8:10)] ] shor the fltth roc cJ ses colu!trE
hourrrg. <tf [, 1] t sbor tLs rbols flrat co1@
tr6us r ng. df tTmlI._VlLllE S s dlttereat rar to ebor tbs uhol.€ tlrat co]-um
trsus r rt8. (UtTOf lI._VlI.t E[1:10] t shoc the flrst 10 rosa ol ths llrst colEn
leryti(hous1a8.df lmTlI_VlI.UE) * tlnd tLe ].sEgth ot ths flrst colu@
.€s[(boB1rg. d'!T0TAL_VII.UE) * flEd th€ EsEr of ths llrst colur.
BtrDrrI (houslDg. dt) I l1d su@rrf statlstlEs tor eaci colulD
4
I
l'D W nanrs
---l
\
I leJrde srrpls of E ob€ervrtlo[s , rnd*n nunl,",t lvw''
l: Idrz
<- sa pls(roy.[ues(hous1E8.df), 5)
r\o.rr 1'1,*
5 rr ",( n'n I
B..
1616r ag. df [s,1
- snrnplt 1 tl l-J,r , J- ) a'll^ft'rttl
ltt$ tvl"y
lrtr cttr\ r *.rr*pll hflrs€E rlth o"sr 10 rooua tIl:iln)
B <- sBlPle(ror.ae€s ihouallg,dl) , 6, prab = 1t€1ss (lsuslrg.dftREgll$t0, 0.9, 0.01))
5sEsr,.g.df [s,1 c t l: j-/,;-,_;
cl t' knf^thaw;X,fir)
Tvpes of Variables
2 | *"plt
Determine the types of pre-processing needed, and algorithms used ? tf^t
t 9 l^ rr(J
Main distinction: Categorical vs. numeric
f r^lrr
Numeric
hc--fiir-,, wiHr ( r^c atqv e * )
Continuous ^tt
Integer ( rorntrllt ) '
CaGorical
Ordered (low, medium, high.t
or/.int ! Asi"
Uioftlered (male, female) norrnin^-Q ,lrk^
Categorical
Naive Bayes can use as-is
In most other algorithms, must create binarl,dummies (nurnber of dummies: number of
categories - 1) [see Table 2.6 fol R codel
5
Reviewing Variables in R
TABLE 2.5 RTVIIWIT{G VARIABTES IN R
12, I 'ux" r+ tY
[3, I "ror.sqFT' I
t1,] "YB.BUILT"
[5, I 'GR(ES. tREl',
[6, ],LIWt'IC.IREA'
f7, I nFaooRq"
I
[8, 'moils'.
I
[e, ,BElnno'lsx
[10, ] "FI'II,BAIII"
[11 , ] UTIAIJ"BITII'
ro i
112,] "KITCFFT'
[13, ] *FIREPLICE' ih l$s
U4, ] ,f,EI.fi)DE.'
I AIrt e
> clasa (houstag. dt8RElffi)tr:I ) 0( d.
[1] "fector" YYawt
i iihr(
> levslg(houalng.df I, f.4l ) 5u m ! ,l ( )
[1] ,ll@s, ugidu --rIilent. 'firl
6
Creating binary dummies
* Esc Lods1" re.tr1.r ( ) to coryort all, categorlcal va.rlabLes ta the data lrars lnto
* B sst of duE, varleblss, UE rust thetr turn th€ resuLt1lg data aatrlr back lDto
J 8 dete frsrs for further rork.
rtotal <- rodel.Detrlr (- 0 + BEDRoOIIS + REHooE., dara = housllg,df)
xrotellBElB8{$gt1:51 * 8111 nor rork bscause rtotel lB a E}atr1r
rtotal tres(rtotel)
<- ea.deta.
t{t{tre€sctotal) ) ) } check the nares of the dr@y varleblss
hesd(rtotal)
rtotel <- rtotel [, -4) I drop on€ of the drn[y yrrlablss .
t Itr ttls cas€, dxop REEDELIT€cetrt.
#now put it all together again. drop original REMODEI- from the data
housing.df<- cbind(housing.df[, -c(9. I -l)]. xtotal)
t(t(names(housing.df)))
) hBad{*ota})
BES*trOlS fiEHlDELfioas IgmDEL0ld nEli0DELRec€nt
13100
24001
34100
45100
83100
63010
7
Note: R's lm function automatically creates dummies, so you can skip dummy creation
when using lm
Detectin Outliers
' An outlier is an observation that is "extreme", being distant from the rest ofthe dala
(definition of "distant" is deliberately vague)
' Outliers can have disproportionate influence on models (a problem if it is spurious)
' An important step in data pre-processing is detecting outliers
' Once detected, domain knowledge is required to determine if it is an error, or truly
extreme.
' In some contexts, finding outliers is the purpose of the DM exercise (airport security
screening). This is called "anomaly detection".
8
Replacing Missing Data with Median
9
The Problem of Overfitting
statistical models can produce highty complex explanations
ofrelationships between
variables
' The "fit" may be excellent i,1, adl {zv t*init
. When used with new data, models of great
complexity do not do so well. ' 'lA,
' 1007o fit - not useful for new data
L"t ,rt lw
1600
v^l',t t',"', lo'h1
1400
1200
1000
€,
800
o
o 600
tr
400
200
0
0 200 400 600 800 1000
Expenditure
Overfitting (cont.)
. Causes:
10
Partitioning the Data
Test Partition
Test Partition
. When a rnodel is developed on training
&,i:'1'Iiy'
11
Partitioning the Data
R code hr gartitkning tfie lld Roxbury datn irto tnirir8. vatid:tion (and test) sets
t use set.se€d0 to ggt tbs srla Frtltlo[s Ehea r€-mB.E1Dg the e code,
s€t. seed(1)
* sslp1e 3O:[ ot ths roE ]Ds lBto Ehe riLldrtloa set, &ealE8 oBlJ froE r€corda
, not rlrssdI lD tbe traltrUt Bst
t use cetdltf() to flD.d !€cords trot elre.dl 1a the tn1nlDg sst
Talld. rofls <-'r'Dl€ (astdttf (rosraBes (ioualDg.df), tri.1D. roes),
dl! (hous lsg-.dtl [1] ro.3)
12
Example - Linear Regression
West Roxbury Housing Data
TAX Tax bill amosd based on total a*essed vabe multiptied by the tax rat*, in USD
i,'j!l::,,-,.,---.nT:.yl-".:::i:":::":.:,!'l*:Ti"j:::l--
13
Fit model to the training data
wrl ut{^{ r
14
Assess accuracy
Assess accuracy
llb!aay (lorcc.st,
.ccu!.cy(pr.d, valid.datatao.lAl VALUZ,
OrrlBI: 1 f
Error metrics
Enor = actual predicted
ME : Mean error i,h.{ ih 0'l
1
all tvvrw
Z et vi.1
I
1 | Y"*1' t'*'l I
ttfiY
I I
Summarv
' Data Mining consists olsupervised methods (Classification & Prediction) and
unsupervised methods (Association Rules, Data Reduction, Data Exploration &
Visualization)
' Before algorithms can be applied, data must be explored and pre-processed
. To evaluate performance and to avoid overfitting, data partitioning is used
' Models are fit to the training pa(ition and assessed on the validation and test partitions
' Data mining methods are usually applied to a sample from a large database. and then the
best model is used to score the entire database
15
R Computinq:
The main software used is R, which is free from R-Project for Statistical Compuling
R Installation:
Download R: https ://uuv.r-proj ect. org/
Example:
chooseCRANmirrorQ
# select a mirror site; a web site close to you
install.packages( "forecast")
library(forecast)
# use the "forecast" library
setwd("C:/DS535")
# set the working directory where the data file is located
R codes
16
s<- sample(row.names(housing.d0, 5)
housing.df[s,l
l0 rooms
# oversample houses with over
s<- sample(rownames(housing.dl), 5, prob - ifelse(housing.df$ROOMS> I 0. 0.9, 0.01))
housing.df[s,]
# Option 2: use model.matrix0 to convert all categorical variables in the data frame into a set of
dummy variables. We must then tum the resulting data matnx back into
# a data frame lor further work.
xtotal <- model.matrix(- 0 + REMODEL, data = housing.df)
xtotal <- as.data.frame(xtotal)
t(t(names(xtotal))) # check the names of the dummy variables
head(xtotal)
xtotal <- xtotal[, -4] # drop one ofthe dummy variables.
# In this case, drop REMODELRecent.
17
rows.to.missing <- sample(row.names(housing.dt). I 0)
housing.dfl rows.to.missing,l$BEDROOMS <- NA
summary(housing.df$BEDROOMS) # Now we have 10 NA's and the median of the
# remaining values is 3.
# replace the missing values using the median ofthe remaining values.
# use median0 with na.rm: TRUE to ignore missing values when computing the median.
housing.df[rows.to.missing,]$BEDROOMS <- median(housing.df$BEDROOMS, na.rm -
TRUE)
summary(housing.df$BEDROOMS)
####Table2.9
# use set.seed$ to get the same partitions when re-running the R code.
set.seed( I )
# altemative code for validation (works only when row names are numeric):
# collect all the columns without training row ID into validation set
valid.data <- housing.df[-train.rows. ] # does not work in this case
# sample 30% ofthe row IDs into the validation set, drawing only lrom records
# not already in the training set
# use setdiff0 to find records not already in the training set
valid.rows <- sample(setdiff(rownames(housing.df , train.row-s),
dim(housing.df)[ I ] *0.3)
# create the 3 data frames by collecting all columns from the appropriate rows
train.data <- housing.df[train.rows. ]
18
valid.data <- housing.df[valid.rows, ]
test.data <- housing.df[test.rows, ]
reg <- Im(TOTAL_VALUE - .-TAX, data: housing.df, subset: train.rows) # remove variable
''TAX'
tr.res <- data.frame(train.data$TOTAL_VALUE, reg$fitted.values, reg$residuals)
head(tr.res)
library(forecast)
# compute accuracy on training set
accuracy(reg$fitted.values, train.data$TOTAL VALUE)
19