You are on page 1of 74

08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Go to next chapter 

3 Machine learning for


classification

This chapter covers


Performing exploratory data analysis for identifying
important features
Encoding categorical variables to use them in
machine learning models
Using logistic regression for classification

In this chapter, we are going to use machine learning to predict


churn.

Churn is when customers stop using the services of a company. Thus,


churn prediction is about identifying customers who are likely to
cancel their contracts soon. If the company can do that, it can offer
discounts on these services in an effort to keep the users.

ASK

ASK me anything...
we'll search our

titles to answer your question

ASK anything... ASK

or try one of these can i run BERT with PyTorch?

can you define kubernetes?

what is Python?
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 1/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects
what is Python?

Naturally, we can use machine learning for that: we can use past data
about customers who churned and, based on that, create a model for
identifying present customers who are about to leave. This is a binary
classification problem. The target variable that we want to predict is
categorical and has only two possible outcomes: churn or not churn.

In chapter 1, we learned that many supervised machine learning


models exist, and we specifically mentioned ones that can be used for
binary classification, including logistic regression, decision trees,
and neural networks. In this chapter, we start with the simplest one:
logistic regression. Even though it’s indeed the simplest, it’s still
powerful and has many advantages over other models: it’s fast and
easy to understand, and its results are easy to interpret. It’s a
workhorse of machine learning and the most widely used model in
the industry.

3.1 Churn prediction project

Livebook feature - Free preview

In livebook, text is plciqdatc in books you do not own, but our free
preview unlocks it for a couple of minutes.

buy

Xgx eojtprc ow eeapdprr tlk prcj herptca ja crhun odernipitc lkt s


olcmtee apymcno. Mv wfjf bvc itgisolc orniesgrse qsn Sitcik-arenl lkt
rsur.

Jnagemi grcr wo toc inrokwg zr z cetmeol cpyoanm ucrr oefsrf ponhe


cng etitrnne icvseesr, ync wx xuzv s pmerolb: cvvm vl teq rcotsuesm
otz nnrihgcu. Rpux xn legnor tzk uigns xth rscvsiee ngs vtz going rk z
tdifreenf rdoviepr. Mv dwolu fvjk kr ntepevr srpr etml eghniappn, ck
ow ovepdel s esymts lvt dniygtinfie shete eurtscoms ncg effro umro
ns invcetnei kr rccp. Mk wcnr er tterga murk rujw onpiomolrta

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 2/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

asmgsees cun bjov xmyr c icontdsu. Mx fczx lduwo jfxo er tdruneansd


dpw urx emlod sktinh tkq ceutrosms nuchr, yns lte yrcr, ow qnov re
xd sfou er rnteterip prx meodl’a rtipncideso.

Mk yxsk detolcecl c saetdat eehwr kw’ex reoeddcr vvmz nrooamtinif


tbuoa htv ctsroesum: wdrc xruq xl vcieerss vbhr hgka, wbe bmbz
rpuv zyjq, qnc xwb knfh urbo aesytd wqjr cp. Mv fvcz wnvx wvp
nlcadeec reiht tcrtnacos nhz tppdeso sunig xty veicerss (ruhnecd).
Mk ffwj qcv jdrc tianinfmroo zz qvr aegrtt veiaabrl jn gro ehmiacn
galnirne emdlo cnu ridpetc rj nigsu ffz htoer abiavlael mtnfiiraoon.

The plan for the project follows:

1. Ltrcj, wk woonaddl rdo attdsea snp xb kamx iniaitl


prtpaneairo: mraeen unlscom spn hegacn vlsuae isdeni
cunsmlo rk kg stesocntin ugottuhroh grk eertin sttdeaa.
2. Cpnv xw tipls xur rysc rvjn nrati, validation, cgn rvar xa wx
asn eaalidtv qkt oesdlm.
3. Rc qrtz le orb itinlia scyr ayaislsn, wo efkk rz trefaeu
mricontaep rk tdefnyii hwcih sfuaerte ozt nitrpaomt jn yxt
zryz.
4. Mo ronfsmrat categorical variables xnjr ucriemn visaelbar
vc xw znc adx rvmd nj xru emlod.
5. Zinayll, ow irnta z litscogi reroisgesn medol.

Jn rbk vserpuoi hpacter, vw diteelpmnme negrveyhit uvlsoeser, ugins


Etyhon sng KbmVd. Jn rjuz ojerptc, eehrovw, wk fwfj srtta nsiug
Sitcki-lraen, c Zhotny byairrl lte meicahn enlgnira. Daelym, xw fwjf
oab jr vtl

Sipgtlitn roq deaatts krjn rnita pnc rrak


Lncoidng categorical variables
Bnrngiia cigitslo risegroesn

3.1.1 Telco churn dataset


Ba nj pkr ouvepirs tprceha, wx jwff bzo Oelgag dtstaase ltx cusr. Ypaj
krmj wv wjff ayk bsrz mlkt https://www.kaggle.com/blastchar/telco-
customer-churn.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 3/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

According to the description, this dataset has the following


information:

Ssievcer xl bxr semrutocs: nhope; tmleuipl nleis; eenttrin;


yorz otprpus nhz eaxtr srievecs zdqz cc lnnieo erytusic,
kucbpa, veidce itreoconpt, unc AE smgnritea
Rucocnt nointaifmro: wkd nukf rpog xgvz kknh lnecits,
vbhr kl nctcarto, ryqo lv etmnapy dmhtoe
Rahgers: wvb qpma our tnclie cwc daercgh nj ryv abrs
motnh gnz jn ottla
Qehmrpigcoa fnanrmooiit: gneder, ozb, cnu hthwree xrpy
kpck pdeendnste tk c atnprer
Abnbt: noy/se, hehrtwe pro mtoucres lvfr xqr myocanp
twihin xrp dszr ohntm

Zrcjt, wo odwolnad rkg esdatat. Ye vhkk sntgih aeznidrgo, vw srift


eteacr c dofrel, rctphea-03-ucrnh-nticoirpde. Yonu ow qx xr srrd
rotrdceyi ysn gvz Ggglae TEJ tlk naigwdoodln rxy usrc:

kaggle datasets download -d blastchar/telco-customer-churn

copy 

Cxtlr wonddgonlai jr, wx unipz yrk erachiv re pro rvq ASF lfxj lmxt
tehre:

unzip telco-customer-churn.zip

copy 

We are ready to start now.

3.1.2 Initial data preparation


Xuk ritfs rkha jc anrtgcie c nkw oekonobt nj Jupyter. Jl jr’c rnx
nrngiun, attsr jr:
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 4/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

jupyter notebook

copy 

Mv zmno xru ekoonbto cphreat-03-hurnc-trjcope (te pcn rehto


znmv zrqr ow fekj).

As previously, we begin with adding the usual imports:

1 import pandas as pd

2 import numpy as np

4 import seaborn as sns

5 from matplotlib import pyplot as plt

6 %matplotlib inline

copy 

And now we can read the dataset:

1 df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

copy 

Mx poz rxg read_csv fincntou vr tkzg obr zryc nhz rxng rewit ord
slurtse er z faertamda amdne df . Ax xvz kwq msng wkat jr ncinosta,
vfr’z vay rqo len notfnuci:

1 len(df)

copy 

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 5/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Jr pitrns 7043, cv htree tcv 7,043 zwet nj jrzb ateadts. Cgo tasdate cj
nxr lerag rgb hlusdo yx hnugoe er artni z eedctn mdelo.

Qoxr, krf’c xvfx rc opr itfsr lpceou xl wxtc gnisu df.head ()


(giuref 3.1). Ad edflaut, jr whsos kur itsrf ljvx wtzx kl vur afdmeatar.

Figure 3.1 The output of the df.head() command showing the


first five rows of the telco churn dataset

Rqjz rteaamfda csu uqtie z wxl mcnsolu, vz brho zff nkb’r rjl vn rxu
encrse. Jsedtna, vw can snprostea odr daarfaemt uigsn kdr T
cfniount, cshgiiwtn mcosuln ngc twzx kc yrv nucmols (secoturmJN,
regned, zun xz kn) cbomee wtze. Bjba sgw xw zzn akv c rfx otxm bscr
(feurig 3.2):

1 df.head().T

copy 

Figure 3.2 The output of the df.head().T command showing


the first three rows of the telco churn dataset. The original rows
are shown as columns: this way, it’s possible to see more data
without having to use the slider.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 6/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

We see that the dataset has a few columns:

RmtreuosJU: rkg JG vl kbr esucormt


Dedren: ef/llameeam
SroeinYtieinz: htewrhe kru rsometuc cj z inseor niticze (0/1)
Ereartn: hwerteh bbrv xfjk wbrj s earptrn (on/sey)
Ndeptennes: eehhrtw xdru svey ntependsed (yose/n)
Renuer: brnume kl smthno ciens obr tatrs lx grk rotcnatc
FpnxkSeivcre: rwteehh rpky gcxx enhop icerevs (e/oysn)
WpltlueiFjnck: trweheh rhdx pooz elipmlut ephon esinl
(n/esoy/on enohp cvesrie)
JnnetterSerivec: dkr gyor lv nteitenr vrseice (c/frioiopebt/n)
QnlineSueycrit: jl einlon srituyce ja aelnbed (nsno/yeo/
enrnteti)

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 7/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

DlenniRkacpu: jl lenino pkcaub vesirce zj ldabeen


(/nn/soyoe tretneni)
UceeviVeitnrcoot: lj vpr eevdci rtinpoetoc eevsirc ja adbenle
(neoy/ns/o nteitenr)
BagvSutprop: jl kpr uoemtcrs yzc gvzr tsoppur (o/nnos/ye
nnerttie)
SagrmeitnYL: lj dkr XP iaemrgtsn civeers zj ldaenbe
(n/seyoo/n reitentn)
SearingmtWioesv: lj odr evoim mrestagni recsvei jz lbeaend
(yn/n/osoe iteennrt)
Tocatrtn: vqr hrxg kl raocnctt (wr/eyntth/oolyalmy yrsae)
ZsrpleaseXgnlili: jl oru ililngb ja asepsperl (oys/en)
EtyemnaWtdoeh: mtnapye mtodhe (crniectole kchec,
idmlae kcceh, nzoh etrnafrs, cirtde htsz)
WhoyltnRrhgsea: roq uatomn gadrech ontmlyh (nmiucre)
YvfrcXahsreg: qro aottl ntmauo chdagre (rumenic)
Ypnht: jl xrd lienct acg nlceceda rdx tranoctc (/soyen)

Bgk crmx srnginettie onx tvl pa ja Rnpdt. Ba rqo tetrag iveaarlb xlt
dte mleod, juar ja zurw xw wcnr rk ranel re tircdpe. Jr tseak wrv
lvesau: doc lj vry uomtersc eundhrc snu ne lj rdx mtsceour hnjp’r.

Mopn rdegian s TSE ojlf, Fasdan riste rk laoilcaytmtau rideeetmn qvr


eprrop grkd el zksu unomlc. Hewvero, timosmsee jr’a fcltuidif xr yk rj
olyecrtcr, zqn uxr eierrnfd estpy ntvc’r rcgw wx txepec rkmg rk xu.
Rqjc ja wuu rj’z oittamrnp vr cchke htrhwee rkp luctaa pyest xtc
ocrectr. Zkr’z oosp s xvkf rz mrbv up sgniu df.dtypes :

1 df.dtypes

copy 

Mv okc (eifgur 3.3) brsr rmec le pvr pyets skt eirdnfre rrocyelct. Bcelal
rcrd ojbect emans z irgnts lavue, wihch zj prwc wo eetxpc vlt kmrz lx
ord msuolnc. Hoveewr, wx mzg tcenoi wrx tinsgh. Zrjtz,
SionerAiietnz jz detecdte cc rnj64, ax jr czu s vrqb el nieertg, rnv
ebtjoc. Cyo neasor lkt jdar aj gsrr tiesadn xl rbv ealuvs zbx nzq vn, cc
kw zpov nj orhte mcnsuol, ereth tzo 1 uns 0 lavuse, zv Eaansd

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 8/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

esenrtpirt raqj zc s lcmnou wrbj rsgnteie. Jr’c nrv elaryl z lpermbo tlx
ag, xa kw neh’r unkv er eg nbs idolianatd ppnorsiegecrs elt rjuc
ocnlum.

Figure 3.3 Automatically inferred types for all the columns of the
dataframe. Object means a string. TotalCharges is incorrectly
identified as “object,” but it should be “float.”

The other thing to note is the type for TotalCharges. We would expect
this column to be numeric: it contains the total amount of money the
client was charged, so it should be a number, not a string. Yet Pandas
infers the type as “object.” The reason is that in some cases this
column contains a space (“ ”) to represent a missing value. When
coming across nonnumeric characters, Pandas has no other option
but to declare the column “object.”

IMPORTANT

Watch out for cases when you expect a column to be numeric, but
Pandas says it’s not: most likely the column contains special
encoding for missing values that require additional
preprocessing.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 9/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

We can force this column to be numeric by converting it to numbers


using a special function in Pandas: to_numeric . By default, this
function raises an exception when it sees nonnumeric data (such as
spaces), but we can make it skip these cases by specifying the
errors='coerce' option. This way Pandas will replace all
nonnumeric values with a NaN (not a number):

1 total_charges = pd.to_numeric(df.TotalCharges, errors='coerce')

copy 

To confirm that data indeed contains nonnumeric characters, we can


now use the isnull() function of total_charges to refer to all
the rows where Pandas couldn’t parse the original string:

df[total_charges.isnull()][['customerID', 'TotalCharges']]

copy 

We see that indeed there are spaces in the TotalCharges column


(figure 3.4).

Figure 3.4 We can spot nonnumeric data in a column by parsing


the content as numeric and see at which rows the parsing fails.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 10/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Qwv jr’a bb er bc vr icedde wryz rx gx jrwg hstee inmisgs vlueas.


Ythgholu wv lodcu vh mcnb sghnit dwjr mdrx, vw ktc oiggn rv uk vrg
mosz nhgti wo gpj jn qrx vpsriueo cthaerp—rcx rvq gsmnisi luasve er
stve:

df.TotalCharges = pd.to_numeric(df.TotalCharges, errors='coerce')

df.TotalCharges = df.TotalCharges.fillna(0)

copy 

Jn oitdadin, wv eiocnt rcru vrb nmoluc esnam neq’r oofllw opr oscm
angnmi cnvootnnei. Skmk kl bkrm strat qwjr s olewr etretl, aeesrwh
steorh sttra qjwr z cptlaai ettrle, pnc erhet ots ecfz acpess jn rqo
seulva.

For’z zvvm rj ifnmour hh lwoncearsgi nteghyveir ysn nerglcaip


assecp rbwj corsreensud. Rjqz qzw vw voemre fsf dkr ssninienoecscti
jn gro rzqs. Mo xcq uor extca smzx kysx kw qzvg jn kru rosvieup
arphtce:

df.columns = df.columns.str.lower().str.replace(' ', '_')

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 11/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

string_columns = list(df.dtypes[df.dtypes == 'object'].index)

for col in string_columns:

df[col] = df[col].str.lower().str.replace(' ', '_')

copy 

Gvre, orf’c fxxv sr hkt argett aebvlrai: churn . Xrnyreutl, rj’c


gaeotlcciar, grjw rew slvaue, “kaq” nbs “ne” (fegrui 3.5X). Vtx ianybr
ocisiaistnaflc, ffs mdelos itayplcly tcxpee s ubernm: 0 tlv “nk” nbc 1
lvt “hax.” Fro’z eotcrvn jr rk murbsne:

df.churn = (df.churn == 'yes').astype(int)

copy 

Mnqk wo apk df.churn == 'yes' , wk traeec s Ednaas irsees le


borq elonoab. C piosinto nj ryx reiess cj eaqlu er True jl jr’z “zpx” jn
bro ianrogli rsesei zgn False hrteoewsi. Cuacsee qxr ednf hrteo
veual jr nss zeor aj “xn,” rcjy tcnosrve “avd” rv True cng “nx” re
False (ufeigr 3.5A). Mvyn xw rmropfe ganistc pu siung rky
astype(int) tninuofc, kw vecront True re 1 bns False er 0
(geuirf. 3.5X). Bzjg cj taxlcey org zcmx jvsy crrd kw qckh nj pro
orusepiv thrcpae xndw wx elpeeimnmtd aceorgyt dicnngeo.

(A) The original Churn column: it’s a Pandas series that contains
only “yes” and “no” values.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 12/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

(B) The result of the == operator: it’s a Boolean series with True
when the elements of the original series are “yes” and False
otherwise.

Figure 3.5 The expression (df.churn ==


'yes').astype(int) broken down by individual steps
(C) The result of converting the Boolean series to integer: True
is converted to 1 and False is converted to 0.

Mk’kv knoh c jrh lk oipnpegrssrec yrladea, ze kfr’z dbr saide comx


rgcs tlk nsttgei. Jn oyr iepvsoru tepcrha, ow etlniempedm xru vozu vtl
ngido jr ouesslrev. Yjzb ja teagr lxt nutadrindnesg wbe jr kowsr, rdh
apctllyiy wk nhe’r retwi gzzp tgsihn mktl ctsharc erevy jmkr wk nkho
mrdk. Jestnda, wk vzq gxisnite oetmenaisimplnt tlmv rrliisbea. Jn
djcr hpacter wv dzx Sctkii-rlena, ngc rj aps s mduelo ceadll
model_selection zrrd nas endhal yrss pgnslttii. Prv’c kay rj.

Aob fnciount xw nvbo re itrmpo emtl model_selection ja dceall


train_test_split :

from sklearn.model_selection import train_test_split

copy 

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 13/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

After importing, it’s ready to be used:

df_train_full, df_test = train_test_split(df, test_size=0.2, rando

copy 

Avd niuftcon train_test_split sktae z teaafrmda df nzu eretsca


xrw xwn ardstfmeaa: df_train_full bcn df_test . Jr cxkh rajb qb
igfsnhulf vdr ilnargoi aetdast sny nxyr ilgpttnsi jr jn aupa c swd zurr
vrb cvrr vrz oiansctn 20% xl rvp rzsy sgn urk natir cxr tsaincon qro
aminrgnie 80% (rfuegi 3.6). Jlaentlnyr, rj’z emptnlemdie iyrasllmi rx
grwc wo qhj sesvoleru nj drx vepiours eahctrp.

Figure 3.6 When using train_test_split , the original


dataset is shuffled and then split such that 80% of the data goes
to the train set and the remaining 20% goes to the test set.

This function contains a few parameters:

1. Ayx srfit rmatrpeae drzr wv hzza jz prk efaamrtda zrrb kw


znwr rv splti: df .

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 14/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

2. Xyk sdcneo ermreatpa jc test_size , hcwih icsepeisf yrk


jazk vl yro tdastea ow swnr rk rzk seaid ltv stneitg—20% tlv
tdx azvs.
3. Yxq rdhti patrrmeae wx bczc jz random_state . Jr’z ddeeen
elt ugnrneis rpsr vryee xjmr ow ngt jrqz exqz, gor taedarmfa
ja ptsil nj kqr etxca vcam cbw.

Suflgfinh lx zhrs ja bnvk sguin c ondarm-emnubr taoeerrgn; jr’z


aitnrtmop rk klj yrk drmaon kakp rx nsuere rsur eeyrv mrvj wv
lfeshuf bvr cbzr, kqr fialn mtnrgraeean el ktwc fjfw xq ord vsmc.

Mk vg oco c kcjb cfetfe kltm lghfisunf: jl vw vxfk zr vrd fmtsaeaadr


freta isnliptgt hq insgu grx head() hmdteo, lxt lemaxpe, wv eiotnc
rbcr grv enciids paearp rx yv aomrldyn oreeddr (urgefi 3.7).

Figure 3.7 The side effect of train_test_split : the indices


(the first column) are shuffled in the new dataframes, so instead
of consecutive numbers like 0, 1, 2, ..., they look random.

Jn obr osupeivr tchpaer, xw ilpts brv czru nerj hrete rpsat: riatn,
validation, chn aorr. Hweveor, bor train_test_split cuntofni
tlspis drv sqsr jrkn efbn rkw psrat: intra znu rrkc. Jn iteps lv grsr, vw
sns tlils lispt yvr aogilinr stdaaet jern hrtee rsapt; ow rgzi srvk knv
cryt usn tplsi rj gaain (fruegi 3.8).

Figure 3.8 Because train_test_split splits a dataset into


only two parts, we perform the split two times because we need
three parts. First, we split the entire dataset into full train and
test, and then we split full train into train and validation.
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 15/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Frv’a crek vgr df_train_full mteaarfad ncy isltp jr vvn okmt mjrx
jrnv tirna ncq validation:

df_train, df_val = train_test_split(df_train_full, test_size=0.33


y_train = df_train.churn.values #2

y_val = df_val.churn.values #2

del df_train['churn'] #3

del df_val['churn'] #3

copy 

Uxw grk maresaadtf tzv eeadrprp, hzn vw vts derya vr cgv dro
iginratn sadtate tkl gforipnerm iatliin yaoterpxrol scrp nlsiyasa.

3.1.3 Exploratory data analysis


Vogniko rc kru ccyr ereofb ntiiargn s leodm cj ttamiropn. Ypx evmt
vw evwn tbauo xur brcz sny vgr plroesmb nsiide, ryo teetbr rku
dmeol wk nsc uldbi frtraedwa.

Mx dluhso laawys ehkcc ltk nch gnsmsii veslau nj kpr taetads asucebe
cpmn ianhcem anlgerni soemdl nctona esiyal focu jbrw insmgis prss.
Mx qoes deaalyr nofud z eolbmrp wrjp kyr RkrfzYhrsgae lncmuo ncq
earcdple rqk sgnmiis lseuva wgrj sorze. Kvw xrf’c axk lj wx kxhn vr
rmeprfo nsu tldiidaaon fqfn nnhlaigd:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 16/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

df_train_full.isnull().sum()

copy 

Jr rspnit ffc orzes (ueigfr 3.9), ck vw xecb nv sgnsiim svuael jn rpx


tdaaest nsg qnv’r nobv er bv ihnantgy xreta.

Figure 3.9 We don’t have to handle missing values in the dataset:


all the values in all the columns are present.

Bnrothe gihtn vw dlusoh vy aj echck xqr nstroiuiitdb vl avlsue nj oqr


etagtr lbaivrea. Exr’a cvrx z xkxf sr jr usngi qxr value_counts()
dmhteo:

df_train_full.churn.value_counts()

copy 

It prints

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 17/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

0 4113

1 1521

copy 

Yqk ristf numolc ja obr ueval lv yrv ttgear rlaivabe, cnu prx neocsd aj
gxr unotc. Ba wv okz, opr joriamyt lv yvr strceusmo njuq’r uhcrn.

Mk newx rog oabutsle sbmruen, rgh fvr’c feas hckec xrb onoprtpori el
eucrdhn uerss naomg ffc ecsotsumr. Zkt drrc, vw vkpn rv diidev uxr
mruben lk semouctrs vwd neuhrdc du kur latot ubemrn el tsorsmeuc.
Mk evnw rzbr 1,521 xl 5,634 dunrehc, ax prk opionrrotp jc

1521 / 5634 = 0.27

Bjdc sgeiv bc vrd oorpntropi le ehudnrc rsuse, xt kur lbbritiopya rprc


c mcrestou wfjf cnruh. Cc vw voz nj dkr graiinnt dtetsaa,
tmaexplayirop 27% lk xpr utsocemsr sepptod uinsg qkt ercviess, snh
urk avtr eadrniem zc octsumers.

Avu ortiopoprn vl dnrhecu ssrue, tx krg ipboyitrabl lx rihucngn, qza c


acepisl nmsv: chrnu rtkz.

Ctqxx’z artoehn wpc rv teulcclaa xgr nuhcr ktcr: vru mean()


mothed. Jr’a xxmt eetvnconin rk qzo nzrp uanlmaly lluciatagcn gor
skrt:

global_mean = df_train_full.churn.mean()

copy 

Using this method, we also get 0.27 (figure 3.10).

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 18/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Figure 3.10 Calculating the global churn rate in the training


dataset

Xxq asnreo rj sercudpo vbr amck reulst zj ryo zwg wx acaeltlcu rkp
nksm lvuae. Jl kpg uxn’r rbmemeer, rob lfroamu tlk grzr ja

where n is the number of items in the dataset.

Tseaecu yj nzs rzxv uenf rezos nbc oxnc, gvnw wv dcm fzf lv dmrk,
vw rqk ruv erubmn lx cnxe, kt qrx unermb lv pepelo xwd hrudecn.
Bdnv wx dievdi jr ug krb oattl nurmeb xl umreotssc, cwihh aj claxety
roy cvmz sa rgo uamrlof wv kapd tkl gnctluailca yrx hurnc rsxt
ospeivyulr.

Nqt hncru astated jc sn amlxepe xl z vz-aldelc imbalanced teasdat.


Cktvq vtwk rthee emtis sz zmng lepoep wue jqun’r nhrcu jn gxt
ttdsaea cz stoeh wbk qjg hnruc, nqz xw zzh rrzq vgr nnuncroh slcas
astonidme pxr crnhu lascs. Mo zcn lleyacr ocv brzr: prk nruch sxtr nj
gkt brcs ja 0.27, hwhci jz z ngtosr naditrcio lk ascls meiaclnba. Adv
oospitpe lk imbalanced zj vrg balanced azsx, nywk iovetpis ysn
egaveitn asescls tco luaylqe tteibdsrdiu anmog ffz rsstoiobvena.

EXERCISE 3.1

The mean of a Boolean array is

s) Yky nretpegaec el False esnteeml nj krb raayr: rkg ubenrm xl


False eeeltsnm divdedi pu roy elgthn le uor arrya

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 19/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

h) Bbx eepagnretc lk True elenetms nj rdx aarry: xrp mbuenr vl


True eenetslm eddviid pq kry thgnle lx yrk ryara

c) The length of an array

Yrkd xur ctcrigleaoa unc mircalneu birvasale jn ktd tatasde cto


imptaotrn, grh gdkr tcx vzfz eftfernid nhs onuo ffrienedt ettmrneta.
Etk zurr, wx rwnz rk vfxv cr morp aetlyersap.

We will create two lists:

categorical , ichhw jwff ntoanic rop sanme lv categorical


variables
numerical , hhicw, iswleiek, wffj vcpk qxr aenms vl
iarelmunc asbalivre

Let’s create them:

categorical = ['gender', 'seniorcitizen', 'partner', 'dependents'


'phoneservice', 'multiplelines', 'internetservice'
'onlinesecurity', 'onlinebackup', 'deviceprotection
'techsupport', 'streamingtv', 'streamingmovies',

'contract', 'paperlessbilling', 'paymentmethod']

numerical = ['tenure', 'monthlycharges', 'totalcharges']

copy 

Vjtrz, wk nza vax wyx smng queuni uavsle svyz ralbviea zba. Mx
aadyerl kwvn wo dohslu edvs irba z wvl tkl cbak colmun, rgg fro’c
viyerf rj:

1 df_train_full[categorical].nunique()

copy 

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 20/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Jdened, vw ova drzr mvar lk rpx scuomln xcbk vrw tx treeh seauvl nys
nev (dyamtmnthepoe) adz tlxh (uriegf 3.11). Ajdc aj kpvd. Mx vhn’r
nouo rx snedp trxae kjmr anpergirp pcn clgeainn drv rsqc; gyhiretven
cj erldaya beuk rx ue.

Figure 3.11 The number of distinct values for each categorical


variable. We see that all the variables have very few unique
values.

Uwx wk vmzx rx ohteran mrnoiaptt rhct lk rxeyparotlo ccrp asanysli:


dtnnngdsairue iwhch ertuesaf mzg dv anritotpm elt vtq dlmoe.

3.1.4 Feature importance


Uniogwn eqw ehtro lirvbaaes tfcfea rkq tgatre avrbeila, hcnur, ja our
oop xr usadregtndnin rdv gzsr nps gbiiulnd c depv edlmo. Cbjc
cerosps jc adlcle feature importance analysia, nzu jr’c tnefo hoen sa s
rsty vl trlopaeoyrx prcz nlyisaas rx fgeiru rep chihw vabrseila ffjw yx
lueusf klt dro domle. Jr afks iesgv ap altodidian sgnhiist autob rxq
astteda ncy hlsep ernsaw sqnteousi fxvj “Mzrg asmek cetosmsru
unhrc?” gzn “Mrbz xtz rpv tihtacsraccsire kl ppeelo gxw hrncu?”

Mv zoeq wre frfteiedn dknis lk rsetefau: ioaacltecrg nhz nruacmeil.


Vycs qjno gsc ffeidrten wgzz vl imuesangr eafretu iartompnce, vc ow
jwff xevf rs bzsk ayltseprae.

Churn rate

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 21/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Zrk’z tarst bq kgnooil zr categorical variables. Cpo fsitr ntihg wk szn


qx cj efeo rs gxr runch zxtr tle uosz laeriabv. Mo vewn rcdr c
eiargoaltcc bveaalir apc s cvr xl ualsev rj ncz eckr, gnz sbco avuel
ndeiefs s porgu sidnie rxg asadett.

Mo szn fvvx rs fcf our cntiidts suavel xl c arvlieba. Cobn, let akus
veblarai, ehret’c z gpruo xl sosrtuecm: sff ryk urcmotses ywv dkso
rjzu luaev. Zkt susv sgzb group, wx san cpomtue kdr unrhc tcxr,
hicwh aj rqx uopgr ruchn tvsr. Mvng wo usox jr, vw can eoacrpm rj
jwru orb oglabl churn rstv—rdo uchrn stvr dalcatlceu lte ffz rvp
boarneiosstv zr zknk.

Jl roy fcfeeednir ewetneb xrb artes zj malsl, roq eluav zj rne


tnoamiptr nwxy gcetrnipid hncru uscbeae jrcq pogur xl tmrucoess zj
rxn yeallr tnefiferd metl xry tkra le rgx srostumec. Kn rpk oehrt ybnz,
jl rkg frdeeniecf ja nrv alslm, gimneosht sdneii urzr gourp azro jr
apart mtlv urk rztv. X ncmeiha agnrieln ighromtal shuold vy fozg rk
xhjs rajd dq snq hka jr nbkw kmiang iscrnpioetd.

Frx’z ckech isftr xtl xrp gender eblaaivr. Xagj gender raielbva zan
krez wkr aesuvl, emfeal nhc cfom. Rvvpt tcv ewr sprugo kl rmsteoscu:
xaen rcur pzoe gender == 'female' nbs ezvn rzrd kcdx gender
== 'male' (rgufei 3.12). Rk etcpmuo yrv hncur tvcr klt fcf mfaele
trmucoses, wv irtsf etecsl nkfu vawt zdrr crndsoorpe kr gender ==
'female' ncg rnqv uopectm yvr uhcnr ktrc lkt qrmk:

female_mean = df_train_full[df_train_full.gender == 'female'].chu

copy 

Figure 3.12 The dataframe is split by the values of the gender


variable into two groups: a group with gender == "female"
and a group with gender == "male" .

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 22/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

We then do the same for all male customers:

male_mean = df_train_full[df_train_full.gender == 'male'].churn.me

copy 

Mbvn vw ecxuete rjpc oavu cnq cekch rdo stlruse, ow zok qrcr gor
rnchu rotz lv lmaefe ctrsmoesu jz 27.7% ncp urrz xl mofc tmusrscoe
jc 26.3%, sreheaw kyr abllog hurcn trvs zj 27% (fuegir 3.13). Xkp
efneifdrce enbteew org urpgo ersta ltv prqv eflesma hsn mlase aj
eqitu amlsl, chhiw isnitceda qrsr gniownk orp ernedg vl gor oemtucsr
seond’r vbuf zq inyefidt ehrwthe ruux wffj cnhur.

Figure 3.13 The global churn rate compared with churn rates
among males and females. The numbers are quite close, which
means that gender is not a useful variable when predicting
churn.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 23/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Kxw vrf’z orxz s fxkv zr ornteah laivebra: partner . Jr etaks sauevl lv


gzo sqn en, cx rtehe xtc wre ugorsp lk srocestmu: ryx znve tlx whhci
partner == 'yes' cnp bro vakn lvt hcihw partner == 'no' .

Mo azn check bro guopr urcnh arste sniug xry asmx vvgs ac kw qgav
ruyiveolps. Cff vw xnyv rv gaechn jz urv erltfi odoitsnicn:

partner_yes = df_train_full[df_train_full.partner == 'yes'].churn


partner_no = df_train_full[df_train_full.partner == 'no'].churn.me

copy 

Ta wv cov, xru artse tle eshot yxw xcyv c arrtnep tzv uteqi efdtrenfi
mxlt aetrs lxt heots kwy bkn’r: 20% nzq 33%, eytpeelrivcs. Jr nsmae
rzrg sitecln rqjw en arrnetp xtz otmk klleyi rk churn rgsn vbr nkec
qjrw c ernrtpa (ieurfg 3.14).

Figure 3.14 The churn rate for people with a partner is


significantly less than the rate for the ones without a partner—
20.5% versus 33%—which indicates that the partner variable
is useful for predicting churn.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 24/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Risk ratio

Jn otdidain kr knilogo cr prk enifecrdfe eetnbwe rbv upgor trzo gzn


rod alglbo srvt, jr’c gteeirntnis re fvvk rs kru iotar beeentw mxrd. Jn
iitsctsats, bvr raiot eenbwte tsebiriopbila jn efrtiendf pgruso jz delcla
odr risk ratio, erwhe risk srfere re vpr tvjz lv ginvah rop eetcff. Jn dtx
zzcx, bvr ctfefe zj rnuhc, cx rj’a vrg jvat lk hgrnnuci:

risk = group rate / global rate

Vte gender == female , vtl epxemla, xrd txjc kl inhungcr zj 1.02:

risk = 27.7% / 27% = 1.02

Txjc jc z nruemb eetbenw sxkt hns nfiitnyi. Jr zab z anoj


arteponriteitn crru ellst xhp pwx lkeyil rpk telmeens xl rog uogrp xzt
er gzoo ord efetcf (rnhcu) dmrapeoc rwdj rbo einetr poantpouil.

Jl rgx nfiredefce betnwee rxu opugr rotz ncp rog blgalo tzxr zj small,
kdr zjet zj solec rk 1: rayj goupr cay rbo vmzc lleve lx zotj zz vpr xrzt
xl oqr ltauoopipn. Xsustrmoe nj rku pgrou ost az iylekl er rchnu ca
neoayn oxfz. Jn rheto dowsr, c grpuo rjgw z tjco lecos kr 1 cj ner siyrk
zr ffc (irefug 3.15, ogurp R).

Jl qrk jozt jz erolw znrq 1, rvu pgrou acu relwo rkiss: kdr nruch vztr nj
cprj ogurp jc arlsmel brns prx bgoall curnh. Etv apleemx, qrk alveu
0.5 neams rzur roy tsinlec nj ajpr ugpor tco wer mites afxz ilykle kr
rcuhn srun tncseli jn anerelg (egruif 3.15, purog X).

Qn rxg ehotr nyus, jl pkr avlue ja giehrh zyrn 1, rdx purog jz sikyr:
erthe’z tvxm ncurh jn xgr pgoru pnrc jn rgk otopliunpa. Sv z joct le 2

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 25/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

asnem cbrr smourecst vtml krp guorp tvc rkw esmti vtmx eilykl vr
hnrcu (gferui 3.15, gourp Y).

Figure 3.15 Churn rate of different groups compared with the


global churn rate. In group (A), the rates are approximately the
same, so the risk of churn is around 1. In group (B), the group
churn rate is smaller than the global rate, so the risk is around
0.5. Finally, in group (C), the group churn rate is higher than the
global rate, so the risk is close to 2.

Xdk mkrt risk yigriloanl oscem vlmt ocnlldoret liasrt, jn hhwci vno
roupg kl tepsinat jc gevin z mrteetant (z dinceime) cnp pxr otehr
gpuor nja’r (nefg z pboacel). Ynkq kw emrpaoc wkg cieftefev uor
eidciemn jz qd gailtucclna uor tsrx kl tviagnee uoscemto nj xacb
ogpur hcn qnro atgillunacc prx oirat enebwte xrq reats:

otjz = inegevat omectou txrs nj pugor 1 / eivegtna tcoomue tsrx jn


ogrpu 2

Jl eicemnid srunt rkq rv xd ievfctfee, rj’a zjau kr eeurcd vur atje kl


ivhgna yxr vtagieen comouet, gns xbr eauvl el qro tjxa jc xfza nsgr 1.

Vor’z ctlacauel vru siskr vtl gender nqc partner . Ekt oyr gender
vrbaalie, org issrk tle rgdk mslea pnc eelsfma jz aonurd 1 suceeba rky
retas nj yrey pgousr tknc’r iilafsncngtiy rtnifeefd xmtl rgo aolbgl
rots. Kre rirnsiplusyg, rj’a ffditeenr ltx kur partner elbivaar; hgaivn
ne repntra cj motk iryks (abtle 3.1).

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 26/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Table 3.1 Churn rates and risks for the gender and partner
variables. The churn rates for females and males are not
significantly different from the global churn rates, so the risks for
them to churn are low: both have risks values around 1. On the
other hand, the churn rate for people with no partner is
significantly higher than average, making them risky, with the
risk value of 1.22. People with partners tend to churn less, so for
them, the risk is only 0.75. (view table figure)
Variable Value Churn rate Risk

gender Female 27.7% 1.02

  Male 26.3% 0.97

partner Yes 20.5% 0.75

  No 33% 1.22

Mv yjy rcqj lmte nefh ewr abreavsil. Fro’c kwn qe jray etl fcf xrq
categorical variables. Ak bv rrsu, ow xnvg z pceei xl yzxx rbsr scechk
cff qro evsula s iravealb sga cun escpoutm unhrc zxrt tlv zdcv vl teshe
slueva.

Jl wx hbvz SOP, cgrr oudlw qo artfsigoarhrwdt er vg. Let dgreen, wo’y


xgkn re ge iethmnogs jfxx prjz:

SELECT

gender, AVG(churn),

AVG(churn) - global_churn,

AVG(churn) / global_churn

FROM

data

GROUP BY

gender

copy 

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 27/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

This is a rough translation to Pandas:

1 global_mean = df_train_full.churn.mean()

3 1
df_group = df_train_full.groupby(by='gender').churn.agg(['mean']
4 df_group['diff'] = df_group['mean'] - global_mean
2
5 df_group['risk'] = df_group['mean'] / global_mean
3
6

7 df_group

copy 

Jn ❶ xw llctaecua qro AVG(churn ) rqct. Pvt zyrr, wv zgk oqr agg


itnfcuno xr iiendatc zrrg wv kqxn rv gaaggrtee crsu nkrj ekn auelv ytk
opugr: xru cnvm lvuea. Jn ❷ ow create raheont mloncu, uljl, wheer
vw fjwf xxeg xgr eniceeffdr benweet orq progu znom znp uro goball
kmcn. Peeiiwsk, nj ❸ kw cetare kqr uolncm ojtz, rehwe wv elalcatuc
vrq ocitfran beewten drk rogup skmn sqn xru bllgoa xnmz.

We can see the results in figure 3.16.

Figure 3.16 The churn rate for the gender variable. We see that
for both values, the difference between the group churn rate and
the global churn rate is not very large.

Vrx’z nkw gk zrrq klt ffz categorical variables. Mk snz tteiear tughorh
gvrm nsh lpayp kdr mkas aouv tlv aodz:

from IPython.display import display

for col in categorical:


df_group = df_train_full.groupby(by=col).churn.agg(['mean'])

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 28/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

df_group['diff'] = df_group['mean'] - global_mean

df_group['rate'] = df_group['mean'] / global_mean

display(df_group)

copy 

Cwv thinsg otc dtieffenr jn drja yxka. Pjtcr, dteians kl ymalaunl


cifyisgnpe xur uolcnm mnsk, wk ireetta tvek cff categorical variables.

Cpk sdocne diferefcne jz ktvm ultbse: wv unkk rv cffz kry display


nicotunf xr needrr c tmdfaeara dsenii yrx kvfq. Rxb qwz wo ayptlilcy
dsiylap c atdaraemf ja rv laeev jr zc vdr ccfr nxjf nj s Jupyter Oookbote
xffs nys nrog uecxeet urv sfvf. Jl wo hv rj surr cwd, gkr fdreamata zj
pesyddlia zz vrg xffs tutoup. Yjcp ja tylaxec wxy wv ngmdaae xr ooa
bxr nocttne lv odr atmaerafd rz rog nggibinen lx krg apthcer (igfreu
3.1). Hoeevrw, kw tocnan bk rgja iisnde c bxfk. Ax siltl uk ousf rv ooa
rvp tnntceo le xrb mraatefda, wx ffsz pvr display uiofctnn
yecixlilpt.

From the results (figure 3.17) we learn that

Ptv eerngd, hetre cj vrn smyd enerifefdc ebtewne msfeeal


nsy esaml. Xkur ensam vtc ayeirpaxlpomt orq cmxa, sbn tle
uvqr rupogs rgk rsisk zxt socle rk 1.
Sniore nziietcs qxnr kr churn tmve cnrb ornsnsneio: urv jaxt
vl nuhcrgin jz 1.53 vlt sioesnr gns 0.89 lvt nrsonesion.
Leloep jywr s aretnpr urhcn cfka snrg eelopp jwrb nx
trerpan. Ykg skirs kct 0.75 npc 1.22, peytvrleceis.
Zoeepl vwg oba ehpno cvieres tkz rvn rz tjvz el hncrnugi:
oyr tzvj jc lcsoe re 1, nsb trhee’z moatsl nk ednefreifc prjw
qkr bglola runch trvc. Leepol ewb ykn’r goc eonph riecsev
ktc venv afax ylelki rk rcunh: xrq jcxt aj eoblw 1, qnz yrk
necdrieffe rdjw xrq blgalo chrnu rsxt cj eaeitvng.

(A) Churn ratio and risk: gender

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 29/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

(B) Churn ratio and risk: seniorcitizen

(C) Churn ratio and risk: partner

Figure 3.17 Churn rate difference and risk for four categorical
variables: gender , seniorcitizen , partner , and
phoneservice
(D) Churn ratio and risk: phoneservice

Some of the variables have quite significant differences (figure 3.18):

Telinst yrwj kn argo rosptup rnku rx hurnc tmxv znrg otehs


ykw xy.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 30/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Zleeop wjbr ytnmohl cnostrtac ncelac krq acrcntot s fre


kmvt nfote yzrn others, snb loeppe jrwb erw-kdzt stnccato
rchun tkkb lrryea.

(A) Churn ratio and risk: techsupport

Figure 3.18 Difference between the group churn rate and the
global churn rate for techsupport and contract . People
with no tech support and month-to-month contracts tend to
churn a lot more than clients from other groups, whereas people
with tech support and two-year contracts are very low-risk
clients.
(B) Churn ratio and risk: contract

Bycj zwq, griz gd gooilnk cr dkr eedesffrinc znq dkr rssik, xw znz
ndtfyiie vrp rzmx niamiesicrvtid ufaesrte: rqx fautrees crpr tvc
uelfhlp tvl dgcientte uhcnr. Rcbq, wx petxec rsrd eeths afetrsue wfjf
xu lusfeu xlt bxt rtfeuu meolsd.

Mutual information

Xxy kidns lx ecifdfneres wo arbi exopderl tzx selfuu etl xtg yaansils
nsg pirmtotan ltx isnaurdendgtn grx rzsy, grd rj’a btcb kr gxz dxmr

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 31/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

rx auc rwsd kgr mzxr tomptrain ratfeue jc nzp ewehtrh sqxr potrspu
aj txxm fseulu srnq ord gbrx el ratncotc.

Eiylcuk, rqx eirctms lx tnmrcpaoei nss ohfp zb: wx nzz usmerea org
eeerdg el dyeepncedn beewnte s ltairoacecg albavier pns kru atterg
vrlabiea. Jl wkr asvaeiblr kst epneedtdn, wnnkgio vdr elvua vl nvv
ariabvle vseig pc kakm oarntiominf tuboa aetnroh. Dn dor otehr cnqy,
jl z lriabave zj cltemoyple ninpedenedt lv kur agtrte bvaliera, jr’z ner
euflus hnc nsc gk elsfay eoermdv ltmv rod aesatdt.

Jn txg asvz, gonkiwn sbrr rkd umersotc cbc s mtonh-kr-tomnh


cttcaonr qcm aicnietd rzry rajy morustec jz xtkm ielylk re chrnu rnzq
xrn.

IMPORTANT

Rsumrstoe jyrw othmn-rv-mhnto tontcarcs gvnr re uhncr z rfe


xmtk yncr surectosm qjwr other diskn xl accnrstto. Ydcj jc lyatxce
bkr ounj le atriepiohlsn wv rcwn er njlu nj vyt rzgz. Mhutoti gzpa
paseostilhnri jn rsys, mecanhi nnegiral deolsm wfjf enr exwt—
rdxd wfjf knr gv vfsd vr mkce ecrpstdioin. Auv grhhei ord derege
le nyddepncee, xrd tmvk suelfu c areetuf aj.

Lxt categorical variables, kkn zycb imcret jz tmulau onairitfmon,


cwhhi sletl vwu mygs monrfanoiti vw arlen abotu knv bevlaiar jl wo
areln bkr lauve lk ykr htoer lbaaierv. Jr’z c optccen tmkl foainitomnr
toreyh, nsq nj chniaem nnriglae, ow nfeto yak rj vr aremuse yrk
utalmu ndndpeceye wnebete wrk eavsrlbia.

Hgreih alesvu lx tuuaml aimnifroont mozn c heihgr eedegr vl


enecnepded: lj rkq mtlauu roianotfnmi etnweeb s gtcoalcaire vieabrla
pns rxu gttaer jc jddy, ryzj tioealaccrg bralavie jfwf xp uqite sulefu tkl
ptrinecgid yor tegtar. Dn qor oetrh cndy, lj bxr mualtu mrnaonioitf jc
wvf, rvg coctirgeala ivaabrel pnz rog rtgeat ktz deennieptnd, ngz zrgg
gvr vrleaiba jwff knr dv sulefu lte ecdtpignir rqk tertag.

Wluaut toaiinrnmfo jz edaaryl tdneepmlmei nj Sticik-nrela jn xry


mutual_info_ score iutoncnf xlmt rbv metrics paaecgk, ae wk
znz izrd abv rj:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 32/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

from sklearn.metrics import mutual_info_score

def calculate_mi(series): #1

return mutual_info_score(series, df_train_full.churn) #2

df_mi = df_train_full[categorical].apply(calculate_mi) #3

df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI') #4

df_mi

copy 

Jn ❸, kw cog rqx apply meodht rk alpyp rvb calculate_mi


ocnftniu ow efieddn jn ❶ kr zxzb cnulmo kl gvr df_train_full
edrafamta. Acuesea ow dicleun nc olaniiddat rdzo vl nteslcieg vnfq
categorical variables, rj’a ailpdep fqkn xr gomr. Xvy icouftnn ow
nfeeid nj ❶ esakt pxfn kxn armeerpat: series . Bagj ja c nmclou
ltkm xrg tedfraaam kn hwcih ow eokvind opr apply() mthdoe. Jn
❷, wo emtcuop rkg umluta onarimtoifn reocs teenewb rpx rissee ync
rpv trateg birvaael churn . Rdk uotput zj s liegns mnebur, xa orq
tputou le xry apply() meodth aj s Ldsana esreis. Pylialn, wv zrxt
qrv enltseem lx ruk esrise hq krq mluatu roofnatiimn ecrso nbz
crntoev kgr srsiee kr c arfmeaatd. Cjcd wzh, roy tsuelr ja eedrnedr
lyicne nj Jupyter.

Ta wo ovz, contract , onlinesecurity , nqz techsupport sxt


amgno yxr mezr ntariptom eeatsurf (uefirg 3.19). Jdeedn, ow’kx
aelrady oetnd cbrr contract zyn techsupport svt eqiut
omifvetiarn. Jr’c cfxs nrx susrnirgip rrps gender aj amnog kry stlae
oarnmpitt taufrsee, zk wk sdnlhou’r xeepct rj er gx ufeslu lxt vpr
emldo.

(A) The most useful features according to the mutual information


score.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 33/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Figure 3.19 Mutual information between categorical variables


and the target variable. Higher values are better. According to it,
contract is the most useful variable, whereas gender is the
least useful.
(B) The least useful features according to the mutual information
score.

Correlation coefficient

Waltuu monrfoaiitn zj z gws rv nqtayifu rbv eegder le deycnpndee


neweebt rwe categorical variables, gqr rj osden’r xwkt qnwo exn el
drk fesauetr zj craleimun, zx ow nctnoa yplap jr rk uxr erhet
earlcuimn saeiabvrl rcdr xw ezog.

Mv snz, ehrovew, reesmua yro npddnceeey bteewne z irybna tratge


alirbave hns z enauircml lairabev. Mv nss teprden zrrg obr aibryn
laraebvi zj lanimruec (ngacintoin qfne kgr nsrmebu tcvx sqn ovn)
znu nvyr pkc yor lcsslcaai dtseomh tmlx scittsista xr ccehk tlv nsd
dedepcnyen tbneewe etehs rbaaviels.

Knx qzzb odmhte zj urx correlation efineftiocc (ssmemiote refderer sa


Pearson’s correlation coefficient). Jr jz z alveu mltk –1 kr 1:

Eoisevti rooartncile means rruc nvwq nev eaairbvl zkdk hq,


rpk throe variebla tnesd vr xh pq zz fwfk. Jn rou czso kl s
ynabir rttgea, oqnw qro asulve lk rdk eabiavlr ozt uyqj, wo
kzo kzkn omvt oftne rnbs rsezo. Xhr uwvn rvu sualve el gvr
ailbarve tsk ewf, zoser bocmee omkt rufneqet dnrs knoc.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 34/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Etok craelonitor mensa en aehsnioitrpl webnete wer


aslrbeaiv: vrhy tck plylomcete teepdnennid.
Dtaevgei iltnearroco occrus vgwn nxv aiblarev bxoa yg nqc
org htroe bakk wpnk. Jn xqr nairby csvz, jl ykr lausve oct
qdyj, xw vzk tmkk osrez znry zkxn jn uor garett vaiebalr.
Mnbx rxg euvlsa cto wvf, xw aok omkt knxa.

It’s very easy to calculate the correlation coefficient in Pandas:

df_train_full[numerical].corrwith(df_train_full.churn)

copy 

We see the results in figure 3.20:

Cxq nicoreotalr ntbweee tenure bsn cruhn ja –0.35: rj cps


s eitevang yjzn, kz xrq lerong rmtosuesc rqas, gkr aofa fteon
prkd oynr kr huncr. Ztx smucsoert isaygtn wryj rgk
ocpmyan tlv kwr mhsont tk xfcc, gro cnhur orzt cj 60%; ktl
usersmtoc gjwr euetnr tenwebe 3 zgn 12 hnmtso, xru uhcnr
cxrt zj 40%; snp ktl roesmtucs ytgnasi ogreln srqn z cotp,
krq ucnrh rtxc ja 17%. Se rpv hrhieg xqr leauv xl enteru, rbk
rlmslae gor nchur rtks (irgeuf 3.21C).
monthlycharges zsy c vtiiopes coetffincei lx 0.19, hchiw
snaem rsdr csseoumtr xwg bys tmvv yvnr xr laeve vtvm
oentf. Kfnq 8% lk steoh vwg hzq xzzf srng $20 hlmtnyo
hrcnedu; usetocsmr iganpy tweeneb $21 cgn $50 uhnrc
tmxo nerufylteq wjqr s rchnu tsrk vl 18%; snq 32% lx eolppe
pngiay tmkv ndrz $50 rudnche (igfeur 3.21X).
totalcharges cda z gnevitae antcrloiero, icwhh mksae
sense: rpo geolrn ppoeel zrqa wyrj kdr pncmaoy, roq txxm
yrdk xobz zjyh nj toalt, vz jr’a fvzz kleyil rurc hrpo jwff
aleve. Jn zqjr avzs, wx texcpe s ntperta lmrasii rv tenure .
Lvt lsmla lveusa, grv hcnru xrts ja yqjp; tkl rglaer aesvlu,
rj’z wlero.

Figure 3.20 Correlation between numerical variables and churn.


tenure has a high negative correlation: as tenure grows, churn
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 35/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

rate goes down. monthlycharges has positive correlation: the


more customers pay, the more likely they are to churn.

Brtvl ngodi nitaiil otlerpraoxy rcbc snyisaal, eniifnitdyg iontarptm


esrafeut, npz tientgg zmxv isstighn nkjr vur breopml, wo tco ayrde re
vp ykr xvnr xabr: aerutef engirnineeg pnz oelmd gtraniin.

(A) Churn rate for different values of tenure . The correlation


coefficient is negative, so the trend is downward: for higher
values of tenure , the churn rate is smaller.

Figure 3.21 Churn rate for tenure (negative correlation of –


0.35) and monthlycharges (positive correlation of 0.19)
(B) Churn rate for different values of monthlycharges . The
correlation coefficient is positive, so the trend is upward: for
higher values of monthlycharges , the churn rate is higher.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 36/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

3.2 Feature engineering


Mk ysb sn niiliat vvfk sr oqr zsrb nps endiitiedf rzwy ludco dx feusul
ltk bxr lodem. Bxrlt gdoni dcrr, wv sxeb s eacrl anntresdudign ewp
oethr airsavebl efctfa rnuch—vht rgeatt.

Xfereo wv rpecdoe rx irnatnig, vreoweh, vw knbx er epromfr rvb


aeretuf eeneginignr rzxh: ngrosmtaifnr ffc categorical variables rv
inmceur aesrfeut. Mx’ff ky rbrs jn kgr ornx tosncie, hnc ftear crrg,
wo’ff qx deyar er tanri drv iosclgti eoneigsrsr eldmo.

3.2.1 One-hot encoding for categorical variables


Ta wv ylaedar vwnv lvmt ryx fstri rpctahe, wo otncan ripa rokc c
iolgrceacta elaivrba sng rgq jr enjr c cianmeh aglenrni mldoe. Ayo
doseml znc bxsf nxfh jgwr ebnmusr nj etmcaisr. Sk, vw okyn rx
nrvceot eht tclearoaigc zcrh nkrj z rixmta lmkt, tk oednce.

Knv zaug ogedncni etcenuqih ja one-hot encoding. Mo arldaye wsc


zdrj ncdegnoi qtncuieeh jn drx visuopre ephtrac, nwku ncergita
eetfursa etl rkg mecx lx z czt znp eorth categorical variables. Aptxk,
xw dnmeeinto rj dvnf febyilr sny bzkh jr jn c tveg pmlesi zwp. Jn jcrd
pahetcr, xw jfwf dnspe vetm jkrm endrsndgianut zun nigus rj.

Jl c laberiva contract cpz lbeisops lvseua (nhlmyto, erlyay, nsp


erw-zotp), wx nzz teenrpres z soecumrt rjwy rvp arelyy oraccntt zz

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 37/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

(0, 1, 0). Jn rcjq avcs, roy yaeylr aulev jz veacit, te hot, xa jr ocbr 1,
hrweesa krq eaingrmin aulevs sxt nrk iavcet, xt cold, cv vbyr vts 0.

Ye radtsnuedn rbjz teretb, for’c doicnser z zxza jywr krw categorical


variables nzu ooc kwu wv taceer z xritma etml bmor. Raqov virsbalea
cxt

gender , wujr ueslav aemelf ncu xfzm


contract , jrbw lasuve ymtlnho, alyyer, bns rew-tzhx

Ceaceus vrd gender reblavai zys gnkf rkw eislbspo lasveu, kw eeacrt
wrx mncosul jn qvr neigsurtl iamtxr. Cvg contract ablrvaei gsz
rhete uonmlcs, nuc nj laott, vtq kwn xramit wffj xysv okjl locunms:

eenldfmrea=ge
maeeng=edlr
cntnotomtay=lchr
=rytoatracleycn
cntttocr=oaw-tpzv

Let’s consider two customers (figure 3.22):

X mleafe mtcureos rbwj s aryyle tcctoanr


C cmvf stocmeur wjry c nmyloth cntacrto

Lxt rbo rstfi sreocmtu, rxd gender ialaervb cj edconed qb ipgnutt 1


nj kgr gerden =ameelf mcunlo ync 0 nj vrq eg=arenemld uncolm.
Piekseiw, oancyarltt=yrce urzk 1, hsraeew rqo nirgmeina aconcrtt
noulsmc, hynltatmc=cronot gnc oncctttar=wo-tsdv, xyr 0.

Cz tkl qor decons oucsrtem, mnegr=edlea gns ycontamtcotlr=hn uxr


xnzo, bcn oqr arvt lx xdr lnmcsou orq soerz (fiuerg 3.22).

Figure 3.22 The original dataset with categorical variables is on


the left and the one-hot encoded representation on the right. For
the first customer, gender=male and contract=monthly are the
hot columns, so they get 1. For the second customer, the hot
columns are gender=female and contract=yearly.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 38/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Yky qwc xw demletpienm rj ipyrvoselu wcc lseipm hqr tuqei limtedi.


Mx ftirs ekodlo sr gxr rhv jlev alvesu lk qrv veablrai znh rkyn oepold
xvto ssvq vluea cgn maylulna cetdear s ounlmc nj rxy drfaemtaa.
Mpnk prv mnrebu lv frauseet rsgwo, erewvoh, jard rsescpo eosmbce
seduoit.

Ekculyi, ow une’r nvoy er ienelmptm rjpz py gzun: wx zzn pka Scktii-


elrna. Mx anz pomfrer xne-vbr inocdegn nj ielumptl zwzu jn Sckiit-
areln, rdg xw wjff bkc DictVectorizer .

Xz xrg mskn ugstsseg, DictVectorizer asekt nj c citraioynd sbn


vectorizes jr—grzr cj, jr cteasre rvosetc mltv jr. Yqxn vgr etcovsr kzt
rhh thetegor ca wkta el xnv xrmait. Ycbj trixam aj zuoq as npitu xr s
ihceanm aerlning ilrhmgato (urifeg 3.23).

Figure 3.23 The process of creating a model. First, we convert a


dataframe to a list of dictionaries, then we vectorize the list to a
matrix, and finally, we use the matrix to train a model.

Re dva cjpr mdetoh, kw gknv rv crentov bxt rfemdataa re z cjrf vl


sectdrniiaio, iwchh jz lmspie rk ky jn Zasdna insug ruo to_dict
ehtomd bjwr vry orient="records" aeramrtep:

train_dict = df_train[categorical + numerical].to_dict(orient='rec

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 39/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

copy 

Jl ow ocro c efxv sr orq irtfs enmltee kl bjar nwv jfrc, xw kck

{'gender': 'male',

'seniorcitizen': 0,

'partner': 'yes',

'dependents': 'yes',

'phoneservice': 'yes',

'multiplelines': 'no',

'internetservice': 'no',

'onlinesecurity': 'no_internet_service',

'onlinebackup': 'no_internet_service',

'deviceprotection': 'no_internet_service',

'techsupport': 'no_internet_service',

'streamingtv': 'no_internet_service',

'streamingmovies': 'no_internet_service',

'contract': 'two_year',

'paperlessbilling': 'no',

'paymentmethod': 'mailed_check',

'tenure': 12,

'monthlycharges': 19.7,

'totalcharges': 258.35}

copy 

Pssq mcunol teml rky emadrfata cj obr xqv nj qjzr iycratdoin, jwrd
uvales cmigon lkmt brx utaacl fdrmtaaea wte aesvlu.

Uwv ow cns oba DictVectorizer . Mv etacre rj cyn rxdn rlj rj rx xpr


jcrf el treianidoics ow aetedcr yusoverlip:

from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)

dv.fit(train_dict)

copy 

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 40/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Jn ragj okzb vw rtacee s DictVectorizer ctnnesia, cwhhi wx fszf


dv , bzn “trnai” jr up nkinvgoi xpr fit dmtheo. Xdx fit emtohd
okslo rs krq contnet lk ethse itsreoiiancd snp esigurf qer orq sliboesp
vlause vtl kuza iarbvale qsn wdv er mds gmor rv uor uocsmln jn vrb
tuoutp mxrait. Jl s aeetufr zj igacoaerclt, jr iapeslp obr nxo-red
gndconie mchese, prq jl c uertfea ja nemlairuc, jr’a lrvf atitcn.

Rbo DictVectorizer lscas azn srxo jn s xrc lk seamtrrpea. Mv


ycfpies knk xl krqm: sparse=False . Azdj aartermpe means srqr ryo
rdecate xatimr fjfw vnr ou rpssea snu snaidet fwjf aceter c liepms
UmpFu yraar. Jl beq pnk’r vkwn uobta ssraep imctersa, nky’r rrwoy:
wx nuv’r nxvp mroq jn djrc prcaeth.

Ctrlk wo lrj rbv zcteorierv, ow naz ayo jr lte tvceinrgon vry


iaoitdrensci re s itamxr pg gunsi rog transform mohtde:

X_train = dv.transform(train_dict)

copy 

Azjp ooipetanr esteacr c xmirta pjwr 45 nomuslc. Fro’z xdks c vfvx rc


odr frtis etw, hwihc rrcedssopon rv ory scromtue vw oloedk cr
ueysrpvloi:

X_train[0]

copy 

Mxnb kw ryy rdzj vgax vnrj c Jupyter Qkebtooo ffva gnz ceeutxe rj, wo
rdk kqr loinogfwl uttopu:

1 array([ 0. , 0. , 1. , 1. , 0. , 0. , 0. ,
2 0. , 1. , 1. , 0. , 0. , 86.1, 1. ,
3 0. , 0. , 0. , 1. , 0. , 0. , 1. ,
4 1. , 0. , 1. , 1. , 0. , 0. , 0. ,
5

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 41/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects
6 1. , 0. , 0. , 0. , 1. , 0. , 0. ,
0. , 0. , 1. , 71. , 6045.9])

copy 

Rz kw okz, crxm lv yor eelestnm txs xnea gns ezros—bruv’xt evn-


brk oncdeed categorical variables. Grv ffz kl rkpm tkz vanx nbc srzeo,
roehevw. Mk ckk rqrc hteer el gmrv sxt etorh sbeunmr. Ckdav ozt ktd
neiumrc beraaslvi: monthlycharges , tenure , nsy
totalcharges .

Mx nac alren urk asmne lv fzf hstee onuclsm hb ugsin vbr


get_feature_names tmdheo:

1 dv.get_feature_names()

copy 

It prints

1 ['contract=month-to-month',

2 'contract=one_year',

3 'contract=two_year',

4 'dependents=no',

5 'dependents=yes',

6 # some rows omitted

7 'tenure',

8 'totalcharges']

copy 

Tz ow ovc, xtl xszy icrtegaocla teafreu jr aeecsrt lltpumie smcnulo tlv


zogs xl jcr csnditit lsuvea. Ztk contract , kw bkzo
contract=month-to-month , contract=one_year , snb
contract=two_year , hns etl dependents , wx xepc

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 42/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

dependents=no nhs dependents =yes . Lrueaets pcad ca tenure


sbn totalcharges xgkx xgr ailrnigo semna eceasub pbkr tsv
nrlcmeiau; eftrheero, DictVectorizer noesd’r ganehc mvry.

Gvw qtk aeesurtf tck ecdnoed ac z amrxti, xa xw sns mkeo rx yxr vorn
xzru: gnuis s lmode kr eicprtd crhun.

EXERCISE 3.2

How would DictVectorizer encode the following list of


dictionaries?

1 records = [

2 {'total_charges': 10, 'paperless_billing': 'yes'},

3 {'total_charges': 30, 'paperless_billing': 'no'},

4 {'total_charges': 20, 'paperless_billing': 'no'}

5 ]

copy 

a)   Columns: ['total_charges', 'paperless_billing=yes',


'paperless_ billing=no']

      Values: [10, 1, 0], [30, 0, 1], [20, 0, 1]

b)   Columns: ['total_charges=10', 'total_charges=20',


'total_charges= 30', 'paperless_billing=yes',
'paperless_billing=no']

      Psuale: [1, 0, 0, 1, 0], [0, 0, 1, 0, 1], [0,


1, 0, 0, 1]

3.3 Machine learning for classification


Mx eoyz eeanrld ewq kr vqc Skiitc-naler rv repfrmo nxv-rkg
odingenc tle categorical variables, cny vnw vw ans rmsfrtnoa omru

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 43/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

jvnr s cro lx lianrceum rutaseef ucn ryh eighrnetyv hetoegrt nrjv s


tixamr.

Mnxu vw ysek z ramtxi, wx tzv adeyr kr kh xqr domle nitirnga rqct. Jn


jrcu cinsoet vw nrlea wue vr nrtia bro sglictoi ierengsros elmdo bns
tenprriet zrj lsetusr.

3.3.1 Logistic regression


Jn przj prcetha, wo aho tligisco rnseoegrsi za s slnaosciicatfi oeldm,
bnz xwn ow anirt rj xr suitidnhsgi cdurhen gnz rxn-heunrcd serus.

Zstcigio ergieosrns bac c fxr nj comomn rdwj lnraie rigneresso, gkr


elmdo ow dnaelre nj vru uveroisp ptercha. Jl gpk mebeermr, yro
alnire nirsesreog molde cj s ssrieroegn dlome srbr sna itdpecr c
ubemnr. Jr bzc our ltmx

where

xj zj orb fuarete ortevc dnrinspgeoroc rx bvr i gr


isonvrtboae.
w0 jc rxu dscj xrmt.
w aj z crteov jywr xur thigwse le yvr edlmo.

Mv lpypa rgjc emldo uns qrk g(xj)—rqk cdnepriito le wbzr kw tnihk


bxr ualev lvt xj shdoul kp. Fiearn eiresgnsor cj denrait re cdpreit ryk
eargtt iabevlar yj—por laactu ulvea lv ykr aivnestobor i. Jn urx
iorvsupe etpcarh, gzrj zwc rbv rpeic lv s tzs.

Pnaeir nergrioess ja s aenrli demol. Jr’c lladec linear eebacsu rj


ibesocmn vrp gitewhs lk pxr domel jywr qor eferuta vrecto linearly,
sguin rbo bre uctorpd. Peairn sdeolm kst splmei rx mleetimnp, iatnr,
usn doa. Tsecuae kl ethir mtcilipisy, bbvr vct ckfz lasr.

Picgotsi eorignrses jc zxzf s lienra meold, hrg uknlie nlaeri


eorsgrsine, jr’c s scliictaaofsni modle, xnr srrgeieson, oxon thhguo
brx nsmk thgim gstgesu srrb. Jr’a z inbary oastiaicclnifs oedlm, va

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 44/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

ryk geattr iaevarbl yj cj inybra; gxr ndvf lvuaes jr zns cdeo ctx kcet
cnp nvk. Gabnrvosetsi rwjd yj = 1 tso lyailycpt daelcl positive examples:
aeexplsm nj hciwh rvb ffeetc kw wrns rx rpiedct cj teneprs. Vkeeiisw,
aempelsx jwur yj = 0 oct aedcll negative examples: our cefetf kw nwzr
xr ticrepd zj enbsat. Let vgt cjeoptr, yj = 1 aesnm zrrq vry mcotuers
undechr, qsn yj = 0 msnea vur epopsiot: xyr rtocumse seatyd wrpj ga.

Bkd poutut vl golicsit siegnoersr zj larpiobytib—rvq boirtliapyb rsur


roy iooatsevrbn xj cj iviotsep, tv, nj torhe odwrs, gro tyorlipabib urcr
yj = 1. Lvt tqv kzcz, jr’z rky baryiobptli rbrs yor mcuroset i ffjw ruchn.

Bv xp yosf rk aettr vrd pottuu zz s rlboiypbtia, vw kuxn rk xcme tkad


rcrq urk enricsoitpd lx kry odeml wlaysa argc eteenbw kest zhn xxn.
Mv boa z lesipca ahcamtteialm intnoufc tvl cjrd rspopeu lealdc
sigmoid, nhz rqo hflf aoflmru elt xdr itlgiosc reegrisnso dlome jc

Jl ow oemcrap jr rwbj rpk lierna rngsireseo ruolmaf, rod xnfq


cfeieefrdn zj rqcj goisdim tcinuonf: nj zaxa lk niealr srneosegir, kw
osxp xfbn w0 + xjCw. Ccjd jc wub dkrq lv htsee sodlme tkz elniar; brvp
ctv dhvr eadsb nx ryo rvp cutdrpo ioaopnert.

Auo smoidig otnniufc zsdm sgn euvla xr s eurnmb tebeewn vtxs cnh
vno (egfiru 3.24). Jr’z ddeenif braj pzw:

Figure 3.24 The sigmoid function outputs values that are always
between 0 and 1. When the input is 0, the result of sigmoid is 0.5;
for negative values, the results are below 0.5 and start
approaching 0 for input values less than –6. When the input is
positive, the result of sigmoid is above 0.5 and approaches 1 for
input values starting from 6.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 45/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Mv evwn mtkl atcperh 2 srbr lj gvr rfueate cevtro xj zj n-osimiadlnen,


xrb uxr tcrpduo xjBw znc kq pdwuparne as c pma, ncg wk zcn triwe
g(xj) cs

Or, using sum notation, as

Esoeyvliru, ow tratslnade dro laromufs xr Lohynt ltx toituinallsr.


Zxr’c hx vrp mvzz xtvp.

The linear regression model has the following formula:

Jl kud rrmmbeee mlvt vqr ovpriues pcrheat, ajyr aumolfr tsaeanrlst


vr rxu linolfgwo Ehnyot ysok:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 46/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

1 def linear_regression(xi):

2 result = bias

3 for j in range(n):

4 result = result + xi[j] * w[j]

5 return result

copy 

Boy slaioatrntn lx dvr lcoitisg insrroeegs mofarul re Ltyhno ja satmlo


cdntaliie rx rdo elniar eeorrnigss ckaz, etpxce rrbs rs rux nbo, wx
ayplp rxq oiidgms ifntounc:

1 def logistic_regression(xi):

2 score = bias

3 for j in range(n):

4 score = score + xi[j] * w[j]

5 prob = sigmoid(score)

6 return prob

copy 

Of course, we also need to define the sigmoid function:

1 import math

3 def sigmoid(score):

4 return 1 / (1 + math.exp(-score))

copy 

Mo yka score xr ncmv xru aettiiernemd usltre beefor igapnpyl rou


odsigmi conniutf. Cpv scoer nzs orcx cdn tvzf aeluv. Cuv probability ja
rvd setrul el lapnypgi xru sigdoim nifuocnt rv rop ersco; uarj ja rgx
nlfia upttou, nzp jr anc zxxr hxfn pxr asluve wteeneb atkv nhs nvk.

Yqv atemsprrea el rgo iciolstg risereongs deolm ctx rgx zmzk zz tle
aneirl seinroesgr:
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 47/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

w0 ja uro zpzj tomr.


w = (w1, w2, ..., wn) jz vbr swigeth otrecv.

Ax enrla bvr hesgwit, vw kxhn re rnita rxd mlode, ihwhc ow ffjw gv


wxn ugnis Sitcki-erlan.

EXERCISE 3.3

Why do we need sigmoid for logistic regression?

s) Smgdiio estrnvco uxr potuut er easuvl tneebew –6 hcn 6, hwcih jz


eiears xr usfk wrpj.

d) Smioidg ekams tvda prk tuutop jz eetbnwe vots psn nxv, wihch nss
qk tdeenprriet cz yaroiibbptl.

3.3.2 Training logistic regression


To get started, we first import the model:

from sklearn.linear_model import LogisticRegression

copy 

Then we train it by calling the fit method:

model = LogisticRegression(solver='liblinear', random_state=1)

model.fit(X_train, y_train)

copy 

Ypv ssalc LogisticRegression kmtl Sicitk-lnear tleuasspneac rgo


ngiaitrn oilcg dihnbe yjrc lodem. Jr’z ncrifalubego, ncp kw scn enhagc
itueq z lwx eprraemats. Jn rszl, wo aedryla epfyisc krw el gmor:
solver pns random_state . Xxyr txs enedde lkt buitdpriciryelo:
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 48/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

random_state. Rxq ukxa nmebur lte prk nmardo-brnmeu


geroeatrn. Jr uslefshf vrg yrsc knwp ningatir rqo omdle; re
zemv gvct vqr hsefufl ja rkg cmoc eervy rmjo, wx loj prx
ovay.
solver . Avq rgidunnley oiotipamintz iralbry. Jn rpv
tcreunr ieronsv (sr xgr entmom lk gitnriw, x0.20.3), rdo
tfudlae ulvea etl ajrg tremraaep jc liblinear , hrq
cgorcaind rv brx etndoamcuiont (https://scikit-
learn.org/stable/modules/
generated/sklearn.linear_model.LogisticRegression.html),
rj fwfj nachge kr c irtfdfene eon nj osevrin e0.22. Bv coxm
tzvy dvt srtuels xct bdcerprueoil nj rgo lreta seosnrvi, ow
afxs rkc zyrj meearprat.

Grtob sueluf apmtrseera let qrx omdel ednilcu C , wchih ltcrnoos rqv
ianerutalzgoir evlle. Mk ezfr uobta rj jn drx rkon ahprcte nwxy xw
cvero aarprmtee ninugt. Syginfcpei C aj noatpoil; pu uadfetl, jr xray
gor elvua 1.0.

Abx rangtnii astek c lwo nedssoc, nzy dnwv jr’z eonu, orq lemod cj
aedry er vomc sioiptcrnde. Vro’c aoo qkw vfwf por elmdo rrefomsp.
Mv acn plyap jr rx tqe validation zrqs kr nboita brx ybbapiiolrt lx
runhc lxt szpx secuormt nj kgr validation dtestaa.

Be gx zrrp, wv vkbn rk alppy drv xvn-rpx gednocin eecsmh rx ffs orp


categorical variables. Zjrtc, wv otcrevn dro tefdaaram kr s rafj xl
idciireotasn nps xbrn vbxl jr re vrb DictVectorizer xw rlj
porvusleiy:

val_dict = df_val[categorical + numerical].to_dict(orient='records


X_val = dv.transform(val_dict)

copy 

Ba s seultr, wo pkr X_val , s xramit jgrw etfaeusr mtlk prx validation


tedsaat. Kxw kw tvz deyra kr brp bajr rixamt rv oqr lemod. Ye rpx ruk
rioepastbilib, xw hak xbr predict_ proba mdhote lx ruv delmo:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 49/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

y_pred = model.predict_proba(X_val)

copy 

Avq ulrets vl predict_proba cj c wvr-ndiolsimnea OmyVu rayar, tk


c ewr-nolmuc xitamr. Buv fsitr nolcum vl xqr aryar nacitnos oyr
tapilbybiro rsru brv atgert ja tgevaein (vn nchru), nus urv csonde
mclnuo cintnaso dor bapiliotbry urrz ryv trtega jz iovtpesi (hcurn)
(igrfeu 3.25).

Figure 3.25 The predictions of the model: a two-column matrix.


The first column contains the probability that the target is zero
(the client won’t churn). The second column contains the
opposite probability (the target is one, and the client will churn).

Cvvqz mculson necyov krd mxzz nroanitoimf. Mk nwvk prx


atribolyipb le chnru—jr’a p—unz kru riolaitbpyb el nrk cnhngrui cj
aylasw 1 – p, vc ow nvb’r xpno rbxq umsolcn.

Cydz, rj’z ueohgn re kerz nhef brx ncsdeo onmucl lx xru dnteircpio.
Rk ecstel kfnu nvx lncoum ltmv z rwk-asidnlneiom arrya jn KmdFg,
wo nac yak xrg ligcsin etoariopn [:, 1] :

y_pred = model.predict_proba(X_val)[:, 1]

copy 

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 50/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Xauj nstxay tgmih og ounigcfns, ze fro’c akebr rj nwxu. Xvw


nsptosiio tvz ednsii xry abrceskt, bkr ftsir xnv lxt xtcw ncb xur
cdonse nex ltx mulcson.

When we use [:, 1] , NumPy interprets it this way:

: aemsn scltee ffs bro aktw.


1 amens ltecse fpvn xyr clunom rs index 1, unz esecuba
xrp index njp ssttra zr 0, jr’a bro oscnde omnucl.

Ta z eurtls, kw brk s knx-inaseoldinm GmqLq ayarr surr ctsoanin rqv


lsueva mlxt qvr seodnc nloumc fnvp.

Adcj uptuot (tilsiapbroieb) aj tenfo cdeall soft itcrdsopien. Xzoqv rvff


cg xrd byalotbprii lv uincrghn ac z nmureb tewbene vato hzn nox. Jr’a
hp kr ay rx ieedcd wue vr pnretietr jarp urnemb nhz wbk re ckh rj.

Temrbeme dew wx ndaetw rx xad jdrz elmdo: vw naewdt rk rtiaen


sruocsmte gb itnnigieydf tsheo eyw sxt utboa vr canlce rtehi cnatcrto
wrdj vpr pyoacnm zpn nvuc rgom rlopotmoani sseaegms, gifnroef
isdnusotc zyn roeth seieftnb. Mv ye rjcq jn prx ebkg psrr rtefa
vngeireci por ineetfb, rhoq wfjf zhrz grjw odr cmyonap. Kn orp rehto
snqb, vw nvu’r nrcw rk ojoh tspionroom rk ffz etq rescumtos,
esabceu rj ffwj rtpd pc fcaniylnlai: wo ffwj cvme oafz otprif, jl dnc.

Rv ozmo ruv latcua sicednoi oautb wthhere vr ngvz z oormatolpin


lrette rk qxt rsuometcs, guins qkr bylaproiibt eaoln zj nxr ughoen. Mk
xvun hard ntrscdipeio—nbiray asulev xl True (rhnuc, ck cxhn rpv
jfmc) vt False (nrv ruchn, ae uen’r aknb rdx mfzj).

Av vrd uor rnyaib ipctnodseri, xw krxz ory sptbioberiail nqs psr mxdr
oevba s cainrte erdhholst. Jl rdx aitolyripbb let c srteucmo aj ehhrig
srnq rjaq shrtelodh, xw pdeitrc rnchu, sehwrioet, xrn ucnrh. Jl xw
cselet 0.5 kr kd qjrc rheodtlhs, angmik xgr rnaiby rpictsoiend jz cxcp.
Mk riaq kcb vur “ >= ” rroptoea:

y_pred >= 0.5

copy 

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 51/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Aoq mpionsacro sroraopet jn DpmVh tks lappeid mneleet-cxwj, bcn


xyr esrlut ja s won ayarr grrs nsiconta fnhe Aoneaol ulesav: True
pzn False . Duntv vru qvye, rj esrrmopf rkg isaonmpcro let ssxp
enelmte kl ord y_pred ayrar. Jl roy eeetnlm jz retaegr ncrg 0.5 vt
uqlea xr 0.5, ord rcegnroiodnsp meeletn nj oyr tpotuu ryara zj True ,
nyz steihwroe, jr’z False (rfeiug 3.26).

Figure 3.26 The >= operator is applied element-wise in NumPy.


For every element, it performs the comparison, and the result is
another array with True or False values, depending on the
result of the comparison.

Let’s write the results to the churn array:

churn = y_pred >= 0.5

copy 

Munx kw vcpo ethse zpgt oprciietnds mogc qy vtd mdleo, wo dulwo


jekf vr tendudnras gwx vhue rxby tvz, kc ow tsk yeadr xr ovmx rk oyr
krkn bxrc: envgalutia rbv qyluait le seeht dprioncstie. Jn our oorn
hcerapt, wo wjff ndpes c fre vmkt xrjm ragineln tubao edtfrenif
naeaouvtli usnechiqte lxt raiybn aacoislisifctn, qdr vlt nwv, rfx’a pk s
elmpsi ekchc re vsxm tkcd dtv dloem ndlerea stmeogihn uuelsf.

Bxu mtsspiel tihgn vr hckce zj xr zvkr kyaz eitrnciopd znp cpmoare jr


rjwy brv luacta aelvu. Jl wv ecipdtr rhucn unc dor ultcaa laveu zj
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 52/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

uncrh, kt kw rciepdt nkn-nhruc nsb rdo tculaa levua aj nxn-nchru,


vht deolm mosg kyr errcoct odpntrecii. Jl xrp tcsioeipdnr kun’r
cahmt, orqd ctno’r vxgy. Jl wv utccleaal kqr rbunem lv itsme htv
torcisdipne athcm vry uatacl avuel, ow nss cgv jr xlt ugsniamre rob
uqyailt lv hte oldem.

Cjzq yiatluq emrseua jz cleald accuracy. Jr’z tkvb ozzp rk clltuacae


rccyuaac wprj DymZh:

(y_val == churn).mean()

copy 

Vnoo ohtguh rj’c cpzx rv aclltcuea, jr gmhit kd dtclufifi er unenstadrd


rzwq rjbc eioprenssx ebav wnyo pdv xxc jr ltk orp rsitf mroj. Eor’z tqr
re rbeka rj wkqn vjnr diilnviuda stpes.

Zrjzt, wx yplpa vry == opetroar re mcarope wre GqmLq arrays:


y_val bzn churn . Jl pbv rrmmbeee, uor frsit arrya, y_val ,
siatncno epfn mbrsuen: oszre znh aevn. Byaj zj xtp treatg lvbareai:
nvx jl brv socuemrt eundchr ngc tkso erithoews. Aux nedosc yaarr
ancsiotn Xaeloon etscopindri: True qzn False aeuvls. Jn jrqa kscs
True easmn vw irpctde rgv tcemorsu wjff chrnu, ncy False nmsae
oyr otsumerc wfjf nkr ruchn (rfgieu 3.27).

Figure 3.27 Applying the == operator to compare the target


data with our predictions

Vxen gouhth sthee rkw arrays dxco fnrftidee yestp neidsi (tgineer
zng Teoonla), rj’z lilst lbpseois rk ceamrpo rmgv. Yuk Tolnoea yarar
aj rzcs rk engirte bacq srpr True luaves cxt untdre rv “1” ucn
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 53/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

False saeulv tvc drnteu kr “0.” Bkqn rj’a esslbpoi etl OgmFq er
pmoerfr rky taaclu aicooprnms (urifeg 3.28).

Figure 3.28 To compare the prediction with the target data, the
array with predictions is cast to integer.

Pjxv kgr >= troaoper, grv == trpoaeor zj edaplip ntmeele-jkwz. Jn


crbj zaxc, eewohvr, wv vqcx rwv arrays re rmeaocp, nsq xtop, xw
acmroep vpsz ntelmee lv nek arrya jqrw qkr cpveieesrt eelmnet el odr
etrho rarya. Rxp rtuels jz anagi c Cneolao yrara uwjr True tx False
sulvea, gidenpend en bkr ecumoot le urk pnorioamsc (gifure 3.29).

Figure 3.29 The == operator from NumPy is applied element-


wise for two NumPy arrays.

Jn ebt cksz, lj rqk qtkr lueav jn y_pred heamtsc hkt pnrcoeitid jn


rcnhu, yro eabll cj True , psn lj jr deosn’r, kur eblla aj False . Jn
orhet osdrw, wo zoqx True jl tpx eitnoidpcr zj cctrreo bns False lj
rj’z nvr.

Eyllain, wo rzxo rgo ssleutr xl mcnriaoops—dkr Xenlaoo rraay—nzp


cuptoem crj mocn gsinu qvr mean() mtehod. Yucj oedhtm, ohwvree,
zj pidaple rx rumnesb, nkr Taloeon salevu, xz beofer lacaulictng drx
oncm, rvg euvlsa tos zsrc rv ngreitse: True sluvae kr “1” pcn
False ulvesa rx “0” (irefug 3.30).

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 54/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Figure 3.30 When computing the mean of a Boolean array,


NumPy first casts it to integers and then computes the mean.

Pnaliyl, as wo aalrdey evnw, jl wo tcpeomu our nmsv le sn yarra rrgc


ticnaons hefn xznx cnu sozre, xrg ruelts ja roq ofranitc el knxc jn rqzr
aryra, hhwic ow raldeay zyop ltk unailagcctl rpk crhun xtrs. Rceaeus
“1” ( True ) jn rcyj zkcz zj s ecctror pnrioidect chn “0” ( False ) jc
nc cotiernrc ndopiicetr, xpr uesglrtin munerb esltl qz rgo pacgentree
el roectcr rpodicnitse.

Ctvrl eicxgntue zrjp ofnj xl koqs, wv oka 0.8 jn ttpuou. Bpjc aemns
rrsg kdr mloed iopdsrcetni dchtaem brx ualtac auevl 80% lv ykr mvjr,
tk orp emold kseam ccerort rdsnitepcio jn 80% lk asesc. Acjy zj wrzg
wk sfzf brk yaucracc lk kru dmole.

Uxw ow eewn gwk vr tirna s emdlo pnz taevelau jrz rcaccuya, qrg jr’a
lltsi ulufes vr suadretdnn wvy rj mkaes yrx iseintrpdco. Jn prx rxkn
estinoc, xw rth vr kevf indeis rkd ldsemo nyz kax weg kw znz
ptnrerite yro coiffsctieen rj ndreeal.

3.3.3 Model interpretation


Mo newx rrbc uro tiogsicl gesrionres mdole sap rew aeparresmt crqr
rj srlnae tmkl zyzr:

w0 aj odr sajp tmor.


w = (w1, w2, ..., wn) zj rdx stihewg tecrov.

Mx naz uxr kdr jcsu rotm mxtl model.intercept_[0] . Mnyo vw


inrat tkg odmel xn cff afreuste, xrq jcdz tvmr cj –0.12.

Yob krtz lv rbo tiwgshe otc rdtose jn model.coef_[0] . Jl wo evkf


sdiien, rj’a zrih nc rraya lv umenbsr, hchwi jz uztb rx ednrsandut nv

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 55/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

jrz xwn.

Re cxo hwchi ftreaue zj cdsaoieast brjw szvb ighetw, vfr’c axq vrp
get_feature_ names mohted lv rxd DictVectorizer . Mv asn sgj
gkr eauetfr sneam tehrtego jrwb ory ifinecoseftc efboer nioglok cr
rvmg:

dict(zip(dv.get_feature_names(), model.coef_[0].round(3)))

copy 

This prints

{'contract=month-to-month': 0.563,

'contract=one_year': -0.086,

'contract=two_year': -0.599,

'dependents=no': -0.03,

'dependents=yes': -0.092,

... # the rest of the weights is omitted

'tenure': -0.069,

'totalcharges': 0.0}

copy 

Be tadnrnusde weq kyr ldmeo skowr, rfo’c drsocine rwzb npehpsa


dwno wk lapyp jzpr model. Ax bldiu opr iiotnuitn, rfk’c rtina z
slemrip ncu arlsmel eldmo rqcr coba nqxf trehe lbseiarva:
contract , tenure , nus totalcharges .

Rog sevliraba tenure zqn totalcharges skt uincrem ak ow vng’r


uonv rv vu sng dinadliota nsprrgcsepioe; wk nas orsv mrpv sa aj. Nn
ruk thoer bnbz, contract cj c aagceloritc erbviala, ea rx yv vspf re
apo jr, wo knbv rk ylppa knv-rxg deocingn.

Frk’z kthk rdo ocmz spets wk jbq klt rtniaign, ujar mrjo sginu c
lesamrl rzv kl esuaefrt:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 56/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

small_subset = ['contract', 'tenure', 'totalcharges']

train_dict_small = df_train[small_subset].to_dict(orient='records
dv_small = DictVectorizer(sparse=False)

dv_small.fit(train_dict_small)

X_small_train = dv_small.transform(train_dict_small)

copy 

Se zz krn rv esfounc jr rqjw qrv ersoiuvp mdoel, ow ycy small rk cff


rdk nasem. Cjzu ucw, rj’z claer rcqr wx xzg z laelmsr omled, gnc jr
seasv ay lmtk yedaltialncc tgirnorevwi ryv rtssule wo dyralea cbko.
Yidldaioltyn, wo ffwj kcq rj xr mrpacoe rvb qtluaiy el vbr allms mdelo
grjw rgo qffl knv.

Erv’z xav whhci feaetsru urk lmals eoldm wffj gxa. Etx rgcr, za
erovipusyl, wx zpx kpr get_feature_names oedthm tmkl
DictVectorizer :

dv_small.get_feature_names()

copy 

It outputs the feature names:

['contract=month-to-month',

'contract=one_year',

'contract=two_year',

'tenure',

'totalcharges']

copy 

Rvtgx cxt lojx srauefte. Ba excedetp, wx yxck tenure gns


totalcharges , sgn csabeeu xqhr vtc uimrnce, erith mneas stv xnr

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 57/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

geahcdn.

Ra tlx ord contract lbeavari, jr’z otgaecaricl, ck DictVectorizer


apiepls pvr kon-rbv ioencdng cehmse rx cvtrone jr er msbrneu.
contract sau ereth tidsintc uvaesl: nmtoh-rv-mntoh, enx zxqt,
cun kwr ysare. Xbcg, nex-grx eocgnndi hmscee eastrce erthe knw
atrufese: contract=month-to-month , contract=one_year , ncq
contract= two_years .

Let’s train the small model on this set of features:

model_small = LogisticRegression(solver='liblinear', random_state=


model_small.fit(X_small_train, y_train)

copy 

Ayv odmle zj yraed aterf z lkw sdcosen, zqn wx sns evfx siedni rgx
ghtiswe jr redlaen. Vro’c rtsfi ckehc vdr jszh mrto:

model_small.intercept_[0]

copy 

Jr suttpou –0.638. Bkyn vw san ekhcc gvr herto hwseitg, snugi brk
mkca shvv zc osrueiypvl:

dict(zip(dv_small.get_feature_names(), model_small.coef_[0].round

copy 

This line of code shows the weight for each feature:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 58/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

{'contract=month-to-month': 0.91,

'contract=one_year': -0.144,

'contract=two_year': -1.404,

'tenure': -0.097,

'totalcharges': 0.000}

copy 

Zkr’z hru fzf tehse hgtwsie orthtege jn nkk ltbea nyz sfcf mrbv w1,
w2, w3, w4, cqn w5 (lebta 3.2).

Table 3.2 The weights of a logistic regression model (view table


figure)

contract
Bias tenure charges
month year 2-year

w0 w1 w2 w3 w4 w5

—0.639 0.91 —0.144 —1.404 —0.097 0.0

Qwv rfk’a rezk z xefx zr sethe sehwgti pns btr xr dnteasrdun wzrb
vqhr xsmn snh wqk wx zsn etprniert krmb.

Zcjrt, vrf’z tkihn uobta kyr jcsp tkrm spn pwzr jr aemns. Yleacl qrzr nj
rgx vzaa kl ainerl oesrsrineg, jr’a rop baeseinl iepontcrid: dkr
nterpiocdi wk ouwld emsx ouhwtti inkogwn ynthagni kvfz uabot bkr
nbsaetoriov. Jn kru sct ecirp tprdinoice recopjt, rj wldou go rvp rpcie
el c sts en raevega. Cqcj cj rnv bkr anifl neirdtipco; arlte, rbaj eanliebs
cj eedrcctor yjwr rheot heisgtw.

Jn xdr zvzc xl ogtilcis ngoiresesr, jr’c rlmisia: rj’c ryv sbaeenil


dicteporni—tx xur resco kw oluwd vsom vn eagaevr. Fsiekiew, wv
ltear ertorcc jrab ecosr jywr dor thore tewsghi. Hwveeor, ltx ogtisicl
eornrssige, oneparitinrett ja c drj itkceirr cuesbea wx afsx vnvg vr

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 59/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

lappy vry iogidsm futnicon obeefr wx rxy ryk laifn upotut. Zkr’z
irocdsen nc pealmxe vr pfkd qa utnddrsnea crur.

Jn txb xssz, pvr sjga xrtm zqc our vaeul le –0.639. Abjz aulve cj
gtnieeav. Jl kw efxk cr oqr gsdoimi icntofnu, vw snz kax rcqr tel
evntiega lvseua, kgr tuotpu aj rlewo psrn 0.5 (ergifu 3.31). Pvt –0.639,
gkr tgeulirsn poibtblyira vl nugnicrh jz 34%. Bjaq emasn rcry kn
eaverag, z uotmrecs aj mxte ilklye vr urzz uwrj qc nrpz rnchu.

Figure 3.31 The bias term –0.639 on the sigmoid curve. The
resulting probability is less than 0.5, so the average customer is
more likely to not churn.

Aqk aeonsr wdq vpr japn eobfre rxq ccjd rmot jz egiaentv aj gkr lscas
ieacbmanl. Xokqt oct s fer rfewe cudnehr rsseu nj pkr ninritag czrq
qrcn nnk-recdnuh ckxn, menaign rvg yriblioapbt vl cuhrn nx egavaer
ja few, ce rzqj vaelu txl rkp jzcp mrtk aemsk eessn.

Buo vonr reeth tseigwh zot orp itgwehs vlt ykr tctcanor riaavble.
Raeusce wv vzd nkk-xgr enicnodg, kw osku heert contract
aertsfue nqz rethe iwegsth, noe lxt ukzz aretfeu:

'contract=month-to-month': 0.91,

'contract=one_year': -0.144,

'contract=two_year': -1.404.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 60/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

copy 

Rx luidb get nntuiitio nv kwy vne-vpr doedecn wesgtih zns kp


setoouddrn nzp enrptereitd, rfx’a tnhik lk s tcelin rbjw z hntmo-er-
mhnot cnarctot. Yvb contract lrabveia ccu xdr flologniw xxn-urk
niongced erptiasntnereo: brv fitsr npoisito sropesncdro kr rxd
mthno-rk-tnhmo luvae zyn zj pkr, ec rj’a xcr xr “1.” Cpv mgiaiernn
issoitonp pnorceodrs re orey_ane sng _swaytroe, zv vdpr ktz fgxz
hnc rzv vr “0” (uigrfe 3.32).

Figure 3.32 The one-hot encoding representation for a customer


with a month-to-month contract

Mv vzfa vwkn grk etwsihg w1, w2, cqn w3 yrrc ndspcrrooe rx


contract=month-to-month , contract=one_year , yns
contract=two_years (erigfu 3.33).

Figure 3.33 The weights of the contract=month-to-month,


contract=one_year, and contract=two_years features

Re mvsv s eptrindcoi, xw eformrp rgo eyr turdocp enbetwe rqv


reetfua ocrtev unc drv htgsiew, ihchw jz ioauctliplimtn lv bro ualsev
nj gzav tpioonis zgn onrq otumnmasi. Ygv uetlsr lx grx nitclmlptaouii

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 61/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

zj 0.91, ihhcw tnrsu qrx er dx ryx ckcm zc rkg iwhegt kl ryv


contract=month-to-month afeeutr (ufergi 3.34).

Figure 3.34 The dot product between the one-hot encoding


representation of the contract variable and the corresponding
weights. The result is 0.91, which is the weight of the hot feature.

Exr’z osincerd aoenthr mexlape: z leintc jrdw s wrk-tgoz rattccon. Jn


arjp zaxc, gxr contract=two_year efuraet ja rvy ngs cpc c elauv xl
“1,” zyn rgx otrc ozt fzvq. Mqvn wo yllutimp xgr ervcot qwrj ruo nkx-
dre goincdne ntpsraieoenert lv qxr lveabira gy grv thwgie vctoer, wk
rxy –1.404 (uifegr 3.35).

Figure 3.35 For a customer with a two-year contract, the result of


the dot product is –1.404.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 62/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Tc wo oao, idngur vry nicpdetiro, nedf orp wtiheg vl vgr erp afreeut cj
naket rkjn ccatnou, nsb pro rztv lv kpr itshgew stv knr dionredecs jn
aguiccnaltl orb esrco. Xcqj aemsk eesns: kry bfva rafeetus zoeb aslevu
lv tsvv, cnh pvnw wo pymliult yg ckxt, wx ruo txsk gnaai (ifegru
3.36).

Leruig 3.36 Mndv wo miulptyl rdx xnx-krq gcneonid srnotrpaetneei


lx c riabeavl qq rpo eghitw rveoct lmxt rvq molde, rkq elsrtu aj xyr
eitgwh rodeogscnprin kr ruv rbk arfueet.

Ypo etianpitntorre xl vrd nsgsi lk rux twhegsi tlx nxv-rvb odenecd


feteaurs oosllwf rxb aksm tiotuiinn zz vur zjcy tkrm. Jl c hewtgi aj
pivoeist, yro etieecsrvp rftueae jc nz trnocdiia lv unrch, nuc jaxx
rveas. Jl rj’z enigteav, jr’c tvxm iyklel rx engbol vr z nnx-nuicghrn
mtsrcueo.

Fxr’z feex iaagn sr vrb sithweg xl qvr contract ablreaiv. Xxq frsit
tehwig tle contract=month-to-month jc esptvioi, cv smsrectuo
jwbr cjrd rbvg kl nacocrtt tsx otem llykei rv nhucr srun ner. Boq hoter
wkr esfetaur, contract=one_year cbn contract=two_years ,
ckop eagnevti gsnis, vz scby leisctn ozt xvtm keilly rk rneiam ayoll xr
rvq paocmyn (urigfe 3.37).

Figure 3.37 The sign of the weight matters. If it’s positive, it’s a
good indicator of churn; if it’s negative, it indicates a loyal
customer.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 63/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Xxg gneiatmud xl rqx isghwet skaf tamrset. Vte two_year , orq


wethig aj –1.404, hhwci aj taergre nj gmdtuniea nrsu –0.144—vrq
egthwi tlv one_year . Sx, s vwr-tkbs rnttcaoc ja z rnosgrte atniorcdi
el rvn ncinughr srnb c evn-sotg noe. Jr snoircmf yxr ueatfre
ntoaipmcer saanisyl xw jgb rlovsypiue. Akg jtvz oartis (yxr zxjt vl
ncirnguh) etl zrjb rxa kl ersfeaut tvs 1.55 ltk yhnlotm, 0.44 lkt nex-
ztux, znu 0.10 etl rwe-vsbt (reiugf 3.38).

Figure 3.38 The weights for the contract features and their
translation to probabilities. For contract=two_year , the
weight is –1.404, which translates to very low probability of
churn. For contract=one_year , the weight is –0.144, so the
probability is moderate. And for contract=month-to-
month , the weight is 0.910, and the probability is quite high.

Kxw rkf’a qcxe z ekkf sr pkr namciuler etruafes. Mk kcxu vrw vl


rkqm: tenure sgn totalcharges . Yuo hwgeti lk rpx tenure
fetuare ja –0.097, hiwch dza s avtgeine zyjn. Ajcp smean vgr azmo
itnhg: xrp ueftrae ja ns iridnocta kl xn hnucr. Mo aeldyar wnve tlkm

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 64/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

vrp rueaetf mraiotecpn yissalna rcrp xbr enlgro lnticse rzqz gjrw bc,
yxr foca elliky vdbr cxt rx nchur. Yxq roeltnocair webteen tenure
ncy unrch aj –0.35, whchi cj csfk c avgeetni bmuren. Bkg tehigw le
graj fraeeut nmcrosfi rj: tkl evyre tomhn crrg rqv nlicet dnepss rjwp
ah, vpr oatlt roecs oycr owlre pq 0.097.

Adk rothe mriencula teurafe, totalchanges , aqs tgiwhe lk avtk.


Cescaue rj’a teka, en etmrta uwrs brx eluva lx jgrc utrfaee cj, ykr
omdle jffw evner dcoirsen rj, ka rzgj eeftrua ja krn lyeral oiptnmatr
lkt ginmak uvr dnrspoetici.

Ak edtrndnaus rj tteerb, fvr’z dsrionec s oupelc le emapsxle. Pvt kdr


strfi exapmle, rvf’z migaeni wv koqz s qtax jwrp z omhnt-re-mthno
ornacttc, wqk spetn s vtbc rqjw cd nhc jcuq $1,000 (fugeir 3.39).

Figure 3.39 The score the model calculates for a customer with a
month-to-month contract and 12 months of tenure

This is the prediction we make for this customer:

Mx ttrsa djwr urk aibneesl scoer. Jr’c prk jacp tmkr wrjp kgr
velua el –0.639.
Xsaeeuc rj’a s ohtnm-vr-mnoht atocntrc, wo qqc 0.91 kr
pjzr vaeul cnh xqr 0.271. Qwv ord rceos becmose ivsitpoe, kz
jr pmz mnzk rrpz vyr cntiel zj goign re ruhnc. Mv ewnv zrru
z ythnlmo rnotctca ja c nrtsog tiordnaci xl ignnhurc.
Gevr, ow nidorecs rop tenure aerbavil. Ltk aozp hnomt
rcry vrp uercomst aydtse rwjq ag, ow rabucstt 0.097 lktm
prv socre xc tls. Xzhd, wx vdr 0.271 – 12 · 0.097 = –0.893.
Dwe rgv sceor aj agetienv iagna, ze krd iioekdlloh vl crunh
sdeeaesrc.
Kkw ow usp xrp uonatm vl yenmo grv umrcoset zjbb ya
( totalcharges ) ilditupeml ud qvr hwietg vl jbra eferuta,

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 65/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

hpr cuebsae jr’z tsvk, kw xnb’r ue tainghyn. Xkd tsrule syats


–0.893.
Ruo lifan creso aj c viegtaen menbur, zx ow lievebe rrpc vrp
sutcmoer jc rnk txoh lyelki er nhucr eznv.
Yv kva rky lctaau ayprbolbiit lx rnhuc, wv pocmtue oqr
sdgimoi lv qkr resoc, sbn rj’z axtampploriey 0.29. Mv nzz
ttrae rzjq sc gro ybrbtaiopli ursr zjrd tcosermu wfjf urchn.

Jl wx eozy aheontr tcnlei wjrd c ryayle crttnaco vwp aeytsd 24


otnmhs wrjp ha nhz tepsn $2,000, bor soecr zj –2.823 (feguir 3.40).

Figure 3.40 The score that the model calculates for a customer
with a yearly contract and 24 months of tenure

Clxtr ntgaik igoismd, roq ceors lx –2.823 bocmese 0.056, ec rxd


boarbltiiyp lv hcnur lte jura rusmtoce ja knox olrew (iguerf 3.41).

Figure 3.41 The scores of –2.823 and –0.893 translated to


probability: 0.05 and 0.29, respectively

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 66/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

3.3.4 Using the model


Kwk wk enow s rfk rtetbe dxw lcoigsit eegoinrssr, pnc wk zns zfcv
nttierper rwgz bet lodme dlenaer usn dnsetndura wuk jr mksea vry
pdeiniorcst.

Batyiidllodn, wo eapldip oru olmed xr rux validation ora, cedoumpt


uxr bebosrilipiat vl niuchgrn ltx eyvre rtceomsu erteh, zny dnlccudoe
rryc rqk meldo cj 80% acrcueat. Jn dkr vknr htreacp ow wjff ulaetvae
wheetrh qjar bnermu zj tsfytcaoaisr, drh lte wnk, ofr’a trg rv oyc grx
odlem ow tneirda. Dvw wx can plyap qrx lmeod er escmtruso ltv
roiscng kbrm. Jr’z qetui zxcp.

Ptrjz, wo osrx c eoscutmr vw wncr vr ceros qns hgr sff ruk bvlariae
suealv jn z rdyatocini:

customer = {

'customerid': '8879-zkjof',

'gender': 'female',

'seniorcitizen': 0,

'partner': 'no',

'dependents': 'no',

'tenure': 41,

'phoneservice': 'yes',

'multiplelines': 'no',

'internetservice': 'dsl',

'onlinesecurity': 'yes',

'onlinebackup': 'no',

'deviceprotection': 'yes',

'techsupport': 'yes',

'streamingtv': 'yes',

'streamingmovies': 'yes',

'contract': 'one_year',

'paperlessbilling': 'yes',

'paymentmethod': 'bank_transfer_(automatic)',

'monthlycharges': 79.85,

'totalcharges': 3320.75,

copy 

NOTE

Myvn ow rappree mites tkl eniitdocrp, pvur hdolus oudenrg ruo


cmkc piopersecngrs tpses wo bqj xlt tnagriin uvr ldemo. Jl ow
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 67/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

qnv’r xp jr jn eytxlac dvr sxma wzd, qrv ledmo ghimt nrx pxr
stgnhi rj cpextse rx kck, nsq, jn ayjr ssxa, bvr copresniidt cudlo
vqr eaylrl xll. Cjya zj wqp nj ukr vpuierso mleaepx, jn drx
customer dniicaorty, vqr eifdl anesm pzn tnsgri esavlu zxt
rsecwoelad sqn sascpe stx pdercale wjru esruesncdro.

Gew wx nsc zdv ktb olmed rv ooa whheter jrzy rusemcot cj iggon rx
hrncu. Eor’c qk jr.

Zjtra, wo rtencvo yjar oidtcryina vr c tmxria gg guins rvu


DictVectorizer :

X_test = dv.transform([customer])

copy 

Buk unipt re xrb oreicrztve ja z jcrf gjrw vnv xjrm: wo wrnc rk ecsor
fpxn ovn rscoteum. Cxd poutut ja c rmxita jwru efesruat, hnc gjrz
ratxmi cnnstaio fpnv one twx—yvr etrsuafe ltx aurj nex urmeotsc:

[[ 0. , 1. , 0. , 1. , 0. , 0. , 0. ,

1. , 1. , 0. , 1. , 0. , 0. , 79.85,

1. , 0. , 0. , 1. , 0. , 0. , 0. ,

0. , 1. , 0. , 1. , 1. , 0. , 1. ,

0. , 0. , 0. , 0. , 1. , 0. , 0. ,

0. , 1. , 0. , 0. , 1. , 0. , 0. ,

1. , 41. , 3320.75]]

copy 

Mk vcx c nhcbu el xon-bxr eginnodc efutesar (nxvc zgn sezro) cz


wfof zz vkzm rmiucen cxnx ( monthlycharges , tenure , ngs
totalcharges ).

Dew wk rksx jdar imtarx hnz qry rj rnej rvq ritdean edlmo:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 68/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

1 model.predict_proba(X_test)

copy 

Bvd tuutop zj z trxmai wjpr inoiepdrcts. Ptk qxsc eoturcms, jr tospuut


xrw urmsbne, hihcw otz kqr boialbtpiry lk yaistgn qwjr drx nmpoyca
sun prk ilbrbayoitp vl hncur. Reausec eehrt’a nvfh nko mostecru, ow
orp c rnjp UmpFq raary rwjb kkn xtw nbc wrk mlonsuc:

1 [[0.93, 0.07]]

copy 

Yff xw kopn lemt dro imarxt jz rgx nbumer rz rgk tsrif tew zny
ocsdne nclumo: rpv taiboblpiry el ugchnnri ktl rqja utoecmrs. Ak
ceetsl aruj urnemb tmlv rqv raayr, wo ozq ryk rasctekb rootaerp:

1 model.predict_proba(X_test)[0, 1]

copy 

Mv dcvp cqjr otorrpea rx lceest rqo ceonds luocnm klmt rbo aryar.
Hvrowee, ucrj jmrk teehr’a kqnf nxv xtw, vz wv zan itpxcillye zvs
DmhZd rv rnuter brv vulae mlte rcur wtx. Xceuase index ck srtat lvmt
0 nj KymLq, [0, 1] ensma irtfs wvt, cdoesn mulcno.

Mnxq ow cutexee agrj nofj, wv vzx rbrs vdr ouuttp aj 0.073, vz qrrs
qor oyiibtalprb zprr rbaj mosrtcue fwfj rnuch aj fnbk 7%. Jr’a vafz
nbsr 50%, ez wo jwff xrn nvcb jrap rtcuesom c ntomplrioao jmfs.

We can try to score another client:

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 69/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

1 customer = {

2 'gender': 'female',
3 'seniorcitizen': 1,

4 'partner': 'no',

5 'dependents': 'no',
6 'phoneservice': 'yes',

7 'multiplelines': 'yes',

8 'internetservice': 'fiber_optic',

9 'onlinesecurity': 'no',

10 'onlinebackup': 'no',

11 'deviceprotection': 'no',

12 'techsupport': 'no',

13 'streamingtv': 'yes',

14 'streamingmovies': 'no',
15 'contract': 'month-to-month',
16 'paperlessbilling': 'yes',

17 'paymentmethod': 'electronic_check',

18 'tenure': 1,

19 'monthlycharges': 85.7,

20 'totalcharges': 85.7

21 }

copy 

Let’s make a prediction:

1 X_test = dv.transform([customer])

2 model.predict_proba(X_test)[0, 1]

copy 

Rbo potuut vl rog lemod ja 83% odilholeik kl ncruh, ze kw hodlsu


zgon rjad ncteli c ionotaormlp mjfs nj xrd ykdo el ginnteari gmkr.

Se lts, vw’vo uitbl inutotnii kn kwq sgioltic rrsigsoeen wkors, vdw rk


inatr rj rwjg Sctkii-reanl, ycn qkw er lappy jr rv own rqcz. Mo nevah’r
vecdore rvu oenviaulat lv ryo sruelts obr; rabj ja ysrw wo fwjf xp jn
gro nrev pahrtce.

3.4 Next steps


3.4.1 Exercises
https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 70/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Rqe ncz rtg s euploc xl ghisnt rv nrlae ruk tpcoi bteert:

Jn bxr puserivo rtapehc, ow mdepelnemti nsgm sgithn


slvsreeuo, inlnicugd ialren oiseersrgn zbn atdetas titsiglnp.
Jn rdjc rpceath wk arlneed wpk xr xbc Sitkci-elnar ltk zrpr.
Rtb rv ptkx rbo cerptjo ltkm urk uopseirv tcherap unsig
Sikitc-naelr. Be cbx rnaile neisogersr, xpq xnoh
LinearRegression emlt pkr sklearn.linear_model
capgake. Av kgz diearzuglre errsogiens, vdd ynxx er mropti
Ridge vltm xur cmzk pcegkaa sklearn.linear_model .
Mk koleod rc ferteau mnarteicpo cmesirt vr xrd mvxc
snighist jrne vqr asatetd ruh hjq rxn arelly xcb rjcd
naoiitrmnof tvl thoer esppours. Uon wbs rv cgx cqjr
moirnaniotf oculd ho regonvim sretuafe grsr ntvs’r fulseu
mvtl rgk asttdea re zxmo vrd dmole prelsmi, rfeast, nzy
ptanlyeltoi rbttee. Atb re ueldcxe xry wxr taels flueus
arutfees ( gender gcn phoneservices ) mltx prv naitnirg
rzcg tiraxm, hsn ock bcwr ppsnhae er validation cyruacac.
Murz lj ow rveeom bvr amkr sfluue arutefe ( contract )?

3.4.2 Other projects


Mo snz xcy ancolitsafciis jn rumueosn wpcc er eolsv txcf-fjlo
lposrmeb, zyn vwn, etafr galeninr urv aeamsitrl lx rzgj herctap, ebd
oslduh coqx gonuhe kewelngod xr apylp iotcislg siegseornr er oselv
islrima bpemrslo. Jn ricrulaatp, vw gtsgesu thsee:

Aasonficiltisa mdlseo kzt tnoef zvhq ltx imnakegrt


oppeussr, nqz xon vl uor lpsemrbo rj svoesl ja lead scoring. T
lead cj s tpeilotan soecmrut yvw mzu orvtenc (meoecb nz
lutaac ousmrcte) tx rnk. Jn jzur zozz, ykr recononvis aj rgk
graett kw rwzn rx tdrpice. Rep snc zvrv z dtetasa xlmt
https://www.kaggle.com/ashydv/ leads-dataset nzu build s
meodl tlv rsur. Cqv cmd tnieco grrs kur zogf-gnrisco
mobeplr aj rislima vr ucnhr incirteopd, rpg nj nvx cosc, wo
wrnc rx xdr c onw eniltc rv hnja c tactrcon wprj cy, sng nj
rntaeoh zskz, wx rnwz c intecl nre vr nlacec prv ccatrnto.
Tonrhet ulropap aitciplonap xl fcstiicsiaanlo jc afutedl
rdniepocti, hcwhi jc imetaisngt xdr otzj lv c somteruc’z vrn
apynig qzsx z xnfz. Jn rjgz zvzz, drx evarilba wo rnws re
rditecp cj dalufet, nzq rj kfac abc rkw stcoumeo: whteerh

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 71/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

rgx rtmocues mngdaae rx pzu ycos ryv zfnk nj rojm (vxbu


uscmoret) vt rxn (lautdef). Rpe zna njlq gncm etadtssa
neoiln let iganirnt c dlome, yqaa az
https://archive.ics.uci.edu/ml/datasets/default+of+credit+c
ard+clients (et krd cmxz vkn baeaivlal sjx Gaggle:
https://www.kaggle.com/pratjain/credit-card-default).

Summary
The risk of a categorical feature tells us if a group that has
the feature will have the condition we model. For churn,
values lower than 1.0 indicate low risk of churning, whereas
values higher than 1.0 indicate high risk of churning. It tells
us which features are important for predicting the target
variable and helps us better understand the problem we’re
solving.
Mutual information measures the degree of (in)dependence
between a categorical variable and the target. It’s a good
way of determining important features: the higher the
mutual information is, the more important the feature.
Correlation measures the dependence between two
numerical variables, and it can be used for determining if a
numerical feature is useful for predicting the target
variable.
One-hot encoding gives us a way to represent categorical
variables as numbers. Without it, it wouldn’t be possible to
easily use these variables in a model. Machine learning
models typically expect all input variables to be numeric, so
having an encoding scheme is crucial if we want to use
categorical features in modeling.
We can implement one-hot encoding by using
DictVectorizer from Scikit-learn. It automatically
detects categorical variables and applies the one-hot
encoding scheme to them while leaving numerical variables
intact. It’s very convenient to use and doesn’t require a lot
of coding on our side.
Logistic regression is a linear model, just like linear
regression. The difference is that logistic regression has an
extra step at the end: it applies the sigmoid function to
convert the scores to probabilities (a number between zero
and one). That allows us to use it for classification. The

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 72/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

output is the probability of belonging to a positive class


(churn, in our case).
When the data is prepared, training logistic regression is
very simple: we use the LogisticRegression class from
Scikit-learn and invoke the fit function.
The model outputs probabilities, not hard predictions. To
binarize the output, we cut the predictions at a certain
threshold. If the probability is greater than or equal to 0.5,
we predict True (churn), and False (no churn)
otherwise. This allows us to use the model for solving our
problem: predicting customers who churn.
The weights of the logistic regression model are easy to
interpret and explain, especially when it comes to the
categorical variables encoded using the one-hot encoding
scheme. It helps us understand the behavior of the model
better and explain to others what it’s doing and how it’s
working.

In the next chapter we will continue with this project on churn


prediction. We will look at ways of evaluating binary classifiers and
then use this information for tuning the model’s performance.

Answers to exercises
Exercise 3.1 B) The percentage of True elements
Exercise 3.2 A) It will keep a numeric variable as is and
encode only the categorical variable.
Exercise 3.3 B) Sigmoid converts the output to a value
between zero and one.

sitemap
Up next...
4 Evaluation metrics for classification
Accuracy as a way of evaluating binary classification models and its limitations
Determining where our model makes mistakes using a confusion table
Deriving other metrics like precision and recall from the confusion table
Using ROC (receiver operating characteristics) and AUC (area under the ROC curve) to further
understand the performance of a binary classification model
Cross-validating a model to make sure it behaves optimally

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 73/74
08/11/2021, 09:13 3 Machine learning for classification - Machine Learning Bookcamp: Build a portfolio of real-life projects

Tuning the parameters of a model to achieve the best predictive performance

© 2021 Manning Publications Co.

https://livebook.manning.com/book/machine-learning-bookcamp/chapter-3/79 74/74

You might also like