217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory
Linear Regression in High Dimensions
In this session, we shal learn about linear regression when the predictor variables are high dimensional
Let us frat considera sample data, which willbe useful for our study of linear regression in high eimensions,
fngort pandas as pd athe pandas Library 1s useful for data processing
Angort matplotlib.pyplot ae plt #the matplotiie Library is useful for plotting purposes
4 The following python directive helps to plot the graph in the notebook directly
"Now let us consider some open source data sets avalable inthe internet, The data set we will consider represents airline costs for afferent
ailine companies as a response variable dependent on muliple atuibutes,
‘toot the cota (rom the wed
nene_cov("hetps:/ rou. asthoburercontent. cor/balanurugan-palanisppan-CEP/AIMI_CEP_2623/nain/sata/sicline.coste.txt',
Sirdine costs data hesd()
branft 176 162 6.60 11869 508 257 0857 4614 16090 SBI 15440
Cantal 142 167 7A7 41097 61.0 288 0510 6255 19502 605 18R96
npe:feolab research google. comidrvafllHwdLBMAGBOYGloGN2Xu-na5yKOCLZ?authuser
\decrolTo=hRLeldUul-208printMode=tue ant217123, 9:40 AM
Copy of 1£506_linear_Regression_Parll_High_Dimensionsipynb - Colaboratory
2 Bonanza 100 140 445189 1418 070 0868 2207 685 0.0164
5 Contal 51124 457 1757 9185 095 0.167 2088 1402 ot 1401
. 8S 175 175 860 18000 502 217 O55 2699 HI41e 3A IO
10 Easter 162 187 980 44000 426 307 0528 Ser4 TOKzS8 187.84 B6A.74
1 Enpke 59 143 403451 1124 069 0513 2208 a1 0924069
13° LakeContal 73. 142 385 5405 1699 048 0212 2288 832 0.0761
What do the numbers in the above data mean? Let us understand the data by seeing its description
orag.vs_pathscore data description file = urlltb.request.urlopen( ‘http: //users. stat ufl.edy/-wlaner/data/airiine cos
unis package is useful for accessing text Fes over internet
for Iine an drug vs_satnscore_data_description tile
ecoted, Line = Tine. decode(*ut 8")
prine(decoded Line)
58). “A Regression Analysis
Journal of Ate Law and Comerce, Vol.21, #3, pp.282-292.
fegression relating Operating Costs per revenue ton-nile
engin of Flight, speed of plane, dally Flight tine per aircraft,
population served, ton-nile lead factor, avatlable tons per atreratt nile,
except load factor
Load factor and available tons (capacity)
for Noctneast Airlines was inputed fron summary calculations
Vorsabies/colums
Length of #iignt (niles)
speed of Plane (niles per hour) 38
ly Flight Tine per plane (hours) 38-44
Population served (1800s)
Total operating Cost (cents per revenue ton-ntle)
toad factor (properticn)
vatlabie Capacity (Tons per alte)
npe:feolab research google. comidrvafellHwdLBMAGBOYGloGN2Xu-Da5yKOCLZ?authuser
\decrolTo=hRLeldUul-208printMode=tue
at217123, 9:40 AM Copy of 1£506_linear_Regression_Parll_High_Dimensionsipynb - Colaboratory
Total Assets ($200,000) 66-92
jestnents and Special Funds ($289,0005) 94-100
najustes Assets ($108,008s) 102-108
Having known the data description, et us Insert the descriptions into the data now.
Diriine costs data.colinns = [‘Airline’, “FLignt Length’, “Plane Speed", Oatdy flight tine per plane’, “Population Served", TOC’, "Revenue, "L
‘tcneck by printing the dats agsin
sirsine 12486 Plane THEE poputsion soe venue _ 58 capacty 7% panay Me
1 Amoican 270 216 as2—SeN2B 20.398 mH 770 Means T8522 a7
2 Boron 100944580 HTS 07] 05 2.287605 gt
Let us move the TOC columa as the second column, just afer the Airline company name.
Toc_colunm = atrline costs_catarpop( TOC") collect the contents of TOC column #
2 temporary objec
44 insert coluen into the estatrane using Snsert( position
4 colina contents) function
Bieline.sosts.data-insert(, "T0C', TOC column) #Insert as second column, so position is 1
Print("After shifting 0c colunn to second posstion")
Alert
nsrtine roc TENE Plane Tekne POPUDEEION evense 54024 capacity TEE rungs MY
Lengen speed HE steve actor sets Tore OS
2 Bonensa 418 100 40448 13078 0388 2207885 a0s
Also note that Adjusted Assets columns sufficient for our analysis, since Its obtained as difference of Total Assets and Funds, So we shall
remave Total Assets and Funds columns
col_nane_assets = airline costs cata.poo(*Total Assets’)
pop("Funes")
print( ‘After dropping Total Asscts and Funds columns")
irline_costs_dats-head()
npe:iolab research google.con/drvetfltwdL BMAGROYAIoGN2Ku-baSyKOCL2?authus ant217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory
After dropping Total Assets and Funds columns
‘Seeing the data as mere numbers might not be interesting. So, let us use some graphical ways to visualize the data
five will plot multiple scatter plots of TOC vs other attributes
Fig > pltfigure(tigsize-(18, 16))
‘ig. constrained_layoutetrue
sent = #ig.add_subptot (422)
fri2 = #ig.add_subptot(422)
nah = #ig.add_subplot 423)
fra2 = Hipadd_sutplot(a2e)
best = Fig. 244 subplot 25)
be32 = Fig. ndd subplot (425)
beah = Fg add_subpiot 42)
eda = figadd_subptot (428)
bett.seatter(airLing costs cata[ Flight Length] airline. costs dataf TOC")
fayah.set_titie( TOC vs Fiignt Length")
elt. set_XLabel(°Fligh Lengen")
ett. set_ylavel(‘T0C")
aut2.seatter(airline_costs_¢ata[ Plane Speed] airline costs, datal TOC" })
toad set title(" 100 vs Plane Speed")
bei2.det_slabel(‘Plane Speed")
benz set_ylasen TOC")
burt scatter(atrLing costs_catal‘Dally flight tine per plane!) airline costs_e2ta[ TOC" })
ob set title( Toe ve Daily fUsgnt tine")
beri. set Habel (‘osily Aight tine")
bran. set_ylael ("700")
s122.scatter(airling_costs_datal Population Served" [,atrline_costs data ‘T0C*])
‘toa set_title(" 100 vs Population Served")
be22. set label ("Population Served’)
be22.set_ylasel TOC")
best scatter(atrline_costs_data[TOC"],alnline costs data[ Revenue’)
fayai.set_title( TOC vs Revenue’)
best. set_jabel (‘Reverse’)
best. set pla ("700")
ses2.scatter(airling costs sata[‘Load Factor") airline costs datal TOC" J)
Nan2d.set_title( TOC vs Lead Factor")
232. set_ label (‘Lead Factor")
bes2.set_ylael(‘T0C°)
sett. scatter(stnline_costs_cata["Capactty”],atrline_costs_data['T0C"))
faxa5.set_title( "TOC vs Capacity")
bedi. set_sabel( Capacity")
bean aet_ylael (100°)
usa. seatterairling costs datal‘Aasusted Assets’, aleLine costs ¢atal“T0C'])
ay2i.set_titie( TOC vs Adjusted Assets")
uta, set Xlabel(‘Aazusted ASSets')
bet2.set_ylavel (T0C")
pis.tiant_Layout(oads0.4, wpade8.5, hpade2.8)
ple-stont)
npe:iolab research google.convdrvetfltwdL BMAGROYGIoGN2Ku-paSyKOCL2?authus
am217123, 9:40 AM Copy of 1£506_linear_Regression_Partl_High_Dimensionsipynd - Colaboratory
Boo Bao
~ the t beget
B+ haw we 8 ates ee ee
maT oe Pe 7 eal So
qe ole
B ao
Mle ab mwee
We see that there isa negative trend in most of the plot.
Another Question: What do we mean bya linear (or) linea looking trend between the response variable and the presictor variable when
‘the predictor variables high dimensional?
Bat 1B amt 1
Inthe nex figure, we shall ilustate the linear relationship between the response varivley andthe predictor variable denoted by x = (21,2)
“which is two-mensiona using a plane in 3d space
Linear relationship between x and y captured by a plane
from uniib.request import urlopen
fron PIL import Ina
ne
ne
nage. open(urlopen( https: //raw.githubusercontent. con/balanurugan-palantappan-CEP/AIM,_CEP_2621/natn/Inages/plane_ple.png"))
|ntps:ifeolab rosearch google. comldrvaytflHwdL BM AGQIOIoQN2Xu-ba5yKOCLZauthuser=tdscrolTo=hRLeldUurIOSprintMode=tue sit217123, 9:40 AM Copy of 1£506_linear_Regression_Partl_High_Dimensionsipynd - Colaboratory
Hence when we lok orlinear trend (oy linear relationship when the predictor variable xs high mensional we look forthe
‘elatonship to be approximated bya plane or hyperplane)
Probabilistic interpretation
In probabitty tems, et us assume that Y”€ IR denotes the response random variable and X « IR¥ denotes the precitor random variable in
imensions,
‘Then we assume thatthe expected value of Y given some observation of X= (21,22, 4) is represented as:
BLY = (#14825. -+54)] = Bo + Bra + Bate bt Baa
This can be equivalenty represented as:
EX = (21,22)... ,24)] = Bo + Df Bis
However when we observe a data point (x',y*) as a ealization ofthe pair (X,Y), whore x! = (2,24...) [ts posible thatthe
served value y ofthe response random variable ¥ isnot necessary equaltothe expected vale E[Y|X = x'] = fo + 4 Ayal
In such case, we assume that the discrepancy between the observed value y* and the expected value B[Y|X = x'] is captured by an error
sien by
e=y -BYX=x']
4
= =v (+ 5).
"Now we can assure thatthe ata set D comprises multe ealzations of the random variable pair (XY).
‘Thus when we the dataset D contains observations ofthe form {(x,y2}, (x2, 2)... (8,9). then we ean compute the eror a:
(60+ 262) ¥ 6 ,2,...00)
‘ng_points_errars = anage.open(urlopen(‘nttps://raw.qthvbusercontent, con /oalamurugan-palansappan-CEP/AIML_CEP_2925/1ain/inages/plane pc ett
sna_points_errore
How to estimate the parameters 8,,j = 0,1,...,d?
|ntps:ifeolab rosearch google. comldrvaytflHwdL BM AGQIOIoQN2Xu-ba5yKOCLZauthuser=tdscrolTo=hRLeldUurIOSprintMode=tue ant217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory
J values of = 0,1, tsby miming te sum of squared enors ven by: min S(€!)? which canbe
(One way to estimate t
cuivaleny writen the folowing optimization problem: in» SoIy! — (Bo + S Sie)
[Note that inthe high dimensional case aswel the optimization problem s called the ordinary least squares (OLS) problem and the erm
SShilyt ~ (Ga + S24, 6:24)? in cated the OLS objective function and we wil denot he OLS objective by
LG Biy- 8a) = Solv! ~ (8s 428,05)
Solving the OLS optimization problem :
Assuming the responses y! 2, ...,yae present nam x 1 matix represented as:
v
?
y
ane the predictor variables x!,x,..., x4 are placed ina m x (d +1) matrix represented as
ex ot a mY
ee o1)_[t a a
x= = where note that he last column of matixX contains a column of all ones. This
ed
ee ot) le ap apa
column is useful to incorporate the effect of fig parameter in X matrix
Similarly assume that we can write the coefcients i, fy... By 88 @ (d+ 1) x 1 matrix represented as:
Now we can wit the objective function as
£(8) = lly ~ X61}
To sohe
sminig L(A) ~ lly ~ X/, we ina the gradient wth respec of and equate o zara,
Thus we get
Vpb(#) =0
— -X'y+X'X6-0
=> B= (XX) 'X'y.
[Note thatthe closed form expression for is valid only when (XX) is invertible. Otherwise we need to solve the system given by:
X7XS = XT y using a solver.
Computing f for the airline costs data set
f= Len(airLine costs ta-tndex) dnunber of da
Print(*nunber of aata points im the data set",9)
ols = (2,344,5,6,708,9)
Xidata ~ Sttline_covts_ata[atrline_costs_éota.coluans{colsT]
|ntps:ifeolab rosearch google. comldrvaytflHwdL BM AGQIOIoQN2Xu-ba5yKOCLZauthuser=tdscrolTo=hRLeldUurIOSprintMode=tue am217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory
Filgnt plane Dally FUignt time PORIREION enue 92 caaciny MUSEO
Length Speed per plane served actor hesets
° Cn) 10 ‘2200 098 0400 2.400 wr92
2 we 12 00 ue69 257 oss kote kc
4 we tar rar soo 288010 8.258 ta6
7 me 180 wer 13600468008 a2 6
8 79 650 ger 170 os7 3.86 4625
10 207 850 44000 307082 sateak74
" us 499 451069 oats 2208 499
6 sr 508 es) 102 One 1445
0 an m2 08 pro 490 0s7 ree a2
18 nr) ers su 100049 2.408 850
2% 85 450 8500822 1.948 sos?
2 66 228 2500 09701880422 tat
0 nr) 572 500 0810490 1078
feonvert predictor varfable coluans into a mimpy array
Xarray = airline costs_data[steline_costs_data-colunns{cols}).t0_numpy()
Xs npimstack((QLarray, aprones((WLarray.shape[@], 1), etypeek_array.dtype)))
y= airline corte aata( aint ine costs ot
colurns[response_sols)]-t0_sunpy0)
eine ( stapes" shape, shape:”y-shape)
npe:iolab research google. comidrvafellHwdL BMAGBOYGloGN2Xu-Da5yKOCLZ?authuser
decro!To=hRLeldUul-08printMode=tue ant217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory
20% = npematnat(np-transpese(X) X)
oeant (7X shape!" ATK shape)
xy = ep.natmsl(np.transpose(X),¥)
seant Cry shape! x¥- Snape)
beta = openatnu (rp. tinalg.fov0808) 29)
Print(*seta' beta)
beta [[_6.6¢9230650-22]
S.822423sbe000)
1sso09s67e003]
2 sesrazsce-a3]
7 zasseasneve2]
Soumnaeneves]
“oaaaaeeeve2]
desaasste-el]
30822689005]
Residual vs Target Plot
‘Sometimes it would be useful to plot the eror (1) residual e! versus the target values y*
residualstist = ()
fn ronaein)
a x[ipt) aaecers 4th row of x
YESS] faeces (th now oy
Yipred_£ = np.dot(x L.neta) Heonpute the prediction obtained using the regression coefficients
€.A.restevals_List.append(e_s) #appene the value of 1 to the List
ple-seattor(ainline caste cata 700'], e-{ reriduals ist)
ple ttleCResiaual ple
ple saaver(“10c")
pit-ylaved("Restouals")
pitieriao)
ple ston)
sida plot
[Note thatthe residual plot helps ta check the variance inthe ertorse', From the residual plot, we observe an outer in terms of actual TOC
value, since the TOC seems tobe very high fo that particular aiine. Theres another aie which has a slightly larger TOC value when
compared tothe other ailines. However iis unclear if that can be considered as an outer.
In a residual vs target plot, the residuals are plotted against the target (or actual values. Otlrs inthis plot canbe identified by points that
have large residuals compared tothe other points. The residuals represent the ciference between the predicted value (‘rom the linear
regression model) and the actual value, Ifa point has large residual, it means thatthe models not fitting well to that point
‘There are efferent methods to decide whether a point isan outlier or net, but common approach isto use the median absolute deviation
(MAD). The MAD i the median ofthe absolute deviations ofthe residuale from the median residual, Outliers can then be defined as points
\inose residuals are mre than a certain number of MADs away from the median residual. This umber is often taken as 3, bu itcan be
adjusted based on the specifies ofthe data and the application
In summary, in a residual vs target plot, outliers can be identified as points with large residuals compared tothe other points, and this can be