You are on page 1of 10
217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory Linear Regression in High Dimensions In this session, we shal learn about linear regression when the predictor variables are high dimensional Let us frat considera sample data, which willbe useful for our study of linear regression in high eimensions, fngort pandas as pd athe pandas Library 1s useful for data processing Angort matplotlib.pyplot ae plt #the matplotiie Library is useful for plotting purposes 4 The following python directive helps to plot the graph in the notebook directly "Now let us consider some open source data sets avalable inthe internet, The data set we will consider represents airline costs for afferent ailine companies as a response variable dependent on muliple atuibutes, ‘toot the cota (rom the wed nene_cov("hetps:/ rou. asthoburercontent. cor/balanurugan-palanisppan-CEP/AIMI_CEP_2623/nain/sata/sicline.coste.txt', Sirdine costs data hesd() branft 176 162 6.60 11869 508 257 0857 4614 16090 SBI 15440 Cantal 142 167 7A7 41097 61.0 288 0510 6255 19502 605 18R96 npe:feolab research google. comidrvafllHwdLBMAGBOYGloGN2Xu-na5yKOCLZ?authuser \decrolTo=hRLeldUul-208printMode=tue ant 217123, 9:40 AM Copy of 1£506_linear_Regression_Parll_High_Dimensionsipynb - Colaboratory 2 Bonanza 100 140 445189 1418 070 0868 2207 685 0.0164 5 Contal 51124 457 1757 9185 095 0.167 2088 1402 ot 1401 . 8S 175 175 860 18000 502 217 O55 2699 HI41e 3A IO 10 Easter 162 187 980 44000 426 307 0528 Ser4 TOKzS8 187.84 B6A.74 1 Enpke 59 143 403451 1124 069 0513 2208 a1 0924069 13° LakeContal 73. 142 385 5405 1699 048 0212 2288 832 0.0761 What do the numbers in the above data mean? Let us understand the data by seeing its description orag.vs_pathscore data description file = urlltb.request.urlopen( ‘http: //users. stat ufl.edy/-wlaner/data/airiine cos unis package is useful for accessing text Fes over internet for Iine an drug vs_satnscore_data_description tile ecoted, Line = Tine. decode(*ut 8") prine(decoded Line) 58). “A Regression Analysis Journal of Ate Law and Comerce, Vol.21, #3, pp.282-292. fegression relating Operating Costs per revenue ton-nile engin of Flight, speed of plane, dally Flight tine per aircraft, population served, ton-nile lead factor, avatlable tons per atreratt nile, except load factor Load factor and available tons (capacity) for Noctneast Airlines was inputed fron summary calculations Vorsabies/colums Length of #iignt (niles) speed of Plane (niles per hour) 38 ly Flight Tine per plane (hours) 38-44 Population served (1800s) Total operating Cost (cents per revenue ton-ntle) toad factor (properticn) vatlabie Capacity (Tons per alte) npe:feolab research google. comidrvafellHwdLBMAGBOYGloGN2Xu-Da5yKOCLZ?authuser \decrolTo=hRLeldUul-208printMode=tue at 217123, 9:40 AM Copy of 1£506_linear_Regression_Parll_High_Dimensionsipynb - Colaboratory Total Assets ($200,000) 66-92 jestnents and Special Funds ($289,0005) 94-100 najustes Assets ($108,008s) 102-108 Having known the data description, et us Insert the descriptions into the data now. Diriine costs data.colinns = [‘Airline’, “FLignt Length’, “Plane Speed", Oatdy flight tine per plane’, “Population Served", TOC’, "Revenue, "L ‘tcneck by printing the dats agsin sirsine 12486 Plane THEE poputsion soe venue _ 58 capacty 7% panay Me 1 Amoican 270 216 as2—SeN2B 20.398 mH 770 Means T8522 a7 2 Boron 100944580 HTS 07] 05 2.287605 gt Let us move the TOC columa as the second column, just afer the Airline company name. Toc_colunm = atrline costs_catarpop( TOC") collect the contents of TOC column # 2 temporary objec 44 insert coluen into the estatrane using Snsert( position 4 colina contents) function Bieline.sosts.data-insert(, "T0C', TOC column) #Insert as second column, so position is 1 Print("After shifting 0c colunn to second posstion") Alert nsrtine roc TENE Plane Tekne POPUDEEION evense 54024 capacity TEE rungs MY Lengen speed HE steve actor sets Tore OS 2 Bonensa 418 100 40448 13078 0388 2207885 a0s Also note that Adjusted Assets columns sufficient for our analysis, since Its obtained as difference of Total Assets and Funds, So we shall remave Total Assets and Funds columns col_nane_assets = airline costs cata.poo(*Total Assets’) pop("Funes") print( ‘After dropping Total Asscts and Funds columns") irline_costs_dats-head() npe:iolab research google.con/drvetfltwdL BMAGROYAIoGN2Ku-baSyKOCL2?authus ant 217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory After dropping Total Assets and Funds columns ‘Seeing the data as mere numbers might not be interesting. So, let us use some graphical ways to visualize the data five will plot multiple scatter plots of TOC vs other attributes Fig > pltfigure(tigsize-(18, 16)) ‘ig. constrained_layoutetrue sent = #ig.add_subptot (422) fri2 = #ig.add_subptot(422) nah = #ig.add_subplot 423) fra2 = Hipadd_sutplot(a2e) best = Fig. 244 subplot 25) be32 = Fig. ndd subplot (425) beah = Fg add_subpiot 42) eda = figadd_subptot (428) bett.seatter(airLing costs cata[ Flight Length] airline. costs dataf TOC") fayah.set_titie( TOC vs Fiignt Length") elt. set_XLabel(°Fligh Lengen") ett. set_ylavel(‘T0C") aut2.seatter(airline_costs_¢ata[ Plane Speed] airline costs, datal TOC" }) toad set title(" 100 vs Plane Speed") bei2.det_slabel(‘Plane Speed") benz set_ylasen TOC") burt scatter(atrLing costs_catal‘Dally flight tine per plane!) airline costs_e2ta[ TOC" }) ob set title( Toe ve Daily fUsgnt tine") beri. set Habel (‘osily Aight tine") bran. set_ylael ("700") s122.scatter(airling_costs_datal Population Served" [,atrline_costs data ‘T0C*]) ‘toa set_title(" 100 vs Population Served") be22. set label ("Population Served’) be22.set_ylasel TOC") best scatter(atrline_costs_data[TOC"],alnline costs data[ Revenue’) fayai.set_title( TOC vs Revenue’) best. set_jabel (‘Reverse’) best. set pla ("700") ses2.scatter(airling costs sata[‘Load Factor") airline costs datal TOC" J) Nan2d.set_title( TOC vs Lead Factor") 232. set_ label (‘Lead Factor") bes2.set_ylael(‘T0C°) sett. scatter(stnline_costs_cata["Capactty”],atrline_costs_data['T0C")) faxa5.set_title( "TOC vs Capacity") bedi. set_sabel( Capacity") bean aet_ylael (100°) usa. seatterairling costs datal‘Aasusted Assets’, aleLine costs ¢atal“T0C']) ay2i.set_titie( TOC vs Adjusted Assets") uta, set Xlabel(‘Aazusted ASSets') bet2.set_ylavel (T0C") pis.tiant_Layout(oads0.4, wpade8.5, hpade2.8) ple-stont) npe:iolab research google.convdrvetfltwdL BMAGROYGIoGN2Ku-paSyKOCL2?authus am 217123, 9:40 AM Copy of 1£506_linear_Regression_Partl_High_Dimensionsipynd - Colaboratory Boo Bao ~ the t beget B+ haw we 8 ates ee ee maT oe Pe 7 eal So qe ole B ao Mle ab mwee We see that there isa negative trend in most of the plot. Another Question: What do we mean bya linear (or) linea looking trend between the response variable and the presictor variable when ‘the predictor variables high dimensional? Bat 1B amt 1 Inthe nex figure, we shall ilustate the linear relationship between the response varivley andthe predictor variable denoted by x = (21,2) “which is two-mensiona using a plane in 3d space Linear relationship between x and y captured by a plane from uniib.request import urlopen fron PIL import Ina ne ne nage. open(urlopen( https: //raw.githubusercontent. con/balanurugan-palantappan-CEP/AIM,_CEP_2621/natn/Inages/plane_ple.png")) |ntps:ifeolab rosearch google. comldrvaytflHwdL BM AGQIOIoQN2Xu-ba5yKOCLZauthuser=tdscrolTo=hRLeldUurIOSprintMode=tue sit 217123, 9:40 AM Copy of 1£506_linear_Regression_Partl_High_Dimensionsipynd - Colaboratory Hence when we lok orlinear trend (oy linear relationship when the predictor variable xs high mensional we look forthe ‘elatonship to be approximated bya plane or hyperplane) Probabilistic interpretation In probabitty tems, et us assume that Y”€ IR denotes the response random variable and X « IR¥ denotes the precitor random variable in imensions, ‘Then we assume thatthe expected value of Y given some observation of X= (21,22, 4) is represented as: BLY = (#14825. -+54)] = Bo + Bra + Bate bt Baa This can be equivalenty represented as: EX = (21,22)... ,24)] = Bo + Df Bis However when we observe a data point (x',y*) as a ealization ofthe pair (X,Y), whore x! = (2,24...) [ts posible thatthe served value y ofthe response random variable ¥ isnot necessary equaltothe expected vale E[Y|X = x'] = fo + 4 Ayal In such case, we assume that the discrepancy between the observed value y* and the expected value B[Y|X = x'] is captured by an error sien by e=y -BYX=x'] 4 = =v (+ 5). "Now we can assure thatthe ata set D comprises multe ealzations of the random variable pair (XY). ‘Thus when we the dataset D contains observations ofthe form {(x,y2}, (x2, 2)... (8,9). then we ean compute the eror a: (60+ 262) ¥ 6 ,2,...00) ‘ng_points_errars = anage.open(urlopen(‘nttps://raw.qthvbusercontent, con /oalamurugan-palansappan-CEP/AIML_CEP_2925/1ain/inages/plane pc ett sna_points_errore How to estimate the parameters 8,,j = 0,1,...,d? |ntps:ifeolab rosearch google. comldrvaytflHwdL BM AGQIOIoQN2Xu-ba5yKOCLZauthuser=tdscrolTo=hRLeldUurIOSprintMode=tue ant 217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory J values of = 0,1, tsby miming te sum of squared enors ven by: min S(€!)? which canbe (One way to estimate t cuivaleny writen the folowing optimization problem: in» SoIy! — (Bo + S Sie) [Note that inthe high dimensional case aswel the optimization problem s called the ordinary least squares (OLS) problem and the erm SShilyt ~ (Ga + S24, 6:24)? in cated the OLS objective function and we wil denot he OLS objective by LG Biy- 8a) = Solv! ~ (8s 428,05) Solving the OLS optimization problem : Assuming the responses y! 2, ...,yae present nam x 1 matix represented as: v ? y ane the predictor variables x!,x,..., x4 are placed ina m x (d +1) matrix represented as ex ot a mY ee o1)_[t a a x= = where note that he last column of matixX contains a column of all ones. This ed ee ot) le ap apa column is useful to incorporate the effect of fig parameter in X matrix Similarly assume that we can write the coefcients i, fy... By 88 @ (d+ 1) x 1 matrix represented as: Now we can wit the objective function as £(8) = lly ~ X61} To sohe sminig L(A) ~ lly ~ X/, we ina the gradient wth respec of and equate o zara, Thus we get Vpb(#) =0 — -X'y+X'X6-0 => B= (XX) 'X'y. [Note thatthe closed form expression for is valid only when (XX) is invertible. Otherwise we need to solve the system given by: X7XS = XT y using a solver. Computing f for the airline costs data set f= Len(airLine costs ta-tndex) dnunber of da Print(*nunber of aata points im the data set",9) ols = (2,344,5,6,708,9) Xidata ~ Sttline_covts_ata[atrline_costs_éota.coluans{colsT] |ntps:ifeolab rosearch google. comldrvaytflHwdL BM AGQIOIoQN2Xu-ba5yKOCLZauthuser=tdscrolTo=hRLeldUurIOSprintMode=tue am 217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory Filgnt plane Dally FUignt time PORIREION enue 92 caaciny MUSEO Length Speed per plane served actor hesets ° Cn) 10 ‘2200 098 0400 2.400 wr92 2 we 12 00 ue69 257 oss kote kc 4 we tar rar soo 288010 8.258 ta6 7 me 180 wer 13600468008 a2 6 8 79 650 ger 170 os7 3.86 4625 10 207 850 44000 307082 sateak74 " us 499 451069 oats 2208 499 6 sr 508 es) 102 One 1445 0 an m2 08 pro 490 0s7 ree a2 18 nr) ers su 100049 2.408 850 2% 85 450 8500822 1.948 sos? 2 66 228 2500 09701880422 tat 0 nr) 572 500 0810490 1078 feonvert predictor varfable coluans into a mimpy array Xarray = airline costs_data[steline_costs_data-colunns{cols}).t0_numpy() Xs npimstack((QLarray, aprones((WLarray.shape[@], 1), etypeek_array.dtype))) y= airline corte aata( aint ine costs ot colurns[response_sols)]-t0_sunpy0) eine ( stapes" shape, shape:”y-shape) npe:iolab research google. comidrvafellHwdL BMAGBOYGloGN2Xu-Da5yKOCLZ?authuser decro!To=hRLeldUul-08printMode=tue ant 217123, 9:40 AM Copy of 1£506_linear_Regression_Pastl_High_Dimensionsipynb - Colaboratory 20% = npematnat(np-transpese(X) X) oeant (7X shape!" ATK shape) xy = ep.natmsl(np.transpose(X),¥) seant Cry shape! x¥- Snape) beta = openatnu (rp. tinalg.fov0808) 29) Print(*seta' beta) beta [[_6.6¢9230650-22] S.822423sbe000) 1sso09s67e003] 2 sesrazsce-a3] 7 zasseasneve2] Soumnaeneves] “oaaaaeeeve2] desaasste-el] 30822689005] Residual vs Target Plot ‘Sometimes it would be useful to plot the eror (1) residual e! versus the target values y* residualstist = () fn ronaein) a x[ipt) aaecers 4th row of x YESS] faeces (th now oy Yipred_£ = np.dot(x L.neta) Heonpute the prediction obtained using the regression coefficients €.A.restevals_List.append(e_s) #appene the value of 1 to the List ple-seattor(ainline caste cata 700'], e-{ reriduals ist) ple ttleCResiaual ple ple saaver(“10c") pit-ylaved("Restouals") pitieriao) ple ston) sida plot [Note thatthe residual plot helps ta check the variance inthe ertorse', From the residual plot, we observe an outer in terms of actual TOC value, since the TOC seems tobe very high fo that particular aiine. Theres another aie which has a slightly larger TOC value when compared tothe other ailines. However iis unclear if that can be considered as an outer. In a residual vs target plot, the residuals are plotted against the target (or actual values. Otlrs inthis plot canbe identified by points that have large residuals compared tothe other points. The residuals represent the ciference between the predicted value (‘rom the linear regression model) and the actual value, Ifa point has large residual, it means thatthe models not fitting well to that point ‘There are efferent methods to decide whether a point isan outlier or net, but common approach isto use the median absolute deviation (MAD). The MAD i the median ofthe absolute deviations ofthe residuale from the median residual, Outliers can then be defined as points \inose residuals are mre than a certain number of MADs away from the median residual. This umber is often taken as 3, bu itcan be adjusted based on the specifies ofthe data and the application In summary, in a residual vs target plot, outliers can be identified as points with large residuals compared tothe other points, and this can be

You might also like