You are on page 1of 8
+ - Introduce three tree methods - Decision Tree Example - Livecoding today pip Anstall pyspark tpip install -U -q PyOrive apt install openjdk-s-jék-headless -oq import os os.environ{"JAVA HONE] = “/usr/1ib/jvn/ java: contacting pyspark ow oading hese hens, ocs/2kaes/45 0 9AGaee ba 94th ance dc ARES G97EaC8e72C28; ee coueetie eee oon ouding spas Licts,ayshanhestd.ors/nackates/9e/9/Sa8f050c62te7556 0671252991707 foscatnct7C¢a5 | 2260 22.5%/s LL ore utlding whee for pypart (etupey) ve ove Eneavea nee! for pyeperks filereneepyspart-311-292-py3-noe-any.h stzee212767606 shaSeuofoetf5985tar3bd0F7e: Stores st eirectory,/rot/ cathe pipronec!s/60/98/ce/etse/ociacti2zedestasese foeoec ave cssoteeseot succesfully suit pyopene Installing collectel pecoges: py, pyspark Successflay installed pytf-0i0-9 ptpark 2.3 The follwing alta pactages oil be installed: oendahcb ire neatless svegertespattages erjot bates darjdt-t-source Lbnss-adnsfonts-dsawuextra Tents tpafonsasthe ont: ipafonminco font-aapeaeronel, fonts-mpy-senhel fons indie re following Ne packages wll be instaldeé: oun Bick-helessopegah-8-fre neasless Reed to ge 36°50 of archives After ths operation, 14) of addtional disk space willbe used. SGlecing previously unseleetea package apenjote jrecneadess ond fassdlntafeavse’). Toons elie and dtecboriea cerenelyratenea.) reparite te engi «-lopengut-e ineeadlers buses toe oipnteicié 0 amd det IEpeclingcpenin-jrefeeolearadae' tus ioeconunetens sa} Selecting previously unseleces packoge apenjoes-akchencless rondo. Fraparing Go ungact on dopenjcerjoecila Suis hot -euncaeae 8, Irpeclingopenica-jachetatearalae huis Soeceomneateas 8) Soeiny open f ja neotieas eee (oeai be auburn) Spasteralternativess ing /ust/lib/jm/sov-t-opegee-oate/Srefvin/orbd to provide /usr/bln/orb (orb) in ao sode tpaate-altematives: wing /sr/Iib/Jn/Jav-t-opegeh-oatt/Sre/bin/sersertol to povide /usr/bin/servercool, (serve Cbdae-ateerntives: using /usr/1ib/)n/Javec®oersch-onet/sre/bin/tnmesery to provi /ose/bin/tnonscry (Conese Siting open b-Jabeedlaaceatee tout soe sbuneaeid 8) Spastetatverntives: asin /usr/1ib/a/fane-8opensuh-once/bin/2dij to provide /usr/bin/S6h} (4dl}) in asto rode Updote-alternstives: using /usr/1ib/}n/tare-8opersch-onet/ein/asiapore to provide Vest /on/wslapart (estore) in a Ubaate-alcernatives: using /usr/1ib/}n/fare-8opersah-onst/bin/Ssodeugd te provide Josrbin[ssaceber® oases) in Cpdate-aleernatives: using /usr/1ib/}n/tarac8-opensek-onset/ein/aoivedasctt to provi, fsr/tta/naivezsyeit (tive vpastecalterratives: wing /sr/tlo/jn/Jevea-opegehconee/bin/faven to srovige /ar/oin/ereh (even) an eet note pdotecalternatives: using /esr/1ib/)a/ana-B-opensuk-anse/bin/asd to provide /osr/ein/ha ost). tn acto toe Uste-alternatives: using /usr/1ib/Jn/fora-Sopensak-one/ein/cinad ce provide /ose/binfelisobCelhsée) in at moe smaatecaiternavess using /wsr/Lib/je/gave-e-spergak_andevbin/eje te provide fsr/oinncje (ge) sn-eee nase Spastecalterrativs: vita /sr/lib/}m/Jovet-opegeh-onts/einextche to provide /osr/olfenteneck.Cesshecd in at vpaatecaleerntivess using /wsr/Lib/jv/gave-e-spegak-anaybin/Shet to provide ruse/bin/Jnec Ghee) te eto ease vpaatecalternatives: using /esr/Lib/}a/Janece-spngat-ana/bin/argen te provigefsr/olauagerengen tn auto nde ‘from google.colab inport drive drive.nount(" /content/drive' ) Mounted at /content/drive Xed drive/Mybrive/Colab\ Notebooks {scontent/drive/Mybrive/colab Notebooks # inport os # cur_path = "/content/drive/My Drive/VandyCourses_design/0S5460_BigbataScaling 2021Spring/0S546@_BigdataScaling/Week # 0s.chdin(cur_path) pwd scortent/drive/My Orive/Colab Notebooks "content /érivelity Drive/Colab Notebooks 4 create a spark session ‘from pyspark.sql import SparkSession spark = SparkSession.builder.appNane( tree').getorcreate() Introduce three tree methods: + Asingle decision tree: httns://spark apache org/docs/latest/ml-classification regression himi#decision-tree-classifier + Arandom forest: httos://spark apache.org/docs /latest/ml-classificationzegression html#random-forest-lassifier + A gradient boosted tree classifier: https:/spark apache org/docs/latest/ml-classification regression himlgradient-boosted-tree- classifier # Load training data data = spark.read.csv(os.getend() + '/College.csv',inferSchena=True,header=True) We will be using a college dataset to try to classify colleges as Private or Public based off these features: Private: A factor with levels No and Yes indicating private or public university ‘Apps: Nusber of applications received Aecopt: hunter of applications accepted Enroll: Ninber of new students enrolled Toplopere: Pet. new stucents fron top 20% of H.S. class Topesperc: Pet. new students from top 25% of H.S. class Pundengrad: Munber of parttine undergraduates oon_foard: Room and board costs Personal: Estineted personal spending PAO: Pet. of faculty with Ph.0.s Terminal: ect. of faculty with terminal cegree S.F.Ratio: Student/saculty eatio porc.alomi: Pet. alumni wine donate Exgend: Tasteuctional expenditure per student Grad Rate: Gnaeuntion rate data.printSchena() [+= School: string (nullable = true) |-+ Private: string (nullable = trve) |-> Apps: integer (nullable = true) |-+ accept: integer (nullable = trve) J++ Enroll: integer (nullable = tree) Topioperc: int ger (oulable = tove) Top2sperc: integer (eullable = true) Undergrad: integer (nullable = true) PLUndergrad: integer (nullable = true) Outstate: Integer (nullable = true) Room gosrd: integer (nullable © true) ooks: integer (nullable = true) Personal: integer (nullable = true) Pho: integer (nullable = true) Terminal: integer (nullable = true) S_F_Ratio: double (nullable = true) pere_alumni: integer (nullable ~ tree) pend: integer (nullable = true) Grag Rate: integer (sullable = tre) head() ow (School='Abslene Christian University", a1, + Spark Formatting of Data 1 few things we need to do before Spark can accept the data! Tt needs to be in the form of two columns ("label", "features") # Import Vectorassenbler and Vectors from pyspark.mi-Linaig import Vectors from pyspark.mi.feature inport Vectorassenbler coluans [seheot', "3085", ‘accept’, nrell", Topteperc' Topaspere’s Undergrad”, PLundererad, outstate’, oon Board” Books, ‘pere_alunni*, “expend ‘Grad-Rate’] assenbler = Vectorassenbler( ‘nputcols-[ "Apps", Accept", enroll", Topieperc' "“Topasperc’, "eUndergrad’ , ‘PUndergrad”, ‘outstate’, ‘Room_Board" , "Books", Personal PhD", Terminal’, SF Ratio’, pere_alunni”, expend”, Grad Rate"), outputcol="Features") output assenbler. transform(data) output .show(5) aa] 723] 23] 52] 2885] s37| 7440] 3300] 450/ |abitene christian...| Yes|2660| | adelpns University] Yes|23a5] 1924] 12] 46] 23} 283] a2z7| 12280] 680] 750] | arian College] Yes|2428] 1097] 336] 2a} 5a} tee} 99] 13250] 3750] 400] | Agnes Seott coltege| Yes| 417/349) 137] «al 33 sie] 63] 12960] 5458) 450] |ataska Pacific Unvs.| Yes} 293] 146] 55] 16] aa] 249] 6s] 7568] 4120] 800] only showing top 5 rows eal with Private column being "yes" or ‘no! from pyspark.mi.feature inport StringIndexer # deal with the labels indexer = Stringindexer(inputCol="Private", outputcol output_fixed = indexer. Fit (output). transform(output) ‘Final_data output_fixed. select("features”, "Privatelndex’ ) ‘#Final_data.show(200) ‘train_data,test_data = final_data.randossplit([@.7,0.3]) The Classifiers from pyspark.ml-classification import DecisionTreeClassifier, GaTClassifier,RandonForestClassifier from pyspark.ml inport Pipeline Create all three models: # Use nostly defaults to make this comparison “fair” dtc = DecisionTreeCiassifier(IabelCol="Privaterndex’ ,featuresCol='features') fc = RandonForestClassifier(1abelCol="Privaterndex’ ,featurescol='features'’) got = GBTClassifier 1abelCol="PrivateIndex’ ,featurescol=" features") Train all three models: 4# Train the models (its three models, so it sight take sone tine) dtc_model = dtc. fit(train_data) rfc_model = rfc.fit(train_data) model = gbt.fit(train data) > Model Comparison Lets compare each of these models! dtc_predictions = dte_nodel.transform(test_data) rfc_predictions = rfc_nodel. transform(test_data) gbt_predictions = gbt_nodel. transform(test_data) atc_predictions.show(5) ‘Features Privaterndex|rawPrediction| probability | predic {81.0,72.0,51.0,3...] 0.8] (291.0,0.8)1 (2.0,0.0)| 8.0) (167.0,138.0,46.0...] 0.9) [291.8,8.8)| | [1.2,0.8)| 8.0) 193.0,146.0,55.0. | @.9] (4.0,0.0)| [1.0,0.8)| 2.9) 292.0,18410,122..| 16] (238.0,08)! [1.0,0.8)| 8.9) (232.0,182.0,99,0...1 eel (4.9,8.8)1 [1.8,8.8)1 eal ly showing top 5 rows Evaluation Metrics: ‘from pyspark.ml-evaluation impo -iclassClassificationEvaluator # Select (prediction, true label) and compute test error acc_evaluator = mult iclassClassificationtvaluator(1abelCo! predictioncol="predictio dtc_acc = acc_evaluator.evaluate(dtc_predictions) Pfc_acc = acc_evaluator.evaluate(rfc_predictions) = acc_evaluator.evaluate(gbt_predictions) print(“Here are the resuits!") print(*-'*89) print(‘A single decision tree had an accuracy of: (0:2.2F print(*-"*8a) print(‘A randon forest ensemble had an accuracy of: {0:2.2)X' fornat(rfc_acc*1@0)) print(*-"*8a) Print(‘A ensenble using GBT had an accuracy of: (@:2.2F)X" format (gbt_acct1ea)) format (dte_ace*108)) Here are the results! A single decision tree had an accuracy of: 91.98x A random forest ensenble had an accuracy of: 94.34% ‘A ensemble using GBT had an accuracy ef 91.98% Interesting! Optional Assignment -play around with the parameters of each of these models, can you squeeze some more accuracy ‘out of them? Oris the data the limiting factor? + Decision Tree Example You've been hired by a dog food company to try to predict why some batches of their dog food are spoiling much quicker than intended! Unfortunately this Dog Food company hasnt upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary @ lot, but which is the chemical that has the strongest effect? The dog food company first mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a “filler” chemical. The food scientists beelive one of the A\B,C, 0” D preservatives is causing the problem, but need your help to figure out Which one! Use Decision Tree to find out which parameter had the most predicitive power, thus finding out which chemical causes the i problem! early spoiling! So create a model and then find out how you can decide which chemical is + Pres A: Percentage of preservative A in the mix + Pres_B : Percentage of preservative B in the mix + Pres_C : Percentage of preservative Cin the mix + Pres_D : Percentage of preservative Din the mix + Spoiled: Label indicating whether or not the dog food batch was spoiled # Load training data = spark.read.csv(os.getewd() + '/dog_food. csv" , inferSchena-True, header: printSchena() root As Anteger (nullable = true) B: integer (nullable = true) C: double (nullable = true) D: integer (nullable = true) Spoiled! double (nullable = true) Row(Ae4, Be2, © 8, D3, Spotled-1.0) # Inport Vectorassenbler and Vectors from pyspark.nl.linalg inport Vectors from pyspark.nl.feature import VectorassenbLer colunns UAL (BY, ‘Ct, “Dt, “Spoited") assenbler = Vectorassenbler(inputCo Features") put = assembler. transform(data) ‘from pyspark.nl.classitication inport RandonForestClassi fier, DecisionTreeClassifier Pfc = DecisionTreeClassifier(LabelCol~'Spoiled’ ,featuresCol-" features") output .printschena() |-= A: integer (nutlable =

You might also like