LC 7

+ - Introduce three tree methods - Decision Tree Example - Livecoding today pip Anstall pyspark tpip install -U -q PyOrive apt install openjdk-s-jék-headless -oq import os os.environ{"JAVA HONE] = “/usr/1ib/jvn/ java: contacting pyspark ow oading hese hens, ocs/2kaes/45 0 9AGaee ba 94th ance dc ARES G97EaC8e72C28; ee coueetie eee oon ouding spas Licts,ayshanhestd.ors/nackates/9e/9/Sa8f050c62te7556 0671252991707 foscatnct7C¢a5 | 2260 22.5%/s LL ore utlding whee for pypart (etupey) ve ove Eneavea nee! for pyeperks filereneepyspart-311-292-py3-noe-any.h stzee212767606 shaSeuofoetf5985tar3bd0F7e: Stores st eirectory,/rot/ cathe pipronec!s/60/98/ce/etse/ociacti2zedestasese foeoec ave cssoteeseot succesfully suit pyopene Installing collectel pecoges: py, pyspark Successflay installed pytf-0i0-9 ptpark 2.3 The follwing alta pactages oil be installed: oendahcb ire neatless svegertespattages erjot bates darjdt-t-source Lbnss-adnsfonts-dsawuextra Tents tpafonsasthe ont: ipafonminco font-aapeaeronel, fonts-mpy-senhel fons indie re following Ne packages wll be instaldeé: oun Bick-helessopegah-8-fre neasless Reed to ge 36°50 of archives After ths operation, 14) of addtional disk space willbe used. SGlecing previously unseleetea package apenjote jrecneadess ond fassdlntafeavse’). Toons elie and dtecboriea cerenelyratenea.) reparite te engi «-lopengut-e ineeadlers buses toe oipnteicié 0 amd det IEpeclingcpenin-jrefeeolearadae' tus ioeconunetens sa} Selecting previously unseleces packoge apenjoes-akchencless rondo. Fraparing Go ungact on dopenjcerjoecila Suis hot -euncaeae 8, Irpeclingopenica-jachetatearalae huis Soeceomneateas 8) Soeiny open f ja neotieas eee (oeai be auburn) Spasteralternativess ing /ust/lib/jm/sov-t-opegee-oate/Srefvin/orbd to provide /usr/bln/orb (orb) in ao sode tpaate-altematives: wing /sr/Iib/Jn/Jav-t-opegeh-oatt/Sre/bin/sersertol to povide /usr/bin/servercool, (serve Cbdae-ateerntives: using /usr/1ib/)n/Javec®oersch-onet/sre/bin/tnmesery to provi /ose/bin/tnonscry (Conese Siting open b-Jabeedlaaceatee tout soe sbuneaeid 8) Spastetatverntives: asin /usr/1ib/a/fane-8opensuh-once/bin/2dij to provide /usr/bin/S6h} (4dl}) in asto rode Updote-alternstives: using /usr/1ib/}n/tare-8opersch-onet/ein/asiapore to provide Vest /on/wslapart (estore) in a Ubaate-alcernatives: using /usr/1ib/}n/fare-8opersah-onst/bin/Ssodeugd te provide Josrbin[ssaceber® oases) in Cpdate-aleernatives: using /usr/1ib/}n/tarac8-opensek-onset/ein/aoivedasctt to provi, fsr/tta/naivezsyeit (tive vpastecalterratives: wing /sr/tlo/jn/Jevea-opegehconee/bin/faven to srovige /ar/oin/ereh (even) an eet note pdotecalternatives: using /esr/1ib/)a/ana-B-opensuk-anse/bin/asd to provide /osr/ein/ha ost). tn acto toe Uste-alternatives: using /usr/1ib/Jn/fora-Sopensak-one/ein/cinad ce provide /ose/binfelisobCelhsée) in at moe smaatecaiternavess using /wsr/Lib/je/gave-e-spergak_andevbin/eje te provide fsr/oinncje (ge) sn-eee nase Spastecalterrativs: vita /sr/lib/}m/Jovet-opegeh-onts/einextche to provide /osr/olfenteneck.Cesshecd in at vpaatecaleerntivess using /wsr/Lib/jv/gave-e-spegak-anaybin/Shet to provide ruse/bin/Jnec Ghee) te eto ease vpaatecalternatives: using /esr/Lib/}a/Janece-spngat-ana/bin/argen te provigefsr/olauagerengen tn auto nde ‘from google.colab inport drive drive.nount(" /content/drive' ) Mounted at /content/drive Xed drive/Mybrive/Colab\ Notebooks{scontent/drive/Mybrive/colab Notebooks # inport os # cur_path = "/content/drive/My Drive/VandyCourses_design/0S5460_BigbataScaling 2021Spring/0S546@_BigdataScaling/Week # 0s.chdin(cur_path) pwd scortent/drive/My Orive/Colab Notebooks "content /érivelity Drive/Colab Notebooks 4 create a spark session ‘from pyspark.sql import SparkSession spark = SparkSession.builder.appNane( tree').getorcreate() Introduce three tree methods: + Asingle decision tree: httns://spark apache org/docs/latest/ml-classification regression himi#decision-tree-classifier + Arandom forest: httos://spark apache.org/docs /latest/ml-classificationzegression html#random-forest-lassifier + A gradient boosted tree classifier: https:/spark apache org/docs/latest/ml-classification regression himlgradient-boosted-tree- classifier # Load training data data = spark.read.csv(os.getend() + '/College.csv',inferSchena=True,header=True) We will be using a college dataset to try to classify colleges as Private or Public based off these features: Private: A factor with levels No and Yes indicating private or public university ‘Apps: Nusber of applications received Aecopt: hunter of applications accepted Enroll: Ninber of new students enrolled Toplopere: Pet. new stucents fron top 20% of H.S. class Topesperc: Pet. new students from top 25% of H.S. class Pundengrad: Munber of parttine undergraduates oon_foard: Room and board costs Personal: Estineted personal spending PAO: Pet. of faculty with Ph.0.s Terminal: ect. of faculty with terminal cegree S.F.Ratio: Student/saculty eatio porc.alomi: Pet. alumni wine donate Exgend: Tasteuctional expenditure per student Grad Rate: Gnaeuntion rate data.printSchena() [+= School: string (nullable = true) |-+ Private: string (nullable = trve) |-> Apps: integer (nullable = true) |-+ accept: integer (nullable = trve) J++ Enroll: integer (nullable = tree)Topioperc: int ger (oulable = tove) Top2sperc: integer (eullable = true) Undergrad: integer (nullable = true) PLUndergrad: integer (nullable = true) Outstate: Integer (nullable = true) Room gosrd: integer (nullable © true) ooks: integer (nullable = true) Personal: integer (nullable = true) Pho: integer (nullable = true) Terminal: integer (nullable = true) S_F_Ratio: double (nullable = true) pere_alumni: integer (nullable ~ tree) pend: integer (nullable = true) Grag Rate: integer (sullable = tre) head() ow (School='Abslene Christian University", a1, + Spark Formatting of Data 1 few things we need to do before Spark can accept the data! Tt needs to be in the form of two columns ("label", "features") # Import Vectorassenbler and Vectors from pyspark.mi-Linaig import Vectors from pyspark.mi.feature inport Vectorassenbler coluans [seheot', "3085", ‘accept’, nrell", Topteperc' Topaspere’s Undergrad”, PLundererad, outstate’, oon Board” Books, ‘pere_alunni*, “expend ‘Grad-Rate’] assenbler = Vectorassenbler( ‘nputcols-[ "Apps", Accept", enroll", Topieperc' "“Topasperc’, "eUndergrad’ , ‘PUndergrad”, ‘outstate’, ‘Room_Board" , "Books", Personal PhD", Terminal’,SF Ratio’, pere_alunni”, expend”, Grad Rate"), outputcol="Features") output assenbler. transform(data) output .show(5) aa] 723] 23] 52] 2885] s37| 7440] 3300] 450/ |abitene christian...| Yes|2660| | adelpns University] Yes|23a5] 1924] 12] 46] 23} 283] a2z7| 12280] 680] 750] | arian College] Yes|2428] 1097] 336] 2a} 5a} tee} 99] 13250] 3750] 400] | Agnes Seott coltege| Yes| 417/349) 137] «al 33 sie] 63] 12960] 5458) 450] |ataska Pacific Unvs.| Yes} 293] 146] 55] 16] aa] 249] 6s] 7568] 4120] 800] only showing top 5 rows eal with Private column being "yes" or ‘no! from pyspark.mi.feature inport StringIndexer # deal with the labels indexer = Stringindexer(inputCol="Private", outputcol output_fixed = indexer. Fit (output). transform(output) ‘Final_data output_fixed. select("features”, "Privatelndex’ ) ‘#Final_data.show(200) ‘train_data,test_data = final_data.randossplit([@.7,0.3]) The Classifiers from pyspark.ml-classification import DecisionTreeClassifier, GaTClassifier,RandonForestClassifier from pyspark.ml inport Pipeline Create all three models: # Use nostly defaults to make this comparison “fair” dtc = DecisionTreeCiassifier(IabelCol="Privaterndex’ ,featuresCol='features') fc = RandonForestClassifier(1abelCol="Privaterndex’ ,featurescol='features'’) got = GBTClassifier 1abelCol="PrivateIndex’ ,featurescol=" features") Train all three models: 4# Train the models (its three models, so it sight take sone tine) dtc_model = dtc. fit(train_data) rfc_model = rfc.fit(train_data) model = gbt.fit(train data)> Model Comparison Lets compare each of these models! dtc_predictions = dte_nodel.transform(test_data) rfc_predictions = rfc_nodel. transform(test_data) gbt_predictions = gbt_nodel. transform(test_data) atc_predictions.show(5) ‘Features Privaterndex|rawPrediction| probability | predic {81.0,72.0,51.0,3...] 0.8] (291.0,0.8)1 (2.0,0.0)| 8.0) (167.0,138.0,46.0...] 0.9) [291.8,8.8)| | [1.2,0.8)| 8.0) 193.0,146.0,55.0. | @.9] (4.0,0.0)| [1.0,0.8)| 2.9) 292.0,18410,122..| 16] (238.0,08)! [1.0,0.8)| 8.9) (232.0,182.0,99,0...1 eel (4.9,8.8)1 [1.8,8.8)1 eal ly showing top 5 rows Evaluation Metrics: ‘from pyspark.ml-evaluation impo -iclassClassificationEvaluator # Select (prediction, true label) and compute test error acc_evaluator = mult iclassClassificationtvaluator(1abelCo! predictioncol="predictio dtc_acc = acc_evaluator.evaluate(dtc_predictions) Pfc_acc = acc_evaluator.evaluate(rfc_predictions) = acc_evaluator.evaluate(gbt_predictions) print(“Here are the resuits!") print(*-'*89) print(‘A single decision tree had an accuracy of: (0:2.2F print(*-"*8a) print(‘A randon forest ensemble had an accuracy of: {0:2.2)X' fornat(rfc_acc*1@0)) print(*-"*8a) Print(‘A ensenble using GBT had an accuracy of: (@:2.2F)X" format (gbt_acct1ea)) format (dte_ace*108)) Here are the results! A single decision tree had an accuracy of: 91.98x A random forest ensenble had an accuracy of: 94.34% ‘A ensemble using GBT had an accuracy ef 91.98% Interesting! Optional Assignment -play around with the parameters of each of these models, can you squeeze some more accuracy ‘out of them? Oris the data the limiting factor? + Decision Tree Example You've been hired by a dog food company to try to predict why some batches of their dog food are spoiling much quicker than intended! Unfortunately this Dog Food company hasnt upgraded to the latest machinery, meaning that the amounts of the five preservative chemicals they are using can vary @ lot, but which is the chemical that has the strongest effect? The dog food companyfirst mixes up a batch of preservative that contains 4 different preservative chemicals (A,B,C,D) and then is completed with a “filler” chemical. The food scientists beelive one of the A\B,C, 0” D preservatives is causing the problem, but need your help to figure out Which one! Use Decision Tree to find out which parameter had the most predicitive power, thus finding out which chemical causes the i problem! early spoiling! So create a model and then find out how you can decide which chemical is + Pres A: Percentage of preservative A in the mix + Pres_B : Percentage of preservative B in the mix + Pres_C : Percentage of preservative Cin the mix + Pres_D : Percentage of preservative Din the mix + Spoiled: Label indicating whether or not the dog food batch was spoiled # Load training data = spark.read.csv(os.getewd() + '/dog_food. csv" , inferSchena-True, header: printSchena() root As Anteger (nullable = true) B: integer (nullable = true) C: double (nullable = true) D: integer (nullable = true) Spoiled! double (nullable = true) Row(Ae4, Be2, © 8, D3, Spotled-1.0) # Inport Vectorassenbler and Vectors from pyspark.nl.linalg inport Vectors from pyspark.nl.feature import VectorassenbLer colunns UAL (BY, ‘Ct, “Dt, “Spoited") assenbler = Vectorassenbler(inputCo Features") put = assembler. transform(data) ‘from pyspark.nl.classitication inport RandonForestClassi fier, DecisionTreeClassifier Pfc = DecisionTreeClassifier(LabelCol~'Spoiled’ ,featuresCol-" features") output .printschena() |-= A: integer (nutlable =
You might also like
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5794)
HW 5 Q 2
Document5 pages
HW 5 Q 2
Ali Yaqoob
No ratings yet
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (399)
Magazines
Podcasts
Sheet music
hw2 Hdfs
Document2 pages
hw2 Hdfs
Ali Yaqoob
No ratings yet
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (537)
HW 5 Q 1
Document22 pages
HW 5 Q 1
Ali Yaqoob
No ratings yet
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1891)
Notes From The Underground PDF
Document167 pages
Notes From The Underground PDF
Ali Yaqoob
No ratings yet
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (838)
Dbscaling hw3
Document1 page
Dbscaling hw3
Ali Yaqoob
No ratings yet
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (895)
Machine Learning Notes PDF
Document30 pages
Machine Learning Notes PDF
Ali Yaqoob
No ratings yet
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (98)
Purposeful Sampling For Qualitative Data Collection and Analysis in Mixed Method Implementation Research
Document20 pages
Purposeful Sampling For Qualitative Data Collection and Analysis in Mixed Method Implementation Research
Ali Yaqoob
No ratings yet
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (231)
Machine Learning Report
Document128 pages
Machine Learning Report
Luffy Raj
No ratings yet
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (588)
F08 MT 1 Sol
Document6 pages
F08 MT 1 Sol
Ali Yaqoob
No ratings yet
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (474)
Machine Learning A Z Q A
Document52 pages
Machine Learning A Z Q A
Dina Garan
100% (1)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (73)
Dostoevsky's Notes from the Underground
Document167 pages
Dostoevsky's Notes from the Underground
Ali Yaqoob
No ratings yet
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (234)
Core 3 January 2011 ER
Document9 pages
Core 3 January 2011 ER
Daniyal Siddiqui
No ratings yet
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (599)
Life As A Game
Document29 pages
Life As A Game
Ali Yaqoob
No ratings yet
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (271)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (344)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (266)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1712)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (137)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (738)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (440)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2219)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (806)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2409)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1090)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1015)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1839)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (119)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2322)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4609)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (1937)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (789)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (3811)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (2100)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (104)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1103)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (792)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4200)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (821)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (104)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1929)

LC 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LC 7

Uploaded by

Copyright:

Available Formats

You might also like