You are on page 1of 11
Data Science Cheatsheet Comite iy Maverick Lin (xtp!/ /aavercllsn.cos) Probability theory proves a framework for remoning Poids away of captrig given dataset oF say wo prio data rosa mon eg pened tong di ‘Arithmetic Mean Until to camutsar munetic 1. Hypothesi-Driven: Given probe, what kind | | Semplo Space Sct af powile ozone of an exper | | distributions withontoutlers we = Soe 2, Date-Drivent Glen so dats what iteeting ‘Geometric Mean Ushi tout Itoming i 8 (L216) vent Est of entoms of 4 experinet eve Probability ofan Outcome = or Pte) mbes tha ering thon, Almas Tr What came ee thi dts? 2 Ent Varinbtty [Random Variable Vs tueseal Gcion on the ou Expected Valuo of Random Veelable Vs B(V) Sree te) * VGH o= VERE {Quatttative Detaz Nuneven big we Categorical Data: Data that cat be abl Saco groupe we, eb Big. Data: Msive dient or dita that contain, We to obtain the mine ria ster read beers ‘the sn event de to atm teen Vatlane ‘Gabe expiaied sy by atting to sapling ot teres Otbet ‘iw, the sara ed tution of the ons Pan B= PAPaa) Conditional Probability: P(AlB) = P(A.D) Bayer Theorem P(ALB) = P(BA}PCA) Joine Probability: PAB) = PDIP) ‘Marginal Probabitiy: P(A) vie Pb) (Conlatonsocticest (%,Y) isaac that meses the dere that Y fo faction a X and. ve ver, Pearson Coeficint Mewes the degre of the rb Most Common Data Formats CSV, XML, SQ. Prababilty Density Function (PDF) 6 Chimulative Density Function (CDE) Fa(a)= P=) Note‘ PDF and the CDE of gine radon vile Tro probs ar repel in dat ce en, Hoo pe (8, A Data Cleaning i th prose of tring row dat i 2. Artifacts: sstemtic problems that aia toe ide dat leasing prose Une prt cab ‘cert bat we tn fe coer ten Data. Compatibility tot "apples to ctanen? Male pee of comer 1 itmbere (decals integer) 1 amen (Jo Sith Sith Jo). mjc tN OFC CE, ‘ earremey eazy type, ition sted, a ling ith ming aes. The proper meth- 1 Hritic Bl make reo I 0 Soules of the wkervng dein 4 Mesn Vale I in msg data with the an 1 Nowest Neighbor: lin sing data wing sina ‘tales ea iter with aba fon arb from Miscetlaneo Lamers, ode, ving yon aphamanee lng ume chante hates the eae er). Te data sal be ata engvring ith proc of wing dan kw Prost of sain remoning: he i an undeying eal aso noe hen antl Ge Sy sta), iy hry dest at pp Popol bt static inferemer slows oso dee “Coming ve wih fsurs ie fen, time cntaning, puis exert oad “Apled machine orang ‘est fstar engineering. Contiows Data cate eget, dca te at (or bland) eg, vale between 1-0 belong to A Trterectloneineracties Bwwen feature Seatiatcal: hg/pomer train Cnipm tr skewed Flow Statin cube of NaN negative va, Inverse Transform Sampling Ssmpting pint fon trenform venping Snir truer, Fst dor hk that the COP eats #4() = pe xa the ‘le ob th rac val wn rom he dso “ewe ty Pate ving PCA, caring, | | Monte Carlo Sampling i higher anions corey Inputs, generat randoes inputs from a probabiy ‘alelton ond salve the vel Encoding: ne some ML uit cannot watk on nie (eg a bm 1.2) sete can (wort embed are sears with yebalty =1-p. POE AA 2) Grd we EV: yp Vasanoe~ np Normal/Gaussian Dstibution (Cotimns) Antu Xin Gitte N°). 1m ells Iraplintione TAIN 9% rE of pat sa ihn Too them, we oy oto Distribution (Decree) ‘eo X te eead el). Plmcn expemen in fac leral of tineigae Kt eet So PDR G)= =o ASA Vane = 9 Power Law Disteibutions (Diczt) PDP! P(x) = oF Maing the pce incorporating afrmation ito front to ett fonction FR) wu ha: Yas) +e Stata larwing la wt ofapprcncten fr etinating sin ‘Why Betimate AX)? Pretition. nm nav 3 gd etme F(X) 9p cm tov it to ke precio on ew data, We rei J 8 ‘ck bax sce we aly care about the secrcy af the mforence’ ‘we wast to undead the eltonip tetween and ¥. We can no ager neat f me ack thax noe we wa to derstand how Y change with rapct to X= (Xia) The ero tr cmponed of the reducible ad i= pate Fina Yecgwe to ertimate Te gal to mie the reciever levedvetble, rrr that canst be seduced ne rater how wll we entered rote ‘oper bound fre fest (pri) and toed interpreta (ie ce) Thiet anther ce of thine vara ra Mods te. proose of Socoporting tnfermaton teat xokaation, In mln, if we Ata pets ely wel wr sa ‘dooe the spe ‘emt Bias Varlance Trade-OM Inbireat part of piv ft fection ten (high Hine > cing Alero tring data wo we (gh a to—- 1 | a ! [No Free Lunch Theorem No ste machine kaning ‘leita beter tha lth ees all pen Iecomaon to try untipe model and ind oe that ‘orks et fora pte een, ‘Thinking Like Nate Silver Teporiad as probability dsiatons Cincheding along 2 Incorporate New Information Use lv me ‘3. Lok Foe Consensus Fovecart Use mip diinet ‘Steomting und tgcng whieh age ub of wal ‘ew o detain ow god ct to alntion mets provides with he tok to nti meat fie the more. soiree sting tot Ss wrong, vl ed od eh Accuracy: atl of crect prtctnos me total pre | | Sing Data: dat «+ Nou-Parametrie: dels that dou’ make any x. | | ‘Stine Mileting when cay ste are matty | | ‘inpcian abot fh allwsthen toca wider | | iifeat eowrocy = reifttitare. | | Valkdation Data: data td to tne th parameters of range of bps Dt mag bat veg ae ee ‘ode Krnnnnas) fh tan oie nn = | | Far te aE dete pete || el tai ree vations thers, owe the ode =, elamiy sponte). Bre pole tea ner he [ROC curve (AUC) 1, whe ran 5, o he ain Teave-One-Out CV (LOOCY): split daa in tsk of | omarion, ne etre i al old CV. randomly vie date tok oars fl) of ‘opradintely eal ae Pt i fs ed wit ‘reagent trfar zy) bt ty do te MS Blackbox dee tht take dacs, bt we Iernng, neural meer 4 Descriptive: model tht prove sgh into hy FiratsPrincple vx, Data-Drive few the eptenunde nvtigation work, erpo- ‘ates demain owls (shoe) «« DatesDrivers inde bl on ote cutela- ‘Absolute Error A= y—y Squared Error: 2 = Gy — a)? Root Menor Squared Error: RISD ~ VISE Bootstrapping + Synetie Data: sate oie to the alta Linea egesson en ste and wef tool or pecetng ‘arian X= (Xie Xp) ad uta rae ta Ym eb KGa Bake be fy aoe the unkown sic (once) which length resol sm’ sguares (RSS), the sn of the predicted th vale, RSS Soy whey eh How tf best th Matrix Form: Won mb the cloefrm equation fo Couticeatwetor ae we = (RPK) ONYX rope Gradient Descent: Piscrerepinieton alt stig a an array pine nad erty ak step the negative diveton of th gradieat. Aer seal atom, we wil evenly comerge tthe minnan In or ce, the tii corerponts tothe etic Inning ate deterios thes of these we ek adie descent aloe in two dineaiens Hepat Hews) Intend we shoul rn the agit tom dierent {he ston ‘Stochastic Gradient Descent: instead of taking step ‘atch of trnning dat at eandom to determine or next faster oversee Tmproving Linear Regression ‘seme vel bt yin deter Tao ym 3 dg (12) in Disnension Reeduct Medinensnal she eed as, ‘The eng par me owe Rig wil alma in RSS +A 835) ton: probe p prditrs ye, where Mp. Tis ty computing M difern near comboaions Residual Standard e wrt (onaid a) SS ‘Mewsve of Bt that reprewnts tbe propor of ‘xpi ot the worihiy YY that com b ‘plated ng XI tar om talon Detoven0 d al Sem of See (TSS) Ela Evaluating Cocliciont Batimtes Standard Er (3 fern lypotas tts onthe one nto b we the pita (X) at the eons (¥) laghte rpm te ted fr clacton, whew the iene (teen andi I the pre certain petri teal (oa (ve) > 08) then th el wl pie Yes ‘Minar Liklbood Thr cnt hf ar Git Wr) afew obcraron ew not soi Isckerved na era can ai hee or ceri TI xo) TT Gs) 18,8)= ‘hte ad to por clades: rl nn ht offal Dene da bat few elit dat. Solon So ‘Shea banana by mong meron eh ‘he lange clin pet da frm Uh sale cla eww the tng sample toward sane Shomoch that Linear Detininat Analys (D4). a Te bal fom pone met (ero) by comming ater Measuring Distances/Similwrity Measure Minkowali Datance Metis (ad) = YEE, ‘The pacer proves u wa to ta etme the Iti the reek can seni import the the we salen ae ad 1 anbttan(k=1) ty Hoc distance, of the ‘th aotte denne tet to pts 1+ Bacto (2) strait ine stance Woightat Minow dan) = YER a Coins Sms: eed) similarity betwwers 2 pon-rero weto: here i ‘kot prc (coral were D and 1), higher tee imp ce saa Kali Drone RIAN = 3 ade pence reer icine Detasrfouthon allow ns to Wet the points cow ton gen target, oe the nest nego (MN) to 4 mt siilar to ey tht etinaten the oodona frobablity of zy belay i ea» tw fst ol {he points wee alien Belong to J. The opt toa value for can be utd gio vlan KN Algorithan 1. Compute distance D(a) fom ps to all pits 3 Outpt lee wth ao en Ines pes tng example computer with a rant of Dd) ich * Vornat Diagrams pt ‘he plane + Tocaily Sensitive Hashing (LSI); sbandoae ‘he io offing th eet mart alors ‘ea bach ap nator pnts to qk fd he P'deinat by a ech neta ip) tthe Point /rector a pat atl prota mer code ‘Pande coe each ob, al Ma ‘at fa apart mu AS ‘Clustering fe problem of groping pons by ai Srey ting tanec whch tally ele the K-Mean Clustering ‘Staple and elgan alge to patton 4 data XK ditict nnroenappng hte 1. Choose a K Randomly tng mune betwen Ma) foreach the Ksntrs ompate ti castor tertsdl bs lve (rhe cost i deed Sine the its ofthe sgt depends ow the na ‘gona om diferent rand ialiations to oan Hierarchical Chastering ‘Aermatie chstering grit that es ot sie ae {ocammi to «patel K- Ante aatage Bt ert that fn attr res, whe tne A 2 Farle ao (@) Examine il prise itech dina cares thet ae lent ect ( on te), Po the ee ltr, "The daily ‘ewe thew tw ster inlets hgh i ‘horgzam sere fon snd pla (@ Aasign th obervation th ler ‘sentria ee Linkage: Complete dni), Sige (mi), Ae cage Contr (betwen contre cf cer Aa) ion ti ‘Comparing ML Algorithms ‘ede y Base of Use sore models feature few parane ters/decons (lar rrein/ NN), we thers rere more decom ial to opts (SV¥s) ‘Tralang Speed model cide In bow ft they ee “Spredodel erin bw fst hy ake Problem: Seppe we ned to cdl vector X = 215 Saco m chase, CC. We alto cunt he probe Uy ach posible case en X, can cn X the Ib of the los with highest prt. We can PuxieyPicy, BeIX) PaXy 2. PX}: aormabsing onetat of probity og, 8 AMICD: the conta probly. cme wl ely ok ke cx) rag were C(X) i the prediction etre or inp X. Decision Trees ple fre compar game some eld (> 2") ‘Se ale cheetietion aad rpreion tee (CATE). — — oy Support Vector Machines {etd ming he mal masa pepe, wth ose Principal Component Analysis (PCA) ‘mrt viable wth x salle of wri hat ‘clleivet explant ofthe vrei in the ceil Note: rardy do models jut use one decon tee eens, basing, and ocr 2 Ger to ambyng nnd unirtartng the data PCA ity reduction, fee eect, nr dae voi tg arate fas Inport while peorming PCA, Ensombles, Baguing, Random Forests, Boosting trapped wou of th sing dtd eosiecting 1 om each te traning ona dict nabnrple wo ‘Radonr Bovesta: bls on bagging ty decreaing {he ten We do eerything these ke in agg, Dt Fandom sanple ofthe pprdicrs bch me spit ca ‘it th lt pay =) We = Boosting: the mtn Hest fo grore ur model where iceeton dpth dl (ental inlereton err ef mah ML Terminology and Concepts ‘Tain/Bvel/ Tot ta ea un he ‘ring tng, ot wel orl the fia rel ‘Ciatineation/ Regret mgpemion prediction & umber (es. ousing price), eieation fe prdetion ba tof cater pict) Linon Rogression: pfs an att by making Upiate Teegression® amar to Twa Toremion Dat oor on the et dat (combat by dopont aly sop Ting, or rode of adr o er) [Sits tt hel he ding sing th wets tothe los faction, pig tyers (opt) wih ie ‘dterece te italy seni ce pit for eongaing how wll a ode pererlg rot ne tbs tha cet collet samphing "Normllzaion: proc of converting an storage tales boa dantard rags of en peal Ite wspendenay and Hdeeticaly Disentes ( deta down frm 2 detribion tht eee’ hangs td ‘ cat depend on pray Inve ie hate Ema ding What Is Deep ccna nbn on Neal Netw (NN) ch iden tayer 1 hide ayer JO), Neal neorhy ate hia as wna appro {hore exe NN that can (aprons) do the jo ‘Were acre the apps (or caplet) Wy ting woe hidden agers nd eon, Popular Architectures {hem with weighs and ace oped capa vale that repre “hidden es” a CNN: eorstonal NR, har» cumanion of convo foe dea that highly tare exiting dle ae FINN: recuent NN, dedghed fr banding a segues of ‘Tancy weron of NN, pop NLP ‘tpl, and soeer model srw th far exp ‘Wide nnd Diep’ cominesKnear ears wit doop ‘Temorfow Tb catrenely popns/eiabl fr working with Newrd [Neer ce the ay TH sets op the computaiad sph prety sich seems «NN oo Flow Through the Graph Yard fxr ea fa) In graph, tensor are the dg and re mension itary that How rough the graph Cee it uentn'TP na consis of sof pro tes ‘ape at an ay of any numberof aon A'tawor is chanctried by fr rank (dimensions the pace fora Tenor that wl be fel at cute a} Deep Learning Terminology and Concepts ‘or traarmation) 0 4 weighted sum of pate ‘Weighs gan NN ga olin od ne the opal weight So cet fete Hf weight = 0 ‘cxrepeing feature owt oot sce ‘Netra Networks amped o eon (spe bing Mocks that actually Ser’) cota cation tho ‘ha meee pone to peice note cate [Activatlon Poets: mahal facto tha bo ‘Sigmold Function: faction that maps very negative CGendient Descent /Backpropagations fread ‘Optimizer: operation that change Uh weighs and bi ‘Wolahs Biases woights ae lest te input fo ‘te hes ofthe ont given a weit of ‘Conse grit ha converge il etal rea Lenrning Rate: tate at which options change weights {ila not converging wees a we ae trl ewer es de to nits of ating pet nas i opts Convaistional Layers sees of eawatonal oper Dropouts method for rguanantion In traning, Ns ‘Tostmek eye singh great np [Bxry Stopping: method for reprint tha icven a erunig eat ‘Gradient Deseo echaique 1 minimize bs Wy eam poring the gaits of las with rept tthe odes parnmeters conte 0 tang data ch} Date can ao longer In memory on one mache (noth) 0 nw way of compat wa deed Hadoop adoop san open source dutriatd prose frame- trrk tha nnage data proceng ad ora fr lg {ata apleatons ring costed mysenm Ie com Hadoup ‘Distributed File System (MDFS): hroaphpat sc to apt data by pion aoe eninge (ink conten) prose ge datasets on mile machine ues "instr nce andthe rotate worky/daca ts flex are tren pan dried arom lilt ‘hin, which a so rept crt ile macknee to prove ul thera MopRedueo porte the danse eh ott et yn YARN: Yet Anothor Resource Nogothtor CCoudnates take running onthe clean gs new mes Ineo of aes, Compe of 2a {he eoaror monet und tb noe anager, (im —s—‘“‘éCS ‘At rate cote of cack ww merged around resting tr deed ira seg, SOc er: (ine. Hive abntractn ney neg Maple Piz high Teel seripling ngage (Pi Lata) that abies writing complex data trneormations 1 pill ‘tructored/ copes daa en wren cea a ‘ues in databan/datn waters” Pig pena ‘euch oper aaa (GCP: Data). Shsksfemewotk or wrt ft, ete frogs SQL qureis, staming data, machine lrg aad ‘at integrates well with Hadoop. Dats i proces thing Reibet Datei Duane (RD), which se ‘aneale,lailyevlte, sod rc Une Mime cational NoSQeclunarientad ditabe management stem thet rune on tp. of Flink/Kedear” seat’ poooerng te triton and deajedprocedng, rem prose fd lnumetinte pocening” Sean data and reat group steaming data itv) tng ing (oomoertaping toe). ing (repping ie) san soon ep) km (Comins) prommsng. Aker bung the piper east by one of Dran'sdatetaeproring ackends [Aparhe Apes, Apuce Fink Apache Spare ‘tnt Google Cnet Dean) Mae at a Dieta ‘acyele Graph (DAG), Genie: wetkiow adler tn to manne Ho Soop! trasering framework to tater lange outs ‘dt lta HFS om lata datas (S19SQU) raph hoy the ty af graphs, wih te traces ‘le, we ean model rendship, computer, mca network, Sn empath Tie Ge “ (ce) and ast a edge connect to ode {oper An edge eps latins etme the Formally, « graph G = (VB) const of w st V nove contets (fends ete 2 pop conmeton the lp havea diet or nde (orn det. A wighed raph x graph where the eles sho the Tloney Weight: oe 1 maps tells wi lnk «+ Nomerie Weight! exprones law sag the ce tecton betes anol cer odes 4+ Normalized Weight: ari of amie wiht ‘re al the tegen of word van to A eaph can be represent: (Goan) Gaal) and ayy =1 ia exe bee Sos td 3 @ other, A weight tae W ‘prem the age wegen between odes of 8 fetwork Aa ancy te an abn ep ota ofthe athowncy mates, and rows (ih it i) "Hloute Optimization: model he txportation of ‘Scommaity fom con plac to nate 1 rd Detection: inde fnud tnmaction ad towering of russes working other hotel groupe f peo il spre ructued Query Language (SQL) i a dechratve Usa he database is Related Deane Sa at Stem (RDBNS), which sees date aang umn a town, where cobmns pret charete Basie Queries lat neared datas HAVING, count?) > 1 DISTINCT return unk oats BETWEEN « AND te lini thr ang tae unbers, es dat TIN (a bye) check Ifthe value is contained among, gen Data Meatitation INSERT INTO tae! (lcs) VALUES (al) INSERT INTO tabiel (Cll,ol8) SELECT coc? ©=O Data trucures area may of trig and manipulating ‘renee Combined with agri, data rts bo 1 2, 3.14, ‘Tupke: p> b= 2, 3.1% eto ,torl"] a8 D> de (ster: «2 pie ‘bouts mamta ete se (12, 3.14, Mellor teers") que: doubleadel qo, generation of sacks td tain, 9-18 Counter: dict sel, town re ned we eye ue to" *woris"1) >> e-= Coumeer(appie') Coomker(C'pe: 2, fats 4 Ys 4, es fap Queerty queue, ag are Kary tes Sor ‘which every pact nce ba le eee hao etal {oy fe hea (ms hep) prt mp ‘ports pa, po, pac, ap elas acta >>> heap = 0 eappashibeap, 2) 335 nap a) c oe ees © Data Seo Design Mat (wow. springer. con/on/book/ 979399854430) Intratton to State Laring (cwwrbet neo garatA/TS) (jewe-wzchen.con/probabi ity-cheats (ctvaiopers-gogle coa/sachineleraing/

You might also like