You are on page 1of 16
‘191723, 348 PM 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory Exploratory Data Analysis Problem Statement: We have used Cars dataset from kagoe wit features including make, model year, engine, and other properties ofthe car used to predict its price Importing the necessary libraries {ngort watplotiib.pyplot as plt Avisualts fngort seaborn ae ste avisualisation Suratplot ib inline warnings. 1 terwarnings("Sgnore") Load the dataset into dataframe oF = pdareaderu("Veontent/Cars_eota.c3¥") ée.neaso cngine vanber ake ode Year "Yael EINE Engine Transmission oriven heels of market categot Dee 7 a ears © BMW Sees 2011. Unleaded 3950 «80 MANUAL rearwheslarve 20 TunerLumuyhig " Perarmane 2 BNW gure! 2011 traded 3000 «6.0 “MANUAL rearwhesleme — 20==«(Sumaitig onset) 3 BNW gay 20M enced 250080 MANUAL 20 us Patan (resoree) Now we observe the each features present Inthe dataset ake: The Make feature is the company name ofthe Car. Yodel: The Model feature isthe model or diffrent version of Car models ear: The year describes the model has been launched Engine Fuel Type: It defines the Fueltype ofthe car model Engine +: Its say the Horsepower that refers tothe power an engine produces. Fngie Cylinders: It define the nos of eylindersin present inthe engine Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 16 ‘191723, 348 PM Driven, trees arket category: This features Its say about the about car size, The feature i all about the style that belongs tocar hagimay Hea: The average a car will get while driving on an open etretch of road without stepping or starting. ypealy ata higher speed, vehicle style: 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory Inis the type of feature that describe about the car transmission type Le Manual or automatic The lype of wheel drive. It defined nos of éoars present inthe ca, about type of car or which category the car belongs. city apg: City MPG refers to driving with occasional stopping ane braking Ian refered to rating ofthat car or popularity of ca. sho The pice ofthat ca, Popularity, Check thi \e datatypes eF.antoo anger oF sen0ni0. ne ware fine Cylinders 1188 noncrall =) m8 1944 non-null #leatea(s), imtea(S), obsect(@) sunt) Fuel Type Sylsoaers B8ucee Dropping irrevalent columns 11 we consider all columns present inthe dataset then unneccessary columns wilimpact onthe mode's accuracy Not all he columns are important tous inthe given daterame, and hence we would drop the columns that are ievalento us. k would reflect four models accucary so we need to drop them. Otherwise it wll affect our model ‘Thelistcols_to.drop contains the names ofthe cols that are ievalent, drop all these cals from the dataframe, Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 216 ‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory col trop = [Engine Fuel Type", "harkat Category", "Venicle style", “Popularity, “Munber of Doors", Vehicle size") These features are not neccessary to obtaln the models accucary. It does not contan any relevant information inthe dataset cols te_dvop = ["Engine Fuel Type", “Market Category", "Vehicle Style", "Popularity", "Wunber of Doors", SF = ef deop(cois 20 crop, ae efsnesde) ake pode Year FEIN Engine TONSRLSSION osssen ypeens MEMMBY earn *5°5 20r1 5350 60 MANUAL rwarwacldive 25, 4 BNW 1Sees 2011 S000 60 MANUAL ~ 2 BNW 1Sere8 2001 3000 60 MANUAL ~ Renaming the columns 19 2» "Now; ts time for renaming the feature to useful feature name. It wll help to use them In model raining purpose. We have aeady dropped the unneccesary columns, and now we are let with useful columns. One extra thing that we would do iste rename the columns such thatthe name clearly represents the essence ofthe column, The given det represents (In key value pat the previous name, and the new name forthe dataframe columns rorename(coluans = (Rake :‘Conpany"), inplace = Teve) tforenana(columns = CASA": Price"), inplace = Thus) efcrenane(columns = {Engine HP°:"H?"), inplace = True) “« 7 1 eM geal auth 9000 CMAMUAL arwhenive 28 2 eww gel aot 90 CMAWAL raruhasave 28 3 aww 1 con 2300 60 MANUAL roar wheel drive 28 11908 pera 20K Zot2 3000 O.—AUTOMATIC. aetna 23, foto Aewa 20K 2012 300069 “AUTOMATC. atwhocidne 28 1 ose 9 pandas fu hare |ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 6 16 40120 esr ‘Vehicle size") ane ‘191723, 348 PM # Prins she nese erhead) ° BMW 2 aww Dropping the duplicate rows of the datsfrane 1 Series 1 Sores 1 Sores zon zon aon 260 000 000 ojtinders 60 60 60 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory ‘ype MANUAL MANUAL rearhoel ve rearubeel ve rearboel ve % 0 ~ 0 2» There are many rows in the dataframe which are duplicate, and hence they are just repeating the information as thoy dont add any value to the dataframe. ht betterif we remove these rows For given data, we would Ike to see how many rows were duplicates For his, we will count the numberof rows, remove the dublicated rows, ‘and again count the number of rows oF = ef.drop duplicates) etsheas0) + aww (9525, 10) Dropping the null or missing values 1 Sores 1 Series 1 sees aon Engine 60 Tranentssion MANUAL ‘Missing values are usually represented inthe form of Nan or nullor None inthe dataset Finding whether we have null values Inthe data is by using the isnull function iehway = ay 40850 ‘There are may values which ae missing, in pandas dataframe these vlues ar reffered to as np nan, We went to deal wth these values bbeause we cart use nan values to train models, Ether we can remave them to apply some svvategy to replace them with ather values, To keep things simple we willbe dropping nan values ef-ssnull().sue() |ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue ane ‘191723, 348 PM company caty moe [As we can see that the HP and Cylinders have rl values of 69 and 20. As these nul values will mpact on models’ accuracy. So to avoid the Impact we wil drop the these values As these values are small camparing with datase that will ot impact any major affect on model accuracy 0 we ill drop the values. #1 = 4F.droona(ants-0) 641.ssn100).s460) Srananassion Tye highway mee ef ante) 3 iponay 906827 nancnull types: floatea(2y, inte), cbseer’e) eF.asnali(). suet) ronission Type highway 9G 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory eck nunber of nan values dn each col agein |ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 56 ‘191723, 348 PM ‘count 10827, 000000 in 20% Removing ‘Sometimes a dataset can contain extreme values that are outside the range of what expected and unlike the ther data. These are called cuties and often machine learning modeling and model skin general can be improved by understanclng and even removing these outer 2mr0.89ex70 1990000000 2007.000000, outliers plt-boxplot (det Price’ 1) 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory HP Engine cytinders woazrooo009 10827 a90000 254 953062 se0teo4 55:000000 0.000000 +173.900000 +,000000 ce" column An dataset “Sratplotlab. inet Linea® at 0x7/20008ab¢39>), aps" (a “natplotisb. seasons" [erat ghway Wee ehty ape ~0827-090000 70827000000, zeavene 19227607 ‘2000000 7.000000 22,900000 16:000000, Hines. Line2D at @x7#2@01Redsde>], LénstplotiibrLines-Linezo at 07#208184b810>] (eauepioel c observation: Here as you see that we got some values near to 1S and 2.0. So these values are called outliers, Because there are away from the normal values, Now we have detect the outliers of the feature of Price. Similarly we will eecking of anothers features. pit.bosolo(s cH) sooiaesie> |, *O82TODe OF 42493286008 2,900000640 21972500404 |ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue ene ‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory “Ghiplociib.lines-inead at ecifeatsataede)y ton [coaplotib nes Lined st Wo riOLalfo), ‘SiritSie ines tinet st xrrasnsoneon, yen: [atplotlie ines Linea at OC FS00L41 0) sedion'! [enatpot is ins, ina or fieeenees], Fitera': [cneprolib Les Linea) at ex 500130700} see's observation: Here boxplots show the proper dstrbution of of 25 percentile and 75 percentile ofthe feature of HP, ou print all the columne which are of int or Neat datatype ino. Hint: Use loc with condition yoes(inciuées[np- Ant, float) Year HP engine Cylinders highay #96 clty ape Price 0 20m 350 60 26 9 48195 4 20m 3000 60 219 406850 3 20 200 60 2 18 20450 4 2011 2000 60 2 18 346500 stort 2012 2000 60 2 16 50620 stos2 2019 2000 60 216 0020 Save the colunn nanes of the above output in variable list named ‘1° Tef'vear’, "18", engine Cylinders", highay PS", city wpa "Prtce"y en Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue m6 ‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory We Engine Cylinders highaay APG city ape 2 201 a0 60 280 36260 2 am gy! 201 s000 £0 MANUAL osrwhosline 28a. 860 1900 pou 70x 012 900 60 AUTOMATIC. stvhesline 2318 48120 time pou 20x 2012 800 0 AUTOMATIC abwhoaline «288.8870 Outliers removal techniques - IQR Method Here comes cool Fact for you! 1ORis the frst quartile subtracted fom the third quartile; these quartiles can be clearly seen on a box plot onthe data, + Calculate IQR and give a suitable threshold to remove the outliers and save this new dataframeinto 2, Let us help you to decide threshold: Ollie in this case are defined ae the observations that are below (Q1 ~1.5x QR) or above (03 + 15x IQR) 18 define QL ane @3 QL, B= wpwpercentslecae{‘Price'}s[25 , 75) 1 # define 198 (interauantse range) Ta ga wa gas serge lw = Gh - 2.58708 for 4 sn af 'Pr8ce") etsclow or UP ) outliers. appena(s) ef{‘Price'] = (op. if x in Outliers else x for x in ef "Price’1] entert"eece']) GL, B = wpspercentslecae('H#"16[25 » 751) 48 cefine 198 (Jntorquantsle range) Te = ga we = gas .sera8 lowe Qh 2.58108 Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue ane ‘191723, 348 PM 14..EDA_Assignment_Oay_14 Done ipynb - Colaboratory for sn dF 18") ‘(Selon or UP ) Ovt1iers.append(i) SATHPD = [apsNeN A x Jn OutlSors else x for x 30 "11 tentare'We")) sns.boxplot(de.tP, ontente"h*) “oatplotlib.sk08._subplots.eesSubplot st 0741803976760 [nt © cetine 442 after renoving ovtliers 812 = af dropna(axisn8) efa.intet) colin en-Mll Count otype types! Floatea(ay, snes), ebseee(e) 4 Fane the shape of of & a¢2 print(at. shine) (20827, 10) (os, '20) 003_coledf.relect_atypes( snclude=[09.0bJect]) soiieal |ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue one ‘191723, 348 PM 2 aww ‘1908 Acura 1 Fine unique values and there counts in each column an df using value counts fan for 4 40 dfcoluans print ( print(af(]-value_counts()) 1 Sones MANUAL auTowanie 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory roar wool sive at wheel ve ay Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 1016 ‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory Visualising Univariate Distributions ‘We willuse seaborn Ibrary to visualize eye catchy univariate plots 10 you know? you have just now already explored one unvarate pt, guess which one? Yeah te box plot Histogram & Density Pl: Histograms and density plots show the frequency of a numeric variable along the y-axis, and the value along the x-axis, The sns.distplot() ss function plots a density curve. Notice that ths is aesthetically better than vanilla Aploting distplot for variable 1? ple figure( Figsize-(24,8)) Ens.oistplor(af[ #8" ],colon = ple titleCofstplot of vasa ple-xdaver (9°) ple-ylaver (*coune*) ple-stont) Detect arate Bs observation: We plo the Histogram of feature HP with help of distplot in seaborn, In this graph we can see that there is max values neat at 200, simiary we have also the 2nd highest value near 400 ane so on, It represents the overall distribution of continuous data variables Since seaborn uses matplotib behind he scenes, the usual matplotib functions work well with sesborn, For example, you can use subplots to plot multiple univariate dstebutions + Hint use matpltio subplot function Levene’, “HP, “Engine cylincers", “highway Ho", “city moa’, “Pesce'] H plot alt the colunns present sn List 1 together using subplot of ainention (2,3) Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue ne 11917, 948 Pat 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory ple. figure (#4gss200(35,10)) ‘ for {im range(@,ten(2)) H tenayy putesubplet(2,3,442) ns countplotef[1[3))) Tt suptitle( "subplot of list) pit-eieleQ(ip) pit-wlabel s1) pit ylabet ("count") plt-stont) sus. pateplet(ef(t)) subplot st tt rs shoe ; h sl : | hs. “ i ‘eget amore ee Bar Chart Plots Plot ahistagram depicting the make in X axis and number of car in y axis, pit. Figure(rigsize = (52,8)) na histpioe 4 Company") ple tight Layout) Fuse nlangest and then plot to get bar plot 2tke below output -ntps:feolab rosearch google. com/riva/ x5SszpUp7KEpS3azD0RUKW\'SqIXEn6w_HprintModo=tue 1216 ‘191723, 348 PM 14..EDA_Assignment_Oay_14 Done ipynb - Colaboratory observation: In this plot we can see that we have plot the bar plot with the cars model and nos. of cars count Plot ‘Acount plot canbe thought of asa histogram across a categorical, instead of quantitative, variable. Plot a countslt fora variable Transmission vertically with hue as Drive mode ple figure(Figsszee(35,5)) Sns.countplot(x = "Transmission Type’ ,catarce2) plt-grig(color = "k’, Linestyle = "=~", Linewidth = @.5) 1 srcxcroun STYLING plt-legend(loc = 4) pis-stont) awuannatpLotlib.Legend:No handles with Labels found to put in legend ecm nee Observation: In this count plot, We have plot the feature of Transmission with help of hue. We can see thatthe the nos of count and the transmission type and automated manuals plotted, Drive mode as been given with help of hue Visualising Bivariate Distributions Bivariate datributions are simply two univariate distributions plotted on x and y axes respectively. They help you cbserve the relationship between the two variables Scatter Plots ‘Scatterplots are used to find the correlation between two continuos variables, Using seatteplot nd the correlation between HP’ and Price’ column af the data s(#igsizes(10,6)) pit. gure Figeize = (8,8)) Ens. seatterplot(et( HP"), Af Price’) ple legens(loe = 4) ple-stont) |ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 1316 ‘191723, 348 PM 14..EDA_Assignment_Oay_14 Done ipynb - Colaboratory anuan:natplotlib.legend:Ne handles observation: Itis a type of plot or mathematical diagram using Cartesian coordinates to dlsplay values for typically two variables fr a set of data ‘We have plot the scatter plot with x axis as HP and y axis as Price. ‘The datapoints between the festures shouldbe same either wise it give errors, Plotting Aggregated Values across Categories Bar Plots - Mean, Median and Count Plots ar plots are used to daplay aggregated values ofa variable rather than entire distributions. This ie especially useful when youhave alot of data whichis cificu to visualise in a single figure. For example, say you want to visualise and compare the Price across Cylinders. The sns.barplot() function can be used to do that, et. grovpby("acear_oroxinity”, a5_tndexeratse){ median house value’ ).222n() Ssns-barplot(‘oceah_proxinity’, “median touse value’, datacdf, ci-False) ers grovpby( ocean proxinity" i necian_ house value’ J.naan()-piot-bar() sns.barplot(de[ Price” J, "Engine Cylinders"]) PIetitLe( the Price across Cylinders") plt-alasel(“price") fx-uaves ron ata plt.ylaved ‘eyLinder’) ple ston) |ntps:ifeolab rosearch google. comlriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 146 ‘191723, 348 PM 14..EDA_Assignment_Day_14 Done ipynb - Colaboratory len) the rice across Cynder Tex(@, 0:5, °0 Observation: By default seabor plots the mean value across categories, though you can plot the count, median, sum et. ‘Also, barplot compures and shows the confidence interval of the mean as well TT When you want to visualise having a large number of categories, it is helpful to plot the categories across the y-axis. Let's now érill down i Transmission sub categories. ‘These plots looks beutiful isntit? In Data Analyst life such chars ae there unavoidable frend.) Multivariate Plots Heatmaps [Aheat map is @ two-dimensional presentation of information withthe help of colors. Heat maps ean help the user visualize simple or complex information Using heatmaps plot the correlation between the featutes present inthe dataset. tind she correlation of features of the A Set size of graphs (12,8) observation: [Aheatmap contains values representing various shades of the same colour foreach value tobe plotted. Usually the darker shades ofthe chart represent higher values than the lighter shade, Fora very different value a completely diferent colour can also be used ‘The above heatmap plot shows corelation between various variables inthe colored scale of 1 to 1 Intps:ifeolab rosearch google. comltriva/ x8SszpUp7KEp33azDoRUKW\'SqIXEn6w_HprntMode=tue 19116 11917, 948 Pat 14..EDA_Assignment_Day_14 Doneipynb -Colaboratory -ntps:feolab rosearch google. com/riva/ x5SszpUp7KEpS3azD0RUKW\'SqIXEn6w_HprintModo=tue 1616

You might also like