You are on page 1of 154
oso, 205 Proj ML Problem 1-Jpytes Notebook Problem 1: You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey was conducted on 1525 voters with 9 variables, You have to build a model, to predict which party a voter will vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and. seats covered by a particular party. 1.1 Read the dataset. Describe the data briefly. Interpret the inferences for each. Initial steps like head() .info(), Data Types, etc . Null value check, Summary stats, Skewness must be discussed. In (1): import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn-model_selection import train test split from sklearn import metrics from sklearn-neighbors import KNeighborsClassifier import warnings warnings. £ilterwarnings(' ignore’) from sklearn.model_selection import cross_validate from sklearn.linear_model import LogisticRegression from sklearn-metrics import roc_auc_score,roc_curve, classification report, confusion, from sklearn-model_selection import train test_split,Gridsearchcv In [2]: data_df=pd.read_csv( "Election Data copy.csv") ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp ws Proj ML Problem 1-Jpytes Notebook In (31: data_df.head(10) out [3]: Unnamed vole age economic.condational economic.condthousehold Blair Hague Eur o 1 Labour 43 3 304 4 1 2 Labour 36 4 44 4 2 3 Labour 95 4 4.5 2 3 4 Labour 24 4 2 2 4 4 5 Labour 41 2 204 1 5 6 Labour 47 3 44 4 6 7 Labour 57 2 2 4 4 7 8 Labour 77 3 O06 4 a 9 Labour 39 3 a 4 4 9 10 Labour 70 3 2 5 4 In (4): data_df.tail(10) out (4): eae vote age economic.cond.national economic.cond.household Blair Ha 1818 1518 Conservative 82 2 2 2 1616 1517 Labour 30 3 a 1817 1818 Labour 76 4 ae 16181519 Labour 50 3 a4 1819 1520 Conservative 35 3 a4 15201521 Conservative 67 5 32 1821 1522 Conservative 73 2 2 4 1622 1523 Labour 97 3 38 1628 1524 Conservative 61 3 34 1624 1525 Conservative 74 2 32 Data Description: 1. vote: Party choice: Conservative or Labour 2. age: in years, 3, economic.cond.national: Assessment of current national economic conditions, 1 to 5: 4, economic.cond.household: Assessment of current household economic conditions, 1 to 5. 6, Blair: Assessment of the Labour leader, 1 to 5. 6, Hague: Assessment of the Conservative leader, 1 t0 5. ecto 88noebooks Downloads Mashine LeamingProject MLProjst_ML_Problem Lipp Proj ML Problem 1-Jpytes Notebook 7. Europe: an 11-point soale that measures respondents’ attitudes toward European integration. High scores represent ‘Eurosceptic’ sentiment. 8 political. knowledge: Knowledge of parties’ positions on European integration, 0 to 3. 9, gender: female or male. 10. Unnamed: Serial Number In [5] data_d£.info() RangeIndex: 1525 entries, 0 to 1524 Data columns (total 10 columns) column Non-Null Count Detype Unnamed: 0 1525 non-null int 64 vote 1525 non-null object age 1525 non-null int 64 economic.cond.national 1525 non-null int 64 economic.cond.household 1525 non-null int 64 Blair 1525 non-null int 64 Hague 1525 non-null int 64 Europe 1525 non-null int 64 political. knowledge 1525 non-null int 64 9 gender 1525 non-null object dtypes: int64(8), object(2) memory usage: 119.3+ KB In [6]: data_df.dtypes.value_counts() out {6}: ints 8 object 2 dtype: inted In (7) data_df-isnull().sum() out {7}: Unnamed: vote age economic.cond.national economic.cond.household Blair Hague Rurope political.knowledge gender dtype: inted Insights: + Data consists of both categorical and numerical values Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn aos Proj ML Problem 1-Jpytes Notebook + There are total 1525 rows representing voters and 10 columns with 9 variables. Out of 10, 2 columns are of object type and 8 columns are of integer type. + Data does not contain missing values. + The first column is an index ("Unnamed: 0°) as these are only serial numbers, we can remove it. In (8): data_df=data_d£.drop( ‘Unnamed: In [9]: data_d£.head() out (9): vote age economic.cond.national economic.cond.household Blair Hague Europe politic © Labour 43 3 a 4 4 2 1 Labour 96 4 44 4 5 2 Labour 95 4 4 5 2 3 8 Labour 24 4 2 2 41 4 4 Labour 41 2 2 1 4 6 In [10]: round(data_df describe ( include ).7,3) out [10]: count unique top freq mean —_—std_ min 25% 50% 75 vote 1525 2 Labour 1068 NaN NaN NaN NaN NaN Ne age 1525 NaN NaN NaN 541823 157112 24 41 53 ¢ economic.cond.national 1525 NaN NaN NaN 3.2459 ose0969 1 3 3 ‘economic.condhhousehold 1525 NaN NaN NaN 9.14053 0.920951 1 3 3 Bis 1525 NaN NaN NaN 3.99443 1.174882 1 2 4 Hague 1525 NaN NaN NaN 2.74689 12907 91 «22 2 Europe 1525 NaN NaN NaN 672852 32075 1 4 6 politicalknowledge 1525 NaN NaN NaN 1.5493 1.08391 0 0 2 gender 1525 2 female 812 NaN NaN NaN NaN NaN Ne Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn aps oso, 205 Proj ML Problem 1-Jpytes Notebook In (11): round(data_dfdescribe().1,3) out [11 count mean std min 25% 50% 75% max age 1525.0 54162 15.711 240 410 530 67.0 920 economic.cond.national 1525.0 3246 0881 10 30 30 40 50 economic.condhousehold 15250 3.140 0950 10 39 30 40 50 Blair 15250 9994 1.175 10 20 40 40 50 Hogue 15250 2747 1201 10 20 20 40 50 Europe 15250 6729 3298 19 40 60 100 110 poliicalknowledge 15250 1.542 1.08 00 00 20 20 30 In [12]: data_d£.shape print('The number of rows of the dataframe is',data_df.shape[0],'.') print('The number of columns of the dataframe is',data_df.shape(1],'.') The number of rows of the dataframe is 1525 . The number of columns of the dataframe is 9 . Insights from Descriptive Statistics: + After dropping "Unnamed: 0°, data now contains 1525 rows and 9 columns. “Table 8 confirms presence of unique value counts for variables “vote” and “gender”, + There are 2 types of voting parties- Labour and Conservative. From Table 8 top party seems to be the Labour Party. + There are 2 types of genders voting- Male and Female with Female being the top most voters (Table 8). + Minimum age of an individual voting is 24 years and maximum age is 93 years. Mean voting age is 54 years. + Minimum assessment of current national economic conditions is 1 and a maximum assessment is § with an average assessment of 3. + Minimum assessment of current household economic conditions 1 and a maximum assessment is 5 with an average assessment of 3. + Minimum assessment of the Labour leader Tony Blair is 1 and maximum assessment is 5 with an average assessment of 4. + Minimum assessment of the Conservative leader William Hague is 1 and maximum r assessment is 5 with an average assessment of 2, + 75% of the voters on a 11-point scale that measures respondents attitudes toward European integration represent high ‘Eurosceptic’ sentiment with a maximum scale of 11 and a minimum scale of 1. Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn 0s Proj ML Problem 1-Jpytes Notebook + On an average knowledge of parties positions on European integration is 2. Approximately 25% of parties do ot hold positions on European integration with a maximum holding of 3. After dropping unnecessary attributes, shape of our data has also changed as mentioned below: * The number of rows of the dataframe = 1525 * The number of columns of the dataframe = 9 In [13 # check for Duplicate Values dups = data d£.duplicated() print (‘Number of duplicate rows ta" ¥ (dups.sum())) data_df[dups} Number of duplicate rows = 8 out (13 vote age economic.cond.national economic.condhousehold Blair Hague Europt 67 Labour 35 4 a 626 Labour 99 3 404 2 £ 870 Labour 98 2 40202 : 983 Conservative 74 4 3 2 4 ‘ 1154 Conservative 53 3 4020 2 ‘ 1208 Labour 36 3 32 2 ‘ 1244 Labour 29 4 404 2 : 1438 Labour 40 4 oe i Insights: ‘There are only 8 duplicated records in the data set, we can remove these records from the data set as they might not be adding any additional value and dropping these may provide distinct records only. In (14) data_df.drop_duplicates(inplace=True) print('Number of rows after removing duplicated records = $d‘ % data df-shape[0],'.! print( ‘Number of columns after removing duplicated records = ad’ 4 data_df.shape[1], Number of rows after removing duplicated records = 1517 Number of columns after removing duplicated records = 9 Iocthost88noebooks Download Machin eam os Proj ML Problem 1-Jpytes Notebook In (15): dups = data_df.duplicated() print('Number of duplicate rows = td’ % (dups.sum())) data df {dups) Number of duplicate rows = 0 out [15]: vote age economic.cond.national economic.cond.household Blair Hague Europe political Insights: + Itis clear that there are no duplicated records in the data set. + Using shape attribute, it is confirmed that the duplicated records have been dropped from dataset from. original data. + Initially data had 1525 records and now data contains 1517 records. In [16]: # Geting unique counts of all objects for column in data_df[{'vote', ‘gender']]: print (column.upper(),": ',data_df[column] .nunique()) print (data_df(column].value_counts().sort_values()) print(‘\n") VvoTE : 2 conservative 460 Labour 1057 Name: vote, dtype: int64 GENDER : 2 male 709 female 808 Name: gender, dtype: inted Insights: + There are 2 types of voting parties- Labour and Conservative.Vote count for Labour Party is 1057 and vote ‘count for Conservative party is 460(which was also confirmed from Table () top party is Labour). + There are 709 Male voters and 808 Female voters(which was also confirmed from Table () that there are higher count of Females voters). Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn osoiaa 120s In (17) # Checking for Object Data Type data_df(data_df.dtypes{(data_df.dtypes: out [17]: vote gender © Labour female 4 Labour male 2 Labour male 3 Labour female 4 Labour male In (18): data_df.info() Proj ML Problem 1-Jpytes Notebook Int64Index: 1517 entries, 0 to 1524 Data columns (total 9 columns): # = column vote. age blair Hague Europe 8 gender dtypes: int64(7), object(2) memory usage: economic.cond.national economic.cond. household political. knowledge 118.5+ KB Non-Null Count 1517 1517 1517 1517 1517 1517 1517 1517 1517 non-null non-null non-null non-null non-null non-null non-null non-null non-null object Deype object inked antes antes inted inted inted inted object -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya ))-index)-nead() ks Proje ML Problem Juste Notebook In [19]: continuous=data_df.dtypes| (data_df dtypes==" int64')].index data_plot=data_d£ {continuous} data_plot -boxplot ( figsize=(20,6)); plt.xlabel ("Continuous Variables") plt.ylabel ("Density") plt.title("Figure 1: Consolidated Boxplot of Continuous Data") out[19]: Text (0.5, 1-0, ‘Figure 1: Consolidated Boxplot of Continuous Data’) Insights: From Figure 1, presence of outliers can be confirmed in variables economic,cond.national and economic.cond, household, 1.2 Perform EDA (Check the null values, Data types, shape, Univariate, bivariate analysis). Also check for outliers (4 pts). Interpret the inferences for each (3 pts) Distribution plots(histogram) or similar plots for the continuous columns. Box plots, Correlation plots. Appropriate plots for categorical variables. Inferences on each plot. Outliers proportion should be discussed, and inferences from above used plots should be there. There is no restriction on how the learner wishes to implement this but the code should be able to represent the correct output and inferences should be logical and correct. Univariate Analysis: ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp 99s oso, 205 Proj ML Problem 1-Jpytes Notebook In [20]: def univariateanalysis_numeric(column,nbins): print("Description of " + column) print(" print (data_df{column].describe(),end=" ') plt.figure() print("Distribution of " + column) print("~ sns.distplot(data_df(colum], kde=True, color='b'); plt.show() plt.figure() print("BoxPlot of " + column) print) ax = sns.boxplot (x=data_d£ [column] , colo: plt.show() In [21 df_num = data_df.select_dtypes(include = ['int64"]) df_cat=data_df.select_dtypes({"object"]) Categorical_column_list=list(df_cat.columns. values) Numerical_colunn_list = list(df_num.columns. values) Numer ical_length=len(Numerical_column_list) Categorical_length=len(Categorical_column_list) print("Length of Numerical columns is :",Numerical_length) print("Length of Categorical columns is :",Categorical_length) Length of Numerical columns is : 7 Length of Categorical columns is : 2 In [22]: d£_cat.head() out (22 vote gender (© Labour female 1 Labour male Labour — male Labour female Labour male -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 109s osoiaa 120s Proj ML Problem 1-Jpytes Notebook In (23) df_nun.head() out [23 ‘age economic.cond.national_economic.cond.household Blair Hague Europe political knowl ° 43 3 fo 1 96 4 Ce 2 35 4 fee 3 28 4 oe a4 2 21 4 6 In [24 for x in Numerical_column_list: univariateAnalysis_numeric(x,20) vo 15 20 25 30 35 40 45 50 ‘economic cond national Description of economic.cond.household count 1517-00000 mean 3.137772 std 0.931069 min 1.000000 258 3.000000 508 3.000000 758 4.000000 Insights from Univariate Analysis: + For variable *age" : Minimum voting age is 24 years and maximum voting age is 98 years, Mean voting age is 54 years, + For variable “economic.cond.national” : Minimum assessment of current national economic conditions is 1 and a maximum assessment is 5 with an average assessment of 3. + For variable “economic.cond. household” : Minimum assessment of current household economic conditions 1 and a maximum assessment is 5 with an average assessment of 3. + For variable “Blair* : Minimum assessment of the Labour leader Tony Blair is 1 and maximum assessment is 5 with an average assessment of 4 + For variable "Hague" : Minimum assessment of the Conservative leader William Hague is 1 and maximum r assessment is 5 with an average assessment of 2, ‘eclhos888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipyn 9s 120s Proj ML Problem 1-Jpytes Notebook + For variable “Europe” : 75% of the voters on a 11-point scale that measures respondents attitudes toward European integration represent high ‘Eurosceptic’ sentiment with a maximum scale of 11 and a minimum scale oft. + On an average knowledge of parties positions on European integration is 2.Approximately 25% of parties do ot hold positions on European integration with a maximum holding of 3. + The medians of variables "Blair" , "Hague',’economic.cond.national” and “economic.cond.household” are identical to the first quartile, which is why there is an overlap in the Boxplot (Figure: 2.2-2.5).This could be because data might have identical large proportion of low values. + We can also confirm presence of outliers in variables "economic.cond.national" and “economic.cond.household”, + Since the lower quarile and middle quartile values are same (/.8. 0), variable “poltical.knowledge" does not have a lower whisker and middle whisker. Bivariate and Multivariate Analysis: Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn ns osa02t, 208 Proj ML Problem 1-Jpytes Notebook In [25]: sns.pairplot (data_df) out[25]3 Since, we need to predict which party a voter will vote for on the basis of the given information, we will do a Bivariate Analysis of variable vote with other variables and also look at the pairwise relationship of variables with dependency on variable “vote”. -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya nos osa02t, 208 In [26]: Proje ML Problem Juste Notebook sns.pairplot(data_df, hu out [2613 vote") (HEL > Tne -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya osa02t, 208 Proje ML Problem Juste Notebook In [27] plt.figure(figsize=(6,6)) sns.violinplot(data df("vote"], data_df{ plt.title("Figure 4.1: Violin Plot of vote and economic.cond.national ") plt.show() Figure 4.1: Violin Plot of vote and economic.cond.national -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya sconomic.cond.national' },data=data df ,color osa02t, 208 Proj ML Problem 1-Jpytes Notebook In [28]: pit. figure(Figsize=(6,6)) sns.violinplot(data df{"vote"], data_df{ 'economic.cond.household’ | ,datasdata_ df, cole plt.title("Figure 4.2: Violin Plot of vote and economic.cond.household" ) plt-show() Figure 4.2: Violin Plot of vote and economic.cond household ‘Labour ‘Conservative -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 60s oso, 205 Proj ML Problem 1-Jpytes Notebook In [29]: pit. figure(Figsize=(6,6)) sns.violinplot (data df("vote"], data df{ ‘Blair’ },data-data_df,color~ plt.title("Figure 4.3: Violin Plot of vote and Blair") plt-show() Figure 4.3: Violin Plot of vote and Blair Labour Conservative 9s -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya oso, 205 Proj ML Problem 1-Jpytes Notebook In [30]: pit. figure(Figsize=(6,6)) sns.violinplot(data df{"vote"], data df{ ‘Hague’ },datasdata_df,color="r") plt.title("Figure 4.4: Violin Plot of vote and Hague") plt-show() Figure 4.4: Violin Plot of vote and Hague Hague Labour Conservative -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 9s osoiaa 120s Proj ML Problem 1-Jpytes Notebook In (31) pit. figure(Figsize=(6,6)) sns.violinplot (data df("vote"], data df{ ‘Burope' },datasdata_df color plt.title("Figure 4.5: Violin Plot of vote and Europe") plt-show() "a") Figure 4.5: Violin Plot of vote and Europe Europe ‘Labour Conservative -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 905 osa02t, 208 Proj ML Problem 1-Jpytes Notebook In [32]: pit. figure(figsize=(6,6)) sns.violinplot(data_df{"vote"], data_df{ ‘political.knowledge" },data-data_df,color="; plt.title("Figure 4.6: Violin Plot of vote and political.knowledge") plt.show() Figure 4.6: Violin Plot of vote and political knowledge Labour ‘Conservative Heatmap of Continuous Variables : -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 200s osoiaa 120s Proj ML Problem 1-Jpytes Notebook In (33): data_df.columns out [33 Index({ ‘vote’, ‘age’, ‘economic.cond-national', ‘economic.cond-househo la’, ‘Blair’, ‘Hague’, ‘Europe’, ‘political-knowledge', ‘gender'}, dtype='object') In [34 pit. figure(figsize=(10,5)) plt.title("Figure 5 : Heatmap of Variables ") sns-heatmap(data_df{{'vote', ‘age’, ‘economic.cond-national', 'economic.cond.housene “Blair', ‘Hague’, ‘Europe’, ‘political.knowledge’, ‘gender' }}.corr() ,annot=T1 omap='Blues' ,cbar=True,mask=np-triu(data_df{('vote', ‘age’, ‘economic.cc Blair’, ‘Hague’, ‘Europe’, 'political-knowledge’, ‘gender'}].corr(),+1)) out [34]: Figure 5 Heatmap of Variables to sxonamic cond household - 0.039, ise 0032 my Ht { i i covernic cond ra Insights of Bivariate and Multivariate Analysi Overall the categories in the data do not look very well correlated. Listing down a few observations from Heatmap below: + Negative Correlation is an indication that mentioned variables move in the opposite direction whoever is voting for Blair is obviously not voting for Hague.Hence there is a negative correlation between the two indicating cause and effect relationship between the variables. + In general,correlation values of -0.30 and + 0.30 represent weak correlation. Variables "Blair" and "Hague" both have weak correlation with national and household economic conditions but "Blair* has sliahtlv better -ecthos888/noebooks Dowanloas/Mochine Leaning Projet MLProj_ ML, Problem Lipyn 21098 Proje _ ML Problem 1-Jpytes Notebook correlation with these parameters (not much of a difference). + National economic conditions has very weak correlation with household economic condition. Listing down a few observations from Violin Plots below: ‘the white dot represents the median the thick gray bar in the center represents the interquartile range the thin gray line represents the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the interquartile range. Violin is “fatter”, there are more data points in the neighborhood. And where itis “thinner”, there are less. Checking for Outliers : In [35 def check_outliers(data_df) voata_num = data df-loc{+,data_df.colums = ‘class'] 01 = wate num.quantile(0.25) 03 = vata num.quantile(0.75) TOR = 03 - 01 count = 0 # checking for outliers, True represents outlier voata_num_mod = ((vData num < (Ql - 1.5 * TOR)) |(vData_num > (93 + 1.5 * TOR))] #iterating over columns to check for no.of outliers in each of the numerical ati for col in vData_num mod: 4£(1 in vData_num mod[col].value_counts().index print("No. of outliers in $s: 8d" ¢( col, vData_num mod{col].value_count count. \n\nWo of attributes with outliers are print! + count) check_outliers(data_éf) No. of outliers in economic.cond.household: 65 No. of outliers in economic.cond.national: 37 No of attributes with outliers are : 2 In [36 Q1 = data_df.quantile(0.25) 93 = data_df.quantile(0.75) TOR = 93 - ol print (TOR) age 26.0 economic.cond.national 1.0 economic.cond.household 1.0 Blair 2.0 Hague 2.0 Europe 6.0 political knowledge 2.0 dtype: floated Cleatly, there are outliers in variables “economic.cond, household" and "economic.cond.national’.Now, we will ccheck the upper range and lower range for both the variables and than treat outliers. Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn ns osoiaa 120s Proj ML Problem 1-Jpytes Notebook In (37): def detect _outlier(col): sorted(col) 01,Q3=np.percentile(col,[25,75]) TOR=93-01 lower_range= Q1-(1.5 * TOR) upper_range= 93+(1.5 * TOR) return lower_range, upper_range For Variable economic.cond.househol: In (38): Lr,ur=detect_outlier(data_df|[ 'economic.cond. household’ }) print("Lower range in economic.cond.household is”, 1r) print("Upper range in economic.cond.household is", ur) print( ‘Number of outliers in economic.cond.household upper print( ‘Number of outliers in economic.cond.household lower print('S of Outlier in economic.cond.household upper: print('® of Outlier in economic.cond.household lower: ", data d€(data_d£[‘ecc "| data_df(data_d£[ ‘ec " round(data_df(data_d£[ ‘cone round (data_d€(data_dé[‘econe Lower range in economic.cond-household is 1.5 upper range in economic.cond.household is 5.5 Number of outliers in economic.cond.household upper : 0 Number of outliers in economic.cond.household lower : 65 ® of outlier in economic.cond-household upper: 0 & % of Outlier in economic.cond-household lower: 4 & Iocthost88noebooks Download Machin eam aus oso, 205 Proj ML Problem 1-Jpytes Notebook In (39) fig, (ax1,ax2,ax3)=plt. subplots(1,3,figsize=(10,4)) #boxplot sns.boxplot(x='economic.cond. household’ ,data=data_df,orient='v' ,ax=ax1,color= "purple axl.set_ylabel( ‘Density’, fontsize=15) axl.set_title( Figure 6.1: Distribution of economic.cond.household’, fontsize=15) axl.tick_params(labelsize=15) #distplot sns.distplot (data_df[ 'economic.cond.household’ },ax=ax2,color=' purple" ) ax2.set_xlabel(' economic.cond.household', fontsize=15) ax2.tick params (labelsize=15) #histogram ax3.hist (data_df[ ' economic.cond-household' ], color="purple’) ax3.set_ylabel( ‘Density’, fontsize=15) ax3.set_xlabel( ‘econonic.cond. household", fontsize=15) ax3.tick parans(labelsize=15) plt.subplots_adjust (wspace ple.tight. layout () Figure 6.1: Distrbuton of economic. cond household "3 600 20 2), gis B40 & 4x0 & 200 os 2 4 eo 2 4 ° 2 4 concn ecoomiccond.houtahald _ sconoraic.cond household Inference: Clearly there is presence of outliers in variable “economic.cond. household" Listing down few observations below: + Lower range in economic.cond.household is 1.5 + Upper range in economic.cond.household is 5.5 ‘+ Number of outliers in economic.cond.household upper : 0 + Number of outliers in economic.cond.household lower : 65 + 9% of Outlier in economic.cond. household upper: 0% + % of Outlier in economic.cond. household lower: 4%, Also, Figure() confirms presence of outlier in the lower range. Median of "econornic.cond, household" is Identical to the first quartile, which is why there is an overlap in the Figure. This could be because data might have identical large proportion of low values. For Variable economic.cond.national: -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 209s oso, 205 Proj ML Problem 1-Jpytes Notebook In [40]: Iryursdetect_outlier(data df{ 'economic.cond.national' }) print("Lower range in economic.cond.national is",1r) print("Upper range in economic.cond.national is", ur) print (‘Number of outliers in economic.cond.national upper : ', data df(data df{ ‘ecor print (‘Number of outliers in economic.cond.national lower : ', data_df(data_df{‘ecor print('8 of Outlier in economic.cond-national upper: ',round(data_df{data_dé{ ‘econon print('2 of Outlier in economic.cond-national lower: ',round(data_df(data_df{ ‘econon Lower range in economic.cond.national is 1.5 Upper range in economic.cond-national is 5.5 Number of outliers in economic.cond.national upper : 0 Number of outliers in economic.cond.national lower : 37 % of Outlier in economic.cond.national upper: 0 % ® of Outlier in economic.cond.national lower: 2% In [42 fig, (axl,ax2,ax3)=plt.subplots(1,3,£igsize=(10,4)) Pboxplot sns.boxplot (x='economic.cond.national' ,data=data_d£,orient='v' ,ax=axl,color=' orange! axl.set_ylabel( Density’, fontsize=15) axl.set_title('Figure 6.2: Distribution of economic.cond.national', fontsize=15) axl.tick_params(labelsize=15) eset cteeen eeeia eens) 20 > 1s 400 & 810 & os 2 4 oo 2 4 ° 2 4 Ancomtnits ———emoecondtatoal sano conta Inference: Clearly there is presence of outliers in variable “economic.cond.national’.Listing down few observations below: + Lower range in economic.cond.national is 1.5 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 2808 Proj ML Problem 1-Jpytes Notebook + Upper range in economic.cond.national is 6.5 ‘+ Number of outliers in economic.cond.national upper : 0 + Number of outliers in economic.cond.national lower : 37 + % of Outlier in economic.cond.national upper: 0.% + % of Outlier in economic.cond.national lower: 2% Also, Figure confirms presence of outlier in the lower range. Median of "economic.cond.national” is identical to the first quartile, which is why there is an overlap in the Figure, This could be because data might have identical large proportion of low values. Since, there are only few outliers in the above two features and the feature set is ordinal, so we will treat these by removing outliers from the dataset. Below is the visualization of consolidated Boxplots of variables with and without treatment of outliers. In [42]: cols = ["economic.cond.national","economic.cond.household” } In [43 def remove outlier(col): sorted(col) 01,03=np.percentile(col,[25,75]) TOR=03-01 lower_range= Q1-(1.5 * I0R) upper_range= 03+(1.5 * TOR) return lower_range, upper_range In [44 for column in data_df{cols].columns: Lr, ur=remove_outlier(data_df[column]) data_df[column]=np.where(data_df[column]>ur,ur,data_df[column]) data_df[column]=np.where(data_df[column] Int64Index: 1517 entries, 0 to 1524 Data columns (total 9 columns): # Column Non-Null Count Dtype 0 age 1517 non-null int 64 1 economic.cond.national 1517 non-null float64 2 economic.cond.household 1517 non-null float64 3. Blair 1517 non-null int 64 4 Hague 1517 non-null int 64 5 Europe 1517 non-null int 64 6 political.knowledge 1517 non-null int 64 7 vote Labour 1517 non-null wines 8 gender_male 1517 non-null wines dtypes: floates(2), int64(S), uines(2) momory usage: 137.8 KB ‘Table _ & 1_confirms that al the categorical data is converted to numerical data now. We will divide the data into Training and Testing data set, with 70:30 proportion with the fixed random state as 1 to ensure uniformity across multiple systems. Before we do the train-test split , we will first separate indlanancant 0X1 and danandant (vl variahles to narform Train-Tast soit Ieclhost 888/ncbooks Downloade Machine Leaming Proje MLProjst ML Problem Lipynb as In [57 jet ML Pole Supyer Notebook # Arrange data into independent variables and dependent variables ## Features X = df.drop("vote Labour" axis=1) y = df[["vote Labour"}] ## Target In [58 # split X and y into training and test set in 70:30 ratio from sklearn.model_selection import train test split train test _split(x, y, test_size=0.30 , random st X_train, Xtest, y train, y test = In [59 print('Number of rows and columns print('Number of rows and columns print (‘Number of rows and columns print (‘Number of rows and columns Number of rows and columns of the jables: (1061, 8) Number of rows and columns of the ble: (1061, 1) Number of rows and columns of the es: (456, 8) Number of rows and columns of the (456, 1) In [60 X_train.head() of the training set for the independent variables of the training set for the dependent variable of the test set for the independent variables ’ x of the test set for the dependent variable:',y tes training training test set test set set for the independent var set for the dependent varia for the independent variabl for the dependent variabl economic.cond.national economie.cond.household Blair Hague Europe political knc out (60): 29% oor 38 20 vrs 40 40 e491 40 om 7 30 sess 50 ao 4 so 4 4 6 3004 4 7 so 4 2 4H so 4 2 8 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya nos In [61]: y_train.head() out (61, vote_Labour In [62 X_test.head() out 62 ° 1 o Proj ML Problem 1-Jpytes Notebook economic.cond.national economic.cond.household Blair Hague Europe political knc 1075 1031 1329 In [63 y_test-head() out (63, n 6 a 33 vote_Labour In general, algorithms that exploit distances or similarities (e.g. in the form of scalar product) between data samples are sensitive to feature transformations i.e. Feature Scaling is performed when we are dealing with Gradient Descent Based algorithms (Linear and Logistic Regression, Neural Network) and Distance-based algorithms (KN, K-means, SVM) as these are very sensitive to the range of the data points. Nott 1 30 30 50 20 50 30 20 50 30 40 2 4 5 ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp 2 2 2 8 8 1 oso, 205 Proj ML Problem 1-Jpytes Notebook The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbours), Neural Networks, Linear Regression, and Logistic Regression. = The machine learning algorithms that do not require feature scaling is mostly non-linear ML algorithms such as Decision trees, Random Forest, AdaBoost, Naive Bayes, etc. Here, we are building a model, to predict which party a voter will vote for on the basis of the given information and to create an exit poll that will help in predicting overall win and seats covered by a particular party.In order to do our analysis we are expected to build model using Logistic Regression, LDA, KNN Model and Naive Bayes Model.For now we are not scaling the data and will do the scaling based on the models we will run ahead.Hence, as mentioned scaling might be necessary for two models and might nat be necessary for the other two. 1.4 Apply Logistic Regression and LDA (Linear Discriminant Analysis) (2 pts). Interpret the inferences of both model s (2 pts). Successful implementation of each model. Logical reason behind the selection of different values for the parameters involved in each model. Calculate Train and Test Accuracies for each model. Comment on the validness of models (over fitting or under fitting) Logistic Regression Logistic regression is a linear model for classification rather than regression. It is also known as logit regression. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. Note: - Regularization is applied by default, which is common in machine learning but not in statistics. - Another advantage of regularization is that it improves numerical stability. No regularization amounts to setting C to a very high value. There are two methods to solve a Logistic Regression problem: 1. Stats Mode! 2. Scikit Learn, Here, we will use Grid Search ( scikit learn method) to find the optimal hyperparameters of a modal which results in the most ‘accurate’ predictions and get the best parameters, ‘Note: Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyper parameter values specified. ‘The parameters used in GridsearchCV can be explained as + param_grid : requires a list of parameters and the range of values for each parameter of the specified estimator + estimator: requires the model we are using for the hyper parameter tuning process + cross-validation (cv):performed in order to determine the hyper parameter value set which provides the best accuracy levels. + N_jobs: controls the number of cores on which the package will attempt to run in parallel ecto 88 ebooks Dowland Machine Leaming Project MLProjst ML_ Problem Lipp Mos Proj ML Problem 1-Jpytes Notebook + solver is a string (liblinear’ by default) that decides what solver to use for fitting the model. Other options are ‘newton-cg’, ‘Ibfgs’, ‘sag’, and ‘saga’ + max iter is an integer (100 by default) that defines the maximum number of iterations by the solver during model fitting + verbose is a non-negative integer (0 by default) that defines the verbosity for the ‘liblinear' and 'Ibfgs' solvers. In [64 from sklearn.linear_model import LogisticRegression In [65 #Logistic_model=LogisticRegression() #Logistic_model.fit(X train,y train) Logistic Regression Model Without Model Tuning In [66]: Logistic_model = LogisticRegression(solver=' newton-cg' ,max_iter=10000,penalty='none’ Logistic_model.fit(x train, y_train) y_train_predict=Logistic_model.predict (x train) Logistic _model_score=Logistic_model.score(X_train,y train) print ("Logistic Model Score for Train Data is ", Logistic _model_score) [Parallel (n_job: )]: Using backend LokyBackend with 2 concurrent work Logistic Model Score for Train Data is 0.8341187558906692 [Parallel(n_jobs=2)]: Done 1 out of 1 | elapsed: 1.55 finished Logistic Regression Model Without Model Tuning-Train Dataset Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn ass oso, 205 Proj ML Problem 1-Jpytes Notebook In (67 y_train_predict=Logistic_model.predict (x train) Logistic_model_score=Logistic_model.score(x train,y train) print (Logistic model_score) print (metrics.confusion_matrix(y train,y train predict)) print (metrics .classification_report(y train,y train predict)) 0.9341187558906692 [[197 110) [ 66 688}) precision recall fl-score support 0.75 0.6: 0.69 307 1 0.86 0.91 0.89 754 accuracy 0.83 1061 macro avg 0.81 0.78 0.79 1061 weighted avg 0.83 0.83 0.83 1061 In (68): ax=sns.heatmap(metrics.confusion plt.xlabel (‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 9.1 plt.show() Figure 9.1 :Confusion Matrix of LR Model before Grid Search-Train D; Actual Label Predicted Label ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp matrix(y train,y train predict) ,annot=True, ata fme='d! onfusion Matrix of LR Model before Grid Search-Train Data") osoiaa 120s In [69]: # Probability of Train Data Proj ML Problem 1-Jpytes Notebook y_train_prob=Logistic_model.predict_proba(Xx_train) pd.DataFrame(y train prob) -head() out [69]: ° © 0.999264 4 0.095272 2 0.293690 3 0.112080 4 0.016288 AUC_ROC Curve-Train Data 1 (0.066736 0.904728 0.706870 0.887870 0.988767 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya aos oso, 205 Proj ML Problem 1-Jpytes Notebook In [70]: # predict probabilities probs = Logistic model.predict_proba(x_train) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate AUC auc = roc_auc_score(y train, probs) print('AUC of LR model without Grid Search for Train Data is: %.3£' % auc) # calculate roc curve train_fpr, train_tpr, train thresholds = roc_curve(y train, probs) plt.plot({0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(train fpr, train tpr); plt.xlabel("False Positive Rat plt.ylabel("True Positive Rate") plt.title("Figure 10.1: AUC-ROC Train Data-LR Model without Gridsearch " AUC of LR model without Grid Search for Train Data is: 0.890 out {70 Text (0.5, 1-0, ‘Figure 10.1: AUC-ROC Train Data-LR Model without Grids earch ') Figure 10.1: AUC-ROC Train Data-LR Model without GridSearch 10 os 06 os “Fue Positive Rate 02 00 oo 02 oa 06 08 10 False Positive Rate ‘Model Without Model Tuning-Test Dataset In (71) y_test_predict-Logistic model.predict (x test) Logistic _model_score=Logistic model.score(X_test,y test) print ("Logistic Model Score for Test Data",Logistic_medel_score) Logistic Model Score for Test Data 0.8289473684210527 Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn sos In (72): y_test_predict=Logistic model.predict (x test) Logistic_model_score=Logistic_model.score(x test,y test) print (Logistic model_score) print (metrics.confusion_matrix(y test,y_test_predict)) print (metrics.classification_report(y_test,y_test_predict)) 0.9289473684210527 (11 42) {36 2671) precision recall 0.76 73 1 0.36 +88 accuracy macro avg 0.1 +80 weighted avg 0.83 +83 In (73) Proje ML Problem Juste Notebook £l-score 0.74 0.87 0.83 +81 0.83 supp ort 153 303 436 456 456 # Confusion Matrix of Logistic Reg Model-Test Data axcsns -heatmap(metrics .confusion_matrix(y_test,y_test_predict),annoterue, plt.xlabel( Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 9.2:Confusion Matrix of LR Model before Grid Search-Test Data’) plt.show() Figure 9.2:Confusion Matrix of LR Model before Grid Search-Test Data Actual Label Predicted Label | 0 i 250 150 100 ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp ay In (74) Proj ML Problem 1-Jpytes Notebook y_test_prob=Logistic_model.predict_proba(x_test) pd.DataFrame(y test_prob) head) out [74]: ° © 0.426549 4 0.181457 0.006891 2 3 0.842674 4 0.063583 1 0573451 ostesas 0.993509 0.187828 0.936467 Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn os osoiaa 120s Proj ML Problem 1-Jpytes Notebook In (75): probs_test = Logistic model.predict_proba(x test) probs test = probs test{t, 1] auc = roc_auc_score(y test, probs_test) print('AUC of LR model without Grid Search for Test Data is: %.3£' $ auc) test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test) plt.plot({0, 1], (0, 1], linestyle='--') plt.plot(train fpr, train tpr); plt.plot(train fpr, train tpr); plt.xlabel("False Positive Rate") plt.ylabel("?rue Positive Rate") plt.title("Figure 10.2: AUC-ROC Test Data-LR Model without GridSearch ") AUC of LR model without Grid Search for Test Data is: 0.883 out [75 Text (0.5, 1-0, ‘Figure 10.2: AUC-ROC Test Data-LR Model without Gridse arch ') Figure 10.2: AUC-ROC Test Data-LR Model without GridSearch 10 os “Fue Positive Rate 02 00 00 02 oa 06 08 vo False Positive Rate Inference of Logistic Regression Model Without GridSearch: Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score.Listing below model performance metrics before fine tuning the mode! Train Data: + True Positive:197 + False Positive:66 + False Negative:110 + True Negative:688 + AUC: 89% + Accuracy: 83% + Precision: 86% + ft-Score: 89% + Recall:91% Iocthost88noebooks Download Machin eam suns 120s Proj ML Problem 1-Jpytes Notebook Test Data: + True Positive:111 + False Positive:36 + False Negative:42 + True Negative:267 + AUC: 88.3% + Accuracy: 83% + Precision: 86% + ft-Score: 87% + Recall:88% + We know that, FPR tells us what proportion of the negative class got incorrectly classified by the classifier Here, we have higher TNR and a lower FPR which is desirable to classify the negative class. + Here, both Type | Error (False Positives) and Type Il Error ( False Negatives) are low indicating high Sensitivty/Recall, Precision, Specificity and F1 Score. + Accuracy of the model is more than 70%, which can be considered as agood accuracy score. + Train and Test data scores are mostly in line and the overall performance of made! looks good Hence, it can be inferred that overall this model can be considered as a good model. + We will further fine tune the model later using Grid Search to understand if there is any improvement in the performance metrics of the model question 1.6 and 1. Linear Discriminant Analysis Without Tuning: In [76]: from sklearn.discriminant_analysis import LinearDiscriminantanalysis In (77 LDA_nodel=LinearDiscriminantAnalysis() LDA _model.fit(X_train,y train) out (77 LinearDiscriminantAnalysis() Linear Discriminant Analysis Without Tunis In [78 # Training Data Class Prediction with a cut-off value of 0.5 y_train_predict=LDA_model.predict (x_train) LDA_model_score=LDA_model.score(X_train,y train) print ("LDA Model Score for Training Data without GridSearch is",LDA_model_score) LDA Model Score for Training Data without Gridsearch is 0.834118755890 6692 Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn sas Proj ML Problem 1-Jpytes Notebook In [79] print (metrics.confusion_matrix(y train,y train predict)) print (metrics.classification_report(y train,y train predict)) [[200 107} [ 69 685) precision recall fl-score support. 0.74 0.65 0.69 307 1 0.86 0.91 0.89 754 accuracy 0.83 1061 macro avg 0.80 0.78 0.79 1061 weighted avg 0.83 0.83 0.83 1061 In [80 # confusion Matrix of LDA Model-Train Data ax=sns.heatmap(metrics.confusion matrix(y train,y train predict),annoteTrue, fmt="d" plt.xlabel( ‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 11.1: Confusion Matrix of LDA Model without Gridsearch-Train Data! plt.show() Figure 11.1: Confusion Matrix of LOA Model without GridSearch-Train Data a0 -s00 3 5 400 & | 20 z es wo 100 Predicted Label AUC_ROC Curve LDA Model- Train Data Iocthost88noebooks Download Machin eam ss osoiaa 120s Proj ML Problem 1-Jpytes Notebook In (81): probs_train-LDA model.predict_proba(X train) probs_trainsprobs_train{t, 1] auc=roc_auc_score(y_train,probs_train) print("the auc %.3f" ¥ auc) train_fpr ,train tpr, train thresholds= roc_curve(y train, probs_train) plt.plot ((0,1],(0/1],1inestyle="--") pit.plot(train fpr, train tpr); plt.xlabel ("False Positive Rat: plt.ylabel ("True Positive Rate’ plt.title("Figure 12.1: AUC-ROC Train Data-LDA Model without GridSearch the auc 0.890 out [#1]: Text (0.5, 1-0, ‘Figure 12.1: AUC-ROC Train Data-LDA Model without Grid Search *) Figure 12.1: AUC-ROC Train Data-LDA Model without GridSearch 10 os 06 os “Yue Postve Rate 02 00 oo 02 oa 06 08 vo False Positive Rate qn (82): y_test_predict=LDA_model.predict(x_test) LDA _model_score=LDA_model.score(x test. print ("LDA Model Score for Test Data iJ test) ;LDA_model_score) LDA Model Score for Test Data is 0.831140350877193 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 9s oso, 205 Proj ML Problem 1-Jpytes Notebook In (83) print (metrics.confusion_matrix(y test,y_test_predict)) print (metrics.classification_report(y test,y test_predict)) faa. 427 [35 268)) precision recall fl-score support 0.76 0.73 0.74 153 1 0.36 0.88 0.87 303 accuracy 0.83 436 macro avg 0.81 0.80 0.81 456 Weighted avg 0.83 0.83 0.83 456 In [84 # Confusion Matrix of LDA Model-Train Data ax-sns.heatmap(metrics.confusion matrix(y test,y_test_predict),annot=True, fm plt.xlabel (‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 11.2: Confusion Matrix of LDA Model without Gridsearch-Test Data} plt.show() Figure 11.2: Confusion Matrix of LDA Model without GridSearch-Test Data 250 Predicted Label AUC_ROC Curve LDA Model-Test Data Iocthost88noebooks Download Machin eam 4808 osoiaa 120s Proj ML Problem 1-Jpytes Notebook In [85]: probs_test=LDA_model.predict_proba(x test) probs_test=probs_test{:,1] auc=roc_auc_score(y test,probs_test) print("the auc curve %.3f " $ auc) test_fpr,test_tpr,test_threshold=roc_curve(y_test,probs_test) plt.plot({0,1],(0,1],1inestyle='--") plt.plot(test_fpr, test_tpr) plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") plt.title("Figure 12.2: AUC-ROC Test Data-LDA Model without Gridsearch " the auc curve 0.888 out (a5, Text (0.5, 1-0, ‘Figure 12.2: AUC-ROC Test Data-LDA Model without Grids earch ') Figure 12.2: AUC-ROC Test Data-LDA Model without GridSearch 10 oo 02 oa 06 08 zo False Positive Rate Inference of LDA Model Without GridSearch: Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score.Listing below model performance metrics before fine tuning the model: Train Data: + True Positive:200 eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipyn 698 0s Proj ML Problem 1-Jpytes Notebook + False Positive:69 + False Negative:107 + True Negative:685 = AUC: 89% + Accuracy: 83% + Precision: 86% + f1-Score: 89% + Recall:91% Test Data: + True Positive:t11 + False Positive:35 + False Negative:42 + True Negative:268 + AUC: 88.8% + Accuracy: 83% + Precision: 86% + f1-Score: 87% + Recall:88% + We know that, FPR tells us what proportion of the negative class got incorrectly classified by the classifier Here, we have higher TNR and a lower FPR which is desirable to classify the negative class. + Here, both Type I Error (False Positives) and Type Il Error ( False Negatives) are low indicating high ‘Sensitivity/Recall, Precision, Specificity and F1 Score. + Accuracy of the model is more than 70%, which can be considered as agood accuracy score. + Train and Test data scores are mostly inline and the overall performance of model looks good. Hence, it ‘can be inferred that overall this model can be considered as a good model. + We will further fine tune the mode! later using Grid Search to understand if there is any improvement in the performance metrics of the model question 1,6 and 1.7. 1.5 Apply KNN Model and Naive Bayes Model (2pts). Interpret the inferences of each model (2 pts). Successful implementation of each model. Logical reason behind the selection of different values for the parameters involved in each model. Calculate Train and Test Accuracies for each model. Comment on the validness of models (over fitting or under fitting) Naive Bayes Model In (96): from sklearn-naive_bayes import GaussianNB from sklearn import metrics In (87): NB_model = GaussianNB() NB_model. fit(_train, y train) out( a7]: GaussianNB() ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp ss oso, 205 Proj ML Problem 1-Jpytes Notebook NB Without Tuning-Train Dataset In [88 y_train_predict=NB_model.predict(x_train) NB_model_score=NB_model.score(x train, y_train) print("NB Model Score is for Train Data is" ,NB_model_score) NB Model Score is for Train Data is 0.8341187558906692 In [89 # confusion Matrix of NB Model-Train Data print (metrics confusion matrix(y train,y train predict)) print (metrics.classification_report(y train,y train predict)) [(212 95) {81 673)) precision recall fl-score support 0.72 0.69 0.71 307 1 0.8 0.89 0.88 754 accuracy 0.83 1061 macro avg 0.80 0.79 0.80 1061 weighted avg 0.83 0.83 0.83 1061 In [90 ax-sns -heatmap(metrics.confusion_matrix(y_train,y train predict),annot-True, fm plt.xlabel( Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 13.1:Confusion Matrix of NB Model-Train Data’) plt.show() Figure 13.1:Confusion Matrix of NB Model-Train Data ‘Actual Label Predicted Label AUC_ROC Curve NB Model-Train Data -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya sus oso, 205 Proj ML Problem 1-Jpytes Notebook In (91): probs_train-NB_model.predict_proba(x train) probs_trainsprobs_train{t, 1] auc=roc_auc_score(y_train,probs_train) print("the auc %.3f" ¥ auc) train_fpr ,train tpr, train thresholds= roc_curve(y train, probs_train) plt.plot ((0,1],(0/1],1inestyle="--") pit.plot(train fpr, train tpr); plt.xlabel("False Positive Rate" plt.ylabel ("True Positive Rate") plt.title("Figure 14.1: AUC-ROC Train Data-NB Model without Gridsearch " the auc 0.889 out [91]: Text (0.5, 1-0, ‘Figure 14.1: AUC-ROC Train Data-NB Model without Grids earch ') Figure 14.1: AUC-ROC Train Data-NB Model without GridSearch 10 00 02 04 06 08 10 False Positive Rate NB Model Without Tunir \9-Test Dataset In [92 y_test_predict=NB_model.predict (x_test) NB_model_score=NB_model.score(x test, y test) print ("NB Model Score is for Test Data is" ,NB_model_score) NB Model Score is for Test Data is 0.8223684210526315 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 90s oso, 205 In (93) print (metrics.confusion_matrix(y 1 a2 4a) [ 40 263)) precision recall 0.74 0.73 1 0.87 0.87 accuracy macro avg 0.80 0.80 weighted avg 0.82 0.82 In (94 ax=sns.heatmap(metrics.confusion_matrix(y_test,y_test_predict) , anno! plt.xlabel (‘Predicted Label") plt.ylabel(‘Actual Label’) plt.title('Figure 13.2:Confusion Matrix of NB Model Without Gridsearch-Test Data’) plt.show() est y_test_predict)) print (metrics.classification_report(y test,y test_predict)) £1-score 0.82 0.80 0.82 Proj ML Problem 1-Jpytes Notebook support. 153 303 436 456 456 Figure 13,2:Confusion Matrix of NB Model Without GridSearch-Test Data Actual Label Predicted Label AUC_ROC Curve NB Model-Test Data -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya a 20 2s | 200 ws 150 Vas -100 “75 “0 rue, fmt='d",<¢ sos osoiaa 120s Proj ML Problem 1-Jpytes Notebook In (95]+ probs_test=NB_model.predict_proba(x + probs_test=probs_test{:,1] auc=roc_auc_score(y test,probs_test) print("AUc for NB Model on Test Data Without Gridsearch %.3£ " $ auc) test_fpr,test_tpr,test_threshold=roc_curve(y_test,probs_test) plt.plot({0,1],(0,1],1inestyle='--") plt.plot(test_fpr, test_tpr) plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") plt.title("Figure 14.2: AUC-ROC Test Data-NB Model without Gridsearch ") AUC for NB Model on Test Data Without GridSearch 0.876 out (95, Text (0.5, 1-0, ‘Figure 14.2: AUC-ROC Test Data-NB Model without Gridse arch ') Figure 14.2: AUC-ROC Test Data-NB Model without GridSearch 10 oo 02 oa 06 08 vo False Positive Rate Inference of NB Model Without GridSearch: Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score.Listing below model performance metrics without fine tuning the model: Train Data: -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya sis 120s Proj ML Problem 1-Jpytes Notebook + True Positive:212 + False Positive:81 + False Negative:95 + True Negative:673 + AUC: 88.9% + Accuracy: 83% + Precision: 88% + f1-Score: 88% + Recall:80% Test Data: + True Positive:112 + False Positive:40 + False Negative:41 + True Negative:263 + AUG: 87.6% + Accuracy: 82% + Precision: 879% + f-Score: 87% + Recal:87% + We know that, FPR tells us what proportion of the negative class got incorrectly classified by the classifier Here, we have higher TNR and a lower FPR which is desirable to classify the negative class. + Here, both Type | Error (False Positives) and Type Il Error ( False Negatives) are low indicating high Sensitivty/Recall, Precision, Specificity and F1 Score. + Accuracy of the model is more than 70%, which can be considered as agood accuracy score. + Tain and Test data scores are mostly in line and the overall performance of model looks good. Hence, it can be inferred that overall this model can be considered as a good model. We will further fine tune the model later using Grid Search to understand if there is any improvement in the performance metrics of the model question 1.6 and 1.7. KNN Model Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is. computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. ‘The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.KNN has the following basic steps: 4. Calculate distance 2. Find closest neighbors 8, Vote for labels. In [96]: from sklearn-neighbors import KNeighborsClassifier from scipy.stats import 2score Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn 208 osa02t, 208 Proj ML Problem 1-Jpytes Notebook In [97]: plt.plot (xX) plt.title("Figure 15:Independent Variable Plot Before Scaling") plt.show() Figure 15-:Independent Variable Plot Before Scaling In [98 X{['age', ‘economic.cond.national' , ‘economic.cond-household’, ‘Blair’, ‘Hague’, ‘Europe! In [99] plt.plot (x) plt.title("Figure 16:Independent Variable Plot Prior Scaling") plt.show() Figure 16:Independent Variable Plot Prior Scaling @ 260 400 600 a0 1000 1200 1400 1600 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya sus oxto20a1. 120s Pre ML Prblen 1 Jp Nt tn [100]: Xshead(10) out{ 100] age _economic.condnational economic.condhousehold __Biair_—-Hague Europe 0 0.716361 0301648 0.179682 0565802 -1.419969 -1.497398 4 t.162118 os70183 0.949003 0565802 1.014951 -0.527664 2 1.205807 70183 0.949003 1.417312 -0.608529 -1.134120 3 -1.926817 70183 1.808966 1.137217 -1.419969 -0.830802 40843577 1.473479 1.908966 1.989727 -1.419969 -0.224465 5 0.461308 301648 0.949003 0.565802 1.014951 -0.830002 6 0.175754 1.473479 1.308366 0.565802 1.014951 1.291625 7 1.449916 0301648 0.949003 0.565802 1.419969 1.740556 8 0.970908 -os01648 0.179682 0.565802 1.014951 1.201625 © 1.008961 -os01648 “1.306966 1.417512 -1.419969 1.201625 Default value of n_neighbors is equal to First we will build KNN Model with k=5. In 101}: from sklearn-neighbors import KNeighborsClassifier KMN_node: NeighborsClassifier() KMN_model.fit(X_train,y train) out [101]: KNeighborsClassifier() In 102}: y_train_predict=KNN_model.predict (x train) KNN_model_score=KNN_model.score(x_train,y train) print ("KNN Model Score for scaled Train Data is",KNN_model_score) KNN Model Score for Scaled Train Data is 0.8539114043355325 Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn sus In [103]: print (metrics.confusion_matrix(y train,y train predict)) print (metrics.classification_report(y train,y train predict)) [[204 103} [52 702}) precision recall fl-score support. 0.0 0.66 0.72 307 1 0.87 0.93 0.90 754 accuracy 0.85 1061 macro avg 0.83 0.80 0.81 1061 weighted avg 0.85 0.85 0.85 1061 In (104): ax-sns -heatmap(metrics .confusion_matrix(y train,y train predict),annot=True, fmt='d! plt.xlabel (‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 17.1:Confusion Matrix of Train Data (k=5)" plt.show() Figure 17.1:Confusion Matrix of Train Data ( 700 00 ° 204 103 500 4 400 a 300 “ : 200 100 0 1 Predicted Label AUC_ROC Curve KNN Model-Train Data ( -ecalhos 888/noebooks Downloads Machine Leaning Projet MLProjt ML Problem Lipp sos oso, 205 Proj ML Problem 1-Jpytes Notebook In [105]: Probs_trainsKNN_model.predict_proba(X train) probs_trainsprobs_train{t, 1] auc=roc_auc_score(y_train,probs_train) print("the auc %.3f" ¥ auc) train_fpr ,train tpr, train thresholds= roc_curve(y train, probs_train) plt.plot ((0,1],(0/1],1inestyle="--") pit.plot(train fpr, train tpr); plt.xlabel("False Positive Rate" pit.ylabel( "True Positive Rate") plt.title("Figure 18.1: AUC-ROC Train Data-KNN Model (K=5) " the auc 0.923 out [105]: Text (0.5, 1-0, ‘Figure 18.1: AUC-ROC Train Data-KNN Model (K=5) Figure 18.1: AUC-ROC Train Data-KNN Model (K=5) oo 02 04 06 08 10 False Positive Rate In 106}: y_test_predict=KNN_model.predict(x_test) KN_model_score=KNN_model.score(X_test, y_test) print ("KNW Model Score for Scaled Test Data for KNN Model Score for Scaled Test Data for k=5 is 0.8157894736842105 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 5 is",KNN_model_score) 008 In 107} print (metrics.confusion_matrix(y test,y_test_predict)) print (metrics.classification_report(y test,y test_predict)) [E99 54) {30 273)) accuracy macro avg weighted avg qn [108]: ax-sns -heatmap(metrics .confusion_matrix(y_test,y_test_predict), annot=True, plt.xlabel( precision recall fl-score 0.77 0.65 0.70 0.83 0.90 0.87 0.82 0.80 0.77 0.78 o.a1 0.82 0.81 Predicted Label’) plt-ylabel( ‘Actual Label’) plt.title('F. plt.show() Figure 17.2 ‘Actual Label igure 17.2 ‘Confusion Matrix of Test Data (k=5) 250 200 = 100 Predicted Label AUC_ROC Curve KNN Model-Test Data (k=5) octet 88nebooks yload/Machine Leaning Projt_MLPaje ML Pro support a Lipynb 153 303 456 456 456 :Confusion Matrix of Test Data (k=5) osoiaa 120s Proj ML Problem 1-Jpytes Notebook In [109]: probs_test=KNN_model.predict_proba(x_test) probs_test=probs_test{:,1] auc=roc_auc_score(y test,probs_test) print("the auc curve %.3f " $ auc) test_fpr,test_tpr,test_threshold=roc_curve(y_test,probs_test) plt.plot({0,1],(0,1],1inestyle='--") plt.plot(test_fpr, test_tpr) plt.xlabel("False Positive Rate") plt.ylabel("?rue Positive Rate") plt.title("Figure 18.2: AUC-ROC Test Data-KNN Model (K=5) ") the auc curve 0.853 out {109}: Text (0.5, 1-0, ‘Figure 18.2: AUC-ROC Test Data-KNN Model (K=5) Figure 18.2: AUC-ROC Test Data-KNN Model (K=5) 10 os 06 os “Yue Postve Rate 02 00 oo 02 oa 06 08 vo False Positive Rate Default value of n_neighbors is equal to 5 has given below model performance: Train Data: + AUG: 92.3% + Accuracy: 85% + Precision: 87% + f1-Score: 90% -ecalhos 888/noebooks Downloads Machine Leaning Projet MLProjct_ ML Problem Lipyn sus Proj ML Problem 1-Jpytes Notebook + Recall:93% Test Data: + AUG: 85.3% + Accuracy: 82% + Precision: 83% + f1-Score: 87% Recall:90% We can see a considerable difference in model AUC between Train and Test Data while the other parameters ‘are mostly in line.Lets check the performance of the model for K=7. In [110]: # Building KNN model with n_neighbours=7 XWN_mode1=KNeighborsClassifier(n_neighbors=7) XWN_model.fit(X_train,y train) out[ 110): kNeighborsClassifier(n_neighbor: qn [111]: y_train_predict=KNN_model.predict(x_train) KNN_model_score=KNN_model.score(X_train,y_train) print("KNN Model Score for Scaled Train Data with K as 7 is",KNN_model_score) KNN Model Score for Scaled Train Data with K as 7 is 0.848256361922714 5 In (112): print (metrics.confusion_matrix(y train,y train predict)) print (metrics.classification_report(y train,y train predict)) [(202 105} [56 698) precision recall fl-score support 0.78 0.66 0.72 307 1 0.87 0.93 0.90 754 accuracy 0.85 1061 macro avg 0.83 0.79 0.81 1061 weighted avg 0.84 0.85 0.84 1061 Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn 90s mn (113): ax=sns -heatmap(metrics .confusion matrix(y train,y train predict),annot=True, fmt='d plt.xlabel ("Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 17.3:Confusion Matrix of 1 plt.show() ain Data (K=7)") Figure 17.3:Confusion Matrix of Train Data ( tual Label 8 x 8 100 Predicted Label AUC_ROC Curve KNN Model-Train Data (K=7) localhost ebooks Downloads Machine Leaming Project MLIProjst_ML_ Problem py aos oso, 205 Proj ML Problem 1-Jpytes Notebook mn (114): Probs_trainsKNN_model.predict_proba(X train) probs_trainsprobs_train{t, 1] auc=roc_auc_score(y_train,probs_train) print("the auc %.3f" ¥ auc) train_fpr ,train tpr, train thresholds= roc_curve(y train, probs_train) plt.plot ((0,1],(0/1],1inestyle="--") pit.plot(train fpr, train tpr); plt.xlabel("False Positive Rate" pit.ylabel( "True Positive Rate") plt.title("Figure 18.3: AUC-ROC Train Data-KNN Model (K=7) the auc 0.918 out [114]: Text (0.5, 1-0, ‘Figure 18.3: AUC-ROC Train Data-KNN Model (K=7) Figure 18.3: AUC-ROC Train Data-KNN Model (K=7) oo 02 04 06 08 10 False Positive Rate qn [115]: y_test_predict=KNN_model.predict(x_test) KWN_model_score=KNN_model.score(X_test, y_test) print("KNN Model Score for Scaled Test Data with K as 7 is",KNN_model_score) ENN Model Score for Scaled Test Data with K as 7 is 0.8245614035087719 in [116}: print (metrics confusion matrix(y test,y test_predict)) print (metrics.classification_report(y test,y_test_predict)) TE 99 54) [26 2771) precision recall fl-score support ° 0.79 0.65 0.71 153 1 0.84 0.91 0.87 303 accuracy 0.82 436 macro avg 0.1 0.78 0.79 456 weighted avg 0.82 0.82 0.82 456 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya ups In (117) ax=sns -heatmap(metrics .confusion_matrix(y test,y test_predict) , anno’ plt.xlabel (‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 17.4:Confusion Matrix of Test ( plt.show() ) Data") Figure 17.4:Confusion Matrix of Test (k=7) Data 250 150 ‘Actual Label 100 Predicted Label AUC_ROC Curve KNN Model-Test Data (K=7) localhost ebooks Downloads Machine Leaming Project MLIProjst_ML_ Problem py osoiaa 120s Proj ML Problem 1-Jpytes Notebook mn [118]: probs_test=KNN_model.predict_proba(x_test) probs_test=probs_test| u auc=roc_auc_score(y test,probs_test) print("the auc curve %.3f " $ auc) test_fpr,test_tpr,test_threshold=roc_curve(y_test,probs_test) plt.plot({0,1],(0,1],1inestyle='--") plt.plot(test_fpr, test_tpr) plt.xlabel("False Positive Rate") plt.ylabel("?rue Positive Rate") plt.title Figure 18.4: AUC-ROC Test Data-KNN Model (K=7) ") the auc curve 0.861 out (118): Text (0.5, 1-0, ‘Figure 18.4: AUC-ROC Test Data-KNN Model (K=7) “Yue Postve Rate 10 os 06 os 02 00 Figure 18.4: AUC-ROC Test Data-KNN Model (K=7) oo 02 oa 06 08 vo False Positive Rate Insights with k as 7on model performance are as follows: Train Data: + AUC: 92.3% Accuracy: 85% Precisior Score: 90% Al 37%, Recall:93% Test Data: + AUC: 85.3% Accuracy: 82% Precision: 84% f1-Score: 87% Recall:91% Inference of K-Nearest Mean (KNN) Model: + KNN Model Score for Scaled Train Data for k=5 is 0.8599 Iocthost88noebooks Download Machin eam Prec MLProjet_ ML_ Problem Lipp y cas Proj ML Problem 1-Jpytes Notebook + KNN Model Score for Scaled Test Data for k=5 is 0.8157 + KNN Model Score for Scaled Train Data with K=7 is 0.8482 + KNN Model Score for Scaled Test Data with Js 0.8245 «There is a slight improvement in Accuracy Score for Test data with + Accuracy score of 85% is generally considered a good accuracy score Further, to find the optimal value of k we will look at the K=1,3,5,7....19 and store the train and test scores in a Dataframe (ac_score) and using these scores, we will calculate the Misclassification error (MCE) and find the ‘model with lowest Misclassification error (MCE) using the below mentioned formula: Misclassification error (MCE) = 1 - Test accuracy score in (119): ac_score=[] for k in range(1,20,2 knn= KNeighborsClassifier(n_neighbor: knn.fit(X train, y train) scores=knn.score(X_test,y test) ac_score.append( scores) MCE=[1-x for x in ac_score] MCE out [119] [0.2192982456140351, +21052631578947367, +1842105263157895, +17543859649122806, +19298245614035092, +19517543859649122, +19736842105263153, +19298245614035092, +20175438596491224, +19517543859649122] In [120]: out [120]: [0.7807017543859649, 0.7894736842105263, 0.8157894736842105, 0.8245614035087719, 0.8070175438596491, 0.8048245614035088, 0.8026315789473685, 0.8070175438596491, 0.7982456140350878, 0.8048245614035088) Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn uns oso, 205 Proj ML Problem 1-Jpytes Notebook tn [121]: import matplotlib.pyplot as plt # plot misclassification error vs k plt.plot(range(1,20,2), MCE) plt.xlabel( ‘Number of Neighbors K') plt.ylabel ( ‘Misclassification Error’) plt.title("Figure 19: MCE Plot for KNN Model") plt-show() Figure 19: MCE Plot for KNN Model 02 sees ees os Number of Neighbors K Hence, we can say that the lowest value of Misclassification Error is at k=7. Also, we have seen above that accuracy score for KNN Model at k=7 is 85% which is considered a good accuracy score and the difference between train and test accuracies is less than 10%, itis a valid model. Therefore, we can say that the optimal value of k is 7 for this particular model. ‘Type Markdown and LaTex: a’ 1.8 Model Tuning (¢ pts), Bagging (1.5 pts) and Boosting (1.5 pts). Apply grid search on each model (include all models) and make models on best_params. Define a logic behind choosing particular values for different hyper-parameters for grid search. Compare and comment on performances of all. Comment on feature importance if applicable. Successful implementation of both algorithms along with inferences and comments on the model performances. Model Tuning is the process of maximizing a mode'’s performance without overftting or creating too high of a variance. This is accomplished by selecting appropriate “hyperparameters.” which is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ are not learned by the model automatically.instead, these parameters are set manually.Below mentioned are the three most commonly used approaches: 1. Grid Search- Grid search also known as parameter sweeping. This method involves manually defining a subset of the hyperparametric space and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen, Iccthost88ebooks Downlanh/Mashine Leaming Project MLProjst ML_ Problem Lipp cans oso, 205 Proj ML Problem 1-Jpytes Notebook 2. Random Search- Random search can be said as a basic improvement on grid search. Instead of testing on a predetermined subset of hyperparameters, random search, as its name implies, randomly selects a ‘chosen number of hyperparametric pairs from a given domain and tests only those. This greatly simplifies the analysis without significantly sacrificing optimization. For example, if the region of hyperparameters that are near optimal occupies at least 5% of the grid, then random search with 60 trials will find that region with high probability (05%). 3, Bayesian Optimization- This process builds a probabilistic model for a given function and analyzes this model to make decisions about where to next evaluate the function. It offers an efficient framework for ‘optimising the highly expensive black-box functions without knowing its format is an efficient tool for hyperparameter tuning for complex models like deep neural networks. Here, we will use Grid Search Method for Model tuning. Naive Bayes Model with Tuning- Grid Search = Explaining the parameters used to find the optimal combinations : + param_grid_NB: Dictionary that contains all of the parameters to try + var_smoothing : Stability calculation to widen (or smooth) the curve and therefore account for more ‘samples that are further away from the distribution mean, ‘+ np.logspace : Returns numbers spaced evenly on a log scale, starting from 0, ending at -9, and generating 100 samples + estimator: Machine learning model of interest + verbose is the verbosity: the higher, the more messages; in this case, itis set to 1 + cv: cross-validation generator or an iterable, in this case, there is a 10-fold cross-validation. + n_jobs: Maximum number of iterations; in this case, itis set to -1 which implies that all CPUs are used. Here, we will build the Naive Bayes Model using Gridsearch to find an optimal combination of hyperparameters that minimizes a predefined loss function to give better results. In [122]: param_grid NB = { Vvar_smoothing': np.logspace(0,-9, num=100)} In (123): NB_model_grid = Gridsearchcv(estimator=GaussianNB(), param_grid=param_grid_NB, verbc NB_model_grid.fit(x_train, y_train) print (NB_model_grid.best_estimator_) Fitting 10 folds for each of 100 candidates, totalling 1000 fits GaussianNB(var_smoothing=0 .0012328467394420659) NB Model With Tuning on Train Data Here, we will perform model prediction on training and testing data to evaluate the model's accuracy and efficiency after fine tuning the model. ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp tas oso, 205 Proj ML Problem 1-Jpytes Notebook tn (124): y_train_predict=NB_model_grid.predict (x train) NB_model_grid_score=NB_model.score(x_train, y train) print ("NB Model Score after Grid Search for Train Data is" ,NB_model_grid score) NB Model Score after Grid Search for Train Data is 0.8341187558906692 qn [125]: # Confusion Matrix of NB Model-Train Data print (metrics.confusion_matrix(y_train,y_train_predict)) print (metrics classification report (y train,y train_predict)) [(210 97) (77 677)) precision recall fl-score support. 0.73 0.68 o.71 307 1 0.7 0.90 0.89 754 accuracy 0.84 1061 macro avg 0.80 0.79 0.80 1061 weighted avg 0.83 0.84 0.83 1061 In (126): ax=sns.heatmap(metrics.confusion_matrix(y train,y train predict) ,annot=True, fmt plt.xlabel (‘Predicted Label") plt-ylabel( ‘Actual Label’) plt.title('Figure 20.1:Confusion Matrix of NB Model With Gridsearch-Train Data’) plt.show() Figure 20.1:Confusion Matrix of NB Model With GridSearch-Train Data 0 o- no 7 500 4 } 200 & 200 aq 7 -200 100 ° 1 Predicted Label -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya ors oso, 205 Proj ML Problem 1-Jpytes Notebook mn [127]: probs_trainsNB_model_grid.predict_proba(x train) probs_trainsprobs_train{t, 1] auc=roc_auc_score(y_train,probs_train) print ("AUC of NB Model with GridSearch is %.3£ " $ auc) train_fpr ,train tpr, train thresholds= roc curve(y train, probs_train) plt.plot ((0,1],(0/1],1inestyle="--") pit.plot(train fpr, train tpr); plt.xlabel("False Positive Rate" pit.ylabel( "True Positive Rate") plt.title("Figure 21.1: AUC-ROC Train Data-NB Model with Gridsearch " AUC of NB Model with Gridsearch is 0.887 out (127): Text (0.5, 1-0, ‘Figure 21.1: AUC-ROC Train Data-NB Model with Gridsear ch’) Figure 21.1: AUC-ROC Train Data-NB Model with GridSearch 10 “Tue Postve Rate 00 02 04 06 08 10 False Positive Rate NB Model With Tuning on Test Data In (128): y_test_predict=NB_model_grid.predict(x_test) NB_model_grid_score=NB_model.score(x test, y test) print ("NB Model Score after Grid Search for Test Data is" ,NB_model_grid_score) NB Model Score after Grid Search for Test Data is 0.8223684210526315, -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya cas oso, 205 Proj ML Problem 1-Jpytes Notebook In [129]: print (metrics.confusion_matrix(y test,y_test_predict)) print (metrics.classification_report(y test,y test_predict)) rear 42) [ 38 2653) precision recall fl-score support. 0.74 0.73 0.74 153 1 0.86 0.87 0.87 303 accuracy 0.82 436 macro avg 0.80 0.80 0.80 456 weighted avg 0.82 0.82 0.82 456 In [130]: ax=sns.heatmap(metrics.confusion_matrix(y_test,y_test_predict) , anno! plt.xlabel( ‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 20.2:Confusion Matrix of NB Model With GridSearch-Test Data’) plt-show() Figure 20.2:Confusion Matrix of NB Model With GridSearch-Test Data ‘Actual Label Predicted Label -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya cons osoiaa 120s Proj ML Problem 1-Jpytes Notebook tn [131]: probs_test=NB_model_grid.predict_proba(x_test) probs_test=probs_test{+:,1] auc=roc_auc_score(y test,probs_test) print("the auc curve %.3f " $ auc) test_fpr,test_tpr,test_threshold=roc_curve(y_test,probs_test) plt.plot({0,1],(0,1],1inestyle='--") plt.plot(test_fpr, test_tpr) plt.xlabel("False Positive Rate") plt.ylabel("?rue Positive Rate") plt.title("Figure 21.2: AUC-ROC Test Data-NB Model with Gridsearch ") the auc curve 0.880 out [131]: Text (0.5, 1-0, ‘Figure 21.2: AUC-ROC Test Data-NB Model with Gridseare a’) Figure 21.2: AUC-ROC Test Data-NB Model with GridSearch 10 os “Tue Positive Rate 02 00 oo 02 04 06 08 vo False Positive Rate Inference of NB Model With GridSearch: Using the confusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score Listing below model performance metrics with fine tuning the model: Train Data: + True Positive:210 + False Positive:77 + False Negative:97 + True Negative:677 + AUG: 88.7% + Accuracy: 84% + Precision: 87% + ft-Score: 89% + Recall:90% Test Data: Iocthost88noebooks Download Machin eam 09s oso, 205 Proj ML Problem 1-Jpytes Notebook + True Positive:t11 + False Positive:38 + False Negative:42 + True Negative:265 + AUC: 88% + Accuracy: 82% + Precision: 86% + f1-Score: 87% + Recall:87% + We know that, FPR tells us what proportion of the negative class got incorrectly classified by the classifier Here, we have higher TNR and a lower FPR which is desirable to classify the negative class. + Here, both Type | Error (False Positives) and Type Il Error ( False Negatives) are low indicating high Sensitivity/Recall, Precision, Specificity and F1 Score. + Accuracy of the model is more than 70%, which can be considered as agood accuracy score. + Train and Test data scores are mostly in line and the overall performance of model looks good. Hence, it can be inferred that overall this model can be considered as a good model. + After fine tuning the model we can see that model has given mostly the same performance with a very slight improvement in few parameters.Hence, we can say that fine tuning this particular model does not make much of a difference the model performance. Logistic Regression Model with Tuning- Grid Search Before using GridSearchC\, listing important parameters below: + estimator: In this we have to pass the models or functions on which we want to use GridSearchCV ictionary or list of parameters of models or function in which GridSearchCV have to select + Scoring: It is used as a evaluating metric for the model performance to decide the best hyperparameters, if not especifiad then it uses estimator score. + solver: string (‘iblinear' by default) that decides what solver to use for fitting the model. Other options are ‘newton-cg', 'bfgs', 'sag’, and 'saga’.Here we are using newton-cg as it adaptively controls the accuracy of the solution without loss of the rapid convergence properties. + max iter:Defines the maximum number of iterations by the solver during model fitting.Here, we are using 10000. *+ penalty: It imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. + verbose: Non-negative integer (0 by default) that defines the verbosity. ‘+n jobs: controls the number of cores on which the package will attempt to run in parallel. + cv: cross validation generator or an iterable, in this case, there is a 5-fold cross-validation. * scoring: choosing scoring F1 since it computes the Harmonic Mean between Recall and Precision, it tells. Us whether both Type | and Type Il error is low or high on an average Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn ns osoiaa 120s Proj ML Problem 1-Jpytes Notebook tn 132]: # Fit the Logistic Regression model Logistic_model_grid = LogisticRegression(solver='newton-cg’ ,max_iter=10000,penalty=' Logistic model_grid.fit(x train, y train) Logistic_grid={'penalty':['12', ‘none' }, ‘solver':{ ‘newton-cg" J, ‘tol’ #[0-0001,0.00001 Logistic_grid search = GridSearchCV(estimator = Logistic _model_grid, param grid = Lc Logistic_grid search.fit(x train, y train) best_Logistic model = Logistic grid search.best_estimator_ # Prediction on the training set ytrain predict = best_Logistic_model.predict (x train) ytest_predict = best Logistic _model.predict(x_test) [Parallel (n_job ers. [Parallel (n_job [Parallel (n_job ers. [Parallel (n_job )]+ Using backend LokyBackend with 2 concurrent work )]: Done 1 out of 1 | elapsed: 1.08 finished )]+ Using backend LokyBackend with 2 concurrent work )]: Done 1 out of 1 | elapsed: 1.15 finished Logistic Regression Model- Train Data In 133}: # Accuracy ~ Training Data Logistic_model_grid.score(x_train, y train) out (133): 0.8341187558906692 In (134): ## Getting the probabilities on the train set ytrain_predict_prob=best_Logistic_model.predict_proba(Xx_train) pd.DataFrame(ytrain_predict_prob) -head() out (134): ° 1 © 0.951817 0.068585, 4 0.096823 0.90317 2 0208222 0.708778 8 0.119989 0.886051 4 0.016883 0.983117 Iocthost88noebooks Download Machin eam ms tn [135]: print (metrics.confusion_matrix(y train,y train predict)) print (metrics.classification_report(y train,y train predict)) [210 97) 77 677)) precision recall 0.73 0.68 1 0.87 0.90 accuracy macro avg 0.80 0.79 weighted avg 0.83 0.84 In [136]: ax=sns-heatmap(metrics.confusion_matrix(y_train,y train_predict),annot=True, plt.xlabel (‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 22. plt.show() Proje ML Problem Juste Notebook £l-seore 0.84 0.80 support. 307 754 1061 1061 1061 Figure 22.1:Confusion Matrix of LR Model After Grid Search-Train Data Actual Label Predicted Label 00 s00 400 + 300 5 on + 200 100 ° 1 ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp :Confusion Matrix of LR Model After Grid Search-Train Data’) oso, 205 Proj ML Problem 1-Jpytes Notebook tn [137]: # Train Model Roc_AUC Score # predict probabilities probs = Logistic model.predict_proba(x_train) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate AUC auc = roc_auc_score(y train, probs) print('AUG: %.3f' % auc) # calculate roc curve train_fpr, train_tpr, train thresholds = roc_curve(y train, probs) plt.plot({0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(train fpr, train tpr) plt.xlabel("False Positive Rat plt.ylabel("True Positive Rate") plt.title("Figure 23. auc: 0.890 out [137] Text (0.5, 1-0, ‘Figure 23.1: AUC-ROC Train Data with GridSearch-Logist ic Reg Model ') Figure 23.1: AUC-ROC Train Data with GridSearch-Logistic Reg Model 10 ‘Yue Positive Rate 00 02 on os 08 10 False Positive Rate Logistic Regression Model-Test Data In (138): # accuracy - Test Data Logistic_model.score(x_test, y test) out (138): 0.9289473684210527 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya AUC-ROC Train Data with GridSearch-Logistic Reg Model sas osoiaa 120s in [139]: Proj ML Problem 1-Jpytes Notebook ## Getting the probabilities on the test set ytest_predict_prob=best_Logistic_model.predict_proba(X_test) pd.DataFrame(ytest_predict_prob)-head() out (139): ° 1 © 0425915 0574085 4 0.153046 0.846954 2 0.006708 0.990294 3 0.839472. 0.160528 4 0.065104 0.934896 in (140): print (metrics.confusion_matrix(y_test,y_test_predict)) print (metrics .classification_report(y test,y_test_predict)) (ati 42) (38 265)) precision ° 0.74 1 0.86 accuracy macro avg 0.80 weighted avg 0.82 Iocthost88noebooks Download Machin eam recall fl-score support 0.73 0.74 0.87 0.87 0.82 0.80 0.80 0.82 0.82 Prec MLProjet_ ML_ Problem Lipp 153 303 456 456 436 78098 Proje ML Problem Juste Notebook In [141 ax=sns -heatmap(metrics.confusion matrix(y test,y test predict),annot=True, fmt="d") plt.xlabel ("Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 22.2:Confusion Matrix of LR Model After Grid Search-Test Data’) plt.show() Figure 22.2:Confusion Matrix of LR Model After Grid Search-Test Data 250 150 ‘Actual Label o 2s 100 Predicted Label ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp 1698 oso, 205 Proj ML Problem 1-Jpytes Notebook mn [142]: # predict probabilities probs = Logistic model.predict_proba(x test) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate AUC test_auc = roc_auc_score(y test, probs) print(‘auc: %.3£' © auc) # calculate roc curve test_fpr, test_tpr, test_thresholds = roc_curve(y test, probs) plt.plot({0, 1], (0, 1], linestyle='--') # plot the roc curve for the model plt.plot(test_fpr, test_tpr); plt.xlabel("False Positive Rate" plt.ylabel("?rue Positive Rate") plt.title("Figure 23.2: AUC-ROC Test Data with GridSearch-Logistic Reg Mod AUC: 0.890 out (142): Text (0.5, 1-0, ‘Figure 23.2: AUC-ROC Test Data with Gridsearch-Logisti © Reg Model *) Figure 23.2: AUC-ROC Test Data with GridSearch-Logistic Reg Model 10 “Fue Postve Rate 00 02 04 06 08 10 False Positive Rate Inference of Logistic Regression Model With GridSearch: Using the contusion matrix, the True Positive, False Positive, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 score Listing below model performance metrics before fine tuning the model: Train Data: + True Positive:197 + False Positive:66 + False Negative:110 + True Negative:688 + AUG: 89% + Accuracy: 83% *+ Precision: 86% + f1-Score: 89% -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya rs oso, 205 Proj ML Problem 1-Jpytes Notebook + Recall:91% Test Data: + True Positive:t11 + False Positive:36 + False Negative:42 + True Negative:267 + AUC: 88.3% + Accuracy: 83% + Precision: 86% + f1-Score: 87% + Recall:88% ‘+ We know that, FPR tells us what proportion of the negative class got incorrectly classified by the classifier Here, we have higher TNR and a lower FPR which is desirable to classify the negative class. + Here, both Type | Error (False Positives) and Type Il Error ( False Negatives) are low indicating high Sensitivity/Recall, Precision, Specificity and F1 Score. + Accuracy of the model is more than 70%, which can be considered as agood accuracy score. + Train and Test data scores are mostly in line and the overall performance of model looks good.Hence, it ‘can be inferred that overall this model can be considered as a good model. * After fine tuning the model we can see that model has given mostly the same performance with a very slight improvement in few parameters. Hence, we can say that fine tuning this particular model does not make much of a difference the model performance. Ensemble Machine Learning: Itis a machine learning paradigm where multiple models (often called “weak learners’) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models. ‘The three most popular methods for combining the predictions from different models are: 1. Bagging: Building multiple models (typically of the same type) from different subsamples of the training dataset. 2. Boosting: Building multiple models (typically of the same type) each of which learns to fix the prediction ‘errors of a prior model in the sequence of models. 8, Voting: Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions, Here, we will use the techniques bagging and boosting. Bagging Idea of Bagging: To fit several independent models and “average” their predictions in order to obtain a model with a lower variance. However, in practice, it requires too much data to fit fully independent models . So, we rely on the good “approximate properties” of bootstrap samples (representativity and independence) to fit models that are almost independent. Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn 749s oso, 205 Proj ML Problem 1-Jpytes Notebook ‘Note:A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random ‘subsets of fhe original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. When samples are drawn with replacement, then the method is known as Bagging Bootstrap Aggregation (or Bagging for shor), is a simple and very powerful ensemble method.t is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm ‘that has high variance are decision trees, ike classification and regression trees (CART) Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Hence, we will build the Bagging model using Decison Tree as the base estimator and than fit the model. ‘We know that, Decision trees are sensitive to the specific data on which they are trained. Ifthe training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite diferent and in turn the predictions can be quite different. Bagging of the CART algorithm would work as follows: 1. Create many (e.g. 100) random sub-samples of our dataset with replacement. 2. Train a CART model on each sample, 3. Given a new dataset, calculate the average prediction from each model Listing below few parmeters used: + base_estimator: The base estimator to fit on random subsets of the dataset Here, we are using Decision ‘Tree Classifier to improve accuracy and reduce variance, ‘+ nestimatorsint:The number of base estimators in the ensemble. HEre we are taking 100. + random_state: Controls the random resampling of the original dataset (sample wise and feature wise). qn [143]: from sklearn-model_selection import train test_split X_train,X test, y train,y test= train test_split(x,y,test_sias -30,random_stat In (144): from sklearn-ensemble import BaggingClassifier from sklearn.tree import DecisionTreeclassifier In [145]: cart=DecisionTreeclassifier() Bagging_model=BaggingClassifier (base_estimator=cart,n_estimators=100, random_state=1 Bagging model. fit(x_train,y train) out [145]: BaggingClassifier(base_estinato 00, jecisionTreeclassifier(), n_estimator random_state=1) ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp 998 oso, 205 Proj ML Problem 1-Jpytes Notebook tn [146]: y_train_predict=Bagging_model.predict (x train) Bagging model_score=Bagging_model.score(x_train,y train) print ("Bagging Model Score on Train Data is",Bagging model_score) Bagging Model Score on Train Data is 1.0 In (147): print (metrics.confusion_matrix(y train,y train predict)) print (metrics .classification_report(y train,y train predict)) [(307 0} [0 -754)) precision recall fl-score support ° 1.00 1.00 1.00 307 1 1.00 1.00 1.00 754 accuracy 1.00 1061 macro avg 1.00 1.00 1.00 1061 weighted avg 1.00 1.00 1.00 1061 In (148): ax=sns.heatmap(metrics.confusion matrix(y train,y train predict),annoterrue, fmt="d" plt.xlabel (‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Pigure 25.1: Confusion Matrix of Bagging Model train Data’) plt.show() Figure 25.1: Confusion Matrix of Bagging Model_Train Data - 70 0 S00 0 200 Actual Label 200 Predicted Label AUG_ROC Curve of Bagging Model ecto 880 ebooks Dowland Machine LeamingProject MLProjst ML_Problem Lipp anos oso, 205 Proj ML Problem 1-Jpytes Notebook mn [149]: probs = Bagging _model.predict_proba(x_train) probs = probs[:, 1] auc = roc_auc_score(y train, probs) print('AUG: %.3f' $ auc) train_fpr, train_tpr, train thresholds = roc_curve(y train, probs) plt.plot({0, 1], [0, 1], linestyle='--') plt.plot(train fpr, train _tpr) plt.xlabel("False Positive Rate" plt.ylabel("?rue Positive Rate") plt.title("Figure 26.1: AUC-ROC of Train Data Bagging Model ") AUC: 1.000 out (149) Text (0.5, 1.0, ‘Figure 26.1: AUC-ROC of Train Data Bagging Model ‘) Figure 26.1: AUC-ROC of Train Data_Bagging Model 0 02 04 06 08 10 False Positive Rate qn 173]: y_test_predict=Bagging_model predict (x_test) Bagging_model_score=Bagging_model.score(x_test,y_test) print("Bagging Model Score on Test Data is",Bagging_model_score) Bagging Model Score on Test Data is 0.8201754385964912 -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya suns osoiaa 120s Proj ML Problem 1-Jpytes Notebook mn [151]: print (metrics.confusion_matrix(y test,y_test_predict)) print (metrics.classification_report(y test,y_test_predict)) [10845] [37 266}) precision recall fl-score support. 0.74 o.71 0.72 153 1 0.86 0.88 0.87 303 accuracy 0.82 436 macro avg 0.80 0.79 0.80 456 weighted avg 0.82 0.82 0.82 456 qn [152]: ax=sns.heatmap(metrics.confusion matrix(y test,y test_predict),annot=True, fmt="d',< plt.xlabel( ‘Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 25.2:Confusion Matrix of Bagging Model_test Data") plt.show() Figure 25.2:Confusion Matrix of Bagging Model_Test Data -20 -m0 150 ‘Actual Label 100 Predicted Label AUC_ROG Curve of Bagging Model_Test Data Iocthost88noebooks Download Machin eam sas osoiaa 120s Proj ML Problem 1-Jpytes Notebook tn [153]: probs_test = Bagging model.predict_proba(x_test) probs_test = probs _test{t, 1] auc = Toc _auc_score(y test, probs_test) print(‘AUC: 8.38" $ auc) test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs_test) plt.plot({0, 1], (0, 1], linestyle='--') plt.plot(test_fpr, test_tpr); plt.xlabel("False Positive Rate" plt.ylabel("?rue Positive Rate") plt.title("Figure 26.2: AUC-ROC of Test Data Bagging Model ") auc: 0.881 out [153]: Text (0.5, 1-0, ‘Figure 26.2: AUC-ROC of Test Data Bagging Model ') Figure 26.2: AUC-ROC of Test Data_Bagging Model 00 02 oa 06 08 vo False Positive Rate Inference of Bagging Model: Using the confusion matrix, the True Positive, False Posit score.Listing below model performance metrics before fine tuning the model: -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya e, False Negative, and True Negative values can be extracted which will aid in the calculation of the accuracy score, precision score, recall score, and f1 sus oso, 205 Proj ML Problem 1-Jpytes Notebook Train Data: + True Positive:307 + False Positive:0 + False Negative:0 + True Negative:754 = AUC: 100% + Accuracy: 100% + Precision: 100% + ft-Score: 100% + Recall:100% Test Data: + True Positive:108 + False Positive:37 + False Negative:45 + True Negative:266 + AUC: 88.1% + Accuracy: 82% + Precision: 86% + f1-Score: 87% + Recall:88% + Clearly,our model has better performance on the training set than on the test set, it is likely that model has overfitted. Hence, it might be a big red flag as our model has 100% accuracy on the training set but only 82% accuracy on the test set.Generally bagging is used to avoid problems of overfiting but in this model may be while sampling with replacements some observations got repeated in each subset. Hence, our model is overfiting. ‘+ We know that, FPR tells us what proportion of the negative class got incorrectly classified by the classifier Here, we have higher TNR and a lower FPR which is desirable to classify the negative class. + Here, both Type | Error (False Positives) and Type Il Error ( False Negatives) are low for Test Data indicating high Sensitivity/Recall, Precision Specificity and F1 Score. We will now try to build the model using Boosting. While bagging and boosting are both ensemble methods, ‘they approach the problem from opposite directions. Bagging uses complex base models and tries to "smooth out" their predictions, while boosting uses simple base models and tries to "boost" their aggregate complexity. Boosting: In Boosting, Base estimators are built sequentially and one tries to reduce the bias of the combined estimator. ‘The idea is to combine several weak models to produce a powerful ensemble.{t makes the boosting algorithms prone to overftting. Examples: AdaBoost, Gradient Tree Boosting Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model Boosting is focused on reducing the bias. It makes the boosting algorithms prone to overfiting.To choose different distribution for each round we use following steps: Step 1: The base leamer takes all the distributions and assign equal weight or attention to each observation. Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm. Step 3: Iterate Step 2 til the limit of base learning algorithm is reached or higher accuracy is achieved. octet 880 ebooks Dowland Machine Leaming Project MLProjst ML_Problem Lipp suns Proj ML Problem 1-Jpytes Notebook ‘There are three types of Boosting Algorithms which are as follow 1. AdaBoost (Adaptive Boosting) algorithm. 2, Gradient Boosting algorithm. 3. XG Boost algorithm. Here, we will use Adaboost and Gradient Boosting algorithm, Below are the key parameters for tuning: + n_estimators: It controls the number of weak learners. + leaming_rate:Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.. + base_estimators: It helps to specify different ML algorithm, In [154]: from sklearn-ensemble import AdaBoostClassifier In (155): ADB_model=AdaBoostClassifier(n_estimators=100,random_state=1) ADB_model.fit(X_train,y train) out [155]: AdaBoostClassifier(n_estimators=100, random state=1) In [156]: y_train_predict-ApB_model.predict (x_train) ADE_model_score=ADB_model.score(X_train,y train) print("Medel score with ADA Boosting algorithm is",ADB_model_score) Model score with ADA Boosting algorithm is 0.8501413760603205 In [157]: print (metrics.confusion_matrix(y train,y train predict)) print (metrics.classification_report(y train,y train predict)) [(214 93) [ 66 688) precision recall £l-score support 0.76 0.70 0.73 307 1 0.8 0.91 0.90 754 accuracy 0.85 1061 macro avg 0.82 0.80 o.81 1061 weighted avg 0.85 0.285 0.85 1061. Iccthost 88noebooks Download Mashine Leaming Projet MLProjst ML. Problem Lipyn s805 osoiaa 120s Proj ML Problem 1-Jpytes Notebook In [158 ax=sns.heatmap(metrics.confusion_matrix(y train,y train predict),annoteTrue, fmt='d' plt.xlabel ("Predicted Label") plt.ylabel( ‘Actual Label’) plt.title('Figure 27.1:Confusion Matrix of Train Data ADA Boosting’) plt.show() Figure 27.1:Confusion Matrix of Train Data_ADA Boosting 600 ° as 2 500 3g a 400 3 & ~300 cl -200 ~100 ° 1 Predicted Label AUC_ROC Curve of Train Data_ADA Boosting: -eclhos 888/noebooks Downloads Machine Leaning Projet MLProjct ML Problem Lipya 005

You might also like