Unit 3

DSC? Yutroduction to Business Analytics os between © and | based on the input variable ly used i statistics, machine learning, and provide probatiaic Iris @ mathematical fanction that assigns vi S-shaped curve and is commer relationships It is characterized by its and neural networks to model non-linear interpretations. ing Hogistie fumerion ing probabilities in various fields. By applying, such as in logistic regression, his allows for the 3.3.2 Estimation of probabi yction is often used for esti ‘ombination of input variables, between O and 1 The logistie fum the logistic function to a linear e+ ranaforms the output into a probability val based oft their likelihoods. 3.4 OMNIBUS T T cay test woud 0 test the niente oF several model pores a SO A ce ee ee whether the combined e' by examining the difference in deviance between the fall jthout predictors) to derive its formu reduced model (8 Omnibu once, It examines The Omnibus statis iel (with predictors) and th ,— Dy mo Omaibus (without predictors) and Df represents d model presents the deviance of the red full model (with predictors). tribution with degrees of ven the full and reduced where Dr rep ely follows eh the deviance of freedom given by the difference in the umber of predictors models. hi tie tC Jhi-squace distribution and calculating the associ euilate the collective statistical significance of the predictor When the calculated han a predefined signiti 0.05), we act the a ype hat th 1p of predictor variables collectively has a exceeds the 5 ic ve fail ull hypothesis, sug esting that the of the overall significance of the und anding of how thesesion model With three predictoe ps7 ie CY}. We want to assess (yet ong o==. variabl dependent Lars oom Omnibus test. fe) ing the OM ‘Summilah os Jes and the dependent variable _ eres somple Dev seep} Usme cane s 7 sion model that includes alt three predictor variables Y= fo + Pits + Bs*Xe ‘ents and the deviance of the ful istigal software, we obtain the est By using s mode fo= 8.463, Br = 0.643, Br = 0.24 full = 5.274 Deviane Step 2: Fit the Reduced Model Next, we fit the reduced model, which only includes the intercept term: sus of Open Learning ce & Continuing Education, Cam al of Open Learning, University of DelhiDSC 7: Introduction to Busin model Similarly, we obtain the deviance of the red Deviance_reduced = 15.924 p 3: Calculate the Ormnibus Test Statistic es obtained from the full and reduced models, we can caleulate the ance full) / Deviance_redueed ‘Omnibus = (Deviance_reduced - De\ (15.924 = 5.274) (15.924 0.668 Step 4: Conduct the Hypothesis Test To assess the statistical significance of the predictors, we compare the Omnibus test statistic stribution with d es of freedom equal to the difference in the number to the chi-square predictors between the full and reduced models. Is this ease, the difference is 3 (since we statistical software, we determine the wme the p-value is 0.022, the chi-square distribution table of using he Omnibus test statistic, Let's ass 0.08), we f predictor variables (X1, X2, 3) aariable (¥). In oth the predetermin ot imated coefficient for a specific dividual predictorB= 8¥ var@) where P is the estimated coefficient for the predictor vanuble of men ‘ofthe coefficient under the null hypothesis (typical z neally 0 for & te ‘sing bypothesized value sod Var(p) is the estimated variance of the coefficient: te coefficient is zero) 2 s compared to the chi-square distribution, where the da The Wald test statistic i ice we are testing. a single parameter) to obtain the assoq freedom are set to 1 x Rejecting the null hypothesis oceurs when the calculated p-value Falls below Ds veficance level (ef 0.05) indicating that the predictor variable has nie Statistical pendent variable. ¢ impact on the de} ‘The Wald test allows us to determine the individual significance of predictor vara Sa goctTcents signitianily deviate from zero. Its a valuble yal "® Wook for. testing wheths have a meaningful impact on the outcome have am the outcome of interest ig identifying which variables regression model, Let's consider an example where we have a logistic regression model with two pre editor variables (X1 and X2) and a binary outcome variable (Y). We want to assess the s sss the signiicm the Wald test al of the coefficient for each predictor usit ry outcome variable: aset with the predictor wariables and the bina Here is a sample d: © Department of Distance & Continuin, anon, Campus af nce & Coninui pane dc ig Education, Campus of er Learning, University of Dethi Learning. pscrl sep 1 Fil we sant rosie) py using feo sep) siep?ec DSC 7: Introduction to Business Analytics sgistic Regression Model sep 1: Fit the Ls We stan by fitting the logistic regression model with the predictor variah Jogit(p) = Bo + Bs" Xs + Bot Xe les 1 and 302 By using statistical software, we obtain the estimated coeff vocfficients and their stan fo = -1.613, fr = 0.921, Be = 0.3 dard errors: 0.512, SE(p=) = 0.295 SP(fi) = 0-833. SEP) Step 2: Calculate the Wald Test Statistic Next, we caleulate the Wald test statistic for each predictor variable ble using the formula: B fo} / VartB) w For X1 Wa = (0.921 - OF 1.790 For X2: We = (0.372 - OF / (0.295) = 1.608 Juct the Hypothesis Tes fe the Wald test statistic for ting a we comp cc of freedom (since we ar nce of each predictor, tistical chi-square distribution w To assess the 8 h variabl we determine the al sofia ingle on table ¢ Vis 0.183 and assume the p-value f By referring to the chi-square ¢i p-value associated with each Wald test statistic, Let's 0.20 the p-value for X2 sults oredetermined significance level (e-g.. Step 4: Interpret the s that the coefficient for XV is not For X1, since the p-value (0) 0105), we fail to reject the null hypothesis sug statistically significantly different from zero, indicatr dhat X1 may not have a significant vanable Y we fail to effect on the binary outcom ¢ significance level ant = st onnt five Rotel ficient for X2 1s arly, for X2, since the p-value (¢ the null hypothesis. Th Omen Learteings Simi rejecsignificantly inary outcome variable Y pased on the ¥ has a signifies! In suramary, Wald ® either X1 or X2h regression model. J, What does the W2! a) The F-dismibetion b) The eismibution The noraat distribution 6) The chi-square di 2, What doos the Omnibus test 3) The indi b) The collinearity between ¢) The overall signifi oo it of 3.6 HOS The Hosme logistic reg align with the observed outcomes. The Hosmer-Lemeshow test is based on s of the logistic re on the predicted probabilit Lemeshow test statistic is as follo 56 aierent from 262, 19 ld test statisti ests, we do HOt rat impact on the jstribution assess in a regyession Ti seidunl significance of ance of predictor v fthe regression model MER LEMESHOW TEST -Lemeshow test is a statistical tes sion model. It assesses how well th edel? predictor variables or variables ariables co used 10 ev ividing the observat (ons into groups or "bi cating that XZ may pot HVE A SMCS tag ton the ave sufficient evidence to conclude binary outcome variable in the logistic uate the goodness-of-fit of a probabilities from the model is" based The formula far the Hosmet- pst 7: where 0 wan 208 thet The © equal cobtal then By! im‘A non ficted probabilities align closely with the Jy, a significant result (P< 0,05) suggests a lack of fit, indie ‘outcomes. Converse! model may not accuratel The Hosmer-Lemeshow test is regression models, allowing us t© &v based on observed and predicted probabil ities- Let's consider the example again with the logistic regression model predicting the probabil We will divide the predicted ofa disease (Y) based on a single predictor variable (X)- probabilities into three bins and cal d and expected frequencies in each bin. Jy represent the data. ‘a valuable tool in assessing the goodness-of-fit of Logistic aluate the model's performance in predicting outcomes: culate the observes Predicted Probability 578del fe obtnin the predicted probabilities fo, 1 f 9c regression Mode Stop I: Fit the Lo By fiting the tog ration based on the predic robabilities into Bins po three bine: (0-1-0.3} [03-05] and\(05-89) ‘or variable X observ Predicted Ps .d probabil Stop 2: Divide Let's divide the pre 4 Expected Frequencies in Each Bin ved and Expec! a ‘and expected frequencies i fe Obse! Step 3: Caleulat ate the observed Now, we calcul: Bin; (0.1-0.3) Total cases in bin: 3 dcases (¥ = 1! - 3: (0.25 + 0.20 + 0.28) *3.= 1-23 Observe Expected case: Bin: [0.3-0.5] Total cases in bin: 4 2 28) * 43.52 Observed cases (Y 0.30 Expected cases: (0.40 + 0.35 Bin: (0.5-0.1] Total cases in bin: 3 Observed cases (¥ © 1): 2 ng Expected cases: (0.45 + 0.60) *3 = 3 Step 4: Calculate the Hosmer-Lemeshow Test Sta We ealeulate the Hosmer-Lemeshow test statistic by summing the contributions from each bin: ((O1 = Ex)? / Es) + (Or - Ex / Ea) + (Os - HL = ((1 - 1.23)2/ 1.23) + ((2 - 3.52P/3.52) + ((2- 3. (0.032) + (0,670) + (0.2 0.926oduction to Business Analytics psc 7: tate the Hypothesis Test er-Lemeshow test statistic (HL) ic to the chi Square distribu tion. with | f bins - 2). tribution table or using statistical soft conduct the Hosm ‘edon (number of to the chi-square ¢ ficance level of 0.05 is 3.841 sup 5: We compare degres of fre let's assume that py referring the eritieal value for asigi since the cafcutated test statistic (0. 326) is less than the critical value Sine 1 aypotnss, This suggests that the logistic repression model ie ae rejeet he data wel 6; Intespret the Results es to sugasit lack of 6 for he ted test statistic (0.926) is below the = A expected frequencies ithe diferent step Based on # regression model sven the observed a vitical value, he Hosmer-Leme The calcul: logistic indicating good fit bet jon how test assesses the goodness of fit of a logistic reg pected frequencies in different bins of predicted Jicates that the model fits the data well. -y, the Hosmer-Leny he observed and €x Jn sum: model by comparing t probabilities. fo this example 3.7 PSEUDO R SQUARE analysis, particula Jent variable explained by the predictor comparable to the R-squared used in test result ind 1 in regression ly in logistic 1 square is a me pseudo R. to assess 1 It is called “pseudo” be variance in the depe the proportion of =ctl it is not di iable: ession ared, and one commonly used method is is as follows: re various methods t© o's Resquared i } Lvutt) Likelihood of the mull del with, where Anode! logelikelihood of the full model, ‘null is the le model (a model with only an intercept tern) ou-likelihood of a mos model that perfeetly predicts 4 max is the | all outcomes). ndicating that the predictors have 00 wever, it is important '° hypothetice prediction es from Q:to 1, with 0 rke's R-squared rang’ anatory power, and | a perfect fit of the model. Ha ‘ and should not be sntenpreved in the note that Nagelkerke" hat Nagelkcerke’s R-squared is an adjusted measure ion sonese —— same way a i Way as R-squared in linear rePseudo R-squared pr varianee in the depende interpretation compare the goodness ‘compared to One commonly used pseudo R- calculate the Cox and rovides an indication of ie in logistic ree nt variabl ance exp fas the proportion of var -of-fit of diffe ‘null model. squared measure J Snel! R-squared usin r variables. Step 1: Fit the Logistic Regression Model To cal assume that the he logistic re gression model us cients for each predictor ulate the Null Log-Likehihoud (LLO) ulate the null log-likelihood, we fit a null model with only an intercept hood (LLO) is -48.21 how well the predictor variables ey gression, While it does not have wy ‘Plain <0 Mined, 1 serves a8 relative mange se ser ates the INPOVEMEH ot is the Cox and Snell R-squared, Ly, iven example of logistic regression mgg a les XI and X2, we obtai term. Let D: sur ‘The towpst Introduction to Business Analytics & 4: Calculate the Foil Log-Likelihood (LLP) suep 3: Caloulate the Fo Log Likelihood (LL 1 tog-likelinood represents the maximums value of 1 ne that the Full Logie Sow-iketiheed for ay The (al 1d (LLP in 31-44, fines leulate the Cox and Snell R-Squared 1 = (LLO/ LLEVE /n), ‘Cox and Snot 2 er of obse 0 (oumber of obs stic regression m ecting whether an input detecting whet ng 100 random calls in which the model X X implemented for the g suppose ) cells ar20 impat sctually cancerous “ opus aa cells ae =-oae me 2 rs etn S03 ae ie es Md on Oe ee while the rest 18 reece On OS aie in te Pond man pcerogs cells, 75 colle ST 208 gered as positive class while mag ss cerrerous. Here, cancerous cell pain ; la FO ovation meties of clsifeaon May Ao primary building ble for which the classificatio y, The number of fiemion delat impart ceils True Positive (TP) predicts that they model X,7P ~ 5. = eS erate The commber of Sat Oe Te are referred as True ¥ cells the classiticay Ne Bok non-cancerous cells True New correctly predicts that. they model X, TN The number of for which the cl input cells ‘ input cel a. example, for Is are referred as False Po: - ve. For xa, False Positive (FPY: ly predicts that they are eancere for the model X, FP = 15. False Negative (FN T For example, for the mod os | AC | _—~cascereus | TP=3._ | FPsIS| Predicted |oo-Cancerous | FN=5 [| _TN=1 | 3.8.1 Sensitivity Recall, is calculated as the True Positive Rate or mn the groond tru. also referred to as uber of cancerous cells i Sensitiv correctly predicted cancerous cells to th ¢ total n To compute sensitivity, you can use the following formule TP sitivity act FN) pus of Open Learnt Education ered a8 True POSiNYE, FOC tinge pS sad Acca Somgate a 3RAaRrs Prcotsio ‘lle 384 Thepsc 7: Introduction fo Business Analytics defined as the ratio of number of input cells th _ dot the total number of non-cancerous cells in the d rectly cancer ree Negative Rate To comple speifcty, wecanine he hee : TH pe eh FP ulated as the ratio of coereet ified cells to the total Sumber of cll. Yo accuracy is you ean use the following formula etn PEN VFN) 38-4 Precision f dicted cancerous eels t0 the tot emi ous by 1 ‘ompulte precision, you can1CIENT 3.9 GINI COE) xs inoquality is the Gini coefficient, also referred ta the Gig A metre used 10 asses id 1. The performance of the mod The Gini coefficient has a ¥ with increasing Gini coefficient va ROC curve using the formula: Gini Coefficient = 2* Ave —1 aalue between 0 Impey Aue 1es. Gini coefficient can be computed puted from the 3.10 ROC IEportevlor logistic repression or machine leaning techniques, the perfor Hee eegaicn model ts ussseed using » graphical representabod elled agama tra Operating Characteristic (ROC) curve. The off between the true Ne Positive rajp (sensitivity) and the false positive rate (specificity minus 1) for various cat thresholds is demonstrated ization V/ a a Es Plotting the true positive rate (TPR) against the false positive rate (FPR) at various categorization thresholds resulis in the ROC curve. The formula for TPR and FPR a f perfect cl model's d left coms 311s When ¢ Curve ( The lik probab The A the Al Instea The /comer of th ¢ between the ROC curve (ROC) curve, the Ar and an FPR Oa plot. The nd the top4.2 INTRODUCTION proach for classification and regresig, Decision Tr e is similar to a is.a popular machine learning 0 at owchart, w LUNES OF att nd leaf nodes signify outcontes or predicted valve jemal sods rep cnt fe lis stracty branches depict decision ues by the decision tree al 0 0 erate fe are divided recursively data partitioning at each stage by analysis parame tree, It chooses the best ation gain or G is to divide a Brita Mammat? | Yes Ne eS Cheetah a eas [= Fig 4.1: Decision Tree for classification secnario of a mammal hoosing @ path through the tree based on feature values, the decision tree can be wsed 1© rate predictions on fresh, unexpected ¢ prediction fresh, unexpected data afer it has been constructed. The ereumference. F re 4.1 shows the imal based on a series rammal?” If the answer ifthe on tree helps classify of questions, i flowchart begins wit the question, “Is it a f branch on the left. The answer is “Yes,” we conclude that it "Yes," we follow 1 Hlext question asks, oes it have spots 4 leopard. If the answer is "No," we determine itis # cheetah, 43 Classific a popular ! Classificati data iio 5 vyarisele Us CART 18 sepresents split ws he MSE Exam have followduction to Business Analytics psc 7 Imre ve amor to she iia! question, ie it mamanal is "Wo." we follow the branch on the Fe en ca bird?” the answer i "Yes," we classify itas a part Ifthe waswer hig fish He Me ave classify it as a fis tre demonstrates a classification scenario wh By following the flowchart, we can systematically cific attribute h a final classification, we aim to determine the type oF ae hough the questions 100 dis Classification and Regression Tree and regression tasks is walled the at divides the roach for elassific (CART). Itis a decision tree-be ces and the predicts the tar ‘4 popular macl Classification to subsets 0° Regression Tree to the values of the input fe ord: structure node ing the v easy it is to understand, Each cor ode represents « class label or a jes the data iteratively variable usi © of h b feature, and each CART is expec 2 a binary tree structure. The method divid epee ssets with regard to the predicted of producing homogs h the within each node using a .d split point at each node best fear that correctly categorises 1 CART measures the q that can forecast In classification rity or entropy. Selecting yew instances iterion like Gi purity. The oute ity of each nimises the they formation: 30 No | 190 Ye

Unit 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3

Uploaded by

Copyright:

Available Formats

You might also like