Machine Learning Topper

: a MACHINE LEARNIi (BE - COMPUTER) Handcrafted by BackkBenchers PublicationsMachine Learning a ware] BREET AAT) DEG=17 | MAY=18 aa) 1 5 15 25 6 * BN z 05 oS 20 ae = ay 3 10 20 10 iB 5 s z 5 5 10 : : a [& 30 30 20 20 0 0s Se i fo | 0 wT 7 | 20 5 “10 30 i 2 a 05 20 5 25 10 3) ~ Repeated 5 55 50 70 90 Ba Marks - = LChap -1 | Introduction to ML www. Topperssohition Q1. Whatis Machine Learning? Explain how supervised learning Is different from u CHAP - 1: INTRODUCTION TO MACHINE LEARNING learning. 2. Define Machine Learning? Briefly explain the types of learning. Ans: [5M | May? & Maytay MACHINE LEARNING: 1. Machine learning is an application of Artificial Intelligence (AD 2. It provides ability to systems to automatically learn and Improve fron experience wuthout bene explicitly programmed Machine leaning teaches computer to do what comes naturally to hurans learns om expenence 4 The primary goal of machine learning ts to allow the systems lean automatically sith Harn intervention or assistance and adjust actions accordingly 5. Keal life examples: Google search Engine, Amazon, Machine learning is helpful in a. Improving business decision b_ Increase productivity, & Detect disease, d. Forecast weather. TYPES OF LEARNING: }) Supervised Learning: 1. Supervised learning as the name indicates a prosence of supervisor as teacher 2. Basically supervised learning isa learning in which we teach or train the machir at whieh! welt labelled 3. After that, machine is provided with new set of «xarnpler(data) so that supervised leartang ob sonthey analyses the trining data 4. Machine and produces a correct outcome from labelled dat 5. Supervised learning classified into two categories of algorithms: a. Classification b. Regression MN) Unsupervised learning: 1. Unlike supervised learning, no teacher is provided that means no taining will be given to the machine 2. Unsupervised learning is the training of machine Using information that is neither classifiedt Nor labelled 3. allows the algorithm to act on that Information without guidance. 4. Unsupervised learning classified into two categories of algorithms a, Clustering b. Association Handcrafted by BackkBonchors Publications Page of 102ToppersSolutions, Chap -1| Introduction to ML www.Topp' 5.C0n, |SED LEARNING: DIFFERENCE. BETWEEN SUPERVISED AND UNSUPERVI ———————rGinsupervised learning — [ Supervised learning Use known and labelled data, Uses unknown and unlabelled data. Very complex to develop. Tess complexto develop. 1 Uses real-time analysis, Q3. Applications of Machine Learning algorithms 4 Machine learning applications Ans: [0M | May16, Decté & Mayt7, APPLICATIONS: 1) Virtual Personal Assistants: 1. Sit, Alexa, Google Now are some of the popular examples of virtual personal assistants. 2. As the name suggests, they assist in finding information, when asked over voice. 1d to do is activate them and ask “What is my schedule for today", “What are the flights % Allyoun trom Germany to London’, or similar questions. Ml) Image Recognition: 1. Itisene of the most common machine learning applications. 2 ate many situations where you can classify the object as a digital image. 2 Th 3. For digital images, the measurements describe the outputs of each pixel in the image. 4. Inthe case of a black and white image, the intensity of each pixel serves as one measuremerit 11) Speech Recognition: 1. Speech recognition (SR) is the translation of spoken words inte text. 2. inspect: recognition, a software application recognizes spoken words. 3. ‘The measurements in this Machine Learning application might be a set of numbers that represent the speech signal. 4. We can segment the signal into portions that contain distinct words or phonemes. In each segment, we can represent the speech signal by the intensities or energy in different time: ° frequency bands. IV) Medical Diagnosis: 1. ML provides methods, techniques, and tools that can help in solving diagnostic and prognostic problems in a variety of medical domains. 2. ttis being used for the analysis of the importance of clinical parameters and of their combinations for prognosis. 3. Eg. prediction of disease progression, for the extraction of medical knowledge for outcomes research 'V) Search Engine Result Refining: 1. Google and other search engir 'es use Machine learning to improve the search results for you. i 12 Handeratodt by BackkBenchers Publications - pope 2 otechap =1 [Introduction to ML www-ToppersSolutions.com 2 Eywiy time you execute a search, the algorithms at the backend keep a watch at how you respond to the tes Vi) Statistical Arbitrage; nce, statistical arbitrage refers to automated short-term trading strategies that involve a large ante umber of securities In such strateaies, the user trles to implement a trading algorithm for a set of securities on the basis a8 a classification or estimation problem. ‘Vuykearnina Associations; 5 into various associations between products. 1 Learning association isthe process of developing insial > A.good example is how seemingly untelated products may reveal an association to ene another. 5 When analyzea in relation to buying behaviors of customers, Vill) Classification: } Classieation is a process of placing each individual from the population under study in many classes. 2 Thisis identified a independent variables. Ciassitication helps analysts to use measurements of an object to identify the category to which that 4 Toestablish an efficient rule, analysts use data. 1X) Prediction: 1. Consider the example of a bank compu ing the probability of any of loan applicants faulting the loan > To compute the probabilty of the fault, the system will first need to ciassify the available data in certain groups 5 It sdesctibod by a set of rates prescribed by the analysts. lon, as per need we can compute the probability odo the cla: X) Extraction: 1 information Extraction (16) is another application of machine learning 2. Its the process of extracting structured information fiom unstructured data, 5 For example web pages articles, blogs, business reperts, and e-mails. 4. The process of extraction takes input as a set of documents and produces a structured data. QS. Explain the steps required for selecting the right machine learning algorithm Ans: [10M | Mayi6 & Dect7] ‘STEPS: 1) Understand Your Data; The type and kind of data we have plays a key rele in deciding which algorithm to use. 2. Some algorithms can work with smaller sample sets while others require tons and tons of samples Handerafted by BackkBenchers Publications Page 3 of 102™ Chap-1 | Introduction to ML ww. Topperssohitions.cay, $ Contain algonthins work with certain types of data 4 Eq. Naive Bayes works well with categorical input but is not at all sensitive to Myissitia date 5 Winehices: » Know your data: Look at Summaty statistics and visualizati © Percentiles can help identify the ange for most of the data. Averages and medians can describe central tendency bb Visualize the data; Rox plots can identify outliers Density plots and histograms show the spread of data, Scatter plots can describe bivariate relationships © Clean your data: Deal with missing value Missing data affects some models more than others, Missina data for certain variables ¢ » result in poor predictions. Augment your dat Feature engineering is the process of going fom rnw data to data that is ready for modelling It ean serve multiple purposes: bi ments 1 models may have different feature engineering requit Some have built in feature enginee WW) Categorize the problem: Whi Cateaorize by input: is a two-step process, 8 Iyou have labelled data, it's a supervised les hing problem, and want to f bb Hyou have unlabeled data id structure it’s an ony uperyised learing problen’ We you wo » objective function by intera: nt Lo optimize ting with aN eAViEHMent, it's 4 Feinforcement learning problem 2. Categorize by output: 9 I the output of your model is a number, it’s a regression problem by If the output of your model is a class, i's a classification problem, I the output of your model is a set of input groups, it's a clustering problem. Bo you want to detect an anomaly? That's anomaly detection, ti) Understand your constraints: 1. What js your data storage capacity? a. Depending on the storage capacity of your system, you might not be able to store qigabytes o! classification/regression mouels or gigabytes of data to cluster Does the prediction have to be fast? In real time applications; 4 For example, in autonomous driving, its important that the classification of road signs be as fst as possible to avoid accidents. S Handcrafted by Backk@enchers Publications Page 4 of 102Chap =) [Hhtraduretton U0 HAL wurw.ToppersSolutions.com 4, Bows the learning have to be fost? 1s een Haina totlels quieHly Is necessary. sureties, you need to rapielly Unetabe on Hie tly, Yo40) Hietlel itty or itferont dataset IV) Find the available algorithins: 1 bun 2 Whether thet he fartore afer ting thee alee ota rede Hiieets thie business gals $Heve (AHO Lite por essing the ieee eee A Vives acettite the teed fo VieH wapsainabite the inode i fy Hows fast thee 7D Nove sealable: te motte V) Try each algorithm, acoess and eo npare, Vi) Adjust and combine, optimization techniques. VIl)Choose, operate and continuously measure. Vill) Repent. Q6, What are the stops in designing a machine learning problem? Explain with the checkers problem. Qt, typlain the stops in developing a machine learning application. 26, fyplaln procedure to design machine learning system. Ane: [10M | May17, May18 & Deci8} STEPS FOR DEVELOPING ML APPLICATIONS: Y) Gathering data: ee deny eye beeauy the quality of dita that you gather will directly determine how eed poise paedictize ented rail be 2. Vie havnt colle) tists fron different sources for our MIL application training purpose. 4 hii aa, udder collonting, ples Wi) scraping a website and extracting data from an RSS feed or an W) Preparing the data: 1 Dita reparation vnhete vie load our data into a suitable place and prepare it for use in our system tor teaning 2 the tone OA bua nny this et vndare fortis that you ut an rx und matching algorithms and data wise ) Choasing a model: res Hiathy toes hat the data cient sand resnarcher have created over years Hendermtind try Bache Benches Publications Page 5 of 102Chap — 1 Ihtteduetion to ML nd some for numerical data fot image data, other for sequence suite of thonsy ate well suited f sutliers and detection of novelty, 2 Itihvehos receanising patterns, identifying Wh trating: 1 bey thie steps wv \aill Use cur data to incrementally improve our models abrlity to predict the data rg have inserted ops and en Lependina on the akgotithm, feed the algorithm good clean data from previow Kenenttedge ot information format that is readily usable by a machine for nezt steps the lounttedae extracted is stored Vy Evaluation: 1 hee the taining is co good for using evaluation, te, it's time to check if the model 2 this Is wheve testing datasets comes into play. never been used for training Evaluation allows us to test our model against data that has Vi) Parameter tuning: 1 hee ye are done with evaluation, we wa 1 to see if we can further improve our training in any wy + We can de this by tuning cur parameters Vinprediction: 1 teisa step where we get to answer for some questions Weis the point where the value of machine learning is realized. CHECKER LEARNING PROBLEM: 1 Acomputer program that learns to play checkers might improve its performance us measured by ability to win at the class of tasks involving playing checkers games, through experience obtained playing games agate 2 Choosing a training experience: wailable to a systern can have significant impaci on success oF a The type of training experience: failure of the learning system 1b. One hey attribute is whether the training experience provides direct or indirect feed: regarding the choices made by the performance system. ©. Second attribute is the degree to which the learner controls the sequence of training exarnples Another attribute is the degree to which the learner controls the sequence of training examples 3. Assumptions: fa, Let us assume that our system will train by playing games against itself. bb And it sallowed to generate as much training data as time permits, 4, Issues Related to Experienc a. What type of knowledge/experience should one learn? bb. How to represent the experience? © What should be the learning mechanism? ¥ Handernfted by BackkBonchors Publications Page 6 of 102Chap-1| Introduction to ML www.ToppersSolutions.com 5, Target Functio a. Choose Move: B + M b. Choose Moves a function, ¢ Where input Bis the set of legal board states and produces M which is the set of legal moves, J. I= Choose Move (8) 6. Representation of Target Function: 3. x1: the number of white pieces on the board b. x2: the number of reci pieces on the board. x3: the number of white kings on the board x4: the number of red kings on the board x5: the numer of white pieces threatened by red (ie, which can be captured on red’s next turn} x6: the number of red pieces threatened by white. P(b) = wot we ty + Wes + Ware + Was * We ¢ problem of learning a checkers strategy reduces to the problem of learning values for the efficients we through We in the target function representation. Q8. What are the key tasks of Machine Learning Ans: [5M | Mayl6 & Deci7) CLASSIFICATION: 1 If we have data, say pictures of animals, we can classify them. 2. This animal is a cat, that animal is a dog and so on. 3. A computer can do the same task using a Machine Learning algorithm that's designed for the fication task. 4. Inthe real world, this is used for tasks like voice classification and object detection 5. This is a supervised learning task, we give training data to teach the algorithm the classes they belong ‘Sometimes you want to predict values. What are the sales next month? And what is the salary for a job? Those type of problems are regression problems. The aim is to predict the value of a continuous response variable. Thisiis also a supervised learning task. wewn ‘CLUSTERING: 1. Clustering is to create groups of data called clusters. Observations are assigned to a group based on the algorithm. This is an unsupervised learning task, clustering happens fully autornatically. swe Imagining have a bunch of documents on your computer, the computer will organize them in clusters based on their content automatically. © Handcrafted by BackkBenchers Publications Page 7 of 102aren TappersSalutions con, chap — 1 | hitesdliiettin te ME FEATURE SELECTION: 1. Hie teste fe dpsed ant stat ae Rite IEEE THT Ft HF led nee ery Hele be beaild Fes aecuntey 9 Ahab tveipe to neta ied bsp He: iat ied ot nM Ie MAFF 3 Atlee He thy Peete tbr et HH Heol ‘A. Teetiviquies uneoed tr fst aie ete ett ib 14st tayo (ad ied TESTING AND MATCHING: 1 Diis tiv vetatiee tes vasbeaot ae Hits bt ob : VA paced eared Ave vied Arnette tied fans ates A HALLE tt dF Q10, What site tlhe Heetiet Hy Maltin Fath AIH Ans: [514 | Dace & taayie) ISSUES: A trvautiat eet inngs ool yaar tetas alge ears be is ebaainadd MuTIEHOn, giaer SUBPICiEnt 3 ulate 4 tesy rotate ab ttl He ae Hariri Prep Spaeth Hranninig eaarnples? 2 What alaerithitns: & Whietralaetithier stHestine Hone Fst ool eH typan i pated ia pve ane Hap eossenb av 4 Heyy tively Lhatiatid dato fe otitie dent? the fear hind tik Het ee HF (NE THACHER appnaimatian prablern 1 Whatisthe best yay te ted ovloehie eked Bey His fesitiisk quirle: ies pare 6 Wht anit Hoe can tbat 2 Cavngorion Krvewetondce tos Hod tak scene al ere il apts vetrrinbady CoM er 1 What is the best strateciy far cliaeoitiy of Hotta beet Haiti id) eaqieiinie ts, hid eae tes Us strateay attr thee eorrapdecctty of He ested piibienint 18 How eat ves learner attest téaly flat i ea totno vy Te Hf tene I bald te rareese nt ane bell the target fu QM. Define well posed learning problen Ans! Heties, define rae driving learning prabler [504 | De18) WELL-POSED LEARNING PROBLEMS: of tasks 1 Acomputer prorat Hs sie Us Hoth uate epttod ee! Zaha te Gane eh enti perforinance fens 'e. 2. (tdentifies following fhiee pith aapanet™ oe oe Wes 2. Class of tasks, ty Measure of prforinance tet 6. Sburce of experience. Handerafted by BackkBenehers fublleations page ot itChap-1] Introduction to ML www.ToppersSolutions.com 3. Examples: ‘a. Learning to classify chemical compounds. b. Learning to drive an autonomous vehicle Learning to play bridge. d._ Learning to parse natural language sentences. ROBOT DRIVING PROBLEM: 1. Task (Driving on public, 4-lane highway using vision sensors. 2. Performance measure (P}; Average distance travelled before an error (as judged by human overseer) 3. Training experience (E); A sequence of images and steering commands recorded while observing 2 human driver. lerafted by BackkBenchers Publications Page 9 of 102Chap -2 | Learning with Regression via tappperesalutions cyy CHAP = 2: LEARNING WITH REGRESSION QL Logistic Regression LOM | May! Mavi & Doc ig, Ans: LOGISTIC REGRESSION: diy is OnE OF the bash andl pOptlar glUGTT nn Lu selves a clagsilieatiany preter, 1. Logistic Reutessic Itisthe Yo-to method for binaty Classification plobletiy (jHoblenys with tyane fase salves) Linear rearessieny algtithinne ate ised Lo predi/forecaat valties BE [OUIutiC Hagression is ser tar classification tasks 4. The tern “Logistic” fs taken fot the Logit funclon thal ie ised in Hits riethied of classification: The loistle fanetion, also called the SIUMOkT [uncHan describe proportion of populate ‘ecology, Hisiing quickly andl nnasing Gul AL the eattylng Capacity of [he enviientient neon Oana), GB esan Seshaped curve that can take any Leal valued HUTHBer ad ap IE fite a vate: but never exactly at those lity He Han tion: Figure 21 shows the example ot lout Figure 21 Louatic Huanetion SIGMOID FUNCTION {LOGISTIC FUNCTION}: 1. Lowistie regression aleronithny ako us on with Independent predictors to predier a value. ean be anywhere between negative mfinity to positive infinity 2. The predicted va 3. Weneed the output of the algouthi to be class variable be 0:19, byes. 4. Therefore, we are squashing the output OF the linear equation into, © and Lae use the sigmoid function 5. To squash the predicted value betw Linear Equation; My Moa Pony bos Sigmoid Function: 1 wel — res ‘Squashed Output -h; 1 thm ula) pcs COST. FUNCTION: 1. Since We ate tying to pre ao 4 in tinea! values, We cannot use the same cost funetion Us regression algorithm \Y Handcrafted by BackkBenchers Publications Page 10 of 102Chap -2 | Learning with Regression www.ToppersSolutions.com 2. Therefore, we use a logarithmic loss function to calculate the cost for misclassifying Cont (hoe) { Hlox(ho(r)) ify = —log() = hole) ify =O CALCULATING GRADIENTS: 1. We take partial derivatives of the cost function with respect to each parameter (theta_O, theta... to obtain the gradients. With the help of these gradients, we can update the values of theta_0, theta.1, etc 3. toes assume linear relationship between the logit of the explanatory variables and the response. 4, Independent variables can be even the power terms or some other nonlinear transformations of the original independent variables 5. The dependent variable does NOT need to be normally distiibuted, but it typically assumes 2 distribution from an exponential family (e.g. binomial, Poisson, rnultinomial, normal); binary logistic regression assume binomial distribution of the response. 6. The homogeneity of variance does NOT need to be satisfied. ‘7. Errors need to be independent but NOT normally distributed. Q2. Explain in brief Linear Regression Technique. Q3. Explain the concepts behind Linear Regression. Ans: [5M | Mayl6 & Dect7] LINEAR REGRESSION: 1.__Linear Regression is a machine learning algorithm based on supervised learning, 2. Itis.a simple machine learning model for regression problems, ie, when the taraet variable is a real value. 3 Itis used to predict a quantitative response y from the predictor variable x. thai there's a linear relationship between x and y. 4. Itis made with an assumption ting and finding out cause and effect relationship between 5, This method is mestly used for for variables. 6 Figure 22 shows the example of linear regression. Figure 22: Linear Regression. ‘The ted line in the above graph is referred to as the best fit straight line. Based on the given data points, we try to plot a line that models the points the best, 9. For example, in a simple regression problem (2 single x and a single y), the form of the model would be: 19+ BY X nae (Linear Equation) ¥ Handcrafted by BackkBenchers Publications k-¥ Chap -2| Learning with Regression wowrw-Topperssolutions.con 10, The motive of the linear regression algorithm is to find the best values for a0 and al. COST FUNCTIONS: 1 The cost function helps us te figure out the best possible values for a0 and al which would provider, is. best fit line for the data po! We convert this search problem into a minimization problem where we would like to minirnize ettor between the predicted value and the actual value. Minimization and cost funetion is given below. WG minimize (pred, ~ w)? ar Jat Sets)? ar re % Cost function(2) of Linear Regression is the Roct Mean Squared Error (RMSE) between predicte value and true y value. GRADIENT DESCENT: 1 To update a> and a, values in order to reduce Cost function and achieving the best fit line the mods uses Gradient Descent. The idea is to start with random ap and a) values and then iteratively updating the values, reachin minimum cost. 24 Explain Regression line, Scatter plot, Error in prediction and best fitting line ! Ans: [5M | Deci REGRESSION LINE: 1. The Regression Line is the line that best fits the data, such that the overall distance from the line the points (veriabie values) plotted on a graph is the smallest. There are as many numbers of regression lines as variables, Suppose we take two variables, say X and Y, then there will be two regression lines: 4 Regression line of Y on X: This gives the most probable values of Y from the given values of X =a bx 5. Regression line of X on ¥: This gives the most probable values of x from the given values of. Xora by SCATTER DIAGRAMS: 1. If data is given in pairs then the scatter diagram of the d. 2. The scatter plot is used to visually identify paired data. fata is just the points plotted on the xy-pla!™ relationships between the first and the second entries ¢ \W Handcrafted by BackkBenchers Publications Page 12 of 102chap-2| Learning with Regression www:ToppersSolutions.com 3. Example: 4. Thescatter plot above represents the age vs. size of a plant. 5 Itisclear from the scatter plot that as the plant ages, its size tends to increase ERROR IN PREDICTION: 1. The standard error of the estimate isa measure of the accuracy of predictions 2. Thestandard error of the estimate is closely related to this quantity and is defined below: So-ry Where on is the standard error of the estimate, Vis an actual score, Y"is a predicted score, and Nis the number of pairs of scores. BEST EITTED LINE: 1. Aline of best fit is @ straight line that best represents the data on a scatter pict 2. This line may pass through some of the points, none of the pcints. or ail of the points. 3. Figure 23 shows the example of best fitted line Figure 23: Best Fitted Line Example 4. The red line in the atove graph is referred to as the best fit straight line Page 13 of 102 ¥ Handcrafted by BackkBenchers Publications/.BackkBer Chap-2| Learning with Regression wwvBackkBenchers, Q5. The following table shows the midterm and final exam grades obtained for students , database course. Use the method of least squares using regression to predict the final exam grade of a stug, who received 86 on the midterm exam. Midterm exam (x) Final exam (y) 72 34 30 6 ai 77 74 78, 34 30 86 75 39 29] 33 73 a 77 S z 3 74 a 90 as TOM | May Finding x'y and x using given data x y xy x 72 | 84 | 6048 | sia 50 | 63 | 3150 | 2500 a | 77 | 6237 | ese 74 | 78 | S772 | S476 94 [30 | 8460 | aa36 86 | 75 | 6450 | 7396 [59 | 49 | zas1 [Saar 33 [79 | 6587 | 6365 6s | 77 | soos | 4225 33/52] 176 | 1089 a8 [74 | 652 | 7745 81 | 90 | 7290 | eser Here n =12 (total number of values in either x or y) Now we have to find 3x, 5, Sixty) and 5x? Where, 3x= sum of all x values By=sum of ally values Bbxty) = sum of all xty values Be = sum of all x’ values ix = 866 ty = 888 Zoey) = 66088 ! Be = 65942 & Handerafted by BackkBenchers Py ications, Page 14 of 102chap-2| Learning with Regression www.BackkBenchers.com Now we have tofind a & b. pry ayy To Putting values in above equation, 1266088) ~ (866888) 12-65942)-(066)" nating final exam grade of a student who received 86 marks = y= a"x +b Hore,a= 0.59 be 3142 +86 sng these values in equation oy y (059° 86) + 31.42 82.16 marks Q6. The values of independent variable x and dependent value y are given below: Find the least square regression line y = ax + b. Estimate the value of y when xis 10. mY = Anse TOM | May?6) Fr bug (ety) and x*using given data [* ¥ xy x o}z,o | o TPs st a a ae] [16 Here n = 5 (total number of values in either x or y) Now we have to find 5x, By, Ey) and 5x7 xvalues Where, x= sum of ai By = sum of ally values Page 15 of 102 © Handcrafted by BackkBenchers Pule VALENCE BONES Ay, Chap -2| Learning with Regro: ici anes Bho) = sum of all ty values, Bx?» cum of all? values x 20 dy #20 Dey 249 we 230 Now we have to find a & b. mir phy 86 marks Estimating final exam grade of a student who rceive yeateeb Here,a=0.9) b=22 x=10 Putting these values in equation of y. y= (09°10) +22 yen? 7. What is linear regression? Find best fitted line for following example: v2 Wa | 12635 s[ 66 | 2 [as “| @ | wr S| | te2 | 1570 | 7 | 6 | woz 7/ 7H | 169 [y692 ' Handcrafted by BackkBenchers Publications Page 16 of 102>_> . =. chap -21 Learning with Regression www.BackkBenchers.com S772 | 1s [1754 a] 7 | i | 18s vo} 75 | 208 | 1938 Ans: [OM - Dects} LINEAR REGRESSION: Refer Q2 SUM: Finding x'y and using given data x y xy x] es | 7] Boor | 3965 em | a | 7744) | 4086 Ca a | 167 | 083s [761 @ |e |e | 276 | [6 | nove | soar [|e [983] sat 7 | es | 1880 518% 7 | er | was [5325 75 | 208 | 15600 | 5625 Here = 10 (total number of values in either x or y) wwe have wo find 3x. Sy, Sey) and 3x? ere, Sx= sum of all x values y= sum ofalll y values, Zev) bag um of all x'y values sum of all values. y = 693 y= Is8B Xory) = nl0896 Be = 48163 Now we have to find a & b. Putting values in above equation, Page 17 of 102 \¥ Handcrafted by BackkBenchers Publicsa woruBackkBenchers<,, $n Finding best fitting ti yeatx tb hore, a= 6.14 bs =20071 x= unknown Patti ) these values in equation of y Y= OM x= 20671 Therefore, best fitting line y = 64x ~266:71 Graph for best fitting line: weight Y Handcrafted by BackkBenchers Publications paChap-3| Learning with Trees www.ToppersSolutions.com CHAP EARNING WITH TREES Qi, What is decision tree? How you will choose best attribute for decision tree classifier? Give suitable example. Ans: [lom | Dect} DECISION TREE: 1. Decision tree is the most powerful and popular tool for classification and prediction. ‘A Decision tree isa flowchart like tree structure. Each internal node denotes a test on an attribute. Each branch represents an outcome of the test. Each leaf node (terminal node) holds a class label. Best attribute is the attribute that “best” classifies the available training examples NOwREN There are two terms one needs to be familiar with in order to define the “best = entropy and information gain. 8. Entropy is a number that represents how heterogeneous a set of examples is based on their target class. 9. Information gain, on the other hand, shows how much the entropy of a set of examples will decrease ifa specific attribute is chosen. 10 Criteria for selecting "best" attribute 2, Want to get smallest tree. b. Choosing the attribute that produces purest nodes. EXAMPLE: Predictors Toreet a)STEETE)STHTETET EISELE 2. Explain procedure to construct decision tree Ans: [5M | Decis] DECISION TREE: 1. Decision tree Is the most powerful and popular tool for classification and prediction 2. A Decision tree is a flowchart like tree structure. 3. Each internal node denotes a test on an attribute 4. Each branch represents an outcome of the test. © Handcrafted by BackkBenchers Publications Page 19 of 102 AaChap ~3 | Learning with Trees WOW TOPHEEBOLUHION Coy. label 5. Each leat node (terminal node) holes a ct 6 Indecision trees, for predicting a class label for a recent we stall fon the jot of the tee 7. We compare the values of the root attribute with recon!s attribute rch corresponding to that vate: ane MIN 10 Ne he 8. On the basis of comparison, we follow the ti node. atthe tree 9. We continue comparing our record's attribute values with otter atonal node rode with predicted Clas value 10. We do this comparison until we reach ale T._ As we know how the modelled decision tree Can be used Lo predic the Larget Class oF the vale: PSEUDO CODE FOR DECISION TREE ALGORITHM: 1 Place the best attribute of the dataset at the toot of the tee. 2. Split the training set into subsets. { contaitys data (uth the same value for gp 3. Subsets should be made in such a way that each sul attribute. pest the they 4 Repeat step and step 2 on each subset until you find leat nodes in all the br CONSTRUCTION OF DECISION TREE: Wind : g 4 ile of Decision tree for Lenn Figure 316 play Figure 31 shows the example of decision tee jor tenn Jot into subsets based on an attribute value te A tree can be “learned” by splitting the sourc lel recurs This process is repeated on cach derived subset in a recursive fo partitioning ame valtie of the target variable, ot The recursion is completed when the subset at a node all hax the when splitting no longer adds value to the predictions or does not require any domain knowledge or paramete! The construction of decision tree classi setting ‘Therefore is appropriate for exploratory knowledge discovery, Decision trees can handle high dimensional data, In general decision tree classifier has good accuracy, ication ware Decision tree induction is a typical inductive approach to learn knowledge on chs Handcrafted by BackkBenchers Publications_: a nie chap-3 | Learning with Trees ‘www.ToppersSolutions.com 3. What are the issues in decision tree induction? Ans: [OM | Deets & Maya} ISSUES IN DECISION TREE INDUCTION: 1) Instability: 1. The reliability of the information in the decision tree depends on feeding the precise internal and external information at the onset. Even a small change in input data can at times, cause large changes in the tree. 3. Following things will require reconstructing the tree: a, Changing variables. b. Excluding duplication information c. Altering the sequence midway. W) Analysis Limitations: 1. Ameng the major disadvantages of a decision tree analysis sits inherent limitations. 2. The major limitations include: a. Inadequacy in applying regression and predicting continuous values b. Possibility of spurious relationships. Unsuitability for estimation of tasks to predict values of a continuous attribute, d_ Difficulty in representing functions such as parity or exponential size. I) Over fitting: 1. Over fitting happens when learning algorithm continues to develop hypothesis 2. Itreduce training set error at the cost of an increased test set error 3. How to avoid over fitting a. Pre-pruning; It stops growing the tree very early, before it classifies the training set. b. Post-pruning: It aliows tree to perfectly classify the trail Iv) Attributes with many values: 1. attributes have a lot values, then the Gain could select any value for processing. ing set and then prune the tree. 2. This reduces the accuracy for classification, V) Handling with costs: 1. Strong quantitative and analytical knowledge required to build complex decision trees. 2. This raises the possibility of having to train people to complete a complex decision tree analysis. 3. The costs involved in such training makes decision tree analysis an expensive option. Vi) Unwieldy: 1. Decision trees, while providing easy to view illustrations, can also be unwieldy. 2. Even data that is perfectly divided into classes and uses only simple threshold tests may require a large decision tree. 3. Large trees are not intelligible, and pose presentation dit Drawing decision trees manually usually require several re-draws owing to space constraints at some ulties. sections 5. There is no fool proof way to predict the number of branches or spears that emit from decisions or sub-decisions. Handcrafted by BackkBenchers Publications Page 21 of 102Chap - 3| Learning with Trees www. Topperssolyy, a 1. The attributes which have continuous values can't have a proper class prediction 2. For example, AGE or Temperature can have any values There is no solution for it until a range is defined in decision tree itself. Viti) Handling exanrples with missing attributes values: 1. Itis possible to have missing values in training set. 2. Toavoid this, most common value among examples can be selected for tuple in consideration, | Vil) Incorporating continuous valued attributes: | | | 1X) Unable to determine depth of decision tree: 1. Ifthe training set dees not have an end value ie. the set is given to be continuous. 2. This can lead to an infinite decision tree building. x) Complexity: 1. Among the major decision tree disadvantages are its complexity. 2. Decision trees are easy to use compared to other decision-making models. 3. Preparing decision trees, especially large ones with many branches, are complex and time-consumn, affairs, 4. Computing probabilities of different possible branches, determining the best split of each node. For the given data determine the entropy after classification using each attribute fe classification separately and find which attribute is best as decision attribute for the root finding information gain with respect to entropy of Temperature as reference attribute. Si_No | Temperature | Wind Humidity 7 Hot Weak High 2 Hot Strong High | 3 Mild | Weak Norrnal 4 Cool Strong High ] ae Weak ‘Norm: 6 | mila [Strong Norm 7 [Mite Weak High 6 ‘Hot ‘Strong. High | ‘3 __| Mile Weak Normal io Hot Stiong Normal Ans: {nom | Mayrs) First we have to find entropy of all attributes, ' 1. Temperature: ‘There are three distinct values in Temperature which are Hot, Mild and Cool. As there are three distinct values in reference attribute, Total information gain will be I{p, n, 1) Here, p= total count of Hot = 4 2 n= total count of Mild total count of coo! s=ptner=4+4+2=10 Therefore, lp.nn) ba ~ Slog, > — Flog, > ¥ Handcrafted by BackkBenchers Publicationschap-3| Learning with Trees P www.Topperssolutions.com =-tlog. io l082 75 ~ 7oloese HPLP6A) £1522 onsen USING Calculator 2. Wind: There are two distinct values in Wind which are Strong and Weak. As there are two distinct values in reference attribute, Total information gain will be I(p,n), Here, p = total count of Strong = 5 total count of Weak = 5 n s=pt+n=5*5=10 Therefore, Np.n) 2 Mog? p,m) 21 as value of p and n are same, the answer will be 1 Humidity: ‘There are two distinct values in Humidity which are High and Normal As there are two distinct values in reference attribute, Total information gain will be I(p,n). Here, p = total count of High = 5 n= total count of Normal =5 5+5=10 p+ therefore, psn) as value of p and nare same, the answer will be1 Mp, n) Now we will find best root node using Temperature as reference attribute. Here, reference attribute is Temperature. a. There are three distinct values in Temperature which are Hot, Mild and Cool. al Information Gain for whole data using reference attribute. b. Here well ind Tot tinct values in reference attribute, Total information gain willbe ip, n, 1). c. As thereare three dist 4 d. Here, p=total count of Hot n= total count of Mild = 4 total count of cool entre 444e2510 Therefore, Jota! — Pegs? ~ Flons Wp." Page 23 of 102 Y Handcrafted by BackkBenchers Publicationsqa Chap -| Learning with Trees wow Topperssolutions.ce, Mp. var using calculator Now we valll find Intorrnation Gain, Entropy and Gain of other attributes except reference attribute 1. Wind: Wind attribut We will ind information gain of these distinct values as following fe 1410 distinct values which are weak and strong 1 Weak = p= no of Hot values related to weak = 1 115 0 Of Mild values related to weak = 3 = no of Coo! values related to weak = 1 So pemenele soles Therefore, Mwah = Np. 1) = ~E loge Mweak) © tp. nt) 1371 using calculator MW. Strong = Pp. no of Hot values related to strong = 3 f= no of Mild values related to strong 1.» no of Coot values related to stron sepenenss ed Therefore, s(weak) = Ip, n.1)= Nuveak) = fp. n,e) = 1371. using calculator Therefore, Wind = Distinct” values | Rotal related | otal related | (total related from Wind values of Hot) | valuesef tld) | values of Cool) r, » " ‘Weak 7 3 T iat a Btiong 3 T 7 TA ts *v Hianderafted by BackkBenchers Publica*ionsee es chap-3| Learning with Trees ‘www-Topperssolutions.com Now we will find Entropy of Wind as rollowing, Entropy of Wind = Shy MS x 1(p,, n,n) Here, p+n+r=total count of Hot, Mild and Cold from reference attribute = 10 Pisnar, = total count of related values from above table for distinct values in Wind attribute 1punan) = Information gain of particular distinct value of attribute Entropy of wind RE Mei mvn) + PALE 1pm) x1a7i+ Bt a7 3n Entropy of wind Gain of wind = Total Information Gain - Entropy of wind = 1522 ~ 1371 = 0181 2. Humidity: Humidity attribute have two distinct values which are High and Normal \We will find information gain of these distinct values as following i. High P.= no of Hot values related to High = 3 no of Mild values related to High 10 of Coo! values related to High = 1 sneneseleres Therefor High) KP. 0) ons! — Togs? ~ Horst iog.? - HHogs$ — High) = Mp, 9, 1) = 1371 ansemnnne Using calculator M Normal = p.= no of Hot values related to Nortnal = n= nc of Mild values related to Normal |= no of Cool values related to Normal tne nel43s1s5: Therefore, Normal) (weak) = Mp, nr) = 1.371... using calculator Handerafted by BackkBenchors Publications eel che. RSWM TOPPETES iting, Chap -3 | Learning with Trees “ Therefore, . Humidity formation Gai Distinct values [total related | (total_—_related {otal related | Inf its - et from Humidity | values of Hot) values of Mild) values of Cool) 7, omet) 1 > om a —13 1 a, 7 Normal y 3 I Now we will find Entropy by Humidity as following, ' Entropy of Humidity = 2h Att x 1pm n) Here, p +n += total count of Hot, Mild and Cold from reference attribute = 10 Pisnyon = total COUNt of related values from above table for distinct values in Humidity attribute 1(Pu mr) = Information gain of particular distinct value of attribute X 1p mar) + ARAL x 1(py my) Entropy of Humidity xaa7i+ 4 x 1371 Entropy of Humidity Gain of wind = Total information Gain ~ Entropy of wind = 1522 - 1371 = OST Gain of wind = Gain of humidity = 0.151 Here both values are same so we can take any one attribute as root node. If they were different then we would have selected biggest value from it. QS. Create a decision tree for the attribute “class” using the respective values: Tye Colour|Maried | Sex Hair Length | Class Brown [Yes Male Long Feotball | Blue Yes Male Short Football Brown Yes Male Tong Football Brown No. Female tong Netball [Brown No Female Long Netball ‘| Blue No Male Tong Football Brown No Fomale Long Netbalt Brown No Male Short Football Brown Yes Female Short Netball Brown No Female Long Netball Blue No Male Tong Football Blue No Male Short Football Ans: Tom | Dect Finaing total information gain lip, n) using class attribute, There are two distinct values in Class which are Football and Netball, Here, p= total count of Football = 7 © Handcrafted by BackkBenchers Publications page 26 of [achap-3| Learning with Trees ‘www.ToppersSolutions.com n= total count of Netbal s=ptn=7+S=12 Therefore, p.m) Plog.’ - “tog, Hip, n) = 0.980 .. Using calculator Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute Eye Colour: Wind attribute have two distinct values which are Brown and Blue, We will find information gain of these distinct values as following 1. Brown = P.= no of Football values related te weak 1.= no of Netball values related to weak = 5 S=pons3+5=8 Therefore, Nerown) = lip.n) = ~2tog32 ~ logs? Brown) = Kp, n} = 0.955 = using ealeulator w Blue= P.= no of Football values retated to Blue = 4 n= no of Netball values related to Blue = 0 pene4+0=4 Therefore, = Mog! NBtue| (p.9) Therefore, Eye Colour Distinct values From | (otal related values oF | (total related values of | Information Gain of value Eye Colour Football) Netball Hund nm Brown 5 5 0955 ES - : a eo by BackkBenchers Publications oe ' Page27 of 102WOW TOPPEE Solu, Chap ~3 | Learning with Trees % Now we will find Entropy of Eye Colour as following, Entropy of Eye Colour = £2 x Fam) rilbute = 12 Here, p +n=total count of Football and Netball from class attribute art Filing Px + 1 total count of related values from above table for distinct v trib, information gain of particular distinct value of attribute Mun wn + BALERS 56 1,3) of Blue Entropy of Eye Colour = Pemlertee™ 5 1(p,n,) of Br 25 xooss+ 42 x0 Entropy of Eye Colour = 0637 Gain of Eye Colour = Total information Gain - Entropy of Eye Colour = 0.980 - 0.637 = 0.343 2 Married: Morried attribute have two distinct values which are Yes and No. We will ind information gain of y distinct values as following L Yes P:= ne of Football values related to yes = 3 = no of Netballl values related to yes. S=ptns3e Therefore, Wes) = lip, n) Therefore, (No) = lip, n) Therefore, [Married pystinet values from J (total related values of | total related values of | information Cain of value Married Football) Netball) Hoan) » ™ Yes 3 7 osi2 No 4 4 1 @ Handcrafted by BackkBenchers Publicationschap-3| Learning with Trees www.Topperssolutions.com Now we will ind Entropy of Martied as following, entropy of Married = 1m) Here, p+ N= total Count of Football and Netball from class attribute = 12 Puen, = total count of related values from above table for distinct values in Married attribute Hpyn) = Information gain of particular distinct value of attribute: Entropy of Married =A x p,.0,) of Yes + PME se H¢p,n,) of No Mt yosizg M4 x1 Entropy of Married = 0.938 Gain of Married = Total Information Gain - Entropy of Married = 0980 - 0.938 = 3 Sex: Windattribute have two distinct values which are Male and Female. We will find information gain of these distinct values as following 1. Mal P.= no of Football values related to Male =7 1. = no of Netball values related to Male = 0 ptn Therefore, lives) = lp.) ois? — tons? Mt¥es) = I(p, n) = 90 using calculato: Fomale = P.= no of Foothail values related to Female = 0 No of Netball values related to Femak 4525 Therefore, No) = fp, n) = — Blogs! — Flog. Spe! ~ flows . If both values are same then the answer will be Ltor Information gain Therefore, Sec = = eeraaieanean Distinct values from | (kotal related values of | (cotal related values of | Information Gain of vais Sex Football) Netball) Hn) » ny Male 7 9 Female °. 5. by BackkBenchers Publicationswww.Toppersso,, Chap -3] Learning with Trees i, Now we will find Entropy of Sex as following. x1) Entropy of Sex p+n=total count of Football and Netball otal count of related values from aby ion gain of particular distinct value of a from class attribute = 12 ple for distinct values in Sex attribute Here, ove tal Peon, pun) = informati ttribute cst Poste 1(yenof NO Entropy of Sex = Pantene x 1(p..m4) of Yes + x Mp nd of | 278 x04 3 x0 entropyotsex=0 Gain of Sex = Total Information Gain - Entropy of Sex = 0.980 - 0 = 0.980 4 HairLength: Hair Length attribute have two distinct values which are Long and Short. We will find information these distinct values as following Long = p.= no of Football values related to Long = 4 = no of Netball values related to Long = 4 S=pene4e4e8 Therefore, (Long) = (p.m) = i is same so answer will be T KLeng) = Kp, 9) As buth values are wh M. Short 10 of Foctball values related to Short = 3 Pe 1.= no of Netbali values related to Short sept ne3+124 Therefore, short = ip.n)=—Eiog, — 2 tog? oi Therefore, Hair Length Distinct values from | (total related values of | (total related values of | Information Gain of value Hair length Football) Netball) Hon) ym n tong z {Short 3 oaia = Handcrafted by BackkBenchers Publications Pag| chap- 3] Learning with Trees www.ToppersSolutions.com Now wewillfind Entropy of Hair length as following, entropy of Hair Length = Di" x 1,0) Here, p += total count of Football and Netball from class attribute = 12 ‘otal COUNt Of related values from above table for distinct values in Hair Length attribute Pram 1(pun) = Information gain of particular distinct value of attribute Entropy of Hair Length = Het 510, 9,) of Long 4 Panter 1(p,,m,) of Short S14 xomi2 Entropy of Hair Length =0938 Gain of Hair Length = Total Information Gain - Entropy of Hair Length = 0,980 - 0.938 = 0.042 Gain of all Attributes except class attribute, Gain (Eye Colour) = 0.343 ‘cain (Married) = 0.042 in (Sex) = 0980 ain (Hair Length) = 0.042 Here, Sex attribute have largest value so attribute ‘Sex’ we be root node of Decision tree. First we will go for Male value to get its child node, Now we have to repeat the entire process where Sex = Maie We will take those tuples from given data which contains Male as Sex. And construct a table of those tuples. “EyeColour [| Married Sex Hair Length Class Brown Yes Male tong Football Blue Yes Male Short Football Brown Yes Male Tong Football Blue No Male tong Football Brown No Male Short Football Blue No Male tong Football Blue No Male ‘Short Football Here we can see that Football is the only one value of class which is related to Male value of Sex class. “owe can say that all Male plays Football Now we will go for Female value to get its child node, We will take those tuples from given data which conti Ue, 's Female as Sex. And construct a table of those. fidltlanderafted by BackkBenchers Publications pape'sTishiolWHO TOPPETSSOltian, Chap -3] Learning with Trees ri Sex Hairtength | Class ee Brown No Female tong Netball Brown No Female ‘Long Netball Brown Yes Female Shoit Netball Brown No Female Tong Netball Here we can see that Netball is the only one value of class which is related to Female value of Sex clas, So we can say that all Female plays Netball So Final Decision Tree will be as following Cm Creme a Oe [aned] See ror] cas colo Leon Fotball Brown | — no | Female] tng FetBal feotball Brown] ns [Female | tong_| Metta Footbal Grow) na Female [tong Netball etal Brown | ves tea eat] ‘row Wo Netball otal Q6. For the given data determine the entropy after classification using each attribute fo classification separately and fine which attribute is best as decision attr finding information gain with respect to entropy of Temperature as reference attribute. ute for the rect by Sr.No | Temperature | Wind Homidiy Hot Weak Normal 2 Hot Strong High 3 wile Weak Normal z mild Strong High 5 Coal Weak Normal] é Mild Strong | Normal 7 mad Weak High 8 Hot Strong “| Normai 3 wild ‘Strong Normal 0 Cool Strong Normal Ans: '¥ Handcrafted by BackkSenchers Publicationsag chap 31 Learning with Trees www.-ToppersSolutions.com girst we have to find entropy of all attributes, 1, Temperature: There are three distinct values in Temperature which are Hot, Mild and Cool as there are three distinct values Here, p= total count of Hot = 3 = total count of Mild reference attribute, Total information gain will be Hip, n. 6) 1 = total count of cool = 2 sepentr=3+5+2=10 Therefore, pine) =—Blogs? ~ Slog,® — Slog, using calculator 2 ‘There are two distinct values in Wind which are Strong and Weak As there are two distinct values in reference attribute, Total information gain will be I(p, n) ‘otal count of Strong = 6 as value of pand n are same, the answer viill be 1 There are two distinct values in Humidity which are High and Normal As there are two distinct values in reference attribute, Total information gain will be l(p, n). Here, p= total count of High = 3 N= total count of Normal = 7 S=p+n=3+7=10 Therefore, Kip, n) No.) as value of p and n are same, the answer will be 1 882 fafted by BackkBenchers Publications Paye 33 of oz ia iiiwir Topper, H Chap ~3 | Learning with Trees J reference attribute. Now we will find best root node using Temperature as referer Hore, reference attribute is Temperature. — and Cool, There are three distinct values in Temperature which are Hot, Mild an‘ teiin As there are three distinct values in reference attribute, Total information gain 1) otal count of Hot = 3 Here, p 1 = total count of Mil total count of cool =2 +5+2=10 s=ptns Therefore, logs ~ Zions & Mp. 0,1) = 1.486. wu: using calculator Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute 1. Win Wind attribute have two distinct values which are weak and strong, Weill find information gain of these distinct values as following 1 Weak= No of Hot values related to weak = 10 of Mild values related te weak = 2 = no of Cool values related to weak = Sspenenslezedes Therefore, Mweak) = lip, 4 Mweak) = ip, n,1) = 1.8 using calculator I, Strong= P.= no of Hot values related to strong n= no of Mild values related to stron. 1.= no of Cool values related to strong 1 S=pienene243e1 Therefore, loge? ~ tloge? (weak) 2 Nweak) = I(p, n, 1) = .- using calculator Y Handcrafted by BackkBenchers Publications~~ - chap-3| Learning with Trees \Www.ToppersSolutions.com meretore: Wind Distinct values [ (total “related | (total related | (total __ related | Information Gain ot valu from Wind values of Hot) | values of Mild) | values of Cool) n Heron) m 1 Wok v z T 1s ron9 2 3 T 1460 Now wewill find Entropy of Wind as following, ropy of Wind = 3,2 Pee 36 1p, nun) es Here, p*+n-+ = total count of Hot, Mild and Cold from reference attribute = 10 ‘otal count of related values from above table for distinct values in Wind attribute: eed te. 1) = Information gain of particular distinct value of attribute Peemensorweet x 1, my.n) 4 LAREN 5 H(p,.n,. 1) Entropy of Wind. Ht xis+§ x 1.460 tropy of wind 1476 Gain of wind Entropy of Reference - Entropy of wind = 1486 - 1.476 = 0.01 2. Humidity: Humidity attribute have two distinct values which are High and Normal. We will find information gain of these distinct values as following High no of Hot values related to High f= no of Mild values retated to High = 2 = no of Cool values related to High tnensle2+O=3 Therefore, High) Kp. 9.0) High) ML Normal = 10 of Hot values related to Normal = 2 Pp 1 = no of Mild values related to Normal = 3 1 = no of Cool values related to Normal = 2 septnens2+342=7 Therefore, Normal) weak) = Ip, n,1) = 1557 using calculator Handcrafted by BackkBenchers Publications Page 35 of 102 iWww-TOPPETSSolutign, Chap - 3 | Learning with Trees = « Therefore, ~ Humidity 1 Information Gai Distinct values | (total related | (total related fora! related [int - ain oF a) it 1M from Humidity | valuescof Hot) | valuesot Mild) | values of Cool Pamir) » O19 High T 2 0 1 Normal 2 3 2 = ' Now we will find Entropy by Humidity as following, 1 Entropy of Humidity = By. at x MDM) 3 =10 “ Here, p+n+r= total count of Hot, Mild and Cold from reference attribute (otal count of related values from above table for distinct values in Humidity attribute information gain of particular distinct value of attribute Planer forweak Entropy of Humicity = = 5 yy) + MELEE 3 Ep, mn) Entropy of Humidity = #22 2 x 1.557 xo919 + 222 Entropy of Humidity = 1366 Gain of wind = Entropy of Reference - Entropy of wind = 1.486 - 1366 = 0.12 Gain of wind = 0.01 Gain of humidity = 0.12 Here value of Gain(Humidity) is biggest so we will take Humidity attribute as root node. Q7. Fora SunBurn dataset given below, construct # de [Name Hair Height Weight Class [sania Bionde Average Tight Yes nite Blonde Tall Average | Ves Wo Kavita Brown Short merge ves We Sushma Blonde Short wrerage No Vee Xavier Red Average Feavy We Yee Balaji Brown Tall Heavy No No Ramesh Brown Average Heavy Wo Wo Swetha ‘Blonde Troe Tight Ves Wo Ans: [OM - May17 & Mayi First we will find entropy of Class attribute as following, There are two distinct values in Class which are Yes and No. Here, p= total count of Yes=5 s=p+n=5+3=8 Therefore, v Handcrafted by BackkBenchers Publicationschap-3 | Learning with Trees www: ToppersSolutions.com pn) =~ Flos} — Flows} Now we will find Information Gain, Entropy and Gain of other attributes except reference attribute 1. Hair Wind attribute have three distinct values which are Blonde, Brown and Red. We will find information gain ofthese distinct values as following 1. Blonde = P= no of Yes values related to Blonde = n= no of No values related to Blonde = 2 24224 erefore, NBlonde) = lip,n) = ~Blogs® ~ logs donde) = I(p, n) = 1 a... a8 value of p and n is same, so answer will be 1 grown = no of Yes values related to Brown = 0 no of No values related to Brown = 2 =0 swe ILanyone val swer wil + Inf oa MW Red = p= noof Yes values related to Ped no of No values related to Red ry = ptmel+O“2 Therefore, Welue) = lip.) = tog, =0 mate Handcrafted by BackkBenchers Publication: Page 37 of 102www-Topperssojyy, | ‘Chap -3 | Learning with Trees ion, Therefore, on formation Gain of Distinet values from ] fotal related values of | (total elated values of | Ini Gain aie Eye Colour Football) Netball) » nm ‘Blonde 2 Zz z Brown 2 © 3 Red i 0 Now we will find Entropy of Hair as following, Entropy of Hair = Dis 22% x 1(pum) Here, p+n-= total count of Yes and No from class attribute = 8 P+ 1 = total count of related values from above table for distinct values in Hair attribute 1m) = Information gain of particular distinct value of attribute Entropy of Eye Colour = nidof Blonde + W280 5 1(y,m,) of Brown + Pate Han) of Red Entropy of Hair =0S Gain of Hair = Entropy of Class ~ Entropy of Hair = 0.955 - 05 = 0.455 Height: Height attribute have three distinct values which are Average, Tall and Short. We will find informats gain of these distinct values as following Tall = P = no of Yes values reiated to Tall = 0 1p, = no of No values related to Tall = 2 s=ptm=0%2=2 Therefore, WAverage) = fp.) = Average) = lip, n) = 0 .- Ifany value from p and nis then answer will be O 1. Average p= no of Yes values related to Average = n= no of No values related to Average = 1 s=ptn=2+ Therefore, Waverage) = I(p,n) = @ Handcrafted by RackkBerichers Publicationscoop [Lenrains WIN Tikes: www.Topperssolutions.com " p= neof Yes values related to Short n No Values related to Red geptnsl+2=3 Therefore, short) = ip. n) = —2tog, Therefore, at [Dstinet values From | (Rotal lated values | Height Yes) total related values of | information Gain of value | | Noy Hp) | n | a a ig 0 fz iu co) p 2 O38 Now we will fing Entropy of Height as following, Entropy of Heigh SESS xn) Here, pon fetal Count of Yes and No from class attribute = 8 P+ m= total count of related values from above table for distinct values in Height attribute (pn) = information gain of particular distinct value of attribute MAHL x yun) of Average + Hotere 0919 Entropy of Height = 0.690 Con of Height = Entropy of Class ~ Entropy of Height = 0955 -0.690 = 0265 % Weight: Weight attribute have three distinct values which are Heavy, Average and Light. We will ind information ‘Gain of these distinct values as following L Heavy = P.= no of Yes values related to Heavy =1 = no of No values related to Heavy =2 WAverage) = I(p, n) logs WAverage) = I(p, n) (ibagares encv nchers Publications page 39.6f10aWonW.TOPPETSSolu, Chap-3) Learning with Trees * Average = = P= n0 Of Yes values related to Average = 1 ‘ N= n0 Of No values related to Averags SDA nsle2es ‘ therefore, NAvorage) = ip,n) = —2tog, m, 0 of Yes values related to Light N= NO Of No values related to Light se pe netste2 Theratore, NLight) = ip, n) = ~2logs? =H a1 AS value of p and n are same, so answer will be THitre(6r@, Wight | Distinct values from | (total related values oF {total related values of | information Gain of value Weight Yes) No) Mun) | " | é i 7 z o5ie | Average 1 2 0919 | Light l Now we will find Entropy of Weight as following, Entropy of Weight = Zh, 24% x 1(pym) Here, p+ n= total count of Yes and No from class attribute = 8 1% + ny = total count of related values from above table for distinct values in Weight attribute pen) = Information gain of particular distinct value of attribute Entropy of Height= PML 10m, n,) of Heavy + Memlerseree Fon X 1pm) of Average + Apu ny) of Light xo919 + Ht x1 Gain of Height = Entropy of Class - Entropy of Height = 0.955 - 04 - 0.015 \@ Handcrafted by BackkBenchers Publicationschap -3| Learning with Trees www.ToppersSolutions.com 4 Location: Location attribute have two distinct values which are Ves and No. We will find information gain of these distinct values as following 1 Yes p= n0 of Yes values related to Ye N= no of No values related to Yes =3 5=ptn=0+3= Therefore, WAverage) = I(p,n} Waverage) = I(p,n} =0 if any one value from p and nis 0 then answer will be O . No= p.= no of Yes values related to No=3 1n,= No of No values related to No = 2 3+2=5 aspen Therefore, WAverage) = Ip. n) [otal related values of | (total elated values of | information Gan ofvalue 1 Yes) No) Hp) n my o 3 ° 3 2 oo7 Now we will find Entropy of Location as following, Pes 3c 1pm) Entropy of Location = Start Here, p+ n= total count of Yes and No from class attribute i+ n= total count of related values rom above table for distinct values in Location attribute (pun) = Information gain of particular distinct value of attribute Entropy of Location = MZ x I(Punid of Yes + AA" x 1(my m1) of No 2 94 HE x O971 Entropy of Height = 0.607 3in of Height = Entropy of Class ~ Entrop: Wy of Height = 0.955 - 0,607 = 0.348 = afted by BackxBenchers Publications Page 41 of 102wwW.TOPPEFSSOlutin, Chap -3 | Learning with Trees Gain (Height) = 0265 Gain (Weight) Gain (Location) = 0348 bute as root node. Here, we ean see that Gain of Hair attribute is highest So we will ake Hair att =) a> Cao Now we will construct a table for each distinct value of Hair attribute as following, Bionde - Name Height TWeight Location Class [Senita ‘Average Light No Yes Anita Tall ‘Average Yes No | Sushma Short Average No Ves | Swetha Short Light Ves No Brown, Name Height Weight Location Class, heave Shon Average = Te Balaji | Tall Heavy ~ Te a Ramesh Average Heavy No =I Red - Name Height Weight Location Class Xavier ‘| Average” Fieavy No" Var Ase can see that for table of brown values, class value is No forall data tuples and samo fer table of fe values where class value is Yes for all data tuples, ‘Sowe will expand the node of Blonde value in decision tree. We will now use following table which we have constructed at ove for Blonde value, Name Feight Weight Tocation Class Sant Twenge tight No Yes Anita Tall Average Yes No Sushma Short ‘Average Ne Vos Swetha ‘Short Tight [ve No 'v Handcrafted by BackkBenchers Publicationschap -3| Learning with Tr www.ToppersSolutions.com Now we will ind Entropy of class attribute again as following, ‘There are two distinct values in Class which are Yes and No. Here, p= total count of Yes = 2 jotal count of No =2 ne sepene2* Therefore, ip. As value of p and n are same, so answer will be 1 Now we wall find Information Gain, Entropy and Gain of other attributes except reference attribute 1. Height: night attribute have three distin a \ Howght attribute have three dlistinet values which are Average, Tall and Short. We will find information gain of these Tale stinct values as following P= no of Yes values related to Tall = 0 no of No values related to Tal sepensorl Therefore, Tall) = Hp, n) = ~Ztoy al Alp.n)=0 Hany value from p and nis @ then answer will be 0 MN. Average = p= no of Yes values related te Average 1 = no of No values related to Average = 0 S=penel+o= Therefore, WAverage) = ip.) If any value from pand nis 0 then answer will be 0 ML short = no of Yes values related to Short =1 Therefore, tod by BackkBenchers Publications Page 43 of 1025 Chap ~3 | Learning with Trees www.ToppersSoluti logs} — Hogs on. AS value of p and n are same, answer will be? Therefore, Height ‘Distinct values from | (total related values of | (total related values of | Information Gain of value Height Yes} No) Kron) nr Tall ° 7 o Average 7 ‘to _ 4 Short 1 7 T Now we will ind Entropy of Height as following, 0 Entropy of Height Here, p + n= total count of Yes and No from class attribute = 4 1, + n, = total count of related values from above table for distinct values in Height attribute pun) = tnformation gain of particular distinct value of attribute Hee ey) of Tall + MAE (psn) of Average + ME Entropy of Height Hn) of Short =o x04 vor it xt Entropy of Height = 05 Gain of Height = Entropy ef Class ~ Entropy of Height =1-05 = 0.5 2: Weight: Weight attribute have three distinct valu2s which are Heavy, Average and Light. We wil find informate gain of these distinct values as following 1 Heavy= 10 of Yes values related to Heavy = 0 pe n= no of No values related to Heavy = 0 s=ptn=0+0=0 Therefore, As value of p and n are 0, ul. Average p.=noof Yes values related to Average =1 no of No values related to Averag penslsl=2 Therefore, @ Handerafted by BackkBenchers Publications mae antnap-31 Learning with Trees : www.TeppersSolutions.com ~~svommna 8 Value Of p and n are same. m. Light= 10 of Yes values related to Light A no of No values related to Light = 1 s=ptnel+ Therefore, Light) Np, AS value of p and n are same, so answer will be 1 Therefore, Weight Distinct values from | (total related values of | (otal related values of | Information Gain of value Weight Yes) No) Hund I » m4 Heay jo ° ° erage 7 T Tight 7 7 7 Now we will fina Entropy of Weight a: following, Entropy of Weight = S222 x Mt) ete, p+ n= tocal count of Yes and iNo from class attribute 1b. + nj = total count of related values from above table for distinct values in Weight attibute 1(pj.n) = Information gain of particular distinct vaiue of attribute Mom tetneee 5 p,m) of Average + BRUCHIM Entropy of Weights MASetttt x (p.m) of Heavy + Nun) of Light 8 gy HE tt Entropy of Weight Gain of Weight = Entropy of Class - Entropy of 3 Location: cation attribute have two distit tinct values as following fb ves B= no of Yes values related f Weight = 1 nct values Which are Yes and No. We will find information gain of these oes =O 1.= noof Novalues related to Yes =2 sis ptnzo+2=2Chap -3 | Learning with Trees ch Therefore, bi yes) = K(p, Flops! — Mog? yer =~ log: $ = Flog: jue romp and nisothen answer will be o W¥e3) = fp, 1) =O osname if ary one VOTE u. No= Ne P.= no of Yes values related to No = 2 1\= ne of No values related to No= 0 ax prenie 240%2 Therefore, HINO} = fp, n) = A f logs 1 ifany one value from pand n is 0 then answer will be o Therefore, Location Distinct values from | (total related values of | (total velated values of | Information Gain of val Location ves) No) Hund » n ves a z 3 No 2 3 3 Now we will find Entropy of Location as following, Entropy af Location = Sh," x 1(p.m) Here, p #1n= total count of Yes and No from class attribute = ri + nj = total count of related values f: 7m above table for distinct values in Location attribute T(p.n)) = Information gain of particular distinct value of attribute Entropy of Location = X Hum) of Yes + Pisurerne Pe pen Mund oF Yes + SE x 1Cg4, 1) of No = x04 82 x0 Entropy of Location = 0 Gain of Location = Entropy of Class Entropy of Location Here, Gain (Height) = 05 Gain (Weight) = 0 Gain (Location) ‘As Gain of location is largest value, we will tak ‘ © location attri attribute a 8 5 Now we willconstruct a table for each distinct vatug of, eng tet f Location attr ibute as following, © Handerafted by BackkBenchers Publications ustchap=31 Learning with Trees wures-Topperssolutions.com yes: Nae —] Weigint [os aa fa] at reaps 0s ro Saetha Short = Tie oe ides: Pilea — . a _ Name Weight” [Location ] Sanita ight Wo | Sasha average [No | As we can see that class value for Yes (Location) is No and No(Location) is Yes. There is no need to further classification, Ihe final Decision tree will be as following ifted by BackkBenchers Publications Page 47 of 102 AB Chap - 4 | Support Vector Machines www-Topperssolvtions en. CHAP - 4: SUPPORT VECTOR MACHINES Ql. What are the key terminologies of Support Vector Machine? Q What is SVM? Explain the following terms: hyperplane, separating hyperplane, margin a, support vectors with suitable example. Q3. Explain the key terminologies of Support Vector Machine Ans: [5M May16, Deci6 & Mayr, SUPPORT VECTOR MACHINE: A support vector machine is a supervised learning algorithm that sorts data into two categories A support vector machine is also known as a support vector network (SVN) 3. Nistrainedy ha series of data already classified into two categories, building the model as its int: ed, An SVM outputs a map of the sorted data with the margins between the twoas far apart as, 5. svn 'S are used in text categorization, image classification, handwriting recognition and sciences. 1. Ahyperplane it 2 generalization of a plane. SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes/arour 3. Figure 41 show the example of hyperplane. Figure 4% Exemple of hyperplane, 4. Asa simple example, for a classification task with only two features as shown in figure 41, you €2P think of a hyperplane as a line that linearly sepaiates and classifies a set of data 5. Wihen new testing data is added, whatever side of the hyperplane it lands will decide the class thst we assign to it SEPARATING HYPERPLANE: 1. From figure 41, we can see that itis possible to separate the data, 2. Wecan use aline to separate the data. All the data points representing men will be above the line. 4, Allthe data points representing women will be below the line. such a line is called a separating hyperplane. MARGIN: 1. Amargin isa separation of line to the closest class points. 2. The margin is calculated as the perpendicular distance from the line to only the closest points \@ Handcrafted by BackkBenchers Publicationschap-41 Support Vector Machines www.ToppersSolutions.com 3. Agood margin is one where this separation is larger for both the classes. 4, Agood margin allows the points to be in their respective classes without crossing to other class 5. The more width of margin is there, the more optimal hyperplane we get. SUPPORT VECTORS: The vectors (cases) that define the hyperplane are the support vectors Vectors are separated using hyperplane. Vectors are mostly from a group which is classified using hyperplane Figure 42 shows the example of support vectors Qh" Define Support Vector Machine (SVM) and further explain the maximum margin linear separators concept, Ans: [OM | Dect7} SUPPORT VECTOR MACHINE: 1 support vector machine isa supervised learning algorthrn that sorts data into two categories Asupport vecter machine is also known as a support vector network (SVN). Iistained with a series of date already classified into wo categories, building the model asit is intially trained An SVM outputs a map of the sorted data with the margins between the two as far apart as possible 5 SVMs are used in text categorization, image classification, handwriting recognition and in the sciences, MAXIMAL-MARGIN CLASSIFIER/SEPARATOR: The Maximal-Margin Classifier is a hypothetical classifier that best explains how SVM works in practice The numeric input variables (x) in your data (the columns) form an n-dimensional space. For example, if you had two input variables, this would form a two-dimensional space. Abyperplane is a line that splits the input variable space, SVM, a hyperplane is selected to best separate the points in the input variable space by their class, ‘lther class 0 oF class I. fimensions you can visualize this as a line and let's assume that all of our input points can be ly separated by this tine. Page 49 of 102 by BackkBenchers PublicationsChap 4 [Support Vector Machines vor ainple Hh (a) (LE) SO perce 1 Whore the cootticients (Hand) that detornine the slape of the fine and t (8) arto yuh ¥) are the two input variables: Jy the fearing algarithnn, and Yor ean make classitionti Wo, by plat Into the tine equation, you can calculate whether a new point is above 110 ty botow the tine J the point belongs to the first class UW Abo vale greater than ( 2 the tine, the equation return 1) Helenw the ine, the exquation yotuns a value les than 0 and the point belongs to the second clase 14. Avalue close to the line returns a value clase to zero and the point may be difficult to classify tule ob thew ba the mage aige, the model may have more confidence in the prediction Wy The distance betya vel agin closest data points 6 ref Jo The best or aptinal tine that cart YW this sparate the wor si the fine that as the largest margin lod the MaximakMargin hyperplane WW the “l + the perpendicular distance from the line to only the closest points 19. Only these poin 10 relevant in defining the line and in the construr ion of the 20, Those points are called the support vectors 21 They support or define the hyperplane, hyperplane bs leaned fron tre ping data using an optirn ion procedure that maxirnizes the margin Figure 4. Maxirnurn snargit ir 5 concept QS. What is Support Voctor Machine (SVM)? How to compute the margin? QG. What is the goal of the Support Vector Machine (SVM)? How to compute the margin? Ans: [10M | Mayt7 & Mayi8) SUPPORT. VECTOR MACHINE: Refer Q4 (SYM Part) MARGIN: 1. Amargin isa separation of line to the closest class points, 2. The margin is ealeulated as the perpendicular distance from the line to only the closest points 3. Agood margin is one where this separation is larger for both the classes, 4. A good margin allows the polnts to be in thelt respective ¢ os without crossing to other class 5, The more width of margin is there, the more optimal hyperplane we get ¥ Handernfted by BackkBonchers Publications Page 50 of 10? wwrw.TOpParssolutions c.,_. OO cnap- 41 SUPPort Vector Machines www.ToppersSolutions.com EXAMPLE FOR HOW TO FIND MARGIN: consider Building an SVM over the (very little) data set shown in figure 4.4 for an example like this 3. Therefore, it passes through. So, the SVM decision boundary is: yaa t2n 55 ng algebraicaily, with the standard constraint that. we seek to minimize PENS when this constraint is satisfied with equality by the two support vectors. er we kn t the solution is for some. e have that 10 The margin boundary is: 2/ |i] = 2/ VE7B F167 = 2/(2V5/5) = VS 1. Ths answer can be confirmed gecmetrically by examining figure 4.4. @. Write short note on - Soft margin SVM 7 TOM | Decis} & SOETMARGIN SVM: Sof margin is extended version of hard margin SVM. 2 Hard margin given by Boser etal 1992 in COLT and soft margin given by Vapnik et al. 1995. 2. asd margin SVM ean work only when data is completely linearly separable without any errors (noise reuters). in case of errors either the ‘Qn the ether hand soft margin SYM .¢ margin is smaller or hard margin SM fails. 8 proposed by Vapnik tu solve this problem by introducing xkBenchers Publications Page 51 of 102> | woww-Topperssolutions.c,, Chap ~ 4 | Support Vector Machines since Soft margin is extended version of hard margin SV soy. ‘As for as their usage is concerned Soft margin SVM, 7. The allowance of softness in margins (ie. alo 10 cost setting) allows for errors to be made while fy the rnodel (support vectors) to the training/discovery data set. & Conversely, hard margins will result in fitting of a model that allows zero errors. 9. Sornatimes it can be helpful to allow for errors in the training set. 10. It may produce a more generalizable model when applied to new datasets. 1. Forcing rigid margins can result in a model that performs perfectly in the training set, but is possi, over-fit /iezs generalizable when applied to a new dataset. 12. Identifying the best settings for ‘cost’ is probably related to the specific data set you are working vi 13. Currently, there aren't many good solutions for simultaneously optimizing cost, features, and kerr» parameters (if using a non-linear kernel), V4. In both the soft margin and hard margin case we are maximizing the margin between supp- vectors, Le. minienizi 1S. Its soft rnargin case, we let our model give some relaxation to few points. 16. If we consider these points our margin might reduce significantly and our decision boundary wilt: poorer 7. Sa instead of considering therm as support vectors we consider them as error points 18. And we give certain penalty for them which is proportional to the amount by which each data por is violating the hard constraint. £.can be added to allow misclassification of difficult or noisy examples. 19. Slack variables 20 This vanables represent the deviation of the examples from the margin. 21. Doing this we are relaxing the margin, we are using a sctt margin. Q8. What is Kernel? How kernel can be used with SVM to classify non-linearly separable date! Also, list standard kernel functions. Ans: {OM | May KERNEL: 1 Akemel isa similarity function. 2. SVM algorithns use a set of rnathematical functions that are defined as the kernel. 4, The function of kernel is to take data as input and transform it into the required form. 4. Iisa function that you provide to a machine learning algorithm, 5. It takes two inputs and spits out how similar they are. Different SVM algorithms use different types of kernel functions. 7. For example linear, nonlinear, polynomial, radial basis function (RBF). and sigmoid. EXAMPLE ON EXPLAINING HOW KERNEL CAN BE USED FOR CLASSIFYING NON-LINEAR! SEPARABLE DATA: 1. To predict if a dog is a particular breed, we load in millions of dog information/properties like ("| height, skin colour, body hair length etc. Handcrafted by BackkBenchers Publications.chap 4 | SuPPort Vector Machines www.ToppersSolutions.com 2. InML language, these properties are referred to as features’ ingle entry of 3 Asingis entry of these list of features is a data instance while the collection of everything isthe Training Data which forms the basis of your prediction if you kn si & te. ifyou know the skin colour, body hair length, height and so on of a particular dog, then you can predict the breed it will probably belong to. 5 InSUPPOTE vector machines, it looks somewhat like shown in figure 45 which separates the blue balls from red Figure 45 6 Therefore the hyperplane of a two dimensional space below isa one dimensional line dividing the red and blue dots, 7 From the ex. mple above ef trying to predict the breed of a particular dog, it goes like this: 8 Data (all breeds of dog) + Features (skin colour, hair ete) + Learning algorithm 9. Ifwe want to solve following example in Linear manner then it is not possible to separate by straight we did in above steps. Figure 46 10. The red and blue balls cannot be separated by 2 straight line as they are randomly distributed. TL Here comes Kernel in picture. 2 In machine iearning_ a “kernel” is usually used to refer to the kernel trick, a method of using a linear classifier t2 solve z non-linear problem. 18. itentails transforming linearly inseparable data like (Figure 4 6) to linearly separable ones (Figure 45), 16 The kernel function 3 what is applied on cach data instance to map the original non-linear observations into a higher-dimensicnal space in which they become separable. Using the dog breed prediction example again, kernels offer a better alternative. 16 Instead of defining a slew of features, you define a single kernel function to compute similarity between breeds of dog. 17 You provide this kernel, together with the data and labels to the learning algorithm, and out comes a classifier, 18. Mathematical definition: K(x. y) Here K is the kernel function, x yare n dimensional inputs. fis a map from n-dimension to m-dimension space, < xy > denotes the dot product. Usually m is much larger than n, foo), >. Page $3 of 102 Handcrafted by BackkBenchers Publications5 sSol Chap ~ 4 | Support Vector Machines www-Tepperssolutions cq, 19, Intultion: product st, and then do the y, Jormally calculating «thd, (yl requires us to calculate f(x), fly) fir 20, Thee two cornputation steps can be quite expensive as they involve manipulations in m ditmensics, spac im ean be a large number 21. But atte vale all the trouble of going to the high dimensional space, the result of the dot product is res), a ccalan 22. Therefore we come back to one-dimensional space again. 24 Now, through all the trouble to get this one number question we have is: do we really need to. go 24, Do we really have to go to the m-dimensional space? No, if you find a clever kernel 22 (16,200,000 = Wi YoY function f(x) © (kok it, 4, XIX) Hake, Hoi, 9K), XX, XK), the Kernel is Kixy 20. 10 make this mor plug in some numbe intuitive: 28, Suppose x= (1,2, 49 © (45,6). Then: F 19 ,.2,5,2,4,6,5,6,9) F (/) > (16, 20, 24, 20, 25, 30, 24, 50, 36) F(0,F (y) = 16 + 40.172 + 40.4 100 + 180 + 72 + 180 + 324 = 1024 29. Now let us use the kernel instead: K bay) = (4410 + 1B) = 32° = 1024 40. Sarne result, but this calculation is so much easier. Q9. Quadratic Programming solution for finding maximum margin separation in Support Vector Machine. Ani 1. The linear programming model is a very powerful tool for the analysis of a wide variety [10M - Mayi6 problemsin th science industry, engineering, and business, 2. However, it does have its limits, 3. Notalll phenomena are linear. 4. Once nonlinearities enter the picture an LP model is at best only a first-order approximation, 5. The nowt level of complexity beyond linear programming is quadratic programming, 6. This model allows us to include nonlinearities of a quadratic nature into the objective function. 7. As we shall see this will be a useful tool for including the Markowitz mean-variance models ¢ uncertainty in the selection of optimal port-folios. 8. A quadratic program (QP) is an optimization problem wherein one either minimizes or maximizes? uadiatic objective function ofa finite number of decision variable subject tea finite number of ine! inequality and/or equality constraints. 9, A quadratic function of a finite number of variables x = (x1, %3....., &)" is any function of the form: mr: falas Yess + pe Doamsy 10. Using matrix notation, this expression sirnplifies to Y Handerafted by BackkBaiiche!s Publicationschap ~ 4 | Support Voctor Machines worn ToppersSolutions.com 1 Me) 4s tn Bet Me t+ 5e7Qe where a 90 2 Hin ‘ 2 ant ge {| ™ ft Int ed om I. The factor of one half preceding the quadratic tern in the functi convenic 6 IL simplifies the expr sons for the first and secot 2 With no loss in generality, we may “well “ume that the reatriz Qe (QT Oem HatQe s Qh) = MEE 18 And so We are free to replace the matric @by tho symmetric mateke gagt Henceforth, we will ascurne hat the mattiz Q is oymenetne. 1 the QP standard form that we ase is, Q minimize xs bel Qe subject to Ar< bh, 0< x, where Ae I and be HE. Je Just as in the ease + of linear prog rimming, every quadratic prog standard form 1 Observed that we can have simplified the expression for the ob erm «sit {A plays no role in optimization step, Qi0. Explain how support Vector Machine can be used to find optimal hyperplane to classify linearly separable data. Give suitable example. ans: [10M | Decig] OPTIMAL HYPERPLANE: 1 Optimal hyperplane is completely defined by 5 oport vectors 2. the optimal hyperplane is the one which marimnizes the margin of the training data SUPPORT VECTOR MACHINE: | Asuypport vector machin is a supervised learning algorithrn that sons data into two egories 2 support vector mnachine is ako known ae support vector network (Sv) 3. ttistrained with a ries of data already classified into two categories, building the model as it is intially trained. 4 An SVM outputs a map of the sorted data with the margins between the two as far apart as possible SVMs are used in text «: regorization, i \age classification, handwnting recognition and in the sciences, HYPERPLANE: hyperplane a generalization of a plane Inderafted by BackkBenchers Publications age SS of 102AMET OPIRIORAS ONS a, Chap 4| Support Vector anes ee 2. Svaisare based on the idea of finding a hyperplar a dataget inter tues lawn ne that best divi 2 ‘ Re F . st. : 3. Asa simple example, for a classification task with only two features (Hike the irriatye aterye. 706 ies a set of dat think of a hyperplane as a line that lineatly separates and cla vill Uecitte thes elvan te 4. When new testing data is added, whatever side of the hyperplane it fa 1e assign tot 5S. Hyperplane: (wx) +b WeERN,bER Corresponding decision function: 7 Optimal hyperplane (maximal marain): all — Xl]: ER" s(w + x) + B= 0) with y.€ (41, “1p holes: y,= (ex) +b) > for alli =... ix) = san(v 0) +b) max.» mint Where, wand b not unique wand b can be scaled, so that |[w_ xi) + bl = for the x, closest to the hyperplane Canonical form (w, b) ef hyperplane, now holds: yo((we-x) + b)2 Hor all i Margin of optimal hyperplane in canonical form equal: Qu. _ Find optimal hyperplane for the data points: £01, 1, (251, (5M (2s WD, (4: OF (5, Wy (5, Mh (6, 03) Ans: fom - Deeté) Plotting given support vector points on a graph (In exam, you can draw it roughly in paper) We are assuming that the hyperplane will appear in graph in following region [as a l=—* $ os, ok 6 5001 T=% al. $ } as) Page rive Handcrafted by BackkBonchers Publications4] Support Vector Machines hown beh 1s a es ° 4 - Ss ° re 6? a ° 8 o 1s will find augmented vectors by adding | as bias p , based on following three linear equation: B)+ (asx §x G24 b+ 8-8) Cb GE b+ @)-Clrb- Orb @)- Gh ‘Simplifying these equations, fay» 2.2) + (1% 1) (HDDS Ces 122) + TF DD + (ay 142) ORD HOD = fey ve f(2. 2) + (1 x 1) +L DDD Eee X12) + KD DD + fey 1G 8 294 OD EX DN fag (2264) + (1. 0) + (1 2 IDNDY Cen X12 VO DD 9 ly x14 A) + (OKO) HX DNA Weer, ted by BackkBenchers Publications page $7 of 02Chap 4| Support Vector Machines ee 4a, + 6a + 9ety 9a, +93 + Vay _aag.and as = 35 =a ate postive c1ass from Neda iv inate pos 0 class. SO we vit By solving above equations we get To find hyperplane, we have to discrimi equation, w= Pas Putting values of a and § in above equation tO D> aS + arg tas fof We get, Now we will remove the bias point which we have added to support vectors to get augmented ven So we will use hyperplane equation which is y = wx + vere w= (2) as we have removed the bias point from it which s-5 And b = -3 which is bias point. We can write this as b + 3 = O also. So we will use w and b to plot the hyperplane on graph asw = (2) soit means it isan vertical line.tfitwere (?) then the line would be horizontal line, And b + 3 = 0,50/t means the hyperplane will go from point 3 as show below. 1s 1 | os-5| Learning with chap-5 | Learning with Classification www.ToppersSolutions.com CHAP - 5: LEARNING WITH CLASSIFICATION ay Explain with suitable example the advantages of Bayesian approach over classical approaches to probability. @ Explain classification using Bayesian Bellet Network with an example. gs Explain, in brief, Bayesian Belicf network. soe [10M | Deci6, Dect7 & Deci8} BAYESIAN BELIEF NETWORK: 1. ABayesian network is a graphical model of a situation Itrepresents a set of variables and the dependencies between them by using probability, The nodes in a Bayesian network represent the variables. ‘The directional arcs represent the dependencies between the variables. The direction of the arrows show the direction of the dependency. Each variable is associated with a conditional probability table (CPT). CPT gives the probability of this variable for different values of the variables on which this node depends, Pp ee Bo 8 Using this model, it is possible to perform inference and learning, 9. BBN provides a graphical model of casual relationship on which learning can be performed 10, We can use a trained Bayesian network for classification, 1. There aie two components that defines a BBN a. Directed Acyclic Graphs (DAG) b. Aset of Conditional Probability Tables (CPT) 12. As an example, consider the following scenario. Pablo travels by air,ithe is on an official visit. Ithe ison a personal visit, he travels by air if he has money. Ithe does not travel by plane, he travels by train but sometimes also takes a bus 3. The variables involved are: a Pablo travels by air (A) b. Goes on official visitlF) c. Pablo has money (M) d. Pablo travels by train (T) €. Pablo travels by bus (8) 14. This situation is converted into a belief network as shown in figure 5:1 below, 16 tn the graph, we ean see the dependencies with respect to the variables, 16. The probability values at a variable are dependent on the value of its parents 1. inthis case, the variable A is dependent on F and M 18. The variable T is dependent on A and variable Bis dependent on A. 18, The variables F and M are independent variables which do not have any parent node. 20.S0 their probabilities are not dependent on any other variable, 2. Node @ has the biggest conditional probability table as A depends on F andi M 2 Tand 8 depend on A landerafted by BackkBenchers Publications Bene baehioa

Machine Learning Topper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Topper

Uploaded by

Copyright:

Available Formats

You might also like