You are on page 1of 10
osiosiz022, 17.25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Medium Open in app ca) Ws Rana singh Jul12 - 6minread - @ Listen classifier ee / © _ P(doc) \ sifier https://insightimiwordpress.com/2020/04/04/naive-bayes: Mathematic behind Naive Bayes algorithm and its application Bayes’ Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event. it's a classification algorithm and probability-based technique hitpsranasinghtkgp.mecium.comimathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a9t ano ‘08/08/2022, 17:25 Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Medium on ovine ED Likelihood > J" P(B|A). P(A) P{(A[B) = ————. (Alb) = Bray Posterior Evidence Conditional probability: ‘What is the Probabilty of rolling a dice and i's value isless than 4 PRIA) = ‘knowing that the value is ‘an odd number Attps:/Awwwyou" HO2B3aMNKzE Hs Q Conditional probability is defined as the likelihood of an event or outcome occurring, based on the occurrence of a previous event or outcome. Examples at page: https://en. wikipedia. org/wiki/Conditional_probability Independent vs Mutually exclusive events: if P(a/b) is P(a) or P(b/a) is P(b) then A & B known as independent event. if P(a/b) = P(b/a) =0 then A & B known as mutually exclusive event hitpsranasinghtkgp.mecium.com'mathematic-behind-naive-bayes-ané-is-applcation- BecBco4f0a1 210 osiosiz022, 17.25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Mecium ed Open in app independent vs Mutually Exclusive Events Summarized Note: independent events _|Mutually Exclusive have an intersection 2 events cannot occur together Events cannot be both mutually exclusive and independent independent the outcome of one cannot affect the other httesi/wwwyoutube,com/watch?v=Px9byn Jo) Example at: https://en. wikipedia. org/wiki/Bayes%27_ theorem Naive Bayes classifier algorithm: ‘Assumptions Made by Naive Bayes: each feature makes an: «features are assumed to be conditional Independent * equal contribution to the outcome Probabilistic model: Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector X= (x1,x2,x3,....xn) representing some n features (independent variables) it assigns to this instance probabilities P(Ck/ x1,x2,x3...xn) The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. using Bayesian probability terminology, equation can be written as hitpsranasinghtkgp.mecium.com/mathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a91 30 osiosiz022, 17.25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Mecium In practice, there is interest only in the numerator of that fraction, because the ed Open in app denominator does not depend on C and the values of the features xi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model P(Ck, x1,x2,x3...xn) which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability: P(Chy tay +2242) = P(iy-++92ny Ce) = p(t | #25-+-5 21 Ce) Pl: 4,Ck) Definition of conditional probability PU | 25-04 5Ses Ce) P(2 | 235+ +-4Bay Ox) Pls +++ ay Ck) = (ts | #25-++5t ns Ck) Pee | t55-+-52 as Ce) ++ P(Onma | Baye) P(En | Cx) (Ce) https://en wikipedia org/wiki/Naive Bayes classifier Now the “naive” conditional independence assumptions come into play: assume that all features in xi are mutually independent, conditional on the category Ck. Under this assumption, (ee | spay +++ star Cx) = p(t | Ch) Thus, the joint model can be expressed as (Ch | 225-09 2m) & PCs B14-+- 0) % p(Cr) P(e | Ce) P(e | Cr) ples | Ce) - « Gi) [J ote: 1k) This means that under the above independence assumptions, the conditional distribution over the class variable C is: (Cs | 212-424) = ZAC) [PCa | C4) where the evidence Z = p(x) = > p(Cx) p(x | Cx) is a sceling factor dependent only on t1,..., an bs Constructing a classifier from the probability model: hitpsranasinghtkgp.mecium.com/mathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a91 ano osiosiz022, 17-25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Mecium ed Open in app corresponding classifier. a Bayes classifier. is the function that assigns a class label {j = C', for some kas follows: § = argmax (Cs) [Ls | )- tefl KP EL There could be cases where the classification could be multivariate. Therefore, we have to find the class variable(y) with maximum probability. Refer blog for example reference: http://shatterline.com/blog/ Time and space Complexity: Training ¢ Time complexity:- o(nd) © space complexity:- 0(d*c) where d-number of feature and c in number of class(we are storing only probability in space so much more memory efficient at run time) Note: * Application in spam filter, review positive or negative mainly text classification problem * Naive base considered as benchmark(base line) model specially for text classification Laplace smoothing in Naive base algo:(use alpha=1) what if a word in a review was not present in the training dataset? Query review = wl w2 w3 w’ In the likelihood table, we have P(w1 | pos ive), P(w2 P(positive). Oh, wait, but where is P(w’ | positive)? Positive), P(w3 |Positive), and In a bag of words model, we count the occurrence of words. The occurrences of word w in training are 0. According to that hitpsranasinghtkgp.mecium.com/mathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a91 50 osiosiz022, 17.25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Medium ou ap Laplace smoothing:~ Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naive Bayes. Using Laplace smoothing, we can represent P(w’ [positive) as number of reviews with w' and y ositive + a N +a*K P(w'|positive) = Here, alpha represents the smoothing parameter, K represents the number of dimensions (features) in the data, and N represents the number of reviews with y=positive If we choose a value of alpha!=0 (not equal to 0), the probability will no longer be zero even if a word is not present in the training dataset. Laplace smoothing is a smoothing technique that helps tackle the problem of zero probability in the Naive Bayes machine learning algorithm. Using higher alpha values will push the likelihood towards a value of the probability of a word equal to 0.5 for both the positive and negative reviews. Since we are not getting much information from that, it is not preferable. Therefore, it is preferred to use alpha=1. As alpha increases, moving likelhoods prob to uniform distribution Not : it should be done at all point Bias and Variance tradeoff: alpha(Laplace smoothing parameter) determines underfitting and overfitting. case1: aqlpha=0 High variance means a small change in data, larger change in the model means high variance — overfitting hitpsranasinghtkgp.mecium.com/mathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a91 ano osiosiz022, 17.25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Mecium ou ap So to find appropriate alpha(hyperparameter), use simple or 10foald cross-validation Impact of Imbalance data on NB: it impacted by imbalanced data = anpmax (Cx) | plas | Cr). fas Z Suppose Same for now depend on minority and majority prior probability class Ply=1)=0.8 P(y=0)=0.2 because of class prior((p(y=1), p(y=0)) ( if multiplication of prob same), majority or dominating class have an advantage Solution: + upsampling or downsampling + drop the prior probability term p(y=1), p(y=0) © creating different alpha for different class because the same alpha impact the minority class more compare to the majority class Handling outlier in NB: 1. ignore outlier(remove) from training/test data if the frequency is small(less than 10 times) 2. use Laplace smoothing hyperparameter Missing value treatment in NB: 1. case1: text data no case of missing data 2. casa2: in categorical feature: consider nan itself as the missing category 3. case3: numerical feature: use feature imputation(generally for the numerical feature we use Gaussian NB) hitpsranasinghtkgp.mecium.com/mathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a91 70 osiosiz022, 17.25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Mecium issues. all probability values are from 0-1, so after multiplication, they will become ed Open in app insignificant so use log probability. Types of Naive Bayes Classi Multinomial Naive Bayes: This is mostly used for the document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document. Bernoulli Naive Bayes: This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not. Gaussian Naive Bayes: When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution. ‘Te Normal Distoton co Since the way the values are present in the dataset changes, the formula for conditional probability changes to hitpsranasinghtkgp.mecium.com/mathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a91 ano osiosiz022, 17.25, Mathematic behind Naive Ba Note: NB can not be used for distance/similarity matrices like KNN used. NB has not ‘algorithm and its application | by Rana singh | Ju, ed Open in app distanced based methods, its probability-based method Best and worst-case: Best case: * interpretability high + runtime and train time complexity low « run time-space is low Worst case: * it can easily overfit so use Laplace smoothing to avoid Reference: https://scikit-learn.org/stable/modules/naive_bayes.html Get an email whenever Rana singh publishes. Your email (3 subseribe ) X 7 ning up, you will create a Medium account ifyou don't already have one. Review our Privacy Policy for more information about our privacy practices, hitpsranasinghtkgp.mecium.com'mathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a91 90 osiosiz022, 17.25, Mathematic behind Naive Bayes algorithm and its application | by Rana singh | Jul, 2022 | Mecium @ti open inapp REE Get the Medium app LPs Google Play hitpsranasinghtkgp.mecium.comimathematic-behind-naive-bayes-and-is-applcation-SecBcc4f0a9t 10110

You might also like