You are on page 1of 4

Q1. Explain the working principle of a Naïve Bayes Text Classifier.

Ans: Naive Bayes model is easy to build and particularly useful for very large data
sets. In text classification, our goal is to find the best class for the document. The
best class in NB classification is the most likely or maximum posteriori (MAP) class
Cmap.
Where Bayes’s rule is applied and we drop the denominator in last step because
P(d) is the same for all classes and doesn’t affect the argmax.

c map = argmax P(d|c)P(c)


cεC
conditional distribution P(d|c):
P(d|c) = P((t i,…., tk,……,tna )|c)

Where (t1,……,tk,…tnd) is the sequence of terms as it occurs in d

Q2. Consider the text classification example of China class. Why the
probabilities found in the pen-and-paper computation differ from the
probability generated by the scikit-learn MultinomialNB classifier.
Ans:
P(c|d5) ∝ 3/4 . (3/7)3 .1/14 . 1/14 ≈ 0.0003.

P(7|d5) ∝ 1/4 . (2/9)3 . 2/9 . 2/9 ≈ 0.0001.

The probability that we have analyzed utilizing multinomial classification does not
match in pen and paper.
But we know the law of probability is the probability of an event happening and
not happening is equal to 1.
Here, it gives the result concurring to this equation after more calculation.
Q3. What does Python shape function provide in general? What is it giving in our
text classification example? Explain with the statements containing
X_train_counts.shape, X_train_tf.shape, and X_train_tfidf.shape. Why all of
them showing the same values?
Ans: The function "shape" returns the shape of an array. The shape is a tuple of
integers. These numbers denote the lengths of the corresponding array dimension.
Shape function giving – number of dimensions and number elements in each
dimension.
Here, X_train_counts.shape =(4, 6)
X_train_tf.shape = (4, 6)
X_train_tf.shape = (4, 6)
number of documents = 4 and vocabulary size = 6 of each document.
That’s why all of them giving the same values

Q4. Read the scikit-learn manual for understanding. Explain how the
CountVectorizer() method works. Use the results of
count_vect.get_feature_names(), X_train_counts.toarray(), and X_train_counts in
your explanation.

Ans: Scikit-learn’s CountVectorizer is used to convert a collection of text documents


to a vector of term/token counts. It also enables the pre-processing of text data
prior to generating the vector representation. This functionality makes it a highly
flexible feature representation module for text.
count_vect.get_feature_names() is providing all the tokens. They are -
'beijing'
'chinese'
'japan'
'macao'
'shanghai'
'tokyo'
X_train_counts.toarray() giving the vector representation.
010100
011001
020010
120000
X_train_counts is giving output :
(0, 1) 1 - In document 0, term 1(chinese) is containing for 1 time.
(0, 3) 1 - In document 0, term 3(macao) is containing for 1 time.
(1, 1) 1 - In document 1, term 1(chinese) is containing for 1 time.
(1, 5) 1 - In document 1, term 5(tokyo) is containing for 1 time.
(1, 2) 1 - In document 1, term 2(japan) is containing for 1 time.
(2, 1) 2 - In document 2, term 1(chinese) is containing for 1 time.
(2, 4) 1 - In document 2, term 4(shanghai) is containing for 1 time.
(3, 1) 2 - In document 3, term 1(chinese) is containing for 1 time.
(3, 0) 1 - In document 3, term 0(beijing) is containing for 1 time.

Q5. Explain how the TfidfTransformer() works. Show the computation how the TF
and TFiDF results come up in the output of X_train_tf.toarray(), and
X_train_tfidf.toarray().

Ans:

You might also like