You are on page 1of 92

Unstructured Data Classification

Why this course?


This course gives you a practical experience for solving unstructured text classification problems.
If you're wondering why you need unstructured text,
"80% of business relevant information originates in the unstructured form,
primarily text ", says Seth Grimes, a leading analytics strategy consultant.
What Would you Need to Follow Along?
 Have a basic understanding of machine learningconcepts.

 Try out the code snippets given for the case study.

 Refer the links to gain an in-depth understanding of other machine learning techniques.

"Programming is usually taught by examples" -Niklaus Wirth


Introduction
Unstructured data, as the name suggests, does not have a structured format and may contain
data such as dates, numbers or facts.
*This results in irregularities and ambiguities which make it difficult to understand using traditional
programs when compared to data stored in fielded form in databases or annotated (semantically
tagged) in documents.
 Source : Wikipedia.

A few examples of unstructured data are:


 Emails

 Word Processing Files

 PDF files

 Spreadsheets

 Digital Images

 Video

 Audio

 Social Media Posts etc.

Identify the unstructured data from the following


Excel Data Image from mySQL DB

Problem Description
Let us understand unstructured data classification through the following case study:
SMS Spam Detection:
In our day-to-day lives, we receive a large number of spam/junk messages either in the form of
Text(SMS) or E-mails. It is important to filter these spam messages since they are not truthful or
trustworthy.
In this case study, we apply various machine learning algorithms to categorize the messages
depending on whether they are spam or not.
Your Playground
You can try your hands-on exercises using Katacoda or having the coding setup done on your
local machine.
For Katacoda Users:
 Open the link : https://www.katacoda.com/courses/python/playground

 Type the terminal commands in the pane below.

 You can use the Python editor (by default you have app.py file) for trying out the code snippets
given in this course.

 You can execute the Python code by clicking the Run command from the left pane.

Your Playground...
Note: In case you don't find any of the required packages while playing around with the case
study, you can do the following :
 pip install nltk --target=./.Here, for Eg: nltk is the package you need to download.

 For NLTK, you have a few other dependent packages. You can perform the following steps to
download them :
o Open the python terminal in the cmd prompt.(Type python)

o Type import nltk

o Type nltk.download()

o Type d for download

o Type all to download all dependent packages of NLTK.

Setup Your Local Machine


To run the code locally:
 Install Python 2.7+ in your machine.

 Install the required packages - Pandas, Sklearn, Numpy(Use pip install).


 Use any IDE (PyCharm, Spyder etc.) for trying out the code snippets.

Note: You can find brief descriptions of the python packages here.
Dataset Download
The dataset is available at -SMS Spam dataset link .
Open the terminal and type the following command to download.
curl
https://www.researchgate.net/profile/Tiago_Almeida4/publication/258050002_SMS_Spam_Collect
ion_v1/data/00b7d526d127ded162000000/SMSSpamCollection.txt>dataset.csv
This command downloads the data and saves it as dataset.csv.
Dataset Description
The dataset contains customer usage pattern of a telecommunication company.
The following is a description of our dataset:
 No. of Classes: 2 (Spam / Ham)

 No. of attributes (Columns): 2

 No. of instances (Rows) : 5574

Data Loading
To start with data loading, import the required python package and load the downloaded CSV file.
The data can be stored as dataframe for easy data manipulation/analysis. Pandas is one of the
most widely used libraries for this.
import pandas as pd
import csv
#Data Loading
messages = [line.rstrip() for line in open('dataset.csv')]
printlen(messages)
#Appending column headers
messages = pd.read_csv('dataset.csv', sep='\t', quoting=csv.QUOTE_NONE,names=["label",
"message"])
As you can see, our dataset has 2 columns without any headers.
This code snippet reads the data using pandas and labels the column names as label and
message.

Data Analysis
Analyzing data is a must in any classification problem. The goal of data analysis is to derive
useful information from the given data for making decisions.
In this section, we will analyze the dataset in terms of size, headers, view data summary and a
sample data.
You can see the dataset size using :
data_size=messages.shape
print(data_size)
Column names can be viewed by :
messages_col_names=list(messages.columns)
print(messages_col_names)
To understand aggregate statistics easily, use the following command :
print(messages.groupby('label').describe())
To see a sample data, use the following command :
print(messages.head(3))

Target Identification
Target is the class/category to which you will assign the data.
 In this case, you aim to identify whether the message is spam or not.

 By observing the columns, the label column has values Spam or Ham . We can call this case
study a Binary Classification, since it has only two possible outcomes.

#Identifying the outcome/target variable.


message_target=messages['label']
print(message_target)
What kind of classification is our case study 'Spam Detection'?
Binary Multi class Multi label

Tokenization
Tokenization is a method to split a sentence/string into substrings. These substrings are called
tokens.
In Natural Language Processing (NLP), tokenization is the initial step in. Splitting a sentence into
tokens helps to remove unwanted information in the raw text such as white spaces, line breaks
and so on.
importnltk
fromnltk.tokenize import word_tokenize
defsplit_tokens(message):
message=message.lower()
message = unicode(message, 'utf8') #convert bytes into proper unicode
word_tokens =word_tokenize(message)
returnword_tokens
messages['tokenized_message'] = messages.apply(lambda row:
split_tokens(row['message']),axis=1)

Lemmatization
 Lemmatization is a method to convert a word into its base/root form.

 Lemmatizer removes affixes of the words present in its dictionary.

Stop Word Removal


Stop words are commons words that do not add any relevance for classification (For eg. “the”, “a”,
“an”, “in” etc.). Hence, it is essential to remove these words.
fromnltk.corpus import stopwords
defstopword_removal(message):
stop_words = set(stopwords.words('english'))
filtered_sentence = []
filtered_sentence = ' '.join([word for word in message if word not in stop_words])
returnfiltered_sentence
messages['preprocessed_message'] = messages.apply(lambda row:
stopword_removal(row['lemmatized_message']),axis=1)
Training_data=pd.Series(list(messages['preprocessed_message']))
Training_label=pd.Series(list(messages['label']))

Why Feature Extraction is important?


To perform machine learning on text documents, you first need to turn the text content into
numerical feature vectors.
In Python, you have a few packages defined under sklearn.
We will be looking into a few specific ones used for unstructured data.
Bag Of Words(BOW)
 Bag of Words (BOW) is one of the most widely used methods for generating features in
Natural Language Processing.

 Representing/Transforming a text into a bag of words helps to identify various measures to


characterize the text.

 Predominantly used for calculating the term(word) frequency or the number of times a term
occurs in a document/sentence.

 It can be used as a feature for training the classifier.

Term Document Matrix


 The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of
terms in a collection of documents.

 In a TDM, the rows represent documents and columns represent the terms.

fromsklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


tf_vectorizer = CountVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)),
max_df = 0.7)
Total_Dictionary_TDM = tf_vectorizer.fit(Training_data)
message_data_TDM = Total_Dictionary_TDM.transform(Training_data)

Term Frequency Inverse Document Frequency


(TFIDF)
 In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is
expressed by Inverse Document Frequency (IDF).

 IDF diminishes the weight of the most commonly occurring words and increases the weightage
of rare words.

fromsklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2),min_df = (1/len(Training_label)),
max_df = 0.7)
Total_Dictionary_TFIDF = tfidf_vectorizer.fit(Training_data)
message_data_TFIDF = Total_Dictionary_TFIDF.transform(Training_data)
Let's take the TDM matrix for further evaluation. You can also try out the
same using TFIDF matrix.
Which preprocessing technique is used to remove the most commonly
used words? Tokenization Lemmatization Stopword removal
Classification Algorithms
There are various algorithms to solve the classification problems. The code to try out a few of
these algorithms will be presented in the upcoming cards.
We will discuss the following :
 Decision Tree Classifier

 Stochastic Gradient Descent Classifier

 Support Vector Machine Classifier

 Random Forest Classifier

Note:- The explanation for these algorithms are given in the Machine Learning Axioms course.
Refer the course for further details.
How Does a Classifier Work?
The following are the steps involved in building a classification model:
1. Initialize the classifier to be used.
2. Train the classifier - All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training)
for the given train data X and train label y.

3. Predict the target - Given an unlabeled observation X, the predict(X) returns the predicted label
y.

4. Evaluate the classifier model - The score(X,y) returns the score for the given test data X and
test label y.

Train and Test Data


The code snippet provided here is for partitioning the data into train and test for building the
classifier model. This split will be used to explain classification algorithms.
Decision Tree Classification
 It is one of the commonly used classification techniques for performing binary as well as multi-
class classification.

 The decision tree model predicts the class/target by learning simple decision rules from the
features of the data.

Stochastic Gradient Descent Classifier


 It is used for large scale learning

 It supports different loss functions & penalties for classification

Support Vector Machine


 Support Vector Machine(SVM) is effective in high-dimensional spaces.

 It is effective in cases where the number of dimensions is greater than the number of samples.

 It works well with a clear margin of separation.


Random Forest Classifier
 Controls over fitting

 Here, a random forest fits a number of decision tree classifiers on various sub-samples of the
dataset and uses averaging to improve the predictive accuracy.

Model Tuning
The classification algorithms in machine learning are parameterized. Modifying any of those
parameters can influence the results. So algorithm/model tuning is essential to find out the best
model.
For example, let's take the Random Forest Classifier and change the values of a few
parameters (n_ estimators,max_ features)
Partitioning the Data
It is a methodological mistake to test and train on the same dataset. This is because the classifier
would fail to predict correctly for any unseen data. This could result in overfitting.
To avoid this problem,
 Split the data to train set, validation set and test set.
o Training Set: The data used to train the classifier.

o Validation Set: The data used to tune the classifier model parameters i.e., to understand how
well the model has been trained (a part of training data).

o Testing Set: The data used to evaluate the performance of the classifier (unseen data by the
classifier).
 This will help you know the efficiency of your model.

Cross Validation
 Cross validation is a model validation technique to evaluate the performance of a model on
unseen data (validation set).

 It is a better estimate to evaluate testing accuracy than training accuracy on unseen data.

Points to remember:
 Cross validation gives high variance if the testing set and training set are not drawn from same
population.

 Allowing training data to be included in testing data will not give actual performance results.

In cross validation, the number of samples used for training the model is reduced and the results
depend on the choice of the pair of training and testing sets.
You can refer to the various CV approaches here.
Stratified Shuffle Split
The StratifiedShuffleSplit splits the data by taking an equal number of samples from each
class in a random manner.
StratifiedShuffleSplit would suit our case study as the dataset has a class imbalance which can
be seen from the following code snippet:
seed=7
fromsklearn.cross_validation import StratifiedShuffleSplit
#creating cross validation object with 10% test size
cross_val = StratifiedShuffleSplit(Training_label,1, test_size=0.1,random_state=seed)
test_size=0.1 denotes that 10 % of the dataset is used for testing.

Stratified Shuffle Split Contd...


This selection is then used to split the data into test and train sets.
fromsklearn.neighbors import KNeighborsClassifier
fromsklearn.multiclass import OneVsRestClassifier
fromsklearn import svm
classifiers = [
DecisionTreeClassifier(),
SGDClassifier(loss='modified_huber', shuffle=True),
SVC(kernel="linear", C=0.025),
KNeighborsClassifier(),
OneVsRestClassifier(svm.LinearSVC()),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=10),
]
forclf in classifiers:
score=0
fortrain_index, test_index in cross_val:
X_train, X_test = message_data_TDM [train_index], message_data_TDM [test_index]
y_train, y_test = Training_label[train_index], Training_label[test_index]
clf.fit(X_train, y_train)
score=score+clf.score(X_test, y_test)
print(score)
The above code uses ensemble of classifiers for cross validation. It helps to select the best
classifier based on the cross validation scores. The classifier with the highest score can be
used for building the classification model.
Note: You may add or remove classifiers based on the requirement.
Cross-validation technique is used to evaluate a classifier by dividing the data set
into training set to train the classifier and testing set to test the same.
TRUE or FALSE
Classification Accuracy
 The classification accuracy is defined as the percentage of correct predictions.

fromsklearn.metrics import accuracy_score


print('Accuracy Score',accuracy_score(test_label,message_predicted_target))
classifier = classifier.fit(train_data, train_label)
score=classifier.score(test_data, test_label)
test_label.value_counts()
 This simple classification accuracy will not tell us the types of errors by our classifier.

 It is just an easier method, but it will not give us the latent distribution of response values.

Confusion Matrix
It is a technique to evaluate the performance of a classifier.
 It depicts the performance in a tabular form that has 2 dimensions namely “actual” and
“predicted” sets of data.

 The rows and columns of the table show the count of false positives, false negatives, true
positives and true negatives.

fromsklearn.metrics import confusion_matrix


print('Confusion Matrix',confusion_matrix(test_label,message_predicted_target))
The first parameter shows true values and the second parameter shows predicted values.
Confusion Matrix
This image is a confusion matrix for a two class classifier.
In the table,
 TP (True Positive) - The number of correct predictions that the occurrence is positive

 FP (False Positive) - The number of incorrect predictions that the occurrence is positive

 FN (False Negative) - The number of incorrect predictions that the occurrence is negative

 TN (True Negative)- The number of correct predictions that the occurrence is negative

 TOTAL - The total number of occurrences

Plotting Confusion Matrix


To evaluate the quality of output, it is always better to plot and analyze the results.
For our case study, we have plotted the confusion matrix of Decision Tree Classifier which is
given in the above image.
The function for plotting confusion matrix is given here.
Classification Report
The classification_report function shows a text report with the commonly used classification
metrics.
fromsklearn.metrics import classification_report
target_names = ['spam', 'ham']
print(classification_report(test_label, message_predicted_target,
target_names=target_names))
Precision
 When a positive value is predicted, how often is the prediction correct?

Recall
 It is the true positive rate.

 When the value is positive, how often does the prediction turn out to be correct?

To know more about model evaluation, check this link.


Other Libraries
For our demonstration purpose, we have used Python with NLTK. There are many more libraries
specific to Java/Ruby, etc.
You can find the reference link here:
NLP Libraries
True Negative is when the predicted instance and the actual is positive.
TRUE OR FALSE
True Positive is when the predicted instance and the actual instance is not
negative. TRUE OR FALSE
Unstructured Data Classification - Course Summary
In this course, we discussed the following :
-Identifying unstructured data.
 Selecting the ideal features for processing.

 Various pre-processing steps for text classification with practical exercises.

 A few of the classification algorithms.

 Classifier performance evaluation.

Q&A
Cross-validation causes over-fitting. TRUE OR FALSE
In document classification, each document has to be coverted from full text to a document
vector TRUE / FALSE
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What is the output of the following
command:print(sentiment_analysis_data['label'].unique())
[yes no]
None of these
[true false]
[1 0]
A classifer that can compute using numeric as well as categorical values is
Naive Bayes Classifier Decision Tree Classifier
SVM Classifier Random Forest Classifier
Stemming and lemmatization gives the same result. True or false
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)
sentiment_analysis_data.select(3)
sentiment_analysis_data.top(3)
sentiment_analysis_data.he
In Supervised learning, class labels of the training samples are
Partially known Known Unknown Does not matter
An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen
instances. This requires the learning algorithm to generalize from the training data to unseen situations in
a "reasonable" way (see inductive bias).
The parallel task in human and animal psychology is often referred to as concept learning.
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
High classification accuracy always indicates a good classifier. False orTrue
In machine learning, multiclass or multinomial classification is the problem of classifying instances into
one of three or more classes. (Classifying instances into one of the two classes is called binary
classification.)
While some classification algorithms naturally permit the use of more than two classes, others are by
nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of
strategies.
Multiclass classification should not be confused with multi-label classification, where multiple labels are to
be predicted for each instance.
General strategies[edit]
The existing multi-class classification techniques can be categorized into (i) Transformation to binary (ii)
Extension from binary and (iii) Hierarchical classification. [1]
Transformation to binary[edit]
This section discusses strategies for reducing the problem of multiclass classification to multiple binary
classification problems. It can be categorized into One vs Rest and One vs One. The techniques developed
based on reducing the multi-class problem into multiple binary problems can also be called problem
transformation techniques.
One-vs.-rest[edit]
One-vs.-rest[2]:182, 338 (or one-vs.-all, OvA or OvR, one-against-all, OAA) strategy involves training a single
classifier per class, with the samples of that class as positive samples and all other samples as negatives.
This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather
than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are
predicted for a single sample.[3]:182[note 1]
In pseudocode, the training algorithm for an OvA learner constructed from a binary classification learner L
is as follows:
Inputs:
 L, a learner (training algorithm for binary classifiers)
 samples X
 labels y where yi ∈ {1, … K} is the label for the sample Xi

Output:
 a list of classifiers fk for k ∈ {1, …, K}

Procedure:
 For each k in {1, …, K}
 Construct a new label vector z where zi = 1 if yi = k and zi = 0 otherwise
 Apply L to X, z to obtain fk

Making decisions means applying all classifiers to an unseen sample x and predicting the label k for which
the corresponding classifier reports the highest confidence score:
Although this strategy is popular, it is a heuristic that suffers from several problems. Firstly, the scale of the
confidence values may differ between the binary classifiers. Second, even if the class distribution is
balanced in the training set, the binary classification learners see unbalanced distributions because
typically the set of negatives they see is much larger than the set of positives. [3]:338
One-vs.-one[edit]
In the one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for a K-way multiclass
problem; each receives the samples of a pair of classes from the original training set, and must learn to
distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers
are applied to an unseen sample and the class that got the highest number of "+1" predictions gets
predicted by the combined classifier.[3]:339
Like OvR, OvO suffers from ambiguities in that some regions of its input space may receive the same
number of votes.[3]:183
Extension from binary[edit]
This section discusses strategies of extending the existing binary classifiers to solve multi-class
classification problems. Several algorithms have been developed based on neural networks, decision trees,
k-nearest neighbors, naive Bayes, support vector machines and Extreme Learning Machines to address
multi-class classification problems. These types of techniques can also be called as algorithm adaptation
techniques.
Neural networks[edit]
Multilayer perceptrons provide a natural extension to the multi-class problem. Instead of just having one
neuron in the output layer, with binary output, one could have N binary neurons leading to multi-class
classification. In practice, the last layer of a neural network is usually a softmax function layer, which is the
algebraic simplification of N logistic classifiers, normalized per class by the sum of the N-1 other logistic
classifiers.
Extreme learning machines[edit]
Extreme Learning Machines (ELM) is a special case of single hidden layer feed-forward neural networks
(SLFNs) where in the input weights and the hidden node biases can be chosen at random. Many variants
and developments are made to the ELM for multiclass classification.
k-nearestneighbours[edit]
k-nearest neighbors kNN is considered among the oldest non-parametric classification algorithms. To
classify an unknown example, the distance from that example to every other training example is measured.
The k smallest distances are identified, and the most represented class by these k nearest neighbours is
considered the output class label.
Naive Bayes[edit]
Naive Bayes is a successful classifier based upon the principle of maximum a posteriori (MAP). This
approach is naturally extensible to the case of having more than two classes, and was shown to perform
well in spite of the underlying simplifying assumption of conditional independence.
Decision trees[edit]
Decision trees are a powerful classification technique. The tree tries to infer a split of the training data
based on the values of the available features to produce a good generalization. The algorithm can naturally
handle binary or multiclass classification problems. The leaf nodes can refer to either of the K classes
concerned.
Support vector machines[edit]
Support vector machines are based upon the idea of maximizing the margin i.e. maximizing the minimum
distance from the separating hyperplane to the nearest example. The basic SVM supports only binary
classification, but extensions have been proposed to handle the multiclass classification case as well. In
these extensions, additional parameters and constraints are added to the optimization problem to handle
the separation of the different classes.
Hierarchical classification[edit]
Hierarchical classification tackles the multi-class classification problem by dividing the output space i.e. into
a tree. Each parent node is divided into multiple child nodes and the process is continued until each child
node represents only one class. Several methods have been proposed based on hierarchical classification.
Learning paradigms[edit]
Based on learning paradigms, the existing multi-class classification techniques can be classified into batch
learning and online learning. Batch learning algorithms require all the data samples to be available
beforehand. It trains the model using the entire training data and then predicts the test sample using the
found relationship. The online learning algorithms, on the other hand, incrementally build their models in
sequential iterations. In iteration t, an online algorithm receives a sample, x t and predicts its label ŷt using
the current model; the algorithm then receives yt, the true label of xt and updates its model based on the
sample-label pair: (xt, yt). Recently, a new learning paradigm called progressive learning technique has
been developed.[4] The progressive learning technique is capable of not only learning from new samples but
also capable of learning new classes of data and yet retain the knowledge learnt thus far.
Which of the following is not a performance evaluation measure?
Accuracy score (X) DecisionTree Confusion matrix Classification report
26/07/2018 (1)
In a Term Document Matrix (TDM) each row represents ______?
TF-IDF value TF value document word
The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of terms in a
collection of documents. In a TDM, the rows represent documents and columns represent the
terms.
email spam data is an example of
Unstructured Data Structured Data
High classification accuracy always indicates a good classifier.TRUEFALSE(X)
It is false, because you might have high, but error might be unacceptable?
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)(X)
sentiment_analysis_data.select(3)
sentiment_analysis_data.head(3)
sentiment_analysis_data.top(3)
Which one of the following is nota classification technique?
SGDClassifier SVM StratifiedShuffleSplit Random Forest
Classification Algorithms
 Decision Tree Classifier

 Stochastic Gradient Descent Classifier


 Support Vector Machine Classifier

 Random Forest Classifier

A technique used to depict the performance in a tabular form that has 2 dimensions
namely “actual” and “predicted” sets of data.
Confusion Matrix Cross Validation Classification Report Classification Accuracy
Confusion Matrix is a technique to evaluate the performance of a classifier. It depicts
the performance in a tabular form that has 2 dimensions namely “actual” and
“predicted” sets of data.The rows and columns of the table show the count of false
positives, false negatives, true positives and true negatives.
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
(may be, as Multiclass is for more than two class)
Higher value of which of the following hyperparameters is better for decision tree
algorithm?
Number of samples used for split Depth of tree
Cannot say Samples for leaf
Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not anhyperparameter in
random forest. Increase in the number of tree will cause under fitting.
27/07/18 (2)
Pruning is a technique associated with
Decision tree Logistic regression SVM Linear regression
In document classification, each document has to be coverted from full text to a
document vector TRUE / FALSE
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
Supervised learning differs from unsupervised learning in that supervised learning
requires
None of the options raw data Labeled data Unlabeled data
The key difference between supervised and unsupervised learning is that supervised learning trying to predict
the labelsP(Y|X), while in unsupervised learning we are trying to get a model that model the X distribution P(X).
And you may ask what is P(X) and what is a good P(X)?
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What is the output of the following command:
print(sentiment_analysis_data['label'].unique())
[yes no] [true false] None of these [1 0]
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
Select pre-processing techniques from the options
All the options Lemmatization Stemming Stopword removal Tokenization
What is the output of the sentence “Good words bring good feelings to the heart” after
performing tokenization, lemmatization and stop word removal.
'Good words bring good feelings heart'
['Good', 'words', 'bring', 'good', 'feelings', 'to', 'the', 'heart']
['Good', 'word', 'bring', 'good', 'feeling', 'to', 'the', 'heart']
'Good word bring good feeling heart'
Which of the following is nota performance evaluation measure?
Confusion matrix Accuracy score DecisionTree Classification report
26/07/18 (3)
Cross-validation causes over-fitting. TRUE FALSE
In document classification, each document has to be coverted from full text to a
document vector TRUE FALSE
High classification accuracy always indicates a good classifier. TRUE FALSE
Pruning is a technique associated with
Decision tree Logistic regression SVM Linear regression
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
Supervised learning differs from unsupervised learning in that supervised learning
requires
Unlabeled data None of the options Labeled data raw data
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.get(3)
sentiment_analysis_data.head(3)
sentiment_analysis_data.top(3)
sentiment_analysis_data.select(3)
XXX
SVM is a
weakly supervised learning algorithm.
Semi-supervised learning algorithm.
supervised learning algorithm.
unsupervised learning algorithm.
TF-IDF is a freature extraction technique
False True
Lemmatization offers better precision than stemming
True False
Choose the correct sequence for classifier building from the following:
None of the options
Train -> Test -> Initialize ->Predict
Initialize -> Evaluate -> Train -> Predict
Initialize -> Train - -> Predict-->Evaluate
27/07/18 (1)
The data you have is called 'mixed data' because it has both numerical and categorical values. And since you
have class labels; therefore, it is a classification problem. One option is to go with decision trees,
which you already tried. Other possibilities are naive Bayes where you model numeric attributes by a Gaussian
distribution or so. You can also employ a minimum distance or KNN based approach; however, the cost function
must be able to handle data for both types together. If these approaches don't work then try ensemble
techniques. Try bagging with decision trees or else Random Forest that combines bagging and random
subspace. With mixed data, choices are limited and you need to be cautious and creative with your choices.
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
To view the first 3 rows of the dataset, which of the following commands are used?
sentiment_analysis_data.head(3) sentiment_analysis_data.select(3)
sentiment_analysis_data.get(3) sentiment_analysis_data.top(3)
In document classification, each document has to be coverted from full text to a
document vector False True
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
Is there a class imbalance problem in the given data set? Yes No
Inverse Document frequency is used in term document matrix. False True
) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
Which of the following command is used to view the dataset SIZE and what is the value
returned?
sentiment_analysis_data.shape,(7086, 3)
sentiment_analysis_data.shape(),(7086, 2)
sentiment_analysis_data.size(),(7086, 2)
sentiment_analysis_data.size,(7086, 3)
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
Is there a class imbalance problem in the given data set? Yes No
A technique used to depict the performance in a tabular form that has 2 dimensions
namely “actual” and “predicted” sets of data.
Classification Report Classification Accuracy
Confusion Matrix Cross Validation
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What is the output of the following command:
print(sentiment_analysis_data['label'].unique())
[true false] [yes no] None of these [1 0]
Select pre-processing techniques from the options
Stemming Lemmatization Tokenization Stopword removal All the options
In a Term Document Matrix (TDM) each row represents ______?
Word document TF-IDF value TF value
In a TDM, the rows represent documents and columns represent the terms.
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
Which of the following command is used to view the dataset SIZE and what is the value returned?
sentiment_analysis_data.size,(7086, 3) sentiment_analysis_data.shape(),(7086, 2)
sentiment_analysis_data.size(),(7086, 2) sentiment_analysis_data.shape,(7086, 3)
In Supervised learning, class labels of the training samples are
Unknown Doesn’t matter Known Partially known
TF-IDF is a freature extraction technique TRUE(X)uor FALSE
Which of the following is not a performance evaluation measure?
Accuracy score DecisionTree Classification report Confusion matrix
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What does the command sentiment_analysis_data['label'].value_counts()
return?
The total count of elements in 'label' column
Number of columns in the dataset
Number of rows in the dataset
counts of unique values in the 'label' column
) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
In document classification, each document has to be coverted from full text to a
document vector TRUE FALSE
SVM is a
supervised learning algorithm. unsupervised learning algorithm.
Semi-supervised learning algorithm. weakly supervised learning algorithm.
An algorithm that counts how many times a word appears in a document
Bag-of-words(BOW) TF-IDF DTM TDM
Q1. Movie Recommendation systems are an example of:
1. Classification
2. Clustering
3. Reinforcement Learning
4. Regression

Options:
B. A. 2 Only
C. 1 and 2
D. 1 and 3
E. 2 and 3
F. 1, 2 and 3
H. 1, 2, 3 and 4
Solution: (E)
Generally, movie recommendation systems cluster the users in a finite number of similar groups based on
their previous activities and profile. Then, at a fundamental level, people in the same cluster are made
similar recommendations.
In some scenarios, this can also be approached as a classification problem for assigning the most
appropriate movie class to the user of a specific group of users. Also, a movie recommendation system can
be viewed as a reinforcement learning problem where it learns by its previous recommendations and
improves the future recommendations.
Q2. Sentiment Analysis is an example of:
1. Regression
2. Classification
3. Clustering
4. Reinforcement Learning

Options:
A. 1 Only
B. 1 and 2
C. 1 and 3
D. 1, 2 and 3
E. 1, 2 and 4
F. 1, 2, 3 and 4
Solution: (E)
Sentiment analysis at the fundamental level is the task of classifying the sentiments represented in an image,
text or speech into a set of defined sentiment classes like happy, sad, excited, positive, negative, etc. It can
also be viewed as a regression problem for assigning a sentiment score of say 1 to 10 for a corresponding
image, text or speech.
Another way of looking at sentiment analysis is to consider it using a reinforcement learning perspective
where the algorithm constantly learns from the accuracy of past sentiment analysis performed to improve
the future performance.
Q3. Can decision trees be used for performing clustering?
A. True
B. False
Solution: (A)
Decision trees can also be used to for clusters in the data but clustering often generates natural clusters and
is not dependent on any objective function.
Q4. Which of the following is the most appropriate strategy for data cleaning before performing
clustering analysis, given less than desirable number of data points:
1. Capping and flouring of variables
2. Removal of outliers

Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of the above
Solution: (A)
Removal of outliers is not recommended if the data points are few in number. In this scenario, capping and
flouring of variables is the most appropriate strategy.
Q5. What is the minimum no. of variables/ features required to perform clustering?
A. 0
B. 1
C. 2
D. 3
Solution: (B)
At least a single variable is required to perform clustering analysis. Clustering analysis with a single
variable can be visualized with the help of a histogram.
Q6. For two runs of K-Mean clustering is it expected to get same clustering results?
A. Yes
B. No
Solution: (B)
K-Means clustering algorithm instead converses on local minima which might also correspond to the global
minima in some cases but not always. Therefore, it’s advised to run the K-Means algorithm multiple times
before drawing inferences about the clusters.
However, note that it’s possible to receive same clustering results from K-means by setting the same seed
value for each run. But that is done by simply making the algorithm choose the set of same random no. for
each run.
Q7. Is it possible that Assignment of observations to clusters does not change between successive
iterations in K-Means
A. Yes
B. No
C. Can’t say
D. None of these
Solution: (A)
When the K-Means algorithm has reached the local or global minima, it will not alter the assignment of data
points to clusters for two successive iterations.
Q8. Which of the following can act as possible termination conditions in K-Means?
1. For a fixed number of iterations.
2. Assignment of observations to clusters does not change between iterations. Except for cases with a
bad local minimum.
3. Centroids do not change between successive iterations.

4. Terminate when RSS falls below a threshold.


Options:
A. 1, 3 and 4
B. 1, 2 and 3
C. 1, 2 and 4
D. All of the above
Solution: (D)
All four conditions can be used as possible termination condition in K-Means clustering:
1. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the
clustering will be poor because of an insufficient number of iterations.
2. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be
unacceptably long.
3. This also ensures that the algorithm has converged at the minima.
4. Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of a desired
quality after termination. Practically, it’s a good practice to combine it with a bound on the number of
iterations to guarantee termination.

Q9. Which of the following clustering algorithms suffers from the problem of convergence at local
optima?
1. K- Means clustering algorithm
2. Agglomerative clustering algorithm
3. Expectation-Maximization clustering algorithm
4. Diverse clustering algorithm

Options:
A. 1 only
B. 2 and 3
C. 2 and 4
D. 1 and 3
E. 1,2 and 4
F. All of the above
Solution: (D)
Out of the options given, only K-Means clustering algorithm and EM clustering algorithm has the drawback
of converging at local minima.
Q10. Which of the following algorithm is most sensitive to outliers?
A. K-means clustering algorithm
B. K-medians clustering algorithm
C. K-modes clustering algorithm
D. K-medoids clustering algorithm
Solution: (A)
Out of all the options, K-Means clustering algorithm is most sensitive to outliers as it uses the mean of
cluster data points to find the cluster center.
Q11. After performing K-Means Clustering analysis on a dataset, you observed the following
dendrogram. Which of the following conclusion can be drawn from the dendrogram?
A. There were 28 data points in clustering analysis
B. The best no. of clusters for the analyzed data points is 4
C. The proximity function used is Average-link clustering
D. The above dendrogram interpretation is not possible for K-Means clustering analysis
Solution: (D)
A dendrogram is not possible for K-Means clustering analysis. However, one can create a cluster gram
based on K-Means clustering analysis.
Q12. How can Clustering (Unsupervised Learning) be used to improve the accuracy of Linear
Regression model (Supervised Learning):
1. Creating different models for different cluster groups.
2. Creating an input feature for cluster ids as an ordinal variable.
3. Creating an input feature for cluster centroids as a continuous variable.
4. Creating an input feature for cluster size as a continuous variable.
Options:
A. 1 only
B. 1 and 2
C. 1 and 4
D. 3 only
E. 2 and 4
F. All of the above
Solution: (F)
Creating an input feature for cluster ids as ordinal variable or creating an input feature for cluster centroids
as a continuous variable might not convey any relevant information to the regression model for
multidimensional data. But for clustering in a single dimension, all of the given methods are expected to
convey meaningful information to the regression model. For example, to cluster people in two groups based
on their hair length, storing clustering ID as ordinal variable and cluster centroids as continuous variables
will convey meaningful information.
Q13. What could be the possible reason(s) for producing two different dendrograms using
agglomerative clustering algorithm for the same dataset?
A. Proximity function used
B. of data points used
C. of variables used
D. B and c only
E. All of the above
Solution: (E)
Change in either of Proximity function, no. of data points or no. of variables will lead to different clustering
results and hence different dendrograms.
Q14. In the figure below, if you draw a horizontal line on y-axis for y=2. What will be the number of
clusters formed?
A. 1
B. 2
C. 3
D. 4
Solution: (B)
Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram are 2,
therefore, two clusters will be formed.
Q15. What is the most appropriate no. of clusters for the data points represented by the following
dendrogram:
A. 2
B. 4
C. 6
D. 8
Solution: (B)
The decision of the no. of clusters that can best depict different groups can be chosen by observing the
dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a
horizontal line that can transverse the maximum distance vertically without intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Q16. In which of the following cases will K-Means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with round shapes
4. Data points with non-convex shapes

Options:
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1, 2 and 4
E. 1, 2, 3 and 4
Solution: (D)
K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread
of data points across the data space is different and the data points follow non-convex shapes.
Q17. Which of the following metrics, do we have for finding dissimilarity between two clusters in
hierarchical clustering?
1. Single-link
2. Complete-link
3. Average-link

Options:
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3
Solution: (D)
All of the three methods i.e. single link, complete link and average link can be used for finding dissimilarity
between two clusters in hierarchical clustering.
Q18. Which of the following are true?
1. Clustering analysis is negatively affected by multicollinearity of features

2. Clustering analysis is negatively affected by heteroscedasticity


Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of them
Solution: (A)
Clustering analysis is not negatively affected by heteroscedasticity but the results are negatively impacted
by multicollinearity of features/ variables used in clustering as the correlated feature/ variable will carry
extra weight on the distance calculation than desired.
Q19. Given, six points with the following attributes:
Which of the following clustering representations and dendrogram depicts the use of MIN or Single
link proximity function in hierarchical clustering:
A.
B.
C.
D.
Solution: (A)
For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined to be
the minimum of the distance between any two points in the different clusters. For instance, from the table,
we see that the distance between points 3 and 6 is 0.11, and that is the height at which they are joined into
one cluster in the dendrogram. As another example, the distance between clusters {3, 6} and {2, 5} is given
by dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843,
0.3921) = 0.1483.
Q20 Given, six points with the following attributes:
Which of the following clustering representations and dendrogram depicts the use of MAX or
Complete link proximity function in hierarchical clustering:
A.
B.
C.
D.
Solution: (B)
For the single link or MAX version of hierarchical clustering, the proximity of two clusters is defined to be
the maximum of the distance between any two points in the different clusters. Similarly, here points 3 and 6
are merged first. However, {3, 6} is merged with {4}, instead of {2, 5}. This is because the dist({3, 6}, {4})
= max(dist(3, 4), dist(6, 4)) = max(0.1513, 0.2216) = 0.2216, which is smaller than dist({3, 6}, {2, 5}) =
max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3,
6}, {1}) = max(dist(3, 1), dist(6, 1)) = max(0.2218, 0.2347) = 0.2347.
Q21 Given, six points with the following attributes:
Which of the following clustering representations and dendrogram depicts the use of Group average
proximity function in hierarchical clustering:
A.
B. C.
D.
Solution: (C)
For the group average version of hierarchical clustering, the proximity of two clusters is defined to be the
average of the pairwise proximities between all pairs of points in the different clusters. This is an
intermediate approach between MIN and MAX. This is expressed by the following equation:
Here, the distance between some clusters. dist({3, 6, 4}, {1}) = (0.2218 + 0.3688 + 0.2347)/(3 ∗ 1) =
0.2751. dist({2, 5}, {1}) = (0.2357 + 0.3421)/(2 ∗ 1) = 0.2889. dist({3, 6, 4}, {2, 5}) = (0.1483 + 0.2843 +
0.2540 + 0.3921 + 0.2042 + 0.2932)/(6∗1) = 0.2637. Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3,
6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at the fourth stage
Q22. Given, six points with the following attributes:
Which of the following clustering representations and dendrogram depicts the use of Ward’s method
proximity function in hierarchical clustering:
A.
B.
C.
D.
Solution: (D)
Ward method is a centroid method. Centroid method calculates the proximity between two clusters by
calculating the distance between the centroids of clusters. For Ward’s method, the proximity between two
clusters is defined as the increase in the squared error that results when two clusters are merged. The results
of applying Ward’s method to the sample data set of six points. The resulting clustering is somewhat
different from those produced by MIN, MAX, and group average.
Q23. What should be the best choice of no. of clusters based on the following results:
A. 1
B. 2
C. 3
D. 4
Solution: (C)
The silhouette coefficient is a measure of how similar an object is to its own cluster compared to other
clusters. Number of clusters for which silhouette coefficient is highest represents the best choice of the
number of clusters.
Q24. Which of the following is/are valid iterative strategy for treating missing values before clustering
analysis?
A. Imputation with mean
B. Nearest Neighbor assignment
C. Imputation with Expectation Maximization algorithm
D. All of the above
Solution: (C)
All of the mentioned techniques are valid for treating missing values before clustering analysis but only
imputation with EM algorithm is iterative in its functioning.
Q25. K-Mean algorithm has some limitations. One of the limitation it has is, it makes hard
assignments(A point either completely belongs to a cluster or not belongs at all) of points to clusters.
Note: Soft assignment can be consider as the probability of being assigned to each cluster: say K = 3
and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
Which of the following algorithm(s) allows soft assignments?
1. Gaussian mixture models

2. Fuzzy K-means

Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of these
Solution: (C)
Both, Gaussian mixture models and Fuzzy K-means allows soft assignments.
Q26. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering algorithm.
After first iteration clusters, C1, C2, C3 has following observations:
C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
What will be the cluster centroids if you want to proceed for second iteration?
A. C1: (4,4), C2: (2,2), C3: (7,7)
B. C1: (6,6), C2: (4,4), C3: (9,9)
C. C1: (2,2), C2: (0,0), C3: (5,5)
D. None of these
Solution: (A)
Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4)
Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2)
Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7)
Hence, C1: (4,4), C2: (2,2), C3: (7,7)
Q27. Assume, you want to cluster 7 observations into 3 clusters using K-Means clustering algorithm.
After first iteration clusters, C1, C2, C3 has following observations:
C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
What will be the Manhattan distance for observation (9, 9) from cluster centroid C1. In second
iteration.
A. 10
B. 5*sqrt(2)
C. 13*sqrt(2)
D. None of these
Solution: (A)
Manhattan distance between centroid C1 i.e. (4, 4) and (9, 9) = (9-4) + (9-4) = 10
Q28. If two variables V1 and V2, are used for clustering. Which of the following are true for K means
clustering with k =3?
1. If V1 and V2 has a correlation of 1, the cluster centroids will be in a straight line
2. If V1 and V2 has a correlation of 0, the cluster centroids will be in straight line

Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of the above
Solution: (A)
If the correlation between the variables V1 and V2 is 1, then all the data points will be in a straight line.
Hence, all the three cluster centroids will form a straight line as well.
Q29. Feature scaling is an important step before applying K-Mean algorithm. What is reason behind
this?
A. In distance calculation it will give the same weights for all features
B. You always get the same clusters. If you use or don’t use feature scaling
C. In Manhattan distance it is an important step but in Euclidian it is not
D. None of these
Solution; (A)
Feature scaling ensures that all the features get same weight in the clustering analysis. Consider a scenario
of clustering people based on their weights (in KG) with range 55-110 and height (in inches) with range 5.6
to 6.4. In this case, the clusters produced without scaling can be very misleading as the range of weight is
much higher than that of height. Therefore, its necessary to bring them to same scale so that they have equal
weightage on the clustering result.
Q30. Which of the following method is used for finding optimal of cluster in K-Mean algorithm?
A. Elbow method
B. Manhattan method
C. Ecludian mehthod
D. All of the above
E. None of these
Solution: (A)
Out of the given options, only elbow method is used for finding the optimal number of clusters. The elbow
method looks at the percentage of variance explained as a function of the number of clusters: One should
choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.
Q31. What is true about K-Mean Clustering?
1. K-means is extremely sensitive to cluster center initializations
2. Bad initialization can lead to Poor convergence speed
3. Bad initialization can lead to bad overall clustering

Options:
A. 1 and 3
B. 1 and 2
C. 2 and 3
D. 1, 2 and 3
Solution: (D)
All three of the given statements are true. K-means is extremely sensitive to cluster center initialization.
Also, bad initialization can lead to Poor convergence speed as well as bad overall clustering.
Q32. Which of the following can be applied to get good results for K-means algorithm corresponding
to global minima?
1. Try to run algorithm for different centroid initialization
2. Adjust number of iterations
3. Find out the optimal number of clusters

Options:
A. 2 and 3
B. 1 and 3
C. 1 and 2
D. All of above
Solution: (D)
All of these are standard practices that are used in order to obtain good clustering results.
Q33. What should be the best choice for number of clusters based on the following results:
A. 5
B. 6
C. 14
D. Greater than 14
Solution: (B)
Based on the above results, the best choice of number of clusters using elbow method is 6.
Q34. What should be the best choice for number of clusters based on the following results:
A. 2
B. 4
C. 6
D. 8
Solution: (C)
Generally, a higher average silhouette coefficient indicates better clustering quality. In this plot, the optimal
clustering number of grid cells in the study area should be 2, at which the value of the average silhouette
coefficient is highest. However, the SSE of this clustering solution (k = 2) is too large. At k = 6, the SSE is
much lower. In addition, the value of the average silhouette coefficient at k = 6 is also very high, which is
just lower than k = 2. Thus, the best choice is k = 6.
Q35. Which of the following sequences is correct for a K-Means algorithm using Forgy method of
initialization?
1. Specify the number of clusters
2. Assign cluster centroids randomly
3. Assign each data point to the nearest cluster centroid
4. Re-assign each point to nearest cluster centroids
5. Re-compute cluster centroids

Options:
A. 1, 2, 3, 5, 4
B. 1, 3, 2, 4, 5
C. 2, 1, 3, 4, 5
D. None of these
Solution: (A)
The methods used for initialization in K means are Forgy and Random Partition. The Forgy method
randomly chooses k observations from the data set and uses these as the initial means. The Random
Partition method first randomly assigns a cluster to each observation and then proceeds to the update step,
thus computing the initial mean to be the centroid of the cluster’s randomly assigned points.
Q36. If you are using Multinomial mixture models with the expectation-maximization algorithm for
clustering a set of data points into two clusters, which of the assumptions are important:
A. All the data points follow two Gaussian distribution
B. All the data points follow n Gaussian distribution (n >2)
C. All the data points follow two multinomial distribution
D. All the data points follow n multinomial distribution (n >2)
Solution: (C)
In EM algorithm for clustering its essential to choose the same no. of clusters to classify the data points into
as the no. of different distributions they are expected to be generated from and also the distributions must be
of the same type.
Q37. Which of the following is/are not true about Centroid based K-Means clustering algorithm and
Distribution based expectation-maximization clustering algorithm:
1. Both starts with random initializations
2. Both are iterative algorithms
3. Both have strong assumptions that the data points must fulfill
4. Both are sensitive to outliers
5. Expectation maximization algorithm is a special case of K-Means
6. Both requires prior knowledge of the no. of desired clusters
7. The results produced by both are non-reproducible.

Options:
A. 1 only
B. 5 only
C. 1 and 3
D. 6 and 7
E. 4, 6 and 7
F. None of the above
Solution: (B)
All of the above statements are true except the 5 th as instead K-Means is a special case of EM algorithm in
which only the centroids of the cluster distributions are calculated at each iteration.
Q38. Which of the following is/are not true about DBSCAN clustering algorithm:
1. For data points to be in a cluster, they must be in a distance threshold to a core point
2. It has strong assumptions for the distribution of data points in dataspace
3. It has substantially high time complexity of order O(n3)
4. It does not require prior knowledge of the no. of desired clusters
5. It is robust to outliers

Options:
A. 1 only
B. 2 only
C. 4 only
D. 2 and 3
E. 1 and 5
F. 1, 3 and 5
Solution: (D)
 DBSCAN can form a cluster of any arbitrary shape and does not have strong assumptions for the
distribution of data points in the dataspace.
 DBSCAN has a low time complexity of order O(n log n) only.

Q39. Which of the following are the high and low bounds for the existence of F-Score?
A. [0,1]
B. (0,1)
C. [-1,1]
D. None of the above
Solution: (A)
The lowest and highest possible values of F score are 0 and 1 with 1 representing that every data point is
assigned to the correct cluster and 0 representing that the precession and/ or recall of the clustering analysis
are both 0. In clustering analysis, high value of F score is desired.
Q40. Following are the results observed for clustering 6000 data points into 3 clusters: A, B and C:
What is the F1-Score with respect to cluster B?
A. 3
B. 4
C. 5
D. 6
Solution: (D)
Here,
True Positive, TP = 1200
True Negative, TN = 600 + 1600 = 2200
False Positive, FP = 1000 + 200 = 1200
29-Aug-18
Select the correct option which directly achieve multi-class classification (without
support of binary classifiers)
K Nearest Neighbor SVM Neural networks Decision trees
Classification where each data is mapped to more than one class is called
Multi class classification(X) Multi label classification Binary classification
The classification where each data is mapped to more than one class is called Binary Classification.
Images,documents are examples of Unstructured Data Structured Data
The most widely used package for machine learning in python is
Pillow bottle jango sklearn
Sentiment classification is a special task of text classification whose objective is to classify a text according to
the sentimental polarities of opinions it contains (Pang et al., 2002), e.g., favorable or unfavorable, positive or
negative. SciKit-Learn. Scikit-learn is open source machine learning library for the Python programming
language. ..
Imagine you have just finished training a decision tree for spam classication and it is
showing abnormal bad performance on both your training and test sets. Assume that
your implementation has no bugs. What could be reason for this problem
Your decision trees are too shallow.
You need to increase the learning rate
You are overfitting.
All the options
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What does the command sentiment_analysis_data['label'].value_counts() return?
counts of unique values in the 'label' column
Number of rows in the dataset
Number of columns in the dataset
The total count of elements in 'label' column
Which numerical statistics is used to identify the importance of a rare word in a document?
It is suppose to be IDFNon
None of the oprtions TF-IDF DF TF
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What command should be given to tokenize a sentence into words?
fromnltk import sentence_tokenize, Word_tokens =sentence_tokenize(sentence)
fromnltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)
fromnltk.tokenizer import word_tokenizer, Word_tokens =word_tokenizer(sentence)
fromnltk import tokenize_words, Word_tokens =tokenize_words(sentence)
19/09/2018
Select the correct statements about Nonlinear classification
kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyperplanes.
The concept of slack variables is used in SVM for Nonlinear classification
kernel trick is used in SVM for non-linear classification
The fit(X, y) is used to
Initialize the classifier
Test the classifier
Train the Classifier
Evaluate the classifier
Model Tuning helps to increase the accuracy
Can't say False True (X)
TF and IDF use matrix representations 2 or TRUE
Identify the stop words from the following "computer"
Both "the" and "it" "fragment" "it" "the"
Which of the given hyper parameter(s), when increased may cause random forest to over fit
the data?
Number of Trees Learning Rate Depth of Tree
Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in
random forest. Increase in the number of tree will cause under fitting.
a) Download the dataset from: https://inclass.kaggle.com/c/si650winter11/download/training.txt and
load it to the variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
What does the command sentiment_analysis_data['label'].value_counts()
return?
Number of columns in the dataset (X)
Number of rows in the dataset
counts of unique values in the 'label' column
The total count of elements in 'label' column
Which of the following is not a preprocessing method used for unstructured data
classification?
confusion_matrix stop word removal lemmatization stemming
Which NLP technique uses lexical knowledge base to obtain the correct base form of the
words?
stop word removal lemmatization Tokenization object standardization
In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is expressed by
Inverse Document Frequency (IDF).
IDF diminishes the weight of the most commonly occurring words and increases the weightage of
rare words.
In a Document Term Matrix (DTM) each row represents ______?
Document TF value word TF-IDF value
Supervised learning differs from unsupervised learning in that supervised learning
requires
Unlabeled data None of the options Labeled data raw data
An algorithm that counts how many times a word appears in a document
DTM Bag-of-words(BOW) TF-IDF TDM
Clustering is a supervised classification True False
Clustering is a UNsupervised classification
What is the purpose of lemmatization?
To remove reduntant words To split into sentences
To convert a sentence to words To convert words to a proper base form
SVM is a
weakly supervised learning algorithm. supervised learning algorithm.
Semi-supervised learning algorithm. unsupervised learning algorithm.
Can we consider sentiment classification as a text classification problem?
No Yes
Which type of cross validation is used for imbalanced dataset?
K –Fold Leave One Out Stratified Shuffle Split
Cross-validation causes over-fitting. True False
Pruning is a technique associated with
Decision tree Linear regression Logistic regression SVM
What are the advantages of Naive Bayes?
1. It will converge quicker than discriminative models like logistic regression AND it requires less training data
2. Requires less training data
3. None of the options
4. It will converge quicker than discriminative models like logistic regression

The fit(X, y) is used to


1. Initialize the classifier
2. Train the Classifier
3. Test the classifier
4. Evaluate the classifier

Higher value of which of the following hyper-parameters is better for decision tree algorithm?
1. Cannot say
2. Number of samples used for split
3. Depth of tree
4. Samples for leaf

Which of the given hyper parameter(s), when increased may cause random forest to over fit the data?
1. Number of Trees
2. Learning Rate
3. Depth of Tree

Choose the correct sequence for classifier building from the following:
1. Initialize -> Train -> Predict -> Evaluate

2. Train -> Test -> Initialize -> Predict


3. Initialize -> Evaluate -> Train -> Predict

4. None of the options

Which numerical statistics is used to identify the importance of a rare word in a document?
1. TF
2. TF-IDF
3. None of the options
4. DF

Supervised learning differs from unsupervised learning in that supervised learning requires
1. Raw data
2. Labeled data
3. Unlabeled data
4. None of the options

Select the correct statements about Nonlinear classification


1. kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyperplanes.
2. kernel trick is used in SVM for non-linear classification
3. The concept of slack variables is used in SVM for Nonlinear classification

Which NLP technique uses lexical knowledge base to obtain the correct base form of the words?
1. lemmatization
2. tokenization
3. object standarization
4. stop word removal

What is the output of the sentence “Good words bring good feelings to the heart” after performing
tokenization, lemmatization and stop word removal.
1. ['Good', 'words', 'bring', 'good', 'feelings', 'to', 'the', 'heart']
2. ['Good', 'word', 'bring', 'good', 'feeling', 'to', 'the', 'heart']
3. 'Good word bring good feeling heart'
4. 'Good words bring good feelings heart'

Classification where each data is mapped to more than one class is called
1. Binary classification
2. Multi Label Classification
3. Multi Class Classification

email spam data is an example of


1. Structured Data
2. Unstructured Data

SVM is a
1. weakly supervised learning algorithm. (X)

2. Semi-supervised learning algorithm.


3. supervised learning algorithm.

4. unsupervised learning algorithm.

Stemming and lemmatization gives the same result. (true/false)


 false
Which type of cross validation is used for imbalanced dataset?
 Leave One Out
 K -Fold (answer)
 Stratified Shuffle Split

An algorithm that counts how many times a word appears in a document


1. TF-IDF (Term Frequency Inverse Document Frequency)
2. DTM
3. Bag-of-words(BOW)
4. TDM
 answer: 3. Predominantly used for calculating the term (word) frequency or the number of times a term
occurs in a document/sentence.
 The Term Document Matrix (TDM) is a matrix that contains the frequency of occurrence of terms in a
collection of documents.
 In a Term Frequency Inverse Document Frequency (TFIDF) matrix, the term importance is expressed by
Inverse Document Frequency (IDF). IDF diminishes the weight of the most commonly occurring words and
increases the weightage of rare words.

In a Document Term Matrix (DTM) each row represents ______?


A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of
terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in
the collection and columns correspond to terms.
Pruning is a technique associated with
1. Logistic regression
2. SVM
3. Linear regression
4. Decision tree

Images and Documents are examples of


 Unstructured data

TF and IDF use matrix representations


 true

 Term Frequency-Inverse Document Frequency

Which of the following is not a pre-processing method used for unstructured data classification?
1. stemming
2. confusion matrix
3. lemmatization
4. stop word removal

Choose the correct sequence from the following:


1. PreProcessing -> Model Building -> Predict
2. Data Analysis -> Pre-Processing -> Model Building -> Predict
3. Data Analysis -> Pre-Processing -> Predict -> Train
4. Pre-Processing -> Predict -> Train
Lemmatization offers better precision than stemming (true)
TF-IDF is a freature extraction technique (true)
Clustering is a supervised classification (true)
Can we consider sentiment classification as a text classification problem? (true)
Which of the following is not a performance evaluation measure?
 Confusion Matrix

 Classification Report
 Decision Tree X

 Accuracy score

Which of the following command is used to view the dataset SIZE and what is the value returned?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
 sentiment_analysis_data.shape,(7086, 3)

Imagine you have just finished training a decision tree for spam classication and it is showing abnormal
bad performance on both your training and test sets. Assume that your implementation has no bugs. What
could be reason for this problem
 You are overfitting (X)
 Your decision trees are too shallow.
 You need to increase the learning rate
 All the options
The most widely used package for machine learning in python is
 sklearn

What is the tokenized output of the sentence "if you cannot do great things, do small things in a great
way"
A technique used to depict the performance in a tabular form that has 2 dimensions namely 'actual' and
'predicted' sets of data.
 Confusion Matrix

What is the output of the sentence "Good words bring good feelings to the heart" after performing
tokenization, lemmatization and stop word removal.
 'Good word bring good feeling heart'

Can we consider sentiment classification as a text classification problem?


 YES

Select the correct option which directly achieve multi-class classification (without support of binary
classifiers)
 SVM - SVMs are inherently two-class classifiers.
 Neural networks - ??
 Decision trees - Decision trees are a powerful classification technique. The tree tries to infer a split of
the training data based on the values of the available features to produce a good generalization. The algorithm
can naturally handle binary or multi-class classification problems.
 K Nearest Neighbor - k-nearest neighbors kNN is considered among the oldest non-parametric
classification algorithms

To view the first 3 rows of the dataset, which of the following commands are used?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
 sentiment_analysis_data.head(3)

Select pre-processing techniques from the options


 Stopword removal
 Lemmatization
 All the options
 Tokenization
 Stemming

High classification accuracy always indicates a good classifier.


 True

Inverse Document frequency is used in term document matrix.


 TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect
how important a word is to a document in a collection or corpus.
 True

Which one of the following is not a classification technique?


1. SGDClassifier
2. StratifiedShuffleSplit
3. SVM
4. Random Forest

A classifer that can compute using numeric as well as categorical values is


1. Decision Tree Classifier
2. SVM Classifier
3. Random Forest Classifier
4. Naive Bayes Classifier

What is the purpose of lemmatization?


 To convert words in base form

Model Tuning helps to increase the accuracy


 True

What is the output of the following command:


print(sentiment_analysis_data['label'].unique())
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
 [yes no]
 None of these
 [1 0] ?
 [true false]
What command should be given to tokenize a sentence into words?
a) Download the dataset from:
https://inclass.kaggle.com/c/si650winter11/download/training.txt and load it to the
variable 'sentiment_analysis_data'.
b) Give the column names as 'label' and 'message'
c) Try out the code snippets and answer the questions.
from nltk.tokenize import word_tokenize
Word_tokens =word_tokenize(sentence)
Let's assume, you are solving a classification problem with highly imbalanced class. The majority class is
observed 99% of times in the training data. Which of the following is true when your model has 99%
accuracy after taking the predictions on test data. ?
1. For imbalanced class problems, precision and recall metrics aren’t good.
2. For imbalanced class problems, accuracy metric is a good idea.
3. For imbalanced class problems, accuracy metric is not a good idea.

Which of the following command is used to view the dataset SIZE and what is the value
returned?
sentiment_analysis_data.shape,(7086, 3)
sentiment_analysis_data.size,(7086, 3)
sentiment_analysis_data.size(),(7086, 2)
sentiment_analysis_data.shape(),(7086, 2)
What command should be given to tokenize a sentence into words?
fromnltk.tokenize import word_tokenize, Word_tokens =word_tokenize(sentence)
fromnltk import tokenize_words, Word_tokens =tokenize_words(sentence)
fromnltk.tokenizer import word_tokenizer, Word_tokens =word_tokenizer(sentence)
fromnltk import sentence_tokenize, Word_tokens =sentence_tokenize(sentence)
What is the tokenized output of the sentence “if you cannot do great things, do small things
in a great way”
'Only', 'heart', 'tells'
'Only', 'do', 'what', 'your', 'heart', 'tell', 'you' (X)
'Only', 'do', 'what', 'heart', 'tells'
'Only', 'do', 'what', 'your', 'heart', 'tells', 'you'
Choose the correct sequence from the following:
Data Analysis ->PreProcessing -> Model Building--> Predict
PreProcessing -> Predict-->Train XX
PreProcessing -> Model Building--> Predict XX
Data Analysis ->PreProcessing -> Predict--> Train
Which of the given hyper parameter(s), when increased may cause random forest to over fit the data?
1. Number of Trees
2. Learning Rate
3. Depth of Tree

What kind of classification is the given case study(Sentiment Analysis dataset)?


Multi class classification Multi label classification Binary classification (X)

You might also like