You are on page 1of 3

6. Use the entire dataset.

Take the first 80% dataset for train and remaining 20% for test. On
the train set, obtain TFIDF features (with 50K vocabulary) and learn a multinomial Naïve
Bayes model. Report the accuracy on the test set for this five-class classification problem.
Accuracy should be reported as class-wise precision, recall and F1. Submit q5.py. [10 marks]

Step 1: Install required libraries

- For the dataframe

i. Pandas

- For machine learning model

i. sklearn.feature_extraction.text -> TfidfVectorizer (creates the TFDIF vector)

ii. sklearn.naive_bayes -> MultinomialNB (for naïve bayes model)

iii. sklearn.pipeline -> make_pipeline (to create a pipeline of forementioned)

iv. sklearn.model_selection -> train_test_split (to split the data)

v. sklearn -> metrics (to compute the accuracy metrices like precision and recall)

vi. sklearn.metrics ->


confusion_matrix,accuracy_score,roc_auc_score,roc_curve,auc,f1_score

- For visual representations

i. seaborn

ii. matplotlib.pyplot

Step 2: Import the forementioned libraries

- Once the libraries are installed, they have to be imported in order for us to use them.

Step 3: Import the json file and split the data

- Place the input file in the source path location and read the data using pandas read json
function.

- Apply the train test split function on the dataset in order to proceed with ML model. This
step creates 4 variables

i. x_train – the training set independent variable

ii. x_test – testing set independent variable


iii. y_train – the training set predictor variable

iv. y_test – testing set predictor variable

Step 4: Create the model pipeline; train and test the model.

- Use the make pipeline function and create a pipeline of TFIDF vectorizer function and the
Multinomial naïve bayes function.

- Add an argument ‘max features’ in the TFDIF function in order to limit the vocabulary to 50k.

- Apply model.fit function on the pipeline. This is to train the model. Hence, we use the
training dataset.

- Use the model on the test dataset. Predicted variables are stored in variable ‘label’

Step 5: Create the confusion matrix

- The confusion matrix shows us the predicted of the test set (label) vs what should have been
the prediction (y_train). It helps us to visualise how accurately the model is predicting.

- A heatmap of label vs y_train will help us create the confusion matrix.

Step 6: Computing the metrices

- The metrics package can be used to calculate the precision, recall and F1.

- Classification report of the metrics functions gives us the required numbers.

Output:

Confusion matrix

Metrices

You might also like