Professional Documents
Culture Documents
Take the first 80% dataset for train and remaining 20% for test. On
the train set, obtain TFIDF features (with 50K vocabulary) and learn a multinomial Naïve
Bayes model. Report the accuracy on the test set for this five-class classification problem.
Accuracy should be reported as class-wise precision, recall and F1. Submit q5.py. [10 marks]
i. Pandas
v. sklearn -> metrics (to compute the accuracy metrices like precision and recall)
i. seaborn
ii. matplotlib.pyplot
- Once the libraries are installed, they have to be imported in order for us to use them.
- Place the input file in the source path location and read the data using pandas read json
function.
- Apply the train test split function on the dataset in order to proceed with ML model. This
step creates 4 variables
Step 4: Create the model pipeline; train and test the model.
- Use the make pipeline function and create a pipeline of TFIDF vectorizer function and the
Multinomial naïve bayes function.
- Add an argument ‘max features’ in the TFDIF function in order to limit the vocabulary to 50k.
- Apply model.fit function on the pipeline. This is to train the model. Hence, we use the
training dataset.
- Use the model on the test dataset. Predicted variables are stored in variable ‘label’
- The confusion matrix shows us the predicted of the test set (label) vs what should have been
the prediction (y_train). It helps us to visualise how accurately the model is predicting.
- The metrics package can be used to calculate the precision, recall and F1.
Output:
Confusion matrix
Metrices