You are on page 1of 9

Practice problem

Natural Language Processing


Session 3

Madhuri Prabhala
Working with text data
Practice problem
Details on the dataset

Details on the dataset:


There are Tweets collected from different ids.
The discussion is about various government policies.
We want to understand the important topics of discussion.
We want to understand what the general perception on the topics is.
Please answer the below for the excel file given

1. What are the steps involved?


2. How many rows are there?
3. How many columns are there?
4. What are the names of the variables?
5. What is the type of each variable in the dataset?
6. How will you get the overall information about the dataset?
7. What do the values of the dataset look like?
8. Are there any missing values in the dataset?
9. Which is the column of interest?
Please answer the below for the excel file given

10. How many unique words are there in the text corpus?
11. What are the 10 most frequent words?
12. What are the 10 least frequent words?
13. Create a Word Cloud.
14. What are your observations?
15. What will you do next?
Sentiment Analysis

16. What are the different sentiment scores that can be calculated?
Use textblob and Vader
17. What is your take based on the sentiment scores
Topic Model

18. What are the topics of discussion?


19. How will you choose the optimum number of topics?
20. How many topics are optimum for this dataset?
21. What are the insights you can draw from the topics?
Topic Model – Assessing optimum number of topics

❑ Latent Dirichlet Allocation


o Each topic is a mixture of underlying words
o Each document is a mixture of underlying topics

❑ Perplexity
o Speaks of how well the model works on held-out data.
o Lower perplexity scores are considered better.

❑ Coherence
o How close to human intuition the identified topics are.
o There are multiple measures of coherence such as:
c_v, c_umass, c_npmi, c_a
o We choose models with higher number of coherence score.

You might also like