You are on page 1of 16

Data Analytics with R

Mid-Term Exam Assignment

Submitted By:
Sakshi Shah
190201089
Section B
Question 1:
Download 4000 tweets from Twitter API under the given condition:

a. Use Geocode of Delhi with 100 miles.


b. Use the Keyword “Covid-19 Vaccine”.
c. Collect popular tweets.
d. Clean all tweets such as text extraction, stopword deletion, etc.

Code:
The below code was used to extract tweets for the keyword “ Covid-19 vaccine” with the geocaode of
Delhi for 100 miles. Recent 4000 tweets were extracted. After Extracting and building the corpus,
data was cleaned - stopwords, punctuations, numbers, URL etc were removed.

 Build Corpus:
Code:
Output:

 Clean Text
Output:
Data Cleaning
 As seen in the output above, 174 stopwords (English) were removed from the tweets as they are
redundant words and don’t provide any insights.
 Regular expressions and Punctuations like ,/!][ etc. were also removed.
 URL were removed from the tweets to make the analysis more meaningful.
 Numbers were removed from the tweets as they weren’t meaningful to the analysis.
 Stemdocument was used to keep the root word and avoid repetition of words in different forms
 Blank spaces were removed with the help of stripwhitespace.

 Term Document Matrix


Code:

Output:
This mathematical matrix was extracted to describe the frequency of terms that occur in a collection
of documents.
It tells us how many times a particular word is used in the document for example as seen above covid
is used once.
Question 2:

After downloading 4000 tweets from Twitter API under the above condition

a. Create a wordcloud for one, two, and three-gram tokenizer.


Code:

Ouput:

Unigram:
Bigram:

Trigram:
b. Find the sentiment analysis of all 4000 tweets.
c. What is the proportion of positive and negative sentiment in the whole document
(4000 tweets)?

Code

Output:
The sentiment score is 0.701 which tells us tweets are on the positive side.

Around 5000 sentiments are positive as compared to around 2000 negative sentiments.
Sentiments related to trust are also very high (3000).

Approximately 26% are positive and 10% are negative.

Combining all positive emotions (positive, trust, surprise, joy), 50% of the emotionas seems
positive.

Combining the negative emotions (negative, sadness, anger, fear, disgust, anger), 42% are on
the negative side.

d. Write the conclusion of the whole analysis obtained from wordcloud and sentiment
analysis.
Word clouds

 Unigram: The word cloud tells us the most popular words associated with covid 19
tweets are covid, India, vaccine, nation, fight, dose etc.
We can interpret that how people feel that we should be united as a nation to fight
against the virus and how development if a vaccine is important.
 Bigram: Similar to the unigram, words most frequently used includes fight covid,
nation spirit, lakh dose, bharat biotech, spirit fight which shoes us the sentiments of
people regarding the whole vaccination drive going in the country and how the nation
should fight together against Covid-19
 Trigram: Similar analysis can also be seen in the trigram with most popular words
being spirit fight covid, nation spirit fight, covid vaccine drive etc.
Therefore, to conclude, people are mostly tweeting about how the nation should come
together to fight covid-19 and how the concerned and positive people are about the
development of vaccine.

Sentiment Analysis:

 The sentiment analysis shows is how people are on the positive side and are hopeful
of the pandemic to come under control due to the vaccination.
 Emotions like trust shows us how the people of India believes in the country and is
ready to fight together.
 Emotions and like anticipation, fear, surprise shows us that even though vaccination is
here, people are still confused about its effectiveness and also fear any negative
symptoms or about a second wave of covid-19.
 People are laso sad and angry due to the havoc created by this pandemic.

Overall, people are hopeful about things getting better this year and are trying to stay
positive, especially due the vaccination drive. They are ready to fight together as one nation.

Question 3:

Using ‘R’ programming and dataset ‘MBA_Salary’

a) a. Calculate the descriptive analytics for the features age, salary, and gender and
discuss their significance.
 Mean of age and salary:

Code:

Output:

This tells us that the average age in our sample is 28.3 years.

Code:
This code was passed to avoid the missing values to calculate the mean, which is
570997.3.

 Standard deviation and variance of age:

Code:

Output:

This shows us that standard deviation of our data is 14.31 and variance is 205.83.

 Frequency table for gender.

Code and output :

This tells that we 169 people coded as sex "1" and 59 people coded as sex "2".

b) Draw bar chart, pie chart, scatterplot, and histogram for one of the variables of
MBA_Salary.
 Bar Chart: Variable: number of working years

Code:

This bar chart shows the count of people from each number of working years category

 Pie chart: Variable- Gender


Code:

The pie chart shows that out of total, 74.1% tweets were by males and 25.9% were by females.
 Scatterplot: Variable: Gmat score and working years
Code:

This shows a scatterplot b/w gmat scroe and working years. No specific pattern could be noted from
the plot.
 Histogram: Variable- Age
Code:
This shows the distribution of age.

c) Find the outlier of features age and salary using boxplot.

Age

Code:

Output:
The plot shows that all the ages beyond 35 are outlier, as denoted by the dots on the plot.

Salary:

Code:

Output:

The plot shows that the values below the salary of 1e+04.5 are all outlier, as denoted by the
dots on the plot
d) Find the missing values in the feature ‘satis’ and replace them with an
appropriate statistical measure.

Code:
 Finding which rows of the feature satis have missing values

 Replacing the missing values with the mean value of satis. We did the mean imputation here.

Output:

You might also like