You are on page 1of 4

2018 5th International Conference on Industrial Engineering and Applications

Sentiment Analysis of Twitter Corpus Related to Artificial Intelligence Assistants

Chae Won Park, Dae Ryong Seo


Paul Math School
Chungcheongbuk-do, Republic of Korea
e-mail: rachae011@gmail.com, dragonseo@hotmail.com

Abstract—Providing an enhancing experience is one of the process about how to collect the data and employ the lexicon
most significant current issues in the user’s research. A process with hypotheses. Moreover, section 4 shows the sentiment
that improves user’s experience should be required to evaluate score of each tweet including artificial intelligence assistants,
the usability and emotion. Above all, sentiment analysis based and results of T-test, Kruskal-Wallis test, and Mann-Whitney
on user’s opinions can be used to understand user’s tendency. test about document matrix. The final section gives a
This paper aims to make a criterion what artificial intelligence summary of the results and suggests a further research.
assistant is statistically better. User’s opinions about three
artificial intelligence assistants from Twitter were collected II. RELATED WORK
and classified into positive, negative, neutral opinions by a
lexicon named Valence Aware Dictionary and sEntiment Earlier studies on sentiment analysis have emerged as
Reasoner (VADER). Also, we analyzed tweets through natural language processing [4], [5], [6]. It could encompass
independent samples T-test, Kruskal-Wallis test, and Mann- machine learning (e.g., Support Vector Machine (SVM) and
Whitney test to show the statistical significance among groups. Naive Bayes classifier) and pattern recognition (e.g., K-
The results suggested the highest rank of three artificial Nearest Neighbor (KNN)). They used data sources from
intelligence assistants by using statistical analysis. Amazon reviews, blog posts, and others. Naive Bayes
classifier, SVM, and Maximum Entropy were applied for
Keywords-sentiment analysis, user research, artificial comparison by Pang and Lee [4]. They analyzed movie
intelligence assistant, twitter corpus, lexicon reviews using those machine learning techniques.
Examining polls using Twitter API, a number of papers
I. INTRODUCTION predicted the political things through statistical analysis [7],
A user should obtain a much better experience from [8], [9]. Both descriptive statistics and inferential statistics
products or services. For enhancing the quality of experience, were conducted. In [7], O’Connor et al. focused on consumer
user research that consists of understanding, observation, and confidence and political opinions. Correlation analysis, linear
feedback to the user has attracted the attention of researchers least-squares model, and regression model were employed
[1]. Many studies focus on evaluating the experience of for understanding people’s sentiment. The correlation of
usability and emotion. Among them, user’s opinions are sentiment ratio and consumer confidence was identified to
generally applied to evaluate emotion [2], [3]. obtain an indicator of polls. Furthermore, the regression
Sentiment analysis known as opinion mining is a kind of model as forecasting analysis was used, and their graph
big data mining. Researchers can utilize the data from social deduced poor predictors and better predictors of consumer
media such as Twitter, Facebook, and Instagram, which has confidence for each section. It implied that qualitative
appeared as user’s subjective opinions. It gives more phenomena through sentiment analysis could be measured in
accuracy to understand the user. Each user’s opinion has its various cases.
own different sentiment according to their values, conditions, In addition, candidates’ profiles obtained an indicator
or interests. This is an important indicator to investigate which included positive emotion, negative emotion, anger,
user’s tendency. Thus, we can provide the best experience and others in [8]. There were two aspects that consisted of
using sentiment analysis. comparing tweets with election results and notifying the
In this paper, user’s opinions about three artificial ideological ties. The correspondence between the politicians’
intelligence assistants: Siri by Apple, Google Assistant by positions and parties was demonstrated by analysis of over
Google, and Cortana by Microsoft are examined. These 100,000 tweets.
opinions are collected from Twitter, a text-based social To summarize, their studies implied that user’s opinions
media service. Tweets are classified into positive, negative, were the practical indicator in user research. It was able to
and neutral opinions using Valence Aware Dictionary and develop a lot of areas in which users were involved.
sEntiment Reasoner (VADER), which converts the opinions
into sentiment scores. Each opinion is quantified to III. METHODS
document matrix, and it demonstrates the statistical Our experimental procedure consisted of collecting
significance between groups. Twitter corpus and classifying using the lexicon. After
The paper is composed of five sections. In section 2, we classification, the proportion of positive, negative, and
provide a literature review on sentiment analysis from earlier neutral opinions was extracted from the lexicon. It was
studies to the present day. Section 3 describes the research applied to user research.

978-1-5386-5748-5/18/$31.00 ©2018 IEEE 495


A. Corpus C. Hypothesis
The most popular microblogging service was Twitter. The null and alternative hypotheses related to three dates
The number of monthly active users reached approximately were derived as:
330 million as of October 2017. Twitter had been attempted : = =
in a field of sentiment analysis. It was based on text, and : ℎ .
people posted their opinions in real time. Their issues, such
as political things, current news, comments on products, and μ was based on the data collection period ( μ = mean of
musical taste, could be called the social data. People usually 1106, μ = mean of 1107, μ = mean of 1108). In addition,
used Twitter as a tool for posting. Since users had posted the null and alternative hypotheses related to artificial
Tweets spontaneously, we could understand their opinions intelligence assistants were specified as:
more effectively. : = =
Tweets in real time were collected by Twitter Streaming : ℎ .
API for the month of November. Only English tweets were
considered since there are enormous users in English- μ stood on the basis of artificial intelligence assistants
speaking countries, such as the U.S. Each tweet contained all ( μ = mean of Siri, μ = mean of Google Assistant, μ =
of those entities in Siri, Google Assistant, and Cortana. mean of Cortana).
Generally, the number of the tweets gathered was 2,000 to
5,000 per day. A lot of tweets were extracted from Siri IV. RESULTS
comparing to Cortana. Plenty of tweets related to each of artificial intelligence
However, an event about Siri occurred during the period assistants were classified using VADER. Furthermore, we
of data collection. Siri’s tweets had a strong influence owing considered differences of times, dates, and artificial
to Paradise Papers. An aftermath of Paradise Papers affected intelligence assistants.
not only Apple but also Siri. It was the extremely negative
issue related to tax evasion to Apple and Siri. Users posted a A. The Proportion of Words
lot of tweets in contrast with an ordinary day. Therefore, we The tweets were categorized as Siri, Google Assistant,
considered the event of Siri as an important issue since and Cortana. The percentage of sentiment words had three
November 6. parts that are positive, negative, and neutral (see Table 1).
Google Assistant also had an event which influenced a
TABLE I. SENTIMENT SCORE OF ARTIFICIAL INTELLIGENCE ASSISTANTS
result since November 6. Its new function that can recognize
music was released. People regarded it as the positive issue Sentiment Score/%
and posted Tweets about the promotion of Google Type ST
Assistant’s new function. 06_1 06_2 07_1 07_2 08_1 08_2
Pos 14.06 15.62 1.01 1.27 4.28 5.10
B. Lexicon SI Neg 9.86 11.31 30.93 30.64 20.62 17.13
VADER was the lexicon, which focused on a human- Neu 76.08 73.06 68.06 68.08 75.10 77.78
centered design by Hutto and Gilbert [10]. It employed Pos 30.00 32.18 34.06 21.23 19.39 19.56
GA Neg 2.50 1.15 0.99 1.33 2.60 1.70
qualitative and quantitative methods. Also, it did not demand
Neu 67.5 66.67 64.95 77.44 78.01 78.74
training data and had enough speed to operate on streaming Pos 0.00 11.36 12.95 17.83 11.39 24.44
data. They verified the confidence in their lexicon through CO Neg 6.67 6.82 13.99 19.75 16.03 4.89
comparing with other lexicons, such as SentiWordNet, Neu 93.33 81.82 73.06 62.42 72.57 70.67
Linguistic Inquiry Word Count, and General Inquirer. For
accuracy of sentiment analysis, they set the weight to each The positive score of Siri tended to decrease in
word that could be positive, negative, or neutral words. As to accordance with the event of Paradise Papers that occurred
weight, for example, the positive word ‘happy’ was after November 6, 19:33 on Twitter. Concurrently, the
transformed into a sentiment score of 0.52. If an adverb, such percentage of the negative score increased rapidly to 30.93%.
as ‘so’, was added to the sentence, the score increased to It, however, indicated a recovery action in the process of
0.61 more than a word, ‘happy’. On the other hand, a times. Google Assistant retained the positive score due to the
negative word ‘sad’, had the sentiment score of -0.48 which addition of a music recognition function, having the
is a negative number. Thus, the weight of each word increasing neutral score as time passed. The negative score
indicated the sentiment score more accurately. of Cortana reached the worst proportion to 19.75% on
VADER classified the tweets related to artificial November 7 and eventually recovered for 4.89%.
intelligence assistants into three categories, which were
positive, negative, and neutral words. Sentiment scores were B. Statistical Analysis
assigned to document matrix. If a tweet did not have any The scores of each assistant were classified as times in
positive or negative word, a matrix had zero. Moreover, the Table 2. The first time represented the period from 07:00 to
proportion of words that consisted of positive, negative, or 18:59, and the second time indicated the others. A result of t-
neutral could be expressed as a percentage. test for Siri was shown in Table 3.

496
TABLE II. THE RESULT OF DESCRIPTIVE STATISTICS According to the statistics results, we identified that the
Type Time N Mean SD
means were not all equal. Consequently, differences of each
1 2806 -.078 .21
artificial intelligence assistant were demonstrated in Table 7
SI
2 5932 -.053 .214 and 8.
1 1470 .082 .197
GA TABLE VII. THE RESULT OF KRUSKAL-WALLIS TEST ON SIRI
2 1650 .064 .175
1 445 .012 .24 Asymptotic Sig.
CO Total N Test Statistic DF
2 470 .072 .247 (2-sided test)
8,738 519.685 2 .000
TABLE III. LEVENE’S TEST AND INDEPENDENT SAMPLES T-TEST ON SIRI
TABLE VIII. THE RESULT OF KRUSKAL-WALLIS TEST ON GOOGLE
Levene’s Test t-test ASSISTANT
 F Sig. t df Sig.(2-tailed)
Equal variances Asymptotic Sig.
29.5 .000 -5.119 5587 .000 Total N Test Statistic DF
not assumed (2-sided test)
3,120 12.525 2 .002
Levene’s test was conducted for equality of variances,
and independent samples t-test was carried out for equality According to the result of tests, null hypotheses were also
of means. Scores on the second time were statistically higher rejected by Kruskal-Wallis test in Table 7 for Siri and in
(M = -.053, SD = .214) than scores on the first time (M = - Table 8 for Google Assistant. There were statistically
.078, SD = .21), t(5587) = -5.119, p < .05, d = 0.12. significant differences on three dates for Siri and Google
According to the result of Levene’s test, equal variances Assistant.
were not assumed (F = 29.5, p = .000). In addition, a result For investigating the difference of artificial intelligence
of t-test of Google Assistant was noticed (see Table 4). assistants, the Kruskal-Wallis test was performed (see Table
9).
TABLE IV. LEVENE’S TEST AND INDEPENDENT SAMPLES T-TEST ON
GOOGLE ASSISTANT TABLE IX. KRUSKAL-WALLIS TEST ON THREE ARTIFICIAL INTELLIGENCE
ASSISTANTS
Levene’s Test t-test
 F Sig. t df Sig.(2-tailed) Total N Test Statistic DF
Asymptotic Sig.
(2-sided test)
Equal variances
6.3 .012 2.662 2961 .008 12,773 1,308.468 2 .000
not assumed

In case of Google Assistant, scores on the first time were There was the statistically significant difference among
higher (M = .082, SD = .197) than scores on the second time Siri, Google Assistant, and Cortana by Kruskal-Wallis test in
(M = .064, SD = .175), t(2961) = 2.662, p < .05, d = 0.09. Table 9. Mann-Whitney test was applied in order that the
Thus, the result of “Equal variances not assumed” on SPSS difference between two independent samples was to
was shown (F = 6.3, p = .012). demonstrate.

TABLE V. LEVENE’S TEST AND INDEPENDENT SAMPLES T-TEST ON TABLE X. MANN-WHITNEY TEST BETWEEN SIRI AND GOOGLE ASSISTANT
CORTANA Sum of
Type N Mean Rank
Levene’s Test t-test Ranks
 F Sig. t df Sig.(2-tailed) SI 8738 5407.47 47250515.50
GA 3120 7391.50 23061495.50
Equal variances
13.4 .000 -3.734 912 .000 Total 11858
not assumed

Score Score
Table 5 shows a result of t-test about Cortana. Scores on
Mann-
the second time were statistically higher (M = .072, SD Whitney U
9069824.500 Z -35.837
= .247) than scores on the first time (M = .012, SD = .24), Asymp. Sig.
t(912) = -3.734, p < .05, d = 0.25. Equal variances were not Wilcoxon W 47250515.500 .000
(2-tailed)
assumed by the result of Levene’s test (F = 13.4, p = .000).
In addition, Kruskal-Wallis test was taken instead of one- First of all, Siri and Google Assistant were examined by
way ANOVA in order to solve the problem related to the Mann-Whitney test in Table 10. It showed that the rank for
homogeneity of variances on dates (p = .000). The null Google Assistant was statistically higher (Mdn = 0) than the
hypothesis was rejected by Kruskal-Wallis test in Table 6. rank for Siri (Mdn = 0), U = 9069724, p = .000, r = -.3s3.
TABLE VI. THE RESULT OF KRUSKAL-WALLIS TEST ON THREE DATES TABLE XI. MANN-WHITNEY TEST BETWEEN SIRI AND CORTANA
Asymptotic Sig. Sum of
Total N Test Statistic DF Type N Mean Rank
(2-sided test) Ranks
12,773 107.861 2 .000 SI 8738 4747.09 41480062.50
CO 915 5590.13 5114968.50
Total 9653

497
three artificial intelligence assistants’ mean ranks and sum of
Score Score ranks were obtained by Mann-Whitney test. Google
Mann-
3299371.500 Z -11.229
Assistant had the highest rank, and Siri had the lowest rank
Whitney U among them.
Asymp. Sig. Eventually, this paper established a novel criterion,
Wilcoxon W 41480062.500 .000
(2-tailed)
which could make a decision to choose a better artificial
intelligence assistant by analysis of the tweets. The user’s
In Table 11, the criterion between Siri and Cortana was opinions had been used for understanding them more
verified by Mann-Whitney test. The rank of Cortana was systematically. However, natural language process was not
statistically greater (Mdn = 0) than the rank of Siri (Mdn = 0), optimized completely. We hoped the development of natural
U = 3299371, p = .000, r = -.11. language processing and actively researching emotion of
TABLE XII. MANN-WHITNEY TEST BETWEEN GOOGLE ASSISTANT AND products or services through sentiment analysis, and
CORTANA expected that it would apply to further researches.
Sum of REFERENCES
Type N Mean Rank
Ranks
GA 3120 2088.46 6515982.00 [1] M. Kuniavsky, “Observing the user experience: a practitioner’s guide
CO 915 1777.76 1626648.00 to user research,” Morgan kaufmann, 2003.
Total 4035 [2] A. Pak and P. Paroubek, “Twitter as a corpus for sentiment analysis
and opinion mining,” In LREc, vol. 10, May. 2010, pp. 1320-1326.
Score Score [3] E. Boiy, P. Hens, K. Deschacht, and M. F. Moens, “Automatic
Mann- Sentiment Analysis in On-line Text,” ELPUB, Jun. 2007, pp. 349-360.
1207578.000 Z -9.071 [4] B. Pang and L. Lee, “Thumbs up?: sentiment classification using
Whitney U
Asymp. Sig. machine learning techniques,” Proc. ACL-02 conference on
Wilcoxon W 1626648.000 .000 Empirical methods in natural language processing, Association for
(2-tailed)
Computational Lingustics, Jul. 2002, pp. 79-86.
Google Assistant and Cortana had statistically significant [5] B. Pang and L. Lee, “Opinion mining and sentiment analysis,”
Foundations and Trends in Information Retrieval, vol. 2, 2008, pp. 1-
differences by Mann- Whitney test in Table 12. The rank for 135.
Google Assistant was statistically greater (Mdn = 0) than [6] G. Vinodhini and R. M. Chandrasekaran, “Sentiment analysis and
Cortana (Mdn = 0), U = 1207578, p = .000, r = -.14. opinion mining: a survey,” International Journal, vol. 2, 2012, pp.
282-292.
V. CONCLUSION [7] B. O’Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith,
In this paper, we analyzed the tweets, which included “From tweets to polls: Linking text sentiment to public opinion time
series,” Proc. The Fourth International AAAI Conference on Weblogs
three entities in Siri, Cortana, and Google Assistant. The and Social Media, ICWSM, vol. 11, 2010, pp. 122-129.
tweets were collected using Streaming API and divided into [8] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe,
positive, negative, and neutral opinions by VADER, the “Predicting elections with Twitter: What 140 characters reveal about
sentiment dictionary. A change of sentiment score was political sentiment,” Proc. The Fourth International AAAI
described as positive, negative, and neutral percentage Conference on Weblogs and Social Media, ICWSM, vol. 10, 2010, pp.
depending on the positive and negative events on Siri and 178-185.
Google Assistant. The Kruskal-Wallis test showed the [9] A. Bermingham and A. Smeaton, “On using Twitter to monitor
political sentiment and predict election results,” Proc. Workshop on
accurate influence due to artificial intelligence assistant’s Sentiment Analysis where AI meets Psychology, SAAIP, 2011, pp. 2-
events. The differences among three dates and three artificial 10.
intelligence assistants were identified by the statistical tests. [10] C. J. Hutto and E. Gilbert, “Vader: A parsimonious rule-based model
The null hypotheses on the date of Siri and Google Assistant for sentiment analysis of social media text,” Proc. The Eighth
were rejected, and the alternative hypothesis on three International AAAI Conference on Weblogs and Social media, May.
artificial intelligence assistants was accepted. Furthermore, 2014, pp. 2-10.

498

You might also like