You are on page 1of 6

CITS 2401

Computer Analysis
and Visualisation

Assignment 1
Tweet Analysis
Worth: 5% of the unit

Submission: Answer the questions on the quiz server.


Deadline: 18 March 2022, 5pm
Late submissions: late submissions attract 5% raw penalty per day up to 7 days (i.e., 25 March 2022, 5pm). After, the mark
will be 0 (zero). Also, any plagiarised work will be marked zero.

1. Outline
Throughout the tumultuous pandemic period, people from all over the world have expressed their
opinions and sentiments on various social media platforms such as Twitter. Analysing tweets tagged with
keywords like “covid-19” can help us gain an overview of the wider public opinion on topics and events
related to the pandemic. These insights can be used to inform social or political actions, targeting specific
concerns. This analysis process falls under the umbrella of natural language processing (NLP). The series
of assignments in this course aim to give you a brief look into NLP and its applications over real-world
datasets.

In this assignment, you will be carrying out simple data analysis tasks using tweets as outlined in the
Tasks section below, mostly just to test your basic Excel competency. More complex tasks will be carried
out in other assignments (stay tuned!).

Note1: This is an individual assignment, please don't share your solution/code/files with others (only
high-level discussion is allowed, e.g., the syntax of the formula, use of array formula with other examples
etc.). If it is found to be not your original work, then you may be penalised.

Note2: You may use intelligent formatting and colour combinations to display your worksheet in an
understandable manner. However, don't "pimp" up the worksheet.
CITS 2401
Computer Analysis
and Visualisation

2. Tasks
Task 1
Import the original.txt into excel word by word. Here the term "word" refers to any sequence of letters
separated by a space. Note, the text qualifier should be set to {none} when you import the text. This
sheet should be named words_data.

Note: If you are using excel 2019 and newer, you should enable legacy text import method (a quick
googling should do the trick).

Next, the whole data range should be named words; each row contains the words from a tweet, apart
from the first 2 columns. Column A should contain the sentiment of the tweet, one of “anger”, “fear”,
“joy”, “sadness” or “neutral”. Name this data range emotions. Finally, Column B should contain the
follower count of the Twitter account that posted the succeeding tweet. Name this data range followers.
Figure 1 shows the example output of what it would look like if this task is done correctly. These tweets
may not seem like they form coherent sentences, but that’s because they’ve been pre-processed using
various NLP techniques (you’ll get to explore these in Assignment 2!).

Figure 1. words_data sheet snippet.

Task 2
Create a new sheet named uniques_data. Import the list of unique words from the uniques.txt file
provided. These are unique words extracted from the tweets. The words should be located from Cell A1.
The whole range should be named uniques.

Task 3
1. In Column B: Calculate the frequency of the unique words from the words_data sheet. You must
use an array formula to do this. Name the cell range as freq. Note that this is not case sensitive
(e.g., Covid and covid both will be counted for "Covid"), but not when punctuations are added to
the word (e.g., Covid; covid!, etc are not counted for "Covid").
CITS 2401
Computer Analysis
and Visualisation
2. In Column C: Calculate the number of letters used for each word from the words_data. This can
be calculated by simply multiplying the number of letters in the word by its frequency count.
Name the cell range as letters.
3. In Column D: Calculate the rank (high rank starts from 1 for largest value) based on the value of
freq. You must use an array formula to do this. Name the cell range as rank.

In addition, apply conditional formatting on rank where the top 10 ranks (i.e., the top 10 highest
frequency values ranked) are formatted with light red filled with dark red text.

Task 4
Create a new sheet named stats. Add the following columns from A2 to A7:
1. Average
2. Max
3. Min
4. Median
5. Mode
6. SD

Note, SD stands for standard deviation. Also, Average and SD should be rounded to 2 decimal places.

Next, add labels as follows (Cell: Value):


1. B1: Frequency
2. C1: Letters
3. D1: >Average
4. E1: <Average

• The "Frequency" category (Column B) is using freq to calculate the statistics of the data.
• The "Letters" category (Column C) is using letters.
• The ">Average" category (Column D) uses letters where values are greater than the average (i.e.,
value in cell C2).
• The "<Average" category (Column E) uses letters where values are smaller than the average.

Populate all the statistics fields (i.e., B2:E7). You must not use any other supporting cells (i.e., you
should calculate all those stats directly using Excel formulas using previously populated cells only).

Note: the values in the uniques_data tab (i.e., freq and letters) should be treated as the entire
population.

Task 5
We would also like to see a breakdown of tweets by emotion label, as well as the volume of tweets per
emotion. Under your calculated statistics, add the following columns from A10 to A14:
1. anger
2. fear
3. joy
4. sadness
5. neutral

Add the following labels (Cell: Value) and complete the following tasks for each emotion. You must not
use any other supporting cells:
1. B9: Frequency. Calculate the frequency of the tweets per emotion.
2. C9: Total Followers. Calculate the total number of followers for tweets per emotion.
3. D9: >90 Percentile. It would be useful to additionally understand the impact of tweets coming
from higher-follower accounts as opposed to low-follower accounts. We’ll define a high-impact
CITS 2401
Computer Analysis
and Visualisation
tweet as one that comes from an account with a follower count in the 90th percentile of ALL
accounts in our dataset. Calculate the total number of followers of high-impact tweets per emotion.
Note: you do not need to consider the percentile on a per emotion basis, just the percentiles over
all the tweets you have.
4. E9: Relative Impact. Calculate the proportion of followers for high-impact tweets per emotion, in
relation to the total volume of follower for high-impact tweets. Values should be in percentages
with 2 decimal places.

Task 6
In the words_data sheet, apply conditional formatting on:

1. emotions, where:
o anger is highlighted red with white text,
o fear is highlighted purple with white text,
o joy is highlighted yellow, and
o sadness is highlighted light blue.
o neutral should receive no formatting.

2. followers, where the follower count of high-impact tweets are highlighted green. You must use
a formula to determine which cells to format.

Task 7
Create a new sheet named charts. In this sheet, create a histogram of letters. The bin size should
start from 0 and the gap between the bins are 20. For this, you must use an array formula.
Note: you should choose a number of bins that covers all your necessary data.

Then, format the chart as follows:


1. The gap width is set to 0%.
2. The series outline is solid black line.
3. The series data labels are set to "Outside End".
4. The title is removed.

Task 8
In "charts" sheet, create a scatter chart using freq in x-axis and rank in y-axis.

Then, format the chart as follows:


1. Remove the title.
2. Add a power trend line and display the R2 value.
3. Add x-axis label "Frequency".
4. Add y-axis label "Rank".

Task 9
Insert a Treemap of uniques and letters into the "charts" sheet. Remove the title and the legend.
Change the size to 15cm height and 25cm width.
CITS 2401
Computer Analysis
and Visualisation

Figure 2. Sample image of the charts sheet

Task 10
Insert a 2D Pie Chart showing the relative impact calculated in Task 5.

Then, format the chart as follows:


1. Give the chart the title “Relative Impact per Emotion”.
2. Fill in each segment with the appropriate colour from Task 6. You can leave ‘neutral’ as white or
colour the segment in as green for contrast.
3. Add data labels for each segment and give each data label a white fill.

NOTE: The sample image of the charts sheet is provided in Figure 2, and your solution may look similar
to this. However, the data shown in the image is sample data and is not the correct result (i.e., your
figures may look different).

3. Submission
You should answer the questions related to the tasks above on the quiz server by the due date - 18 March
2022, 5pm (drop dead due date 25 March 2022, 5pm with 5% raw penalty per day, UWA late submission
policy applies).

Submit your Excel workbook on the quiz server, you should name the file as A1_[student id].xlsx.
For example, if your student ID is 12345678, then your file name is A1_12345678.xlsx.
Fail to follow this will result in a penalty of 50%.
CITS 2401
Computer Analysis
and Visualisation

4. Rubrics
Criteria Highly Satisfactory (D, HD) Satisfactory (P, CR) Unsatisfactory (N)
• Understand various Demonstrated the ability to use Demonstrated the ability Failed to demonstrate
Excel functions

Excel functions. Excel functions fluently: to use Excel functions: the ability to use Excel
(10 marks)

• Demonstrate the • Correct use of Excel functions • Some correct uses of functions:
ability to carry out as appropriate. Excel functions. • Incorrect use of
various Excel Excel functions.
functionalities and
tools.
• Understand the use of Demonstrated the ability to Demonstrated the ability Failed to demonstrate
Excel formulas. utilise Excel formulas fluently: to utilise Excel formulas: the ability to utilise
Excel formulas

• Correct use of Excel formulas • Correct use of Excel Excel formulas:


(20 marks)

most adequate for the formulas for the • Incorrect use of


problem. problem. Excel formulas.
• Comprehensive understanding • Understanding of • Misunderstanding
of Excel formulas and their Excel formulas and of Excel formulas
usage. their usage. and their usage.
• Understand the use of Demonstrated the ability to Demonstrated the ability Demonstrated the
Excel visualisation

Excel visualisation visualise using Excel: to visualise using Excel: ability to visualise using
(20 marks)

tools. • The visualisation generated is • The visualisation Excel:


accurate and comprehensive. generated is accurate. • The visualisation
generated is not
accurate.

This assignment is worth a total of 50 marks.

You might also like