Professional Documents
Culture Documents
Gareth Tyson
https://theconversation.com/what-is-mastodon-a-social-media-expert-explains-how-the-federated-network-works-and-why-it-wont-be-a-new-twitter-194329
How does Mastodon work?
REMOTE
FOLLOW
Learning objectives
1. To understand the importance of social messaging applications, and
how we can gather data from them
Kiran Garimella and Gareth Tyson. WhatApp Doc?: A first look at WhatsApp public group data. In 12th International AAAI Conference on Web and Social Media (ICWSM), Stanford, CA (2018).
Getting data from WhatsApp is tough!
• About 80-90% of messages are unicast
Hi!
Hi
Hello
both!
:)
So, can we collect data from
these public groups?
Data collection options
Jail
Web Rooted
Manually broken
WhatsApp phone
WhatsApp
Step 1: Obtain a list of public group URLs
• Use public listings, e.g.
• https://joinwhatsappgroup.com/
• https://whatsgrouplink.com/
• Use search engines or social media
• Search for chat.whasapp.com
• And then manually filter
Step 2: Join the groups
• Create a dedicated WhatsApp account
https://github.com/gvrkiran/whatsapp-public-groups
Step 3: Receive the updates
• Messages will start to come through (via the phone app)
• Our script extracts from the phone’s local SQLitedatabase file
• storage/WhatsApp/Databases/msgstore.db.crypt12
https://arxiv.org/pdf/1507.07739.pdf
What sort of data might you see ?
Group metadata
Text Content
Rana Tallal Javed, Mirza Elaaf Shuja, Muhammad Usama, Junaid Qadir, Waleed Iqbal, Gareth Tyson, Ignacio Castro and Kiran Garimella. A Deep Dive into COVID-19-Related Messages on WhatsApp
in Pakistan. In Social Network Analysis and Mining (SNAM) (2022).
How are public WhatsApp
groups used to share COVID-19
(mis)information?
Data collection (Step 1)
• We compiled a list of 227 political groups in Pakistan using
Google and Twitter
5K messages
across the
measurement
period
Data collection (Step 3)
• But this won’t work for images…
• Two annotators tagged a total of 6,699 images
35% of
images are
COVID-19
related!
Let’s ask some
questions of our
data…
What type of messages are shared? Majority of content is
simple information,
e.g. news articles,
government actions
Large volume of
religious commentary
Maros, Alexandre, et al. "Analyzing the use of audio messages in Whatsapp groups." Proceedings of The Web Conference 2020. 2020.
Audio messages are growing in
popularity
Maros, Alexandre, et al. "Analyzing the use of audio messages in Whatsapp groups." Proceedings of The Web Conference 2020. 2020.
Let’s ask some questions…
• RQ1: What are the characteristics of audio messages in terms of
content properties and propagation dynamics?
32% (truckers) and 21% (election) of all users in the monitored groups shared 1+ audio message
How long are messages?
What is in the audio messages?
What is in the audio messages?
• Used LIWC to categorize words in
the transcript
• Calculate relative difference
between messages shared more
than 20 times vs. a single time
• Most popular words were related to
• Sad emotions, negations, needs,
achievement, family, work, time,
money, anxiety, and future
https://www.liwc.app/
Wow. So, misbehavior is common!
What else
might happen…
Dissemination of in public
groups
“Spam” is
unsolicited and
unwanted messages
sent out in bulk
XA8-07j83F1l::Laa$1bb
Hi Holly! How are you? Hi Holly! How are you?
So, let’s build some spam
classifiers…
Classifier 1: Let’s assume we don’t have end-to-end encryption…so
we can run a text spam classifier on the server (similar to email)
Content-based spam detection on-server
• We use off-the-shelf
email spam classifier
• Accuracy of 87%
Relies on text
content…
Classifier 2: The problem is that message text is inaccessible to the server
because of end-to-end encryption!
Metadata-based spam detection on-server
Feature Importance
• We build a ‘metadata’ classifier
Posted message 0.52
Non-domestic number 0.15
• Create user profiles containing Posted URL 0.12