WSC Week11 Messaging

Social and Web Computing
Gareth Tyson
WEEK 11: ONLINE MESSAGING

PLATFORMS
Recap
How did the web work?
● More symmetrical power

structures?
● More resilient?
● No central points of failure?
● Superior privacy?
How does the web work?
● Critical mass of user?

● Orchestrated deployments?
● Economies of scale?
How does the Fediverse work?
https://theconversation.com/what-is-mastodon-a-social-media-expert-explains-how-the-federated-network-works-and-why-it-wont-be-a-new-twitter-194329
How does Mastodon work?
REMOTE
FOLLOW
Learning objectives
1. To understand the importance of social messaging applications, and
how we can gather data from them
2. To understand how information and misinformation spreads on

these messaging platforms
3. To understand ways to mitigate abuse on these platforms, and

some further
What do you think when you
hear “social media”?
But this only covers a fraction of
online communications…
Who uses WeChat?
Questions you might want to ask…
• How much data do you provide WeChat?
• How much data does WeChat expose?
• To whom is that data exposed?
• What are the risks of this centralization?
• How can such platforms be misused?

Next question:
Who uses WhatsApp?
You’re not alone…
• WhatsApp is one of the most popular social app in the world

• 1.5 billion active users each day!
• 5 billion downloads from the Android Play Store alone!
• Over 60 billion texts, 100 million audio and 55 million video calls daily!
WhatsApp Data Collection for Social
Computing Studies
Kiran Garimella and Gareth Tyson. WhatApp Doc?: A first look at WhatsApp public group data. In 12th International AAAI Conference on Web and Social Media (ICWSM), Stanford, CA (2018).
Getting data from WhatsApp is tough!
• About 80-90% of messages are unicast
• There are no Application Programming Interfaces providing control of

WhatsApp
• End-to-end encryption even prevents from gathering data

But WhatsApp also supports groups
Hi!
Hi
Hello
both!
:)
So, can we collect data from
these public groups?
Data collection options
Jail
Web Rooted
Manually broken
WhatsApp phone
WhatsApp
Step 1: Obtain a list of public group URLs
• Use public listings, e.g.
• https://joinwhatsappgroup.com/
• https://whatsgrouplink.com/
• Use search engines or social media
• Search for chat.whasapp.com
• And then manually filter
Step 2: Join the groups
• Create a dedicated WhatsApp account
• Run our script (which uses the web.whatsapp.com interface)

• It takes a list of WhatsApp join URLs and programmatically joins them
https://github.com/gvrkiran/whatsapp-public-groups
Step 3: Receive the updates
• Messages will start to come through (via the phone app)
• Our script extracts from the phone’s local SQLitedatabase file
• storage/WhatsApp/Databases/msgstore.db.crypt12
• But it is encrypted - this is where rooting is necessary
https://arxiv.org/pdf/1507.07739.pdf
What sort of data might you see ?
Group metadata
Text Content
User behaviour Geographic information

COVID-19 (Mis)Information
Sharing on WhatsApp
Rana Tallal Javed, Mirza Elaaf Shuja, Muhammad Usama, Junaid Qadir, Waleed Iqbal, Gareth Tyson, Ignacio Castro and Kiran Garimella. A Deep Dive into COVID-19-Related Messages on WhatsApp
in Pakistan. In Social Network Analysis and Mining (SNAM) (2022).
How are public WhatsApp
groups used to share COVID-19
(mis)information?
Data collection (Step 1)
• We compiled a list of 227 political groups in Pakistan using
Google and Twitter
• 60K messages from 18.5K users

Message Type # %
Text 28.5K 47%
Images 14.6K 24.5%
Videos 2.6K 18.6%
URLs 3.2K 2.5%
• Next need to extract COVID-19 related messages!
• Compiled list of keywords, e.g. covid, covid19
5K messages
across the
measurement
period
• But this won’t work for images…
• Two annotators tagged a total of 6,699 images
35% of
images are
COVID-19
related!
Let’s ask some
questions of our
data…
What type of messages are shared? Majority of content is
simple information,
e.g. news articles,
government actions
Large volume of
religious commentary
But 14% of the total

messages are
misinformation…
What types of misinformation are shared?
• Fake news covers 45% of misinformation texts, e.g.
• COVID related deaths of world figures such as Ivanka Trump, Prince William,
Imran Khan
• Conspiracy theories about Bill Gates intending to place RFID chips in people to
track COVID-19
• Fake origins also prominent, e.g.
• COVID-19 developed in research lab in Lake Corona in Kazakhstan
• Predicted in films such as Resident Evil
• Fake remedies less prominent but circulate for longer
Most prominent in the
How long do these messages last..? tail – 2% of
misinformation exceeds
100 hours
Let’s drill into the details…
• ‘Fake News’ category has the

shortest lifespan
• Fake Remedies’ category has a
mean life of 10 hrs
Who shares what?
The majority
share
Only 37 users
“information”
exclusively shared
misinformation
Does content spread across
platforms?
Does content spread across platforms?
• We gathered 67K Twitter images using hashtags, e.g.
#CovidPakistan, #CoronaFreePakistan
• 1.5K shared across both WhatsApp and Twitter.
• 1/3 were COVID-19 related
• Largest category shared across both Twitter and
WhatsApp is misinformation (29%)
Who influences whom?
Okay, so maybe graphs are useful for
understand the interconnection of
these groups…?
Let’s look at how graph data can
be used
Using data from Brazil

Data gathered from Brazil
• Truck drivers strike in Brazil

• May 21st to June 2nd 2018
• Brazilian presidential elections

campaign
• August 16th to October 7th 2018
What type of images are shared?
A look at the group network
Trucker drivers’ strike Election

A look at the user network
Can we use these graphs to study
misinformation spread?
Labelling misinformation
Misinformation spread
• Each node is a group
• Edge indicates the group spread
information to another group
• Size of a node represents the number of

images with misinformation posted on
that group
• Color represents the total number of
images that were “first seen” in that
group
• Few groups are responsible for

spreading misinformatinon
How does this differ from the web?
• Authors used google to find webpage
that host the same misinformation
images
• Twitter shares many misinformation

images…but who influences whom?
• The central node represents all
WhatsApp groups
• Color represents average time
difference between
appearance of image on
WhatsApp and on the specific
domain
• Images that were first

published on the Web take
much longer to reach the
WhatsApp groups (more than
a year) than the other way
around (only a few days) for
both types of images
But images aren’t the only
modality…
Maros, Alexandre, et al. "Analyzing the use of audio messages in Whatsapp groups." Proceedings of The Web Conference 2020. 2020.
Audio messages are growing in
popularity
Maros, Alexandre, et al. "Analyzing the use of audio messages in Whatsapp groups." Proceedings of The Web Conference 2020. 2020.
Let’s ask some questions…
• RQ1: What are the characteristics of audio messages in terms of
content properties and propagation dynamics?
• RQ2: What are the properties of audio content (e.g., gender of

speaker, music versus speech content) and how do these properties
correlate with propagation dynamics?
Data summary
32% (truckers) and 21% (election) of all users in the monitored groups shared 1+ audio message
How long are messages?
What is in the audio messages?
What is in the audio messages?
• Used LIWC to categorize words in
the transcript
• Calculate relative difference
between messages shared more
than 20 times vs. a single time
• Most popular words were related to
• Sad emotions, negations, needs,
achievement, family, work, time,
money, anxiety, and future
https://www.liwc.app/
Wow. So, misbehavior is common!
What else
might happen…
Dissemination of in public
groups
“Spam” is
unsolicited and
unwanted messages
sent out in bulk
“Ham” refers to the

remaining
messages
Data collection
• Gathered data from 5,051 political groups:
2.6 million messages posted by over 172K
users
• We take Hindi, English, Telugu and Tamil
(74%) and filter boilerplate
• Labeled posts as spam vs ham

• Identify similar text and images to group in
“message clusters” aka spam campaigns
Identifying spam
1. Create a ground truth
• Identify a seed set of users who were manually removed from at least two
groups by their admins (257 users, 68K messages)
• Use human to manually annotate frequently seen messages as spam or ham
2. Construct a dictionary of spam words
1. Extract commonly occurring (5 times) words
2. Manually filter strong signals to produce 324 spam words
3. Extract all messages containing spam words from frequently sent
messages
Who spreads spam?
• Not individual phone numbers!
• And large clusters of messages

tend to be spam
• Mean 83.6 in spam clusters
• …vs 35 for ham clusters
Who spreads spam?
• Not individual phone numbers!
• And large clusters of messages

tend to be spam
• Mean 83.6 in spam clusters
• …vs 35 for ham clusters
• Mostly from India but…

Jobs largely Ham tend to
What is contained
containing in spam? not include
phone URLs or phone
numbers numbers
Yet click &

earn are
mostly URLs Over half of
spam contains
a URL
How long does spam circulate?
But non-spam users
Spam messages live much longer
circulate for longer
Why doThenon-spammers
majority of live longer?
removals are
spammers Few spammers
are added by
admins
How can spammers avoid removal?
• Under ½ days during spam
campaign are active with 10+
messages
• With noticeable outliers
• Get Free Win Award!
• Pay with Reward!
• Spammers also tend to leave

& join regularly
But how can we deal with spam
on WhatsApp?
Particularly with end-to-end encryption!

What is end-to-end encryption?
Hi Holly! How are you?

What is end-to-end encryption?
XA8-07j83F1l::Laa$1bb
Hi Holly! How are you? Hi Holly! How are you?
So, let’s build some spam
classifiers…
Classifier 1: Let’s assume we don’t have end-to-end encryption…so
we can run a text spam classifier on the server (similar to email)
Content-based spam detection on-server
• We use off-the-shelf
email spam classifier
• Accuracy of 87%
Relies on text
content…
Classifier 2: The problem is that message text is inaccessible to the server
because of end-to-end encryption!
Metadata-based spam detection on-server
Feature Importance
• We build a ‘metadata’ classifier
Posted message 0.52
Non-domestic number 0.15
• Create user profiles containing Posted URL 0.12
counts for the different actions Joined via link 0.08

Posted phone number 0.05
Left group 0.04
• Train a Random Forest Classifier Added by member 0.023
Added by admin 0.021
Hmmm, but Removed from group 0.01
• Accuracy of 90% isn’t this Number changed 0.003
private?
Classifier 3: Why not run a classifier on the user’s device?!
Content & metadata detection~80%
on-device
of Others very
groups do
poorly…
• We can use both user profile and well!
text-based scores
• We train local Random Forest

model for each group on device
• 86% mean accuracy across

groups
Let’s conclude with some
challenges…
Challenges in using WhatsApp for social
computing studies
• Public messaging group data is highly biased
• We only ever have a lower bound of activity
• Groups are independent and may be managed differently
• Difficult to definitively link behaviours across platforms or identify

causality
Learning objectives
1. To understand the importance of social messaging applications, and
how we can gather data from them
2. To understand how information and misinformation spreads on

these messaging platforms
3. To understand ways to mitigate abuse on these platforms, and

some further
Further Reading
• Pushkal Agaarwal, Aravind Raman, Damilola Ibosiola, Nishanth Sastry, Kiran Garimela,
Gareth Tyson. “Countering Spam in the Era of End-to-End Encryption: A study of Indian
Political WhatsApp Groups”. In Web Conference (WWW), Lyon, France (2022).
• Rana Tallal Javed, Mirza Elaaf Shuja, Muhammad Usama, Junaid Qadir, Waleed Iqbal,
Gareth Tyson, Ignacio Castro, Kiran Garimella. “A First Look at COVID-19 Messages on
WhatsApp in Pakistan”. In IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining (ASONAM), The Hague, Netherlands (2020).
• Resende, Gustavo, et al. "(Mis) information dissemination in WhatsApp: Gathering,
analyzing and countermeasures." The World Wide Web Conference. 2019.
• Resende, Gustavo, et al. "Analyzing textual (mis) information shared in WhatsApp
groups." Proceedings of the 10th ACM conference on web science. 2019.
• Kiran Garimella & Gareth Tyson. “WhatApp Doc?: A first look at WhatsApp public group
data” In AAAI International Conference on Web and Social Media (ICWSM), Stanford, CA
(2018)

WSC Week11 Messaging

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WSC Week11 Messaging

Uploaded by

Copyright:

Available Formats

Social and Web Computing

WEEK 11: ONLINE MESSAGING

● More symmetrical power

● Critical mass of user?

2. To understand how information and misinformation spreads on

3. To understand ways to mitigate abuse on these platforms, and

• How much data does WeChat expose?

• To whom is that data exposed?

• What are the risks of this centralization?

• How can such platforms be misused?

• WhatsApp is one of the most popular social app in the world

• There are no Application Programming Interfaces providing control of

• End-to-end encryption even prevents from gathering data

• Run our script (which uses the web.whatsapp.com interface)

• But it is encrypted - this is where rooting is necessary

User behaviour Geographic information

• 60K messages from 18.5K users

But 14% of the total

• ‘Fake News’ category has the

Using data from Brazil

• Truck drivers strike in Brazil

• Brazilian presidential elections

Trucker drivers’ strike Election

• Size of a node represents the number of

• Few groups are responsible for

• Twitter shares many misinformation

• Images that were first

• RQ2: What are the properties of audio content (e.g., gender of

“Ham” refers to the

• Labeled posts as spam vs ham

• Not individual phone numbers!

• And large clusters of messages

• Not individual phone numbers!

• And large clusters of messages

• Mostly from India but…

Yet click &

• Spammers also tend to leave

Particularly with end-to-end encryption!

Hi Holly! How are you?

counts for the different actions Joined via link 0.08

• We train local Random Forest

• 86% mean accuracy across

• We only ever have a lower bound of activity

• Groups are independent and may be managed differently

• Difficult to definitively link behaviours across platforms or identify

2. To understand how information and misinformation spreads on

3. To understand ways to mitigate abuse on these platforms, and

You might also like