Professional Documents
Culture Documents
Pushkal Agarwal
King’s College London, London, UK
Abstract
Why use social media when we have print media and television media for news? The answer is
simple, social media combines both of them into a single platform. E.g. Times of India and NDTV both
belonging to different media types share a common screen space on Twitter, the most popular micro
blogging social media platform on the Internet; both of them having close to 10 million followers.
Another prominent reason is that social media is dynamic and instantaneous. Dynamic in the sense
that a particular news item gets updated as and when required, it comes and leaves. By instantaneous we
mean that a news appears on print media the next day, on television media after a few hours but on social
media it gets updated as soon as reported and in some cases even forms the basis to further reporting
and investigation. One popular example is the killing of Osama Bin Laden which was first reported by a
person on social media when he was near the scene where he saw few helicopters encircling a building
and commandos climbing down, this post further led to catching up by other news handles.
Furthermore, by using social media as a news source it is easy to keep track of posts which can be
helpful in determining trends and performing various analysis.
In this project we will investigate how a particular news handle is reporting a particular news item
and their affinity or biasness towards a political party. We present an analytical model which captures
and quantifies the level of biasness of a news handle by mining and analysing tweets of popular news
handles in present scenario.
Keywords: media, bias, news, social, hash-tag, sentiment
ii
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Area of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Litrature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 A Machine Learning Approach to Twitter User Classification [1] . . . . . . . . . . . . 3
2.2 A Comparative Study of Two Models for Celebrity Identification on Twitter [2] . . . . 4
2.3 Perceptions of Media Bias : viewing the news through ideological cues [3] . . . . . . . 5
2.4 A Measure of Media Bias [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 A Small Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Algorithmic Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
iii
Chapter 1
Introduction
The basic idea is that we will follow popular news handles on twitter and keep a track of popular
hash tags of the same news items across different news handles. This will allow us to investigate how
a particular news handle is reporting a particular news item and their affinity or bias towards a political
party .
We will also check the reach of news handles based on various parameters such as re-tweets per
tweet and likes per tweet scaled according to the number of followers. We want to find out the popular
and active news handles for further analysis through the above approach. Finally we perform sentiment
analysis on various news handles to determine their fairness.
This quantification can have multiple applications. One being the ability to generate a summary for
the trending hash-tags such that all the pros and cons pointed out in the tweets of news handles with
different bias is compiled to produce a comprehensive summary. Another application is the identification
of news handles which are relatively unbiased and whose news items can be trusted more in order to get
a neutral point of view.
We are in the process of creating interactive applications for visualization of results obtained by
various analysis and data modeling processes. For the aforementioned task, tool which we used is Shiny
by R-Studio which combines the computational power of R with the interactivity of modern web.
Subsequently, we also went on to create some word clouds to gain more insights into what were the
actual terms used for a hash-tag by various news handles.
The role of news agencies is to provide factual, neutral and unbiased news without any opinion of
the agency or the person reporting it. This is what helps the government and the society at large to be
aware of causes, reactions and effect. Clean and clear media is a major contributing factor to help a
nation progress further.
1
Unfortunately, in the present times, especially in our country, news channels and newspapers have
the tendency to follow a particular ideology. Some of them even tend to support one or the other political
party in power or opposition. And in this process any policy, scheme, law or decision by the government
or any other political event is reported in a starkly different manner by different news handles.
This usually leads to the common man not knowing the complete picture and making ill-informed
decisions. The people do not know which source to trust and form wrong assumptions often leading to
conflicts and failure of policies.
Our approach to this problem can help the people to look at the quantification of the neutrality
claimed by news handles in a very easy to understand and user-friendly manner. The people can analyze
the results personally and decide which news handle they would like to follow. Thus the system may
be able to push the media to be more precise and factual for the greater good of everyone. Also, our
system will provide the people with a comprehensive summary of an event possibly containing all the
major and important points put forward by various news handles.
2
Chapter 2
Litrature Survey
A study by Mr. Marco Pennacchiotti and Ms. Ana-Maria Popescu from Yahoo! Labs in Sunnyvale,
USA proposed a machine learning approach that can automatically infer the values of user attributes
such as political orientation, ethnicity or business fan detection by leveraging observable information
such as the user behavior, network structure and the linguistic content from the users Twitter feed.
The approach uses the successful micro-blogging platform Twitter to classify users. Twitter is widely
used for recommendation services, real-time news sources and content sharing venues in addition to
communicating with friends, family or acquaintances.
Profile information including name, age, location and short summary of interests is available in
Twitter, although it may be incomplete (a user may choose not to post bio details) or misleading (a
user may choose to list an imaginary place - “Wonderland“ - as her location). Such Profile information
along with users tweeting behaviour/users social network/linguistic content is extracted and processed
to determine user attributes such as political orientation, ethnicity or business fan detection.
The Machine Learning Framework for social media user classification relies on four general feature
classes: user profile, user posting behavior, linguistic content of user messages and user social network
features.The method used is Gradient Boosted Decision Trees which are at par with Support Vector
Machines with a lesser response time.
Previous work explored gender, location etc from blogs, e-mail and traditional text. Micro blogging
websites like twitter are being explored now. There are various feature classes defined like -
1. Profile Features - Who you are? Where regular expression matching is used in determining
ethnicity/age. e.g.
(I|i)(m|am|m)[0 9] + (yo|yearold)
white(man|woman|boy|girl)
Profile picture identification was used to determine gender. Both of the methods yielded poor
results and were concluded to only being good enough for bootstrapping training data.
3
2. Tweeting Behaviour - How you tweet? - Properties such as number of tweets posted by the user,
number and fraction of tweets that are re-tweets, number and fraction of tweets that are replies,
average number of hash-tags, URLs per tweet, fraction of tweets that are truncated,average time
and standard deviation between tweets, average number and standard deviation of tweets per day,
fraction of tweets posted in each of 24 hours.
3. Linguistic Content - What you tweet? - Prototypical words (LING-WORD) - They are lexical
expressions for people of specific class. e.g. - younger people tend to use “dude“ or “lol“ more.
democrats tend to use the expression “health care“ more than republicans.
They employed a probabilistic model for automatically extracting proto words - it only needs
a few seed users and it is easily portable to different tasks. Given n classes, each class ci is
represented by a set of seed users Si . Each word w issued by at least one of the seed users is
assigned a score for each of the classes. The score estimates the conditional probability of the
class given the word as follows:
|w, Si |
proto(w, ci ) = Pn
j=1 |w, Sj |
where |w, Si | is the number of times the word w is issued by all users for class ci . For each
class, we retain as proto words the highest scoring k words. The n*k proto words collected across
all classes serve as features for representing a given user: for each proto word wp the user u is
assigned the score:
|u, wp |
f proto wp (u) = P
w2Wu |u, w|
where |u,wp | is the number of times the proto word w is issued by user u, and wu is the set of all
words issued by u.
4
2.3 Perceptions of Media Bias : viewing the news through ideological
cues [3]
Haley Devaney, University of California, San Diego, Citations: 17
Individuals formulate opinions of media bias based on their own prejudices with little evidence or
justification. It is natural for people to perceive bias with news not pertaining to their ideological beliefs.
But is it because people do not believe the sources of the article and the information present or simply
because they do not agree with the articles; ideology and information. This is the primary issue focussed.
This research does not focus on whether a media bias exists or not and who is to blame for it. Rather
it focuses on why media bias hype is so prominent among general population and understanding the
reason why they believe so. Our work compliments this research as we gave a quantification of media
bias not based on any prejudice.
5
Figure 2.1 Follower Count of Republic TV
6
Chapter 3
Proposed Work
(a) Extracted hash-tags of each tweet if there is one, using regex matching.
(b) Of all the extracted hash-tags, broke it down into individual words using case regex match-
ing.
(c) Removed Stop words. Stemming.
7
(d) Got a unique list of all words used in all the hash-tags.
(e) Mapped those words manually to specific hash-tags.
(f) Every tweet is now examined, if any tweet contains a word or a combination of words as we
had mapped earlier, it is labelled with that hash-tag.
(g) Finally, all tweets are labelled with at least some hash tag, maybe political and if not then
generic hash-tag to avoid them in further analysis.
(a) The first step is to generate a weighted list of terms of a particular hash-tag by considering
all the tweets of that hash-tag. The weighting scheme used is TF-IDF.
(b) The next step is to generate a weighted list of terms or a dictionary of a political party. This
is achieved using the same TF-IDF scheme and the terms are extracted from three sources -
• Official websites of the party. Contains the names of major leaders, schemes and agen-
das.
• Their Wikipedia Pages.
• Descriptions of official pages or screen names of party and party leaders on twitter
itself.
(c) A similarity score is found between the hash-tag list and the party dictionaries using Pearson
Co-relation coefficient.
5. A subset of tweets with a particular hash-tag and a particular news handles is taken. The average
sentiment score pf this subset is calculated.
6. The bias of every news handle for each party for a particular hash-tag is then equal to the average
sentiment score of a hash-tag list multiplied by the respective similarity score of that hash-tag
with that party.
7. The final bias of that news handle with a political party is the average of all such scores over all
top hash-tags.
8
Chapter 4
The black line is for RepublicTV which has been consistent with number of tweets per day in 200-
300 range. The green line is for ZeeNews which has some outliers in various places, interestingly in Feb
5 to Feb 10 time frame, due to a trending news of Kulbhushan Jadhav, ZeeNews has more than 1000
tweets per day also. Rest of the lines represent all the other news channels which are fairly inconsistent
and unpopular.
9
Figure 4.3 Number of tweets per day
The black line is for RepublicTV, red for NDTV, and blue for ZeeNews. Their re-tweet count is
greater than 10 with 50% probability and less than 1000 with 100% probability. While the other colours
represent other news handles with an average re-tweet count of less than 2.
These two plots help us to determine that RepublicTV, ZeeNews and NDTV are the only news han-
dles with significant online presence as evident from the high frequency of tweets per day and average
CDF value of 10+ re-tweets. So, we will narrow down our discussion and analysis to these 3 news
handles only. Also, for the sake of simplicity, only Congress and BJP are being considered for biasness
estimation.
10
4.3 Word-cloud of tweets of a particular Hash-tag
Here, we will show the word-clouds of tweets of a particular hash-tag and a particular news handle.
Comparing word-clouds of various news handles for a particular hash-tag will give us a clear idea about
the types of words and approach used by different news handles to describe the same event.
e.g. The word-clouds for #GST from RepublicTV and ZeeNews clearly shows that RepublicTV has
focused more on the tax, income, returns, product chain, accounts like keywords while ZeeNews has
held to business, VAT, sales, purchase, distribution like keywords demonstrating how different perspec-
tives two news handles have for the same policy.
11
4.4 Top Trending Hash-tags
Top Trending Hash-tags after careful mapping across all news handles in 2018 are -
1. #Jadhav
2. #CBSE
3. #FodderScam
4. #2GVerdict
5. #Doklam
6. #TripleTalaq
7. #SSCScam
8. #DelhiCS
9. #PappuDiwas
10. #KamalKaFOOL
4.5 Results
12
Figure 4.8 Tweet Frequency of #Jadhav
Figure 4.10 Final Average Sentiment Score of #Jadhav with Political Party
13
4.5.2 Overall Analysis Results
This clearly shows that NDTV speaks more negatively about BJP than Congress. Similarly, Zee
News is more favourable towards BJP. ReopublicTV comes out to be neutral thrashing both the parties
equally. Also note that majority of the news reported is negative.
14
Figure 4.13 Tweet Frequency of #2GVerdict
Interestingly, in Hash-tags like #2GVerdict which are deemed to produce negative publicity for the
party involved, NDTV chose to minimize the number of tweets with this hash-tag while ZeeNews chose
to maximize the number of tweets with this hash-tag in order to reduce and increase political mileage of
the event respectively. This clearly depicts the bias we discovered above.
4.6 Validation
We took 34 responses in total for the following questions from LNMIIT students and a few more
people who were not students to validate whether our results match the common public perception.
15
Figure 4.15 Validation 2
16
Figure 4.18 Validation 5
The validation results match with our analysis results for NDTV and ZeeNews. ZeeNews is perceived
to be favourable to BJP while NDTV is perceived to be favourable towards Congress. But, interestingly
in case of RepublicTV, half the people find it neutral or unbiased while half of them ind it favourable
towards BJP.
17
Chapter 5
5.1 Conclusion
Gathering data from various social information web-sites and getting useful social insights by fusing
information across various social media platforms carves out the niche in a competitive market. We
addressed the problem of ever-increasing bias of news agencies towards various political parties and
tried to alleviate it by providing a quantified approach of their bias to some extent.
We demonstrated the change in sentiment of a news handle towards a trending news on a daily
basis and in the long run. We also compared various news handles sentiment towards a trending news
item. Finally we labelled hash-tags with political parties and used these labels to quantify bias of news
handles.
We were able to successfully find a bias for News Channels like NDTV, ZeeNews, and RepublicTV
and showcase their affinity towards political parties.
Various applications of this system were discussed and a real-time simulation of the system is being
worked upon.
The thesis gives us a lot of room of improvement and has a lot of scope for future work.
1. Finding datasets dated back to Congress Ruled era and verify if the bias is not present because of
ruling government.
2. Analyzing whether Bias is towards a particular person in the party or the whole party/ideology in
particular.
3. Making a summary of the news item/hash-tag using tweets of various news handles with different
bias to the labelled political party so that all the pros and cons of that event/policy are complied
and delivered in a comprehensive manner.
18
We need to identify the most significant tweets to be used in our summary. To achieve this, we
need to represent each tweet in the corresponding form. We will use the VSM model for text
summation. The approach followed is as follows -
(a) We will obtain a list of unique terms of a hash-tag from all the tweets of that hash-tag across
all the news handles. The obtained list is stemmed and stop-words removed.
(b) Let T=(t1 ,t2 ,::::,tm ), where m = total number of unique terms, represents all the unique terms
in the final processed list.
(c) Each tweet of that hash-tag is decomposed into a similar list of terms. Let the tweets be
represented as D=(d1 ,d2 ,:::,dn ) , where n = total number of unique tweets.
(d) Then for each tweet di , a weighted term vector can be represented as di =(wi1 ,wi2 ,:::,wik ),
where 0¡k = total number of unique terms in the tweet di .
(e) The weight wij assigned can be calculated using the tf-idf weighting scheme.
(f) This scheme aims at balancing the local and global occurrences in the tweets:
n
wij = tf ij ⇤ idf i = tf ij ⇤ log
df i
where tfij (local weight) are the number of occurrences of term tj in the tweet di and dfi
denotes the number of tweets containing the term tj .
(g) The next step is to calculate the cosine similarity using term weights between each pair of
tweets using this formula
Pm
j=1 Wij Wlj
cos(Di , Dl ) = qP Pm
m 2 2
j=1 Wij j=1 Wlj
i = 1, 2..., n
i 6= l
(i) After finding the relevance score for each sentence, we arrange the sentences in descending
order of their relevance score. Extract the sentences starting from the highest relevance score
till the compression ratio is obtained.
19
Bibliography
[1] Marco Pennacchiotti and Ana-Maria Popescu, Yahoo! Labs, Sunnyvale, USA. A Machine Learn-
ing Approach to Twitter User Classification, pennac,amp@yahoo-inc.com
[2] MS Srinivasan, IBM India, Bangalore; Srinath Srinivasa, IIIT-Bangalore, Bangalore; Sunil Thu-
lasidasan, Los Alamos National Laboratory, NM. Celebrity Identification, A Comparative Study of Two
Models for Celebrity Identification on Twitter
[3] Perceptions of Media Bias : viewing the news through ideological cues, Haley Devaney, Univer-
sity of California, San Diego, Citations: 17
[4] A Measure of Media Bias, Tim Groseclose and Jeffrey Milyo, The Quarterly Journal of Eco-
nomics, November 2005, Citations: 931
[5] http://www.thedailyrecords.com/2018-2019-2020-2021/world-famous-top-10-list/india/best-news-
channel-in-india-most-watched-english/14983/
20