You are on page 1of 23

Detecting Political Bias in News Channels

Nirmal Kumar Sivaraman, Ujjwal Madan, Jigyasa Yadav, Sakshi


The LNM Institute of Information Technology, Jaipur

Pushkal Agarwal
King’s College London, London, UK
Abstract

Why use social media when we have print media and television media for news? The answer is
simple, social media combines both of them into a single platform. E.g. Times of India and NDTV both
belonging to different media types share a common screen space on Twitter, the most popular micro
blogging social media platform on the Internet; both of them having close to 10 million followers.
Another prominent reason is that social media is dynamic and instantaneous. Dynamic in the sense
that a particular news item gets updated as and when required, it comes and leaves. By instantaneous we
mean that a news appears on print media the next day, on television media after a few hours but on social
media it gets updated as soon as reported and in some cases even forms the basis to further reporting
and investigation. One popular example is the killing of Osama Bin Laden which was first reported by a
person on social media when he was near the scene where he saw few helicopters encircling a building
and commandos climbing down, this post further led to catching up by other news handles.
Furthermore, by using social media as a news source it is easy to keep track of posts which can be
helpful in determining trends and performing various analysis.
In this project we will investigate how a particular news handle is reporting a particular news item
and their affinity or biasness towards a political party. We present an analytical model which captures
and quantifies the level of biasness of a news handle by mining and analysing tweets of popular news
handles in present scenario.
Keywords: media, bias, news, social, hash-tag, sentiment

ii
Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Area of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Addressed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Litrature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 A Machine Learning Approach to Twitter User Classification [1] . . . . . . . . . . . . 3
2.2 A Comparative Study of Two Models for Celebrity Identification on Twitter [2] . . . . 4
2.3 Perceptions of Media Bias : viewing the news through ideological cues [3] . . . . . . . 5
2.4 A Measure of Media Bias [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.5 A Small Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Algorithmic Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Simulation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


4.1 Data-set Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Shortlisting News Handles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.1 Number of tweets per day . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2.2 Log Scaled Re-tweet Count Cumulative Distributed Frequency . . . . . . . . . 10
4.3 Word-cloud of tweets of a particular Hash-tag . . . . . . . . . . . . . . . . . . . . . . 11
4.4 Top Trending Hash-tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5.1 Example of #Jadhav Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5.2 Overall Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.6 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

iii
Chapter 1

Introduction

1.1 The Area of Work

The basic idea is that we will follow popular news handles on twitter and keep a track of popular
hash tags of the same news items across different news handles. This will allow us to investigate how
a particular news handle is reporting a particular news item and their affinity or bias towards a political
party .
We will also check the reach of news handles based on various parameters such as re-tweets per
tweet and likes per tweet scaled according to the number of followers. We want to find out the popular
and active news handles for further analysis through the above approach. Finally we perform sentiment
analysis on various news handles to determine their fairness.
This quantification can have multiple applications. One being the ability to generate a summary for
the trending hash-tags such that all the pros and cons pointed out in the tweets of news handles with
different bias is compiled to produce a comprehensive summary. Another application is the identification
of news handles which are relatively unbiased and whose news items can be trusted more in order to get
a neutral point of view.
We are in the process of creating interactive applications for visualization of results obtained by
various analysis and data modeling processes. For the aforementioned task, tool which we used is Shiny
by R-Studio which combines the computational power of R with the interactivity of modern web.
Subsequently, we also went on to create some word clouds to gain more insights into what were the
actual terms used for a hash-tag by various news handles.

1.2 Problem Addressed

The role of news agencies is to provide factual, neutral and unbiased news without any opinion of
the agency or the person reporting it. This is what helps the government and the society at large to be
aware of causes, reactions and effect. Clean and clear media is a major contributing factor to help a
nation progress further.

1
Unfortunately, in the present times, especially in our country, news channels and newspapers have
the tendency to follow a particular ideology. Some of them even tend to support one or the other political
party in power or opposition. And in this process any policy, scheme, law or decision by the government
or any other political event is reported in a starkly different manner by different news handles.
This usually leads to the common man not knowing the complete picture and making ill-informed
decisions. The people do not know which source to trust and form wrong assumptions often leading to
conflicts and failure of policies.
Our approach to this problem can help the people to look at the quantification of the neutrality
claimed by news handles in a very easy to understand and user-friendly manner. The people can analyze
the results personally and decide which news handle they would like to follow. Thus the system may
be able to push the media to be more precise and factual for the greater good of everyone. Also, our
system will provide the people with a comprehensive summary of an event possibly containing all the
major and important points put forward by various news handles.

Figure 1.1 News Media Groups

2
Chapter 2

Litrature Survey

2.1 A Machine Learning Approach to Twitter User Classification [1]

A study by Mr. Marco Pennacchiotti and Ms. Ana-Maria Popescu from Yahoo! Labs in Sunnyvale,
USA proposed a machine learning approach that can automatically infer the values of user attributes
such as political orientation, ethnicity or business fan detection by leveraging observable information
such as the user behavior, network structure and the linguistic content from the users Twitter feed.
The approach uses the successful micro-blogging platform Twitter to classify users. Twitter is widely
used for recommendation services, real-time news sources and content sharing venues in addition to
communicating with friends, family or acquaintances.
Profile information including name, age, location and short summary of interests is available in
Twitter, although it may be incomplete (a user may choose not to post bio details) or misleading (a
user may choose to list an imaginary place - “Wonderland“ - as her location). Such Profile information
along with users tweeting behaviour/users social network/linguistic content is extracted and processed
to determine user attributes such as political orientation, ethnicity or business fan detection.
The Machine Learning Framework for social media user classification relies on four general feature
classes: user profile, user posting behavior, linguistic content of user messages and user social network
features.The method used is Gradient Boosted Decision Trees which are at par with Support Vector
Machines with a lesser response time.
Previous work explored gender, location etc from blogs, e-mail and traditional text. Micro blogging
websites like twitter are being explored now. There are various feature classes defined like -

1. Profile Features - Who you are? Where regular expression matching is used in determining
ethnicity/age. e.g.
(I|i)(m|am|m)[0 9] + (yo|yearold)

white(man|woman|boy|girl)

Profile picture identification was used to determine gender. Both of the methods yielded poor
results and were concluded to only being good enough for bootstrapping training data.

3
2. Tweeting Behaviour - How you tweet? - Properties such as number of tweets posted by the user,
number and fraction of tweets that are re-tweets, number and fraction of tweets that are replies,
average number of hash-tags, URLs per tweet, fraction of tweets that are truncated,average time
and standard deviation between tweets, average number and standard deviation of tweets per day,
fraction of tweets posted in each of 24 hours.

3. Linguistic Content - What you tweet? - Prototypical words (LING-WORD) - They are lexical
expressions for people of specific class. e.g. - younger people tend to use “dude“ or “lol“ more.
democrats tend to use the expression “health care“ more than republicans.
They employed a probabilistic model for automatically extracting proto words - it only needs
a few seed users and it is easily portable to different tasks. Given n classes, each class ci is
represented by a set of seed users Si . Each word w issued by at least one of the seed users is
assigned a score for each of the classes. The score estimates the conditional probability of the
class given the word as follows:

|w, Si |
proto(w, ci ) = Pn
j=1 |w, Sj |

where |w, Si | is the number of times the word w is issued by all users for class ci . For each
class, we retain as proto words the highest scoring k words. The n*k proto words collected across
all classes serve as features for representing a given user: for each proto word wp the user u is
assigned the score:
|u, wp |
f proto wp (u) = P
w2Wu |u, w|

where |u,wp | is the number of times the proto word w is issued by user u, and wu is the set of all
words issued by u.

2.2 A Comparative Study of Two Models for Celebrity Identification on


Twitter [2]
A study by MS Srinivasan (IBM India), Srinath Srinivasa (IIIT-Bangalore) and Sunil Thulasidasan
(Los Alamos National Laboratory) focused on recognizing celebrities on social media platforms.
Celebrities are well known personalities who attract a lot of attention. Social media breeds its own
celebs. E.g. - Balaji Vishwanathan (Quora). Another definition says that celebrity is a person whose
fame transcends the exact reason he became famous in first place. e.g. - Kim Kardashian.
Celebrity identification covers two primary models source credibility and source attractiveness.
Source credibility model is a function of expertise and credibility while the source attractiveness model
is based on familiarity/likeability/similarity. Friedman proposed another model which is based on the
attention/recall/loyalty of a follower.

4
2.3 Perceptions of Media Bias : viewing the news through ideological
cues [3]
Haley Devaney, University of California, San Diego, Citations: 17
Individuals formulate opinions of media bias based on their own prejudices with little evidence or
justification. It is natural for people to perceive bias with news not pertaining to their ideological beliefs.
But is it because people do not believe the sources of the article and the information present or simply
because they do not agree with the articles; ideology and information. This is the primary issue focussed.
This research does not focus on whether a media bias exists or not and who is to blame for it. Rather
it focuses on why media bias hype is so prominent among general population and understanding the
reason why they believe so. Our work compliments this research as we gave a quantification of media
bias not based on any prejudice.

2.4 A Measure of Media Bias [4]


Tim Groseclose and Jeffrey Milyo, The Quarterly Journal of Economics, Nov 2005, Citations: 931
The Media bias is measured by estimating ideological scores for several major media outlets. To
compute the measure, the number of times that a media outlet cites various think tanks and other policy
groups was counted. This was then compared with the times that members of Congress cite the same
think tanks in their speeches on the floor of the House and Senate. By comparing the citation patterns,
a score relating senators to thinks tanks and eventually the news-paper was generated. As an over-
simplified example, imagine that there were only two think tanks, and suppose that the New York Times
cited the first think tank twice as often as the second. The method asks: What is the estimated score of
a member of Congress who exhibits the same frequency (2:1) in his or her speeches? This is the score
that their method would assign the New York Times.

2.5 A Small Case Study


Mr. Arnab Goswami, the editor-in-chief of Times Now news channel resigned and announced his
upcoming venture Republic TV. The major kick in the venture is that Arnab claimed that it isn’t funded
by any political party or business group and hence isn’t biased towards any political party.
Our aim is to track their posts, perform sentiment analyses and verify this claim. Another feature we
kept an eye on is Flooding of hash tags. i.e. the hash tags aired by the channel are getting how much
response from the users.
As Republic TV is a new venture started on 6th May this year we also keep a track of their follower
count and calculate the percentage change per day. An interesting fact observed was that on days of
big news headlines like #BeefBan the follower count rose comparatively faster which is represented by
peaks in our graph. Below are the graphs for follower count and percentage change respectively.

5
Figure 2.1 Follower Count of Republic TV

Figure 2.2 Percentage Change of Republic TV Twitter Followers

6
Chapter 3

Proposed Work

3.1 Data Collection


The data collection step involved extracting 3200 most recent tweets from the shortlisted news han-
dles on Twitter. We use the twitteR Package in R library for connecting to Twitter and extracting the
tweets. The max limit for number of tweets that can be extracted are 3200 at a particular time.
We also maintain a database of tweets and every time the system is run it updates the database with
any new tweets from the news handles if any. This lets us see how the trending news is changing and
allows us a more in-depth analysis of a particular news handle.

3.2 Data Pre-processing


The collected data/tweets goes through a pre-processing phase where the hash-tags from each tweet
of a particular news handle is extracted and frequency maintained. At the end the hash-tags which have
the highest frequency across all news handles are chosen as our working set, their tweets are organized
and other tweets are discarded. This approach usually provides us with most popular and significant
hash-tags.

3.3 Algorithmic Procedure


1. Collected tweets of top news handles using twitteR package in R. Data size of 2 lac + tweets from
Dec 25 to April 15 (automatic update).

2. Labelled each tweet with a hash-tag.

(a) Extracted hash-tags of each tweet if there is one, using regex matching.
(b) Of all the extracted hash-tags, broke it down into individual words using case regex match-
ing.
(c) Removed Stop words. Stemming.

7
(d) Got a unique list of all words used in all the hash-tags.
(e) Mapped those words manually to specific hash-tags.
(f) Every tweet is now examined, if any tweet contains a word or a combination of words as we
had mapped earlier, it is labelled with that hash-tag.
(g) Finally, all tweets are labelled with at least some hash tag, maybe political and if not then
generic hash-tag to avoid them in further analysis.

3. Found out the top hash-tags using their frequency.

4. Now every hash-tag is to be given a similarity score with political parties.

(a) The first step is to generate a weighted list of terms of a particular hash-tag by considering
all the tweets of that hash-tag. The weighting scheme used is TF-IDF.
(b) The next step is to generate a weighted list of terms or a dictionary of a political party. This
is achieved using the same TF-IDF scheme and the terms are extracted from three sources -
• Official websites of the party. Contains the names of major leaders, schemes and agen-
das.
• Their Wikipedia Pages.
• Descriptions of official pages or screen names of party and party leaders on twitter
itself.
(c) A similarity score is found between the hash-tag list and the party dictionaries using Pearson
Co-relation coefficient.

5. A subset of tweets with a particular hash-tag and a particular news handles is taken. The average
sentiment score pf this subset is calculated.

6. The bias of every news handle for each party for a particular hash-tag is then equal to the average
sentiment score of a hash-tag list multiplied by the respective similarity score of that hash-tag
with that party.

7. The final bias of that news handle with a political party is the average of all such scores over all
top hash-tags.

8
Chapter 4

Simulation and Results

4.1 Data-set Description


We took 10 leading English news handles in India as reported by The Daily Records [5].
The total statistics of the data-set and the corresponding statistics of a single news handle.

Figure 4.1 Total Data-set

Figure 4.2 RepublicTV Data-set

4.2 Shortlisting News Handles

4.2.1 Number of tweets per day

The black line is for RepublicTV which has been consistent with number of tweets per day in 200-
300 range. The green line is for ZeeNews which has some outliers in various places, interestingly in Feb
5 to Feb 10 time frame, due to a trending news of Kulbhushan Jadhav, ZeeNews has more than 1000
tweets per day also. Rest of the lines represent all the other news channels which are fairly inconsistent
and unpopular.

9
Figure 4.3 Number of tweets per day

4.2.2 Log Scaled Re-tweet Count Cumulative Distributed Frequency

The black line is for RepublicTV, red for NDTV, and blue for ZeeNews. Their re-tweet count is
greater than 10 with 50% probability and less than 1000 with 100% probability. While the other colours
represent other news handles with an average re-tweet count of less than 2.

Figure 4.4 Log Scaled Re-tweet Count CDF Plot

These two plots help us to determine that RepublicTV, ZeeNews and NDTV are the only news han-
dles with significant online presence as evident from the high frequency of tweets per day and average
CDF value of 10+ re-tweets. So, we will narrow down our discussion and analysis to these 3 news
handles only. Also, for the sake of simplicity, only Congress and BJP are being considered for biasness
estimation.

10
4.3 Word-cloud of tweets of a particular Hash-tag

Here, we will show the word-clouds of tweets of a particular hash-tag and a particular news handle.
Comparing word-clouds of various news handles for a particular hash-tag will give us a clear idea about
the types of words and approach used by different news handles to describe the same event.

e.g. The word-clouds for #GST from RepublicTV and ZeeNews clearly shows that RepublicTV has
focused more on the tax, income, returns, product chain, accounts like keywords while ZeeNews has
held to business, VAT, sales, purchase, distribution like keywords demonstrating how different perspec-
tives two news handles have for the same policy.

Figure 4.5 Word-Cloud @RepublicTV

Figure 4.6 Word-Cloud @ZeeNews

11
4.4 Top Trending Hash-tags

Top Trending Hash-tags after careful mapping across all news handles in 2018 are -

1. #Jadhav

2. #CBSE

3. #FodderScam

4. #2GVerdict

5. #Doklam

6. #TripleTalaq

7. #SSCScam

8. #DelhiCS

9. #PappuDiwas

10. #KamalKaFOOL

4.5 Results

4.5.1 Example of #Jadhav Analysis

Figure 4.7 Similarity Score of #Jadhav with Political Party

12
Figure 4.8 Tweet Frequency of #Jadhav

Figure 4.9 Average Sentiment Score of #Jadhav

Figure 4.10 Final Average Sentiment Score of #Jadhav with Political Party

13
4.5.2 Overall Analysis Results

Figure 4.11 Similarity Scores and Average Sentiment Scores

Figure 4.12 Biasness Values

This clearly shows that NDTV speaks more negatively about BJP than Congress. Similarly, Zee
News is more favourable towards BJP. ReopublicTV comes out to be neutral thrashing both the parties
equally. Also note that majority of the news reported is negative.

14
Figure 4.13 Tweet Frequency of #2GVerdict

Interestingly, in Hash-tags like #2GVerdict which are deemed to produce negative publicity for the
party involved, NDTV chose to minimize the number of tweets with this hash-tag while ZeeNews chose
to maximize the number of tweets with this hash-tag in order to reduce and increase political mileage of
the event respectively. This clearly depicts the bias we discovered above.

4.6 Validation

We took 34 responses in total for the following questions from LNMIIT students and a few more
people who were not students to validate whether our results match the common public perception.

Figure 4.14 Validation 1

15
Figure 4.15 Validation 2

Figure 4.16 Validation 3

Figure 4.17 Validation 4

16
Figure 4.18 Validation 5

Figure 4.19 Validation 6

The validation results match with our analysis results for NDTV and ZeeNews. ZeeNews is perceived
to be favourable to BJP while NDTV is perceived to be favourable towards Congress. But, interestingly
in case of RepublicTV, half the people find it neutral or unbiased while half of them ind it favourable
towards BJP.

17
Chapter 5

Conclusions and Future Work

5.1 Conclusion

Gathering data from various social information web-sites and getting useful social insights by fusing
information across various social media platforms carves out the niche in a competitive market. We
addressed the problem of ever-increasing bias of news agencies towards various political parties and
tried to alleviate it by providing a quantified approach of their bias to some extent.
We demonstrated the change in sentiment of a news handle towards a trending news on a daily
basis and in the long run. We also compared various news handles sentiment towards a trending news
item. Finally we labelled hash-tags with political parties and used these labels to quantify bias of news
handles.
We were able to successfully find a bias for News Channels like NDTV, ZeeNews, and RepublicTV
and showcase their affinity towards political parties.
Various applications of this system were discussed and a real-time simulation of the system is being
worked upon.

5.2 Future Work

The thesis gives us a lot of room of improvement and has a lot of scope for future work.

1. Finding datasets dated back to Congress Ruled era and verify if the bias is not present because of
ruling government.

2. Analyzing whether Bias is towards a particular person in the party or the whole party/ideology in
particular.

3. Making a summary of the news item/hash-tag using tweets of various news handles with different
bias to the labelled political party so that all the pros and cons of that event/policy are complied
and delivered in a comprehensive manner.

18
We need to identify the most significant tweets to be used in our summary. To achieve this, we
need to represent each tweet in the corresponding form. We will use the VSM model for text
summation. The approach followed is as follows -

(a) We will obtain a list of unique terms of a hash-tag from all the tweets of that hash-tag across
all the news handles. The obtained list is stemmed and stop-words removed.
(b) Let T=(t1 ,t2 ,::::,tm ), where m = total number of unique terms, represents all the unique terms
in the final processed list.
(c) Each tweet of that hash-tag is decomposed into a similar list of terms. Let the tweets be
represented as D=(d1 ,d2 ,:::,dn ) , where n = total number of unique tweets.
(d) Then for each tweet di , a weighted term vector can be represented as di =(wi1 ,wi2 ,:::,wik ),
where 0¡k = total number of unique terms in the tweet di .
(e) The weight wij assigned can be calculated using the tf-idf weighting scheme.
(f) This scheme aims at balancing the local and global occurrences in the tweets:
n
wij = tf ij ⇤ idf i = tf ij ⇤ log
df i

where tfij (local weight) are the number of occurrences of term tj in the tweet di and dfi
denotes the number of tweets containing the term tj .
(g) The next step is to calculate the cosine similarity using term weights between each pair of
tweets using this formula
Pm
j=1 Wij Wlj
cos(Di , Dl ) = qP Pm
m 2 2
j=1 Wij j=1 Wlj

where i,l = 1,2....n


(h) Our approach to text summation is arranging the sentences in descending order of relevance
score and by extracting sentences from the top until the compression ratio is satisfied. The
relevance score can be generated using this formula -
n
X
R Score(Di ) = C(Di , Dl )
i=1

i = 1, 2..., n

i 6= l

(i) After finding the relevance score for each sentence, we arrange the sentences in descending
order of their relevance score. Extract the sentences starting from the highest relevance score
till the compression ratio is obtained.

19
Bibliography

[1] Marco Pennacchiotti and Ana-Maria Popescu, Yahoo! Labs, Sunnyvale, USA. A Machine Learn-
ing Approach to Twitter User Classification, pennac,amp@yahoo-inc.com

[2] MS Srinivasan, IBM India, Bangalore; Srinath Srinivasa, IIIT-Bangalore, Bangalore; Sunil Thu-
lasidasan, Los Alamos National Laboratory, NM. Celebrity Identification, A Comparative Study of Two
Models for Celebrity Identification on Twitter

[3] Perceptions of Media Bias : viewing the news through ideological cues, Haley Devaney, Univer-
sity of California, San Diego, Citations: 17

[4] A Measure of Media Bias, Tim Groseclose and Jeffrey Milyo, The Quarterly Journal of Eco-
nomics, November 2005, Citations: 931

[5] http://www.thedailyrecords.com/2018-2019-2020-2021/world-famous-top-10-list/india/best-news-
channel-in-india-most-watched-english/14983/

20

You might also like