You are on page 1of 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/271991231

Research methods in the age of digital journalism

Article · February 2013


DOI: 10.1080/21670811.2012.714928

CITATIONS READS

36 393

7 authors, including:

Ilias Flaounas Thomas Lansdall-Welfare


Google Inc. University of Bristol
34 PUBLICATIONS   273 CITATIONS    38 PUBLICATIONS   345 CITATIONS   

SEE PROFILE SEE PROFILE

Justin Matthew Wren Lewis


Cardiff University
35 PUBLICATIONS   1,368 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Computational social science driven by online content View project

All content following this page was uploaded by Ilias Flaounas on 08 February 2019.

The user has requested enhancement of the downloaded file.


RESEARCH METHODS IN THE AGE OF DIGITAL
JOURNALISM
Ilias Flaounas *, Omar Ali * , Thomas Lansdall-Welfare * , Tijl De Bie * , Nick Mosdell †,
Justin Lewis † , and Nello Cristianini *

*Intelligent Systems Laboratory, University of Bristol, Bristol, UK


†Cardiff School of Journalism, Media and Cultural Studies, Cardiff University, UK

Abstract

News content analysis is usually preceded by a labour-intensive coding phase, where experts
extract key information from news items. The cost of this phase imposes limitations on the
sample sizes that can be processed, and therefore to the kind of questions that can be
addressed. In this paper we describe an approach that incorporates text-analysis technologies
for the automation of some of these tasks, enabling us to analyse data sets that are many
orders of magnitude larger than those normally used. The patterns detected by our method
include: 1) similarities in writing style among several outlets, which reflect reader
demographics; 2) gender imbalance in media content and its relation with topic; 3) the
relationship between topic and popularity of articles.

Keywords
Automation of content analysis, large-scale text analysis, pattern discovery, data mining,
automation of coding

This is an Author's Original Manuscript of an article whose final and definitive form has been
published in the Digital Journalism, 2012, copyright Taylor & Francis, available online at:
http://www.tandfonline.com/doi/full/10.1080/21670811.2012.714928
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 2

RESEARCH METHODS IN THE AGE OF DIGITAL


JOURNALISM
Massive-scale automated analysis of news-content: topics, style and gender

Introduction

In recent years there has been strong interest in the social sciences for computational methods
that allow large scale quantitative studies to be performed automatically. This has produced
studies ranging from the analysis of massive social networks (Watts, 2007) to the content of
millions of books (Michel, 2011), to the point that some have heralded the advent of
computational social sciences (CSS) (Lazer, 2009) and ‘culturomics’ (Michel, 2011). Recent
work from CSS study human interactions, such as the study of friendships by using mobile
phone data (Eagle, 2009); the discovery of patterns in email exchange (Eckmann, 2004); and
the study of player interactions in online games (Szell, 2010). The introduction of computer
science into the social sciences is still at an immature stage (especially when compared to the
physical sciences) for a variety of reasons, such as the difficulties of studying social
interactions and the lack of digital data (Watts, 2007).

This study presents a large scale investigation of the content of online news outlets, covering
2.5 million articles, published on the main page of their online edition. It demonstrates how
automated approaches can access both semantic and stylistic properties of content, and
therefore how content analysis can be scaled to sizes that were previously unreachable. A
study by Gilens and Hertzman (2000), for example, required two assistants to code 113 news
articles. Similarly, Len-Rios et al. (2006) required three assistants to code articles from 42
issues of two newspapers. Indeed, authors of this article have conducted many content
analyses, and while these can sometimes produce sample sizes of a few thousand they
generally require coding teams working over a period of a several months to produce the data
(e.g Lewis, Cushion and Thomas, 2005; Lewis, Inthorn and Wahl-Jorgensen, 2005; Lewis
and Cushion, 2009) Our approach, in this article, is to explore the application of modern
Artificial Intelligence (AI) techniques, including data mining, machine learning and natural
language processing for the large-scale automated analysis of news media content. Since this
approach is new and its reliability needs to be validated, we chose two areas of analysis –
writing style and gender representation – with some fairly predictable outcomes. Our aim, in
part, was to replicate earlier findings on a much larger scale.

Methods

We based our analysis on state of the art AI techniques including data mining (Liu, 2007),
machine learning (Shawe-Taylor, 2004) and natural language processing (Manning, 1999)
techniques. The outlets we tracked were mainstream traditional media, which offer their
content online in news feeds format. We monitored the main feed of each outlet in this study,
that is the feed advertised in the home pages of the outlet. We performed an automatic coding
of news articles and annotated them according to their topic using the well established
approach of Support Vector Machines from the field of machine learning. This involved
training a machine to perform the annotation, in a similar way a scholar would have to train
human coders. We trained one SVM classifier for each topic we wanted to detect. To train
classifiers we used two well known corpora, namely the Reuters corpus (Lewis et al, 2004)
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 3

and The New York Times corpus (Sandhaus, 2008) (See Note 1). Thus these classifiers will
reflect any bias that the editors of The Reuters and N.Y. Times had when they annotated their
articles by topic.

This allowed us to collect 2.5 million articles from 498 different English-language news
outlets spanning a continuous period of ten months. We automatically annotated them into 15
topic areas (we allowed an article to belong to more than one topic). We then scored the
articles based on two properties of their writing style - their readability and their linguistic
subjectivity - and extracted the name and gender of all the people mentioned in them. Our
analysis focuses on two units of analysis: topics and outlets. We compared topics according to
their writing style and the male/female ratio of the most frequently mentioned people in that
topic. We also compared 15 major US and UK newspapers according to the same criteria (as
well as to their topic selection bias), as well as the popularity (in terms of readers'
preferences) of a sub-set of articles.

At every step of our analysis, we checked our partial and final findings against external
benchmarks in order to give us confidence in our findings.. So, for example, we found that,
as we might expect, Op/Ed pieces are indeed more linguistically subjective than average;
that news for children is more readable than average; and that articles about female sports are
more likely to mention women.

Altogether these findings corroborate our thesis that macroscopic patterns in the collective
contents of large amounts of news outlets can now be detected by automated means, opening
the possibility to ask new types of questions, and reveal biases and properties that cut across
the news/media sphere..

Findings

We analysed 2,490,429 articles gathered from 498 online English-language news outlets,
from 99 different countries, from January 1st, 2010 to October 31st, 2010. The data collection
and analysis was based on our system, described in detail in Flaounas, 2011. This system has
been used successfully before for several media analysis studies such as the analyses of
factors that affect the choices of media editors (Flaounas, 2010). Each article in our corpus
was classified automatically without human interaction (using Support Vector Machines)
into 15 different generic news categories such as ‘Crime’ or ‘Sport’. The articles were then
assessed for readability (i.e., the ease of reading an article); for linguistic subjectivity (based
on the ratio of sentimental adjectives over the total number of adjectives); and for gender
imbalances (among the most frequently mentioned people).

Writing Style

In the first set of experiments we compared topics based on two properties of writing style:
their readability and their linguistic subjectivity. We assessed readability using the Flesch
Reading Ease Test scoring method (Flesch, 1948). Readability scores range from 0 to 100:
the higher the score, the more readable the text. We acknowledge, of course, that readability
cannot be entirely reduced to a set of linguistic properties, but they do provide a useful
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 4

framework in which certain properties (such as shorter word or sentences) are likely to be
associated with higher levels of readability.

Figure 1 ranks topics based on their mean readability scores. We found that ‘Sports’ and
‘Arts’ were the most readable topics while ‘Politics’ and ‘Environment’ were the least
readable. For validation reasons we added a set of articles from the BBC show CBBC-
Newsround, which is a current affairs programme aimed specifically at children. As expected
the CBBC news items were found to be the most readable collection of articles with a mean
readability score of 62.50 and standard error of the mean (SEM) equal to 0.27.

For each article we also measured the percentage of all adjectives that express a judgement,
words such as ‘terrible’ or ‘wonderful’ for example, a quantity that we refer to as linguistic
subjectivity (although we acknowledge that many other forms of language can express
subjectivity – our focus was on the most overt subjective expressions). The discovery of
adjectives was based on the ‘Stanford Log-linear Part-Of-Speech Tagger’ (Toutanova, 2003)
and the measurement of their sentimental content was based on Senti-WordNet (Baccianella,
2010). We categorised an adjective as subjective if the subjectivity weight is above 0.25. If a
word has many different weights we used their average. The gender of people named in each
article was extracted from the corpus using the open source tool Gate (Cunningham; 2002).

For each article per topic we measured the linguistic subjectivity of their title and their first
three sentences. Adjectives were found by parsing the text, and their level of subjectivity was
found by using a standard database (Baccianella, 2010). Figure 2 illustrates our findings.
‘Fashion’ and ‘Art’ articles were the most linguistically subjective, insofar as they use the
most expressive adjectives. Topics such as ‘Business’, ‘Politics’, and ‘Elections’ appear to
use the least overtly subjective language.

This confirms, on a much larger scale, what many academics and media analysts would
predict; that the language of news varies according to topic, as Martin Conboy writes:
“Within specialist sections of the newspaper, sports, fashion and entertainment for instance,
there is more latitude for the language to show traces of opinion and even judgement of taste”
(Conboy, 2007,p.9). It might also confirm that certain subjects are more suited to a narrative
and conversational approach (see for example Jacobs, 1996 and Connel, 1998) while the less
subjective story subjects are also constrained to some extent by the relative complexity of the
subject matter. Nonetheless, our finding raise some other interesting issues which we will
discuss shortly.

Proper validation of the approach is difficult in the absence of any agreed upon gold standard
for sentiment analysis (Pang, 2008). As an indicator of the validity of our approach we
compared 15 leading UK and US newspapers and validated the hypothesis that tabloids
would use more sentimental adjectives than broadsheets. We also observed that linguistically
subjective language is more pronounced in Op/Ed pieces, as one would expect. We collected
5766 Op/Ed articles from 57 different media and found that their linguistic subjectivity has a
mean of 31.20% (SEM=0.20%), well above the average subjectivity of the articles (22.41%)
(SEM=0.02%). Again this would confirm the view that “opinion pieces [...] exist to provoke
reaction and response...” (Conboy, 2007, p.9).

We plotted the two writing style properties - readability and subjectivity - by topic in 2D
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 5

space illustrated in Fig. 3. This allowed us to visualize different topics on both scales. We also
found a significant 73.49% correlation between readability and linguistic subjectivity
(Spearman correlation, p = 0.0018). In other words, for the topics we examined, the more
readable a topic is, the more linguistically subjective it tends to be. While there are many
possible explanations for this, it does open up the possibility that these two stylistic features
have become associated with one another in journalistic conventions. This, in turn, allows us
to imagine new conventions that break down these associations - so, for example, political
coverage that strives for high readability without linguistic subjectivity (a stylistic
combination achieved, to some extent, by sports coverage).

Gender Bias

The analysis of the representation of gender in the news media has a long history within
media, communication and cultural studies, often involving complex judgements of
stereotyping and language as well as more straightforward measures such as the relative
incidence of male and female sources and actors (see for example, Carter, 1998). Of the top-
1000 most mentioned people in our dataset the vast majority are men. This is consistent with
traditional content studies, as well as with other domains such as the distribution of income, .
Figure 4 presents the ranking of topics based on the gender bias of their articles. ‘Sport’ and
financial articles are the most male biased, while ‘Fashion’ and ‘Arts’ are the least biased,
with ‘Fashion’ articles having almost equal references to males and females.

This accords with previous work on gender bias in sports coverage, which has found that
females account for between only 7% and 25% of coverage (Alexander, 1994; Eastman,
2000; Bishop, 2003). A broader analysis was carried out by Len-Rios et al. (2006), examining
gender bias across seven topics in two U.S. newspapers, and they also found the most male
bias in sports articles, with the least in entertainment. We note that Len-Rios et al. based their
research on articles from two newspapers for a period of three weeks. Our study, we would
suggest, means that we can be more definitive about these trends.

Within the field of politics, studies have examined the amount and type of coverage that
female politicians and candidates receive in comparison to their male counterparts. A study
by Heldman et al. (2005) found that Republican presidential nominee Elizabeth Dole
received significantly less coverage than George W. Bush and John McCain, despite her
status in opinion polls at the time. Again this study was based on a relatively small sample of
421 news articles.

We validated our tools for gender detection by using the Freebase database
(http://www.freebase.com). Freebase contains information about the gender of many
celebrities in a computer readable format. We found 38,480 entities in Freebase that match
exactly our systems entries. From those entities, the gender of 30,569 entities was detected
correctly by our system; the gender of 7,285 left unlabelled; and the gender of only 626 was
not detected correctly (Errors were balanced among males and females).

Popularity of Topics

For a subset of the 498 outlets, we were able to explore the appeal of different topics to
readers. More formally, we measured the conditional probability of an article to become
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 6

popular given its topic. To measure the popularity of articles we tracked the special news feed
provided by some outlets that carries the ‘Most Popular’ articles. In our corpus we tracked 16
outlets that provided this special feed. From those outlets we collected a total of 92,956
popular articles and 200,750 articles that appeared in the main feed of the same outlets at the
same time. The intersection of those articles that appeared both in the main feed of the outlet
and in the popular feed gave us a sample of 24,409 articles. Figure 5 presents the ranking of
the topics by popularity. We found that the most appealing topics are ‘Disasters’ and ‘Crime’,
while the least appealing topics are ‘Markets’ and ‘Prices’. This confirms a long-held belief
amongst newspaper editors that ‘if it bleeds it leads’ (Williams, 1998; Allan, 2000; Harisson,
2006), although it runs counter to the significant growth, over the last decade or so, in
Business News (Roush, 2006; Svennevig, 2007), which would appear to remain a niche area
for most people. 3If the main findings here are as we might predict, the popularity of the
environmental stories and the unpopularity of stories about sport is much less predictable.
This suggests that some assumptions about reader preferences may be misplaced.

We then examined the relation between the popularity of articles and their writing style. We
found that the popular articles tend to be more readable and more linguistically subjective.
This is illustrated in Fig. 1 and Fig. 2 where the ‘Average’ bar indicates a random selection of
news articles, and the ‘Most Popular’ bar indicates the popular articles. This would lend
further weight to what Franklin refers to as the growth of ‘infotainment’ or ‘Newszak’,
characterised by “insensitive conjoining of the sentimental and the sensational, the prurient
and the populist” (Franklin, 1997, p.3), and demonstrates that there is a substantial appetite
for softer news stories. While we cannot be sure about the causal factors at work here, our
findings suggest the possibility, at least, that the language of hard news and dry factual
reporting is as much as a deterrent to readers and viewers as the content.

Comparing Newspapers (US and UK)

Finally, we conducted a case study of different news outlets in order to demonstrate the way
in which our analysis allows us to cluster different outlets according to a multiplicity of
variables . We focused on 15 leading newspapers, eight from the US and seven from the UK,
and compared of their content. Both the UK and US newspapers samples comprised three
tabloids and four broadsheets. The subset of the 15 newspapers comprised 218,302 articles
out of the 2.5M articles dataset. This comparison was based on the topics they choose to
cover, their writing style and the ratio of males over females featured in their coverage of
people.

Figure 6 illustrates the topic selection bias of selected newspapers. In this figure we project
newspapers into a space where the axes have no particular meaning, but where the distances
reflect the proximity of the outlets based on the topics they prefer to cover. To achieve this
result we represented each outlet as a 15 dimensional vector, i.e. one dimension per topic, and
then we embedded the 15 dimensional vectors in a 2D plane suitable for visualisation by
utilising Multidimensional Scaling. While this gives us clusters that are, broadly speaking, as
we might expect, it also gives us outliers - notably The Wall Street Journal, whose focus on
business stories clearly differentiates it from other.

As we might expect, UK tabloids tend to cluster together, while the UK’s Independent has a
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 7

distinct position compared to both the US media and other UK broadsheets. While further
work is needed to explore these similarities/differences, our analysis shows how automated
methods can be used to identify patterns and clusters in way that traditional content analysis
would find more difficult.

Figure 7 compares the newspapers based on their writing style. We found that although all the
UK newspapers in our sample have clear editorial biases, the UK tabloids tend to be more
linguistically subjective than the UK broadsheets. This result, while not unexpected, suggests
that the editorial bias of newspapers is exacerbated by their form, and that bias may be made
more pronounced by the linguistic style of the tabloid press. This analysis also throws up
some less predictable findings. We might have expected, for example, the Guardian's
readability levels to be closer to other UK broadsheets (although its position may reflect the
highly educated nature of its readership), and while the Daily Mail's subjectivity reflects its
mid-market tabloid status, its readability does not.

Figure 8 compares the newspapers based on the gender bias of the people they cover. The
dominance of males over females ranges from a threefold to a six-fold ratio. As we might
expect, given their focus on celebrity and entertainment, UK media tabloids have a higher
ratio of females than the UK broadsheets. For the subset of UK outlets it was also possible to
observe a significant correlation of 36.19% (p=0.022) between style and reader demographic
profiles (See Note 2). In other words, we found that outlets that have similar writing styles
tend to get the attention of similar audiences. We found no significant correlation between
writing style and topics, or between topics and demographics in respect to outlets. Thus, it
appears, audiences relate more to writing style than to choice of topic - an interesting finding
since prevailing assumptions tend to assume readers respond to both.

Conclusions

The automation of many tasks in news content analysis will not replace the human judgment
needed for fine-grained, qualitative forms of analysis, but it allows researchers to focus their
attention on a scale far beyond the sample sizes of traditional forms of content analysis.
Rather than spending precious labour on the coding phase of raw data, analysts could focus
on designing experiments and comparisons to test their hypotheses, leaving to computers the
task of finding all articles of a given topic, measuring various features of their content such as
their readability, use of certain forms of language, sources etc. (just a few of the tasks that can
now be automated).

Similar studies have been conducted describing how machine translation can be used to
access the contents of all key news outlets of Europe for one year (Flaounas et al, 2010),
showing that content similarities among countries reflect cultural, economic and geographic
ties. This kind of research demands a scale and scope that is beyond the reach of traditional
content analysis. Longitudinal studies have also been conducted, based on comprehensive
rather than sample data sets. So, for example, it was possible to conduct an analysis of all
crime stories of the New York Times over 20 years, showing that the perpetrators of violent
crime tended to be male, while the victims tended to be women and children. While crime
figures suggest the first finding reflects the world of recorded crime, the second runs counter
to it (Sudhahar et al, 2011). Other work towards the automation of news media content
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 8

analysis include: the ‘Europe Media Monitor’ (Steinberger, 2009) and the ‘Lydia’ system
(Lloyd, 2005).

Our approach – apart from freeing scholars from more mundane tasks - allows researchers to
turn their attention to higher level properties of global news content, and to begin to explore
the features of what has become a vast, multi-dimensional communications system. This level
of analysis is needed now more than ever, since the range of communications outlets now
available to people – together with trends towards concentration of ownership (Bagdikian,
2000; McChesney and Nichols, 2010). - makes it increasingly difficult to isolate one media
form. So while it was possible, hitherto, to focus on a dominant communications medium like
television to explore the relationship between media content and public
understanding/opinion, the contemporary media environment makes this more difficult.

The work of George Gerbner and the Cultural Indicators project, for example, tried to isolate
television as an information system (Morgan, 2002). Their very reasonable premise was that
television’s impact could be explored when heavy TV viewers expressed understandings of
the world that matched television’s dominant representations – especially where these
diverged from real world comparisons. While television remains a dominant medium which
may have a profound impact on the stories we use to understand the world (Miller, 2009),
such impacts may now be masked by cross-media ownership patterns combined with the
presence of multiple media outlets. It is distinctly possible, in such a world, that light
television viewers – whether they get their information from a newspaper, a website or a
podcast – are receiving much the same stories as those told by television. In this context,
finding no differences between light and heavy viewers does not mean television has no
influence. It may simply mean that both television and other media outlets are telling similar
stories about the world and that both are equally influential.

Our approach raises the possibility of exploring the whole range of media outlets and
identifying, in all their complexity, moments of similarity and divergence – a cultural
indicators project writ large across a multi-dimensional media world. It also allows us to
systematically explore the genealogy of stories and ideas, tracking their emergence and
passage through different media over time. So, for example, we can explore the relationship
between news outlets and the burgeoning world of blogs to see how ideas travel between
them.

The findings presented here confirm a number of features of media content that might have
been predicted on the basis of theoretical assumptions or smaller scale content studies in the
social sciences, notably the scale and character of gender bias in media coverage and the
readability of certain news outlets and types of stories. At this stage in the development of
automated analysis, we were keen to produce findings that confirmed rather than questioned
assumptions based on earlier research – in part to test the plausibility of our approach.

Even at this exploratory stage, however, some of our findings throw up some intriguing
patterns alongside the more predictable results that shed light on debates and raise some
interesting questions. So, for example, the failure of many citizens to engage with political
news (Lewis, 2005) or with environmental problems like climate change may, in part, reflect
the fact that these news topics tend to be written in less readable language than most others.
Indeed, the failure of many citizens (notably those in countries like the US and the UK) to
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 9

understand what climate change is or the scale of the scientific consensus and alarm about it
(Lewis and Boyce, 2009) may not simply be a product of efforts to make it appear
controversial. The fact that the least readable topics – the environment and politics – are
precisely the places where climate change is most likely to be discussed may also play a role
in maintaining this confusion. And yet, intriguingly, the environmental news appears to be a
topic people want to read about.

Similarly, the fact that politics and elections are topics notable for their lack of adjectival
excess or subjectivity, while laudable in one sense, may also be a factor in explaining low
levels of political interest and engagement (Lewis et al, 2005), as well as the popularity of
overtly biased outlets like Fox News (Cushion and Lewis, 2010). This may oblige us to
questions assumptions that the popularity of more subjective forms of political coverage is a
reflection of political attitudes. So, for example, the popularity of Fox News or right-leaning
tabloid newspapers may be as much a matter of style as of political preference – indeed the
two (in the form of a certain kind of right-wing populism) may have developed a symbiotic
relationship.

We can also see begin to separate different linguistic features ways that allow new kinds of
journalistic writing. It may be possible, for example, for political coverage to strive for
greater levels of readability while retaining low levels of subjectivity to be popular rather
than populist.

These are early days for this form of analysis, and our findings remain suggestive. The ability
to automatically extract the key actors of the news narrative, generating a network of their
interactions (Sudhahar et al, 2011); or to compare the preferences of readers with those of
editors on a large scale (Hensinger et al, 2012), allows us to begin to develop these findings.
The overall effect of this technology, we would suggest, can complement the skills of human
scholars, allowing the social sciences to be both more ambitious and more comprehensive in
scale.

Acknowledgements

I. Flaounas and N. Cristianini are supported by FP7 Project CompLACS; O. Ali is supported
by a DTA grant; N. Cristianini has been supported by a Royal Society Wolfson Merit Award;
The members of the Intelligent Systems Laboratory are supported by the ‘Pascal2’ Network
of Excellence.

Notes

1. We used the LibSVM implementation of SVMs (Chang, 2011) and the cosine similarity as
a measure of proximity between articles. The C parameter of SVMs was adjusted empirically.
The articles were pre-processed with typical data mining techniques: stop word removal,
stemming (Porter, 1980), and transfer to the TF-IDF space.

2. For the subset of UK outlets we defined both a similarity measure between demographic
profiles (as obtained from Newspaper Marketing Agency at http://www.nmauk.co.uk) and a
similarity in stylistic space (spanned by readability and linguistic subjectivity). These two
distances were found to be significantly correlated (36.19% Kendall correlation, p=0.022).
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 10

REFERENCES

ALLAN, STUART (2000) “News Culture”, Milton Keynes: Open University Press.

ALEXANDER, SUE (1994) “Newspaper coverage of athletics as a function of gender”,


Women’s Studies International Forum 17 pp. 655–662.

BACCIANELLA, STEFANO, ESULI, ANDREA and SEBASTIANI FABRIZIO (2010)


“SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion
mining”, Seventh conference on International Language Resources and Evaluation 25, pp.
2200–2204.

BAGDIKIAN, BEN H. (2000) “The media monopoly”, Beacon press.

BISHOP, RONALD (2003) “Missing in action: Feature Coverage of Women’s


Sports in Sports Illustrated”, J. of Sport and Social Issues 27 pp. 184–194.

CARTER, CYNTHIA, BRANSTON, GILL and ALLAN, STUART (1998) “News, Gender
and Power”, London: Routledge.

CHANG, CHIH-CHUNG and LIN, CHICH-JEN (2011) “LIBSVM: a library for support
vector machines”, ACM Transactions on Intelligent Systems and Technology 2(3) pp. 1-27

CONBOY, MARTIN (2007) “The Language of the News”, London: Routledge.

CONNEL, IAN (1998) “Mistaken Identities: Tabloid and Broadsheet News Discourse”,
Javnost - The Public: Tabloidization and the Media 5(3), pp. 11-31.

CRISTIANINI, NELLO and SHAWE-TAYLOR, JOHN (2000) “An Introduction to Support


Vector Machines and other Kernel-based learning methods”, Cambridge University Press.

CUNNINGHAM HAMISH, MAYNARD DIANA, BONTCHEVA KALINA and TABLAN


VALENTIN (2002) “GATE: A framework and graphical development environment for robust
NLP tools and applications”, Proc. of the 40th Anniversary Meeting of the Association for
Computational Linguistics, pp. 168–175.

CUSHION STEPHEN and LEWIS JUSTIN (2009) “Towards a ‘Foxification’ of 24 hour


news channels in Britain? An analysis of market driven and publicly funded news coverage”,
Journalism: Theory, Practice and Criticism 10(2) pp. 131-153.

EAGLE, NATHAN, PENTLAND, ALEX and LAZER DAVID (2009) “Inferring friendship
network structure by using mobile phone data”, Proc. of the National Academy of Sciences
106, pp. 15274–15278.

EASTMAN, SUSAN T. and BILLINGS, ANDREW C., (2000) “Sportscasting and Sports
Reporting: The Power of Gender Bias”, J. of Sport and Social Issues 24, pp. 192–213.

ECHMANN, JEAN-PIERRE, MOSES, ELISHA, SERGI, DANILO (2004) “Entropy of


RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 11

dialogues creates coherent structures in e-mail traffic”, Proc. of the National Academy of
Sciences 101, pp. 14333–14337.

FLAOUNAS, ILIAS, TURCHI, MARCO, ALI, OMAR, FYSON, NICK, DE BIE, TIJL,
MOSDELL, NICK, LEWIS, JUSTIN, and CRISTIANINI, NELLO (2010) “The Structure of
EU Mediasphere” PLoS ONE 5, pp. e14243.

FLAOUNAS, ILIAS, ALI, OMAR, TURCHI, MARCO, SNOWSILL, TRISTAN, NICART,


FLORENT, DE BIE, TIJL, and CRISTIANINI, NELLO (2011) “NOAM: News Outlets
Analysis and Monitoring System” Proceedings of the 2011 ACM SIGMOD International
Conference on Management of Data, pp. 1275–1278.

FLESCH, RUDOLPH (1948) “A New Readability Yardstick”, Journal of Applied Psychology


32, pp. 221–233.

GILENS, MARTIN, and HERTZMAN, CRAIG (2000) “Corporate Ownership and News
Bias: Newspaper Coverage of the 1996 Telecommunications Act”, The Journal of Politics 62,
pp. 369–386.

HARRISON, JOHN (2006) News, London: Routledge.

HELDMAN, CAROLINE, CARROLL, SUSAN J., and OLSON, STEPHANIE (2005). “'She
Brought Only a Skirt': Print Media Coverage of Elizabeth Dole's Bid for the Republican
Presidential Nomination”, Political Communication 22, pp. 315–335.

JACOBS, RONALD N. (1996) “Producing the News, Producing the Crisis: Narrativity,
Television and News Work”, Media, Culture & Society 18, pp. 373-397.

LAZER, DAVID, PENTLAND, ALEX, ADAMIC, LADA, ARAL, SINAN, BARABASI,


ALEBERT-LASZLO, BREWER, DEVON, CHRISTAKIS, NICHOLAS, CONTRACTOR
NOSHIR, FOWLER, JAMES, GUTMANN, MYRON, and et al. (2009) “Computational
Social Science”, Science 323, pp. 721-723.

LEN-RIOS, MARIA E., RODGERS, SHELLY, THORSON, ESTHER, and YOON, DOYLE
(2006) “Representation of women in news and photos: Comparing content to perceptions”
Journal of Communication 55, pp. 152–168.

PANG, BO, and LEE, LILLIAN (2008) “Opinion mining and sentiment analysis”,
Foundations and Trends in Information Retrieval 2, pp. 1–135.

FRANKLIN, BOB (1997) “Newszak and News Media”. London: Hodder Arnold.

HENSINGER, ELENA, FLAOUNAS, ILIAS, and CRISTIANINI, NELLO (2012) “What


makes us click? Modelling and Predicting the Appeal of News Articles”, Proc. of
International Conference on Pattern Recognition Applications and Methods, pp. 41-50.

LEWIS, DAVID D., YANG, YIMING, ROSE, TONY G., and LI, FAN (2004) “RCV1: A
New Benchmark Collection for Text Categorization Research”, Journal of Machine Learning
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 12

Research 5, pp. 361–397.

LEWIS, JUSTIN., CUSHION, STEPHEN. AND THOMAS, JAMES. (2005) ‘Immediacy,


Convenience or Engagement? An analysis of 24-hour news channels in the UK’ Journalism
Studies, 6 (4), pp 461- 478.

LEWIS, JUSTIN, INTHORN, SANNA and WAHL-JORGENSEN, KARIN (2005) “Citizens


or Consumers: The Media and the Decline of Political Participation”, Milton Keynes: Open
University Press.

LEWIS, JUSTIN and BOYCE, TAMMY (2009) “Climate Change and the Media: The Scale
of the Challenge”, Climate Change and the Media, New York: Peter Lang.

LEWIS, JUSTIN AND CUSHION, STEPHEN. (2009) ‘The thirst to be first: an analysis of
breaking news stories and their impact on the quality of 24-hour news coverage in the UK’ in
Journalism Practice, Vol. 3 (3)

LIU, BING (2007) “Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data”
Springer.

LLOYD, LEVON, KECHAGIAS, DIMITRIOS, and SKIENA, STEVEN (2005) “Lydia: A


system for large-scale news analysis”, String Processing and Information Retrieval, pp. 161-
166.

MANNING CHRISTOPHER., and SCHUTZE, HINRICH (1999) “Foundations of Statistical


Natural Language Processing”, MIT Press, Cambridge Mass.

McCHESNEY, ROBERT W., and NICHOLS, JOHN. (2010). The Death and Life of
American Journalism: The Media Revolution that Will Begin the World Again, Nation Books.

MICHEL, JEAN-BAPTISTE, SHEN, YUAN K., AIDEN, AVIVA P., VERES, ADRIAN,
GRAY, MATTHEW K., et al. (2011) “Quantitative Analysis of Culture Using Millions of
Digitized Books”, Science 331, pp. 176–182.

MILLER, TOBY (2009) “Television Studies: The Basics” New York: Routledge.

MORGAN, MICHAEL (ed.) (2002) “Against the Mainstream: The Selection Works of
George Gerbner”, New York: Peter Lang Publishing.

PORTER, MARTIN F. (1980) An Algorithm for Suffix Stripping, Program, 14, pp. 130–137.

ROUSH, CHRIS (2006) The Need for More Business Education in Mass Communication
Schools. Journalism and Mass Communication Educator 61(2), pp. 196-204.

SANDHAUS, EVAN (2008) “The New York Times Annotated Corpus”, The New York
Times Company, Research and Development.

SHAWE-TAYLOR, JOHN, and CRISTIANINI, NELLO (2004) “Kernel Methods for Pattern
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 13

Analysis”, Cambridge University Press.

STEINBERGER, RALF, POULIQUEN, BRUNO, and VAN DER GOOT, ERIK, (2009) “An
introduction to the Europe Media Monitor family of applications. Information Access in a
Multilingual World- Proceedings of the SIGIR, pp. 1-8.

SUDHAHAR, SAATVIGA, FRANZOSI, ROBERTO and CRISTIANINI, NELLO (2011)


“Automating Quantitative Narrative Analysis of News Data”, Proceedings of the Journal of
Machine Learning Research in Conjunction with the Second Workshop on Applications of
Pattern Analysis, pp. 63-71.

SVENNEVIG, MICHAEL, (2007) “BBC Coverage of Business in the UK: A Content Ana-
lysis of Business News Coverage” London: BBC Trust.

SZELLA, MICHAEL, LAMBIOTTE RENAUD, THURNER, STEFAN (2010) “Mutireltion-


al organization of large-scale social networks in an online world”, Proceedings of the Nation­
al Academy of Sciences 107, pp. 13636–13641.

KRISTINA, TOUTANOVA, KLEIN, DAN, MANNING, CHRISTOPHER and SINGER,


YORAM (2003) “Feature-rich part-of-speech tagging with a cyclic dependency network”,
Proc. of HLT-NAACL, pp. 252-259.

WATTS, DUNCAN (2007) “A twenty-first century science”, Nature 445, pp. 489.

WILLIAMS, KEVIN (1998) “Get Me a Murder a Day! A history of mass communications in


Britain”, London: Arnold.
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 14

Figure 1. Comparison of topics based on their Readability.

Figure 2. Comparison of topics based on their Linguistic Subjectivity.


RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 15

Figure 3. Comparison of topics based on their writing style.

Figure 4. Comparison of topics based on their male/female ratio.


RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 16

Figure 5. Comparison of topics based on their popularity.

Figure 6. Comparison of a selection of US and UK outlets based on the topics they choose to
cover.
RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM 17

Figure 7. Comparison of a selection of US and UK outlets based on their writing style.

Figure 8. Comparison of a selection of US and UK outlets based on their Male/Female ratio.

View publication stats

You might also like