You are on page 1of 74

1.

INTRODUCTION

1.1 INTRODUCTION OF FAKE NEWS DETECTION

As an increasing amount of our lives is spent interacting online through


social media platforms, more and more people tend to seek out and consume
news from social media rather than traditional news organizations. The reasons
for this change in consumption behaviors are inherent in the nature of these social
media platforms:
(i) it is often more timely and less expensive to consume news on social
media compared with traditional news media, such as newspapers or television;
and
(ii) it is easier to further share, comment on, and discuss the news with
friends or other readers on social media. For example, 62 percent of U.S. adults
get news on social media in 2016, while in 2012, only 49 percent reported seeing
news on social media. It was also found that social media now outperforms
television as the major news source. Despite the advantages provided by social
media, the quality of news on social media is lower than traditional news
organizations. However, because it is cheap to provide news online and much
faster and easier to disseminate through social media, large volumes of fake
news, i.e., those news articles with intentionally false information, are produced
online for a variety of purposes, such as financial and political gain. It was
estimated that over 1 million tweets are related to fake news “Pizzagate” by the
end of the presidential election. Given the prevalence of this new phenomenon,
“Fake news” was even named the word of the year by the Macquarie dictionary in
2016.

The extensive spread of fake news can have a serious negative impact on
individuals and society. First, fake news can break the authenticity balance of the
news ecosystem. For example, it is evident that the most popular fake news was

1
even more widely spread on Facebook than the most popular authentic
mainstream news during the U.S. 2016 president election. Second, fake news
intentionally persuades consumers to accept biased or false beliefs. Fake news is
usually manipulated by propagandists to convey political messages or influence.
For example, some report shows that Russia has created fake accounts and social
bots to spread false stories. Third, fake news changes the way people interpret
and respond to real news. For example, some fake news was just created to
trigger people’s distrust and make them confused, impeding their abilities to
differentiate what is true from what is not. To help mitigate the negative effects
caused by fake news–both to benefit the public and the news ecosystem–It’s
critical that we develop methods to automatically detect fake news on social
media.

Detecting fake news on social media poses several new and challenging
research problems. Though fake news itself is not a new problem–nations or
groups have been using the news media to execute propaganda or influence
operations for centuries –the rise of web-generated news on social media makes
fake news a more powerful force that challenges traditional journalistic norms.
There are several characteristics of this problem that make it uniquely
challenging for automated detection. First, fake news is intentionally writ- ten to
mislead readers, which makes it nontrivial to detect simply based on news
content. The content of fake news is rather diverse in terms of topics, styles and
media platforms, and fake news attempts to distort truth with diverse linguistic
styles while simultaneously mocking true news. For example, fake news may cite
true evidence within the in- correct context to support a non-factual claim. Thus,
existing hand-crafted and data-specific textual features are generally not
sufficient for fake news detection. Other auxiliary information must also be
applied to improve detection, such as knowledge base and user social
engagements. Second, exploiting this auxiliary information actually leads to
another critical challenge: the quality of the data itself. Fake news is usually
related to newly emerging, time-critical events, which may not have been

2
properly verified by existing knowledge bases due to the lack of corroborating
evidence or claims. In addition, users social engagements with fake news produce
data that is big, incomplete, unstructured, and noisy. Effective methods to
differentiate credible users, extract useful post features and exploit network
interactions are an open area of research and need further investigations.
In this article, we present an overview of fake news detection and discuss
promising research directions. The key motivations of this survey are
summarized as follows:
• Fake news on social media has been occurring for several years; however, there is
no agreed upon definition of the term “fake news”. To better guide the future
directions of fake news detection research, appropriate clarifications are
necessary.

• Social media has proved to be a powerful source for fake news dissemination.
There are some emerging patterns that can be utilized for fake news detection in
social media. A review on existing fake news detection methods under various
social media scenarios can provide a basic understanding on the state-of-the-art
fake news detection methods.

• Fake news detection on social media is still in the early age of development, and
there are still many challenging issues that need further investigations. It is
necessary to discuss potential research directions that can improve fake news
detection and mitigation capabilities.

To facilitate research in fake news detection on social media, in this survey


we will review two aspects of the fake news detection problem: characterization and
detection. As shown in Figure, we will first describe the background of the fake
news detection problem using theories and properties from psychology and social
studies; then we present the detection approaches. Our major contributions of this
survey are summarized as follows:

3
We discuss the narrow and broad definitions of fake news that cover most
existing definitions in the literature and further present the unique characteristics
of fake news on social media and its implications com- pared with the traditional
media; We give an overview of existing fake news detection methods with a
principled way to group representative methods into different categories; and
We discuss several open issues and provide future directions of fake news
detection in social media.

The remainder of this survey is organized as follows. In Section, we present


the definition of fake news and characterize it by comparing different theories
and properties in both traditional and social media. In Section, we continue to
formally define the fake news detection problem and summarize the methods to
detect fake news. In Section, we discuss the datasets and evaluation metrics used
by existing methods. We briefly introduce areas related to fake news detection on
social media in Section. Finally, we discuss the open issues and future directions
in Section and conclude this survey in Section.

1.1.1 HISTORY OF FAKE NEWS DETECTION

Giant man-bats that spent their days collecting fruit and holding animated
conversations; goat-like creatures with blue skin; a temple made of polished sapphire.
These were the astonishing sights witnessed by John Herschel, an eminent British
astronomer, when, in 1835, he pointed a powerful telescope “of vast dimensions” towards
the Moon from an observatory in South Africa. Or that, at least, was what readers of the New
York Sun were told in a series of newspaper reports.

This caused a sensation. People flocked to buy each day’s edition of the Sun. The
paper’s circulation shot up from 8,000 to over 19,000 copies, overtaking the Times of
London to become the world’s bestselling daily newspaper. There was just one small hitch.
The fantastical reports had in fact been concocted by Richard Adams Locke, the Sun’s
editor. Herschel was conducting genuine astronomical observations in South Africa. But
4
Locke knew it would take months for his deception to be revealed, because the only means
of communication with the Cape was by letter. The whole thing was a giant hoax – or, as
we would say today, “fake news”. This classic of the genre illuminates the pros and cons of
fake news as a commercial strategy – and helps explain why it has re-emerged in the
internet era.
That fake news shifted copies had been known since the earliest days of printing. In the
16th and 17th centuries, printers would crank out pamphlets, or newsbooks, offering
detailed accounts of monstrous beasts or unusual occurrences. A newsbook published in
Catalonia in 1654 reports the discovery of a monster with “goat’s legs, a human body,
seven arms and seven heads”; an English pamphlet from 1611 tells of a Dutch woman who
lived for 14 years without eating or drinking. So what if they weren’t true? Printers argued,
as internet giants do today, that they were merely providing a means of distribution, and
were not responsible for ensuring accuracy.

But newspapers were different. They contained a bundle of different stories, not just
one, and appeared regularly under a consistent title. They therefore had reputations to
maintain. The Sun, founded in 1833, was the first modern newspaper, funded primarily by
advertisers rather than subscriptions, so it initially pursued readership at all costs. At first it
prospered from the Moon hoax, even collecting its reports in a bestselling pamphlet. But it
was soon exposed by rival papers. Editors also realised that an infinite supply of genuine
human drama could be found by sending reporters to the courts and police stations to write
true-crime stories – a far more sustainable model. As the 19th century progressed,
impartiality and objectivity were increasingly venerated at the most prestigious newspapers.

But in recent years search engines and social media have blown apart newspapers’
bundles of stories. Facebook shows an endless stream of items from all over the web. Click
an interesting headline and you may end up on a fake-news site, set up by a political
propagandist or a teenager in Macedonia to attract traffic and generate advertising revenue.
Peddlers of fake stories have no reputation to maintain and no incentive to stay honest; they
are only interested in the clicks. Hence the bogus stories, among the most popular of 2016,
that the pope had endorsed Donald Trump, or that Hillary Clinton had sold weapons to

5
Islamic State. The impetus behind these was commercial rather than political; it transpired
that Trump supporters were more likely to click and share bogus stories.

Thanks to internet distribution, fake news is again a profitable business. This


flowering of fabricated stories corrodes trust in the media in general, and makes it easier for
unscrupulous politicians to peddle half-truths. Media organisations and technology
companies are struggling to determine how best to respond. Perhaps more overt fact-
checking or improved media literacy will help. But what is clear is that a mechanism that
held fake news in check for nearly two centuries – the bundle of stories from an
organisation with a reputation to protect – no longer works. We will need to invent new
ones.

1.1.2 DEFINITION OF FAKE NEWS

Fake news has existed for a very long time, nearly the same amount of time
as news began to circulate widely after the printing press was invented in 1439 7.
However, there is no agreed definition of the term “fake news”. Therefore, we first
discuss and compare some widely used definitions of fake news in the existing
literature, and provide our definition of fake news that will be used for the
remainder of this survey. A narrow definition of fake news is news articles that
are intentionally and verifiably false and could mislead readers. There are two key
features of this definition: authenticity and intent. First, fake news includes false
information that can be verified as such. Second, fake news is created with
dishonest intention to mislead consumers. This definition has been widely
adopted in recent studies. Broader definitions of fake news focus on the either
authenticity or intent of the news content. Some papers regard satire news as fake
news since the contents are false even though satire is often entertainment-
oriented and reveals its own deceptiveness to the consumers. Other literature
directly treats deceptive news as fake news, which includes serious fabrications,
hoaxes, and satires.

6
In this article, we use the narrow definition of fake news. Formally, we state
this definition as follows,

Fake news on social media: from characterization to detection.

The reasons for choosing this narrow definition are three- folds. First, the
underlying intent of fake news provides both theoretical and practical value that
enables a deeper understanding and analysis of this topic. Second, any techniques
for truth verification that apply to the narrow conception of fake news can also be
applied to under the broader definition. Third, this definition is able to eliminate
the ambiguities between fake news and related concepts that are not considered in
this article. The following concepts are not fake news according to our definition:
(1) satire news with proper context, which has no intent to mislead or deceive
consumers and is unlikely to be mis-perceived as factual;
(2) rumors that did not originate from news events;
(3) conspiracy theories, which are difficult verify as true or false;
(4) misinformation that is created unintentionally;
(5) hoaxes that are only motivated by fun or to scam targeted individuals.

1.1.3 FAKE NEWS ON TRADITIONAL NEWS MEDIA

Fake news itself is not a new problem. The media ecology of fake news has
been changing over time from newsprint to radio/television and, recently, online

7
news and social media. We denote “traditional fake news” as the fake news
problem before social media had important effects on its production and
dissemination. Next, we will describe several psychological and social science
foundations that describe the impact of fake news at both the individual and
social information ecosystem levels.

Psychological Foundations of Fake News. Humans are naturally not very good
at differentiating between real and fake news. There are several psychological
and cognitive theories that can explain this phenomenon and the influential power
of fake news. Traditional fake news mainly tar- gets consumers by exploiting their
individual vulnerabilities. There are two major factors which make consumers
natu- rally vulnerable to fake news: (i) Naı̈ve Realism: consumers tend to believe
that their perceptions of reality are the only accurate views, while others who
disagree are regarded as uninformed, irrational, or biased; and (ii) Confirmation
Bias: consumers prefer to receive information that confirms their existing views.
Due to these cognitive biases inherent in human nature, fake news can often be
perceived as real by consumers. Moreover, once the misperception is formed, it is
very hard to correct it. Psychology studies shows that correction of false
information (e.g., fake news) by the presentation of true, factual information is
not only unhelpful to reduce misperceptions, but sometimes may even increase
the misperceptions, especially among ideological groups

Social Foundations of the Fake News Ecosystem. Considering the entire news
consumption ecosystem, we can also describe some of the social dynamics that
contribute to the proliferation of fake news. Prospect theory describes decision
making as a process by which people make choices based on the relative gains
and losses as compared to their current state. This desire for maximizing the
reward of a decision applies to social gains as well, for instance, continued
acceptance by others in a user’s immediate social network. As described by social
identity theory and normative influence theory, this preference for social
acceptance and affirmation is essential to a person’s identity and self-esteem,

8
making users likely to choose “socially safe” options when consuming and
disseminating news informa- tion, following the norms established in the
community even if the news being shared is fake news.

This rational theory of fake news interactions can be modeled from an


economic game theoretical perspective by formulating the news generation and
consumption cycle as a two-player strategy game. For explaining fake news, we
assume there are two kinds of key players in the information ecosystem: publisher
and consumer. The process of news publishing is modeled as a mapping from

original signal s to resultant news report a with an effect of distortion bias b, i.e., s b
a, where b = [ 1, 0, 1] indicates [left, no, right] biases take effects on news publishing
process. Intuitively, this is capturing the degree to which a news article may be
biased or distorted to produce fake news. The utility for the publisher stems from two
perspectives:
(i) Short-term utility: the incentive to maximize profit, which is positively
correlated with the number of consumers reached;
(ii) long- term utility: their reputation in terms of news authenticity. Utility of
consumers consists of two parts:
(i) Information utility: obtaining true and unbiased information (usually extra
investment cost needed);
(ii) psychology utility: receiving news that satisfies their prior opinions and
social needs, e.g., confirmation bias and prospect theory. Both publisher and
consumer try to maximize their overall utilities in this strategy game of the
news consumption process. We can capture the fact that fake news happens
when the short-term utility dominates a publisher’s overall utility and
psychology utility dominates the consumer’s overall utility, and an equilibrium
is maintained. This explains the social dynamics that lead to an information
ecosystem where fake news can thrive.

9
1.1.4 FAKE NEWS ON SOCIAL MEDIA

In this subsection, we will discuss some unique characteristics of fake news


on social media. Specifically, we will highlight the key features of fake news that
are enabled by social media. Note that the aforementioned characteristics of
traditional fake news are also applicable to social media.

Malicious Accounts on Social Media for Propaganda. While many users on social
media are legitimate, social media users may also be malicious, and in some
cases are not even real humans. The low cost of creating social media accounts
also encourages malicious user accounts, such as social bots, cyborg users, and
trolls. A social bot refers to a social media account that is controlled by a
computer al- gorithm to automatically produce content and interact with humans
(or other bot users) on social media. Social bots can become malicious entities
designed specifically with the purpose to do harm, such as manipulating and
spreading fake news on social media. Studies shows that social bots distorted the
2016 U.S. presidential election online discussions on a large scale, and that
around 19 million bot accounts tweeted in support of either Trump or Clinton in
the week leading up to election day. Trolls, real human users who aim to disrupt
online communities and provoke consumers into an emotional response, are also
playing an important role in spreading fake news on social media. For example,
evidence suggests that there were 1,000 paid Russian trolls spreading fake news
on Hillary Clinton9. Trolling behaviors are highly affected by people’s mood and
the con- text of online discussions, which enables the easy dissemination of fake
news among otherwise “normal” online com- munities. The effect of trolling is to
trigger people’s inner negative emotions, such as anger and fear, resulting in
doubt, distrust, and irrational behavior. Finally, cyborg users can spread fake
news in a way that blends automated activities with human input. Usually cyborg
accounts are registered by human as a camouflage and set automated pro- grams
to perform activities in social media. The easy switch of functionalities between
human and bot offers cyborg users unique opportunities to spread fake news. In a

10
nutshell, these highly active and partisan malicious accounts on social media
become the powerful sources and proliferation of fake news.

Echo Chamber Effect. Social media provides a new paradigm of information


creation and consumption for users. The information seeking and consumption
process are changing from a mediated form (e.g., by journalists) to a more
disinter-mediated way. Consumers are selectively ex- posed to certain kinds of
news because of the way news feed appear on their homepage in social media
amplifying the psychological challenges to dispelling fake news identified above.
For example, users on Facebook always follow like-minded people and thus
receive news that promote their favored existing narratives. Therefore, users on
social media tend to form groups containing like-minded people where they then
polarize their opinions, resulting in an echo chamber effect. The echo chamber
effect facilitates the process by which people consume and believe fake news due to
the following psychological factors:
(1) social credibility, which means people are more likely to perceive a source as
credible if others perceive the source is credible, especially when there is not enough
information available to access the truthfulness of the source; and
(2) frequency heuristic, which means that consumers may naturally favor
information they hear frequently, even if it is fake news. Studies have shown that
increased exposure to an idea is enough to generate a positive opinion of it, and in
echo chambers, users continue to share and consume the same information. As a
result, this echo chamber effect creates segmented, homogeneous communities with a
very limited information ecosystem. Research shows that the homogeneous
communities become the primary driver of information diffusion that futher
strengthens polarization.

1.1.5 DETECTING FAKE NEWS ONLINE

Fake news has become increasingly prevalent over the last few years, with over 100
incorrect articles and rumors spread incessantly just with regard to the 2016 United States
11
presidential election. These fake news articles tend to come from satirical news websites
or individual websites with an incentive to propagate false information, either as clickbait
or to serve a purpose. Since they typically hope to intentionally promote incorrect
information, such articles are quite difficult to detect. When identifying a source of
information, one must look at many attributes, including but not limited to the content of
the email and social media engagements. specifically, the language is typically more
inflammatory in fake news than real articles, in part because the purpose is to confuse and
generate clicks. Furthermore, modeling techniques such as n-gram, python encodings
and bag of words have served as other linguistic techniques to determine the legitimacy of
a news source On top of that, researchers have determined that visual-based cues also
play a factor in categorizing an article, specifically some features can be designed to
assess if a picture was legitimate and provides more clarity on the news. There is also
many social context features that can play a role, as well as the model of spreading the
news. Websites such as “Snopes” try to detect this information manually, while certain
universities are trying to build mathematical models to do this themselves.

1.1.6 FAKE NEWS DETECTION IN INDIA

Fake news in India has led to episodes of violence between castes and religions,
interfering with public policies. It often spreads through the smartphone instant
messenger WhatsApp, which had 200 million monthly active users in the country as of
February 2017.

On November 8, 2016, India established a 2,000-rupee currency bill on the same


day as the Indian 500 and 1,000 rupee note demonetization. Fake news went viral over
WhatsApp that the note came equipped with spying technology that tracked bills 120
meters below the earth. Finance Minister Arun Jaitley refuted the falsities, but not before
they had spread to the country's mainstream news outlets. Later, in May 2017, seven people
were lynched as rumor of child abductions spread through WhatsApp in a village.

12
Prabhakar Kumar of the Indian media research agency CMS, told The Guardian that
India was hit harder by fake news because the country lacked media policy for verification.
Law enforcement officers in India arrested individuals with charges of creating fictitious
articles, predominantly if there was likelihood the articles inflamed societal conflict.

In April 2018, the Information and Broadcasting Ministry said the government
would cancel the accreditation of journalists found to be sharing fake news, but this was
quickly retracted after criticism that this was an attack on freedom of the press.

In June 2018, mobs murdered a governmental employee, Sukanta Chakraborty, who


was fighting against false news and rumors, and two other unrelated people. More people
were severely injured. The local government temporarily shut down mobile Internet and
texting services.

To tackle the menace of fake news in Kashmir, Amir Ali Shah, a youth from south
Kashmir' Anantnag district has developed a website called "Stop Fake in Kashmir" where
news and facts can be verified. The website is the first of its kind developed in the Kashmir
valley.

1.2 BACK GROUND OF FAKE NEWS DETECTION

Internet gave opportunity to everyone enter online news business because many of
them were already rejected the traditional news sources that had gained high level of public
trust and also credibility of the work. According to a survey general trust on mass media
collapsed as lowest in the history of this business. Especially in political right 51%
democrats and 14% republican in USA expressing a great deal and trust in mass media as
news source.

It has come to known that the information repeated again is more likely to be rated
true than the information that has not been heard before. Familiarity with false news would
increase with truthfulness. Further this thing did not stop here as the false stories would
13
result to create the false memory. The authors first observed “illusory-truth effect” and gave
the results that subject rated repeated statements truer as compare to the new statements.
They present a case study with the results that the participants who had read the false news
or stories consecutively five weeks believe false stories more truthful and more plausible as
compare to the participants who had not been exposed.

News can be true if the information it expresses that is more familiar. Familiarity
means automatic consequences of exposure so it will influence on truth and that is fully
unintentional. In those cases where the source or the agency that circulated stories warns
that source may not be credible, people did not stop to believe on that story due to the
familiarity. Another study that contains half statements showing in the experiments were
true and half were false but the results shows that the participants like repeated statements
although they were false but due to the familiarity they rated as more true than the stories
they heard first time (Bacon et al. 1979). Monitoring of source is itself an ability to check
and identify the news origin we read. Some studies clearly indicate that the participants use
familiarity to understand the source of their memory. Another study that proposed general
knowledge and semantic memory does not focus on conditions but it only helps a person
when and where he learned this information. Similarly a person may have some knowledge
about an event but not remember the event so it comes from memory

State of art semantic models do an excellent job at detecting semantic similarity. For
example, calculating cosine similarity between two word vectors.

Such model will be able to tell that cappuccino, espresso and americano are similar
to each other. However, it cannot predict semantic differences between words.

If you can tell that americano is similar to cappuccino and espresso but you can't tell
the difference between them, you don't know what americano is. As a consequence, any
semantic model that is only good at similarity detection will be of limited practical use.

14
1.3 CONCEPTUAL FRAMEWORK

The conceptual framework of this research is presented in Figure 1. This framework


combines the entire procedure required for designing a system suitable for detecting a fake
news, specifically for combating the spread of fake news. This framework also aids the
explanation of who created the fake news, how it was spread and others as well as the
algorithm to detect and filter the online fake news. It also presents a general outline that is
suitable for designing detection system. Another most important feature of the proposed
solution. Therefore, both sources of online fake news and detecting that the news is fake, as
well as filtering of the online fake news can be address by this framework. The structure of
the framework as presented in Figure 1, comprises of five sections namely: An outline of
detecting sources of online fake news; designed technique for use in the detection; the
verifying of source node(s); and the detection of the status of the news.

The first part of the framework is on the approach of determining fake news
sources. There is no established network of fake news sources, for that reason, the creation
of fake news can be seen to be initiated by entity usually unreliable ones. Due to technical
constraints, most of the unreliable entities are difficult to be identified. Therefore, barring
these entities from further spreading fake news is a challenging method to be implemented.
Then, the entity combines fake news and truth to produce the fake news. This news can be
published using fake headline and true content, true headline and fake content or
combination of fake and true headline and content. The fake article produce will be
published online using any website hosted by the entity. In most cases, the website hosted
by entity looks similar to reputable, authentic and reliable website. The links from these
websites or articles are shared on Microblogging sites because this news is considered
credible news by some parties. So, they share this so-called credible news to provide
information to other Microblogging site users. Hence the paths followed is outline and then
drafted.

Following this path, the nodes identifiers are tags. This are the first point where
verification check is required. As a result, a verification function of the legitimacy of the
15
node(s) mostly domain name or IP address is set. Furthermore, the entire work of the
function in this verification step is named. A specially name like “the verification of IP
address” is most appropriate. In the examination of node(s), network analyzer tools like
Wireshark application is used as a preliminary tool. Although, there are online tool that can
used to check the information regarding the entered IP address or domain name, but in this
framework, a function suitable for direct evaluation of domain and IP address is required
for examine the pattern of sources of what constitute a fake news. Thus, examining source
identifier is a lead for determining a fake news: For instance, If the IP address is constantly
changing, this occurrence is known as DNS hijack. The result will be invalid IP address. If
the IP address is static and correct, the system will print valid IP address. If it is otherwise,
the result will be invalid.

In the next phase, the proposed detection of the contain of information that make of
a fake news from the source is evaluated. A function is set, and its responsibilities is to
check the sources node contain for which either an article, or title of the article, the author
of the article and the background information of the article. This check is appended with the
validity guide and guideline. A dictionary of those guides is provided in a database. Access
to the database is open. For each information required to be analyzed from a source, a
check function to the database should be invoked. If found in database, return authentic
source. If not found in the database, the system returns ambiguous and save the entry in a
different database where it will be analyzed by the verification team later. Next, the title of
the article is analyzed, the topic is automatically searched online. If the title is found, return
true value and the output is ensured that it is from legitimate website. If the title is not
found, the title is sent to the verification team for confirmation and return false value.

The content of the article is manually checked by the validation team where the
validation is done by supporting the claims. If the content is right, then return true value for
the content. If the content is otherwise, return false. On the article, if author’s name is
available then check database. If author is found in the database, the system will return true
value for author. If author name is not available on the article or not found in the database,
then return false. If website value is true, title value is true, content value is true and author

16
value is true, the article is verified or else the article is considered ambiguous.
The next aspect of detections is on how to determine the status of the news. This is
based on the result of verification of source and the validity of the content. If result from
source verification is legitimate, and content valid, then the status of news is verified. If the
source of the news and is ambiguous and the content of the new is valid, then, the news is
also valid, and the status of news is now claim. If source of the news is verified and the
content of the news is invalid, the status of news is label fake or claim. If the source of the
news is ambiguous and the content is invalid, the status of news is as similar stated. Various
unforeseen events are possible, but in general a matrix of source node and content of news
is the backbone of this proposed conceptual framework.

Finally, after labelling a news as either fake or reliable, by the status label, for each,
round a tag is appended displaying the status of what has been requested by the user. These
tags are now filtered. The sequence of the news is determined. If the status is verified, the
news is labelled “verified” by the tag and placed on the top of the newsfeed of
Microblogging sites. If the news is considered claim, then the news will be labelled “claim”
and below all the verified news. If the news is marked fake news, the news will be labelled
“fake” and placed below the list of claim news.

Figure 1: The conceptual framework

17
1.4 STATEMENT OF THE PROBLEM

1.4.1 PROBLEM DEFINITION

In this subsection, we present the details of mathematical formulation of


fake news detection on social media. Specifically, we will introduce the definition
of key components of fake news and then present the formal definition of fake
news detection. The basic notations are defined below,

• Let a refer to a News Article. It consists of two major components: Publisher and
Content. Publisher p ~ a includes a set of profile features to describe the original author,
such as name, domain, age, among other attributes. Content c ~ a consists of a set of
attributes that represent the news article and includes headline, text, image, etc.

• We also define Social News Engagements as a set of tuples E = {eit} to represent


the process of how news spread over time among n users U = {u1 , u2,..,un} and their
corresponding posts P = {p1,p2,...,pn} on social media regarding news article a. Each
engagement eit = {ui,pi,t} represents that a user ui spreads news article a using pi at time
t. Note that we set t = Null if the article a does not have any engagement yet and thus ui
represents the publisher.

Definition 2 (Fake News Detection) Given the social news engagements E among n users
for news article a, the task of fake news detection is to predict whether the news article a
is a fake news piece or not, i.e., F : E → {0,1} such that,

F(a) = 1, if a is a piece of fake news,


0, otherwise.

where F is the prediction function we want to learn. Note that we define fake news
detection as a binary classification problem for the following reason: fake news is
essentially a distortion bias on information manipulated by the publisher. According to
18
previous research about media bias theory, distortion bias is usually modeled as a
binary classification problem.

Next, we propose a general data mining framework for fake news detection
which includes two phases:
(i) feature extraction and
(ii) model construction. The feature extraction phase aims to represent
news content and related auxiliary information in a formal
mathematical structure, and model construction phase further builds
machine learning models to better differentiate fake news and real
news based on the feature representations.

Feature Extraction:
Fake news detection on traditional news media mainly relies on news content,
while in social media, extra social context auxiliary information can be used to as
additional information to help detect fake news. Thus, we will present the details of
how to extract and represent useful features from news content and social context.

News Content Features


News content features c˙a describe the meta information related to a piece of
news. A list of representative news content attributes are listed below:
Source: Author or publisher of the news article


Headline: Short title text that aims to catch the attention of readers and
describes the main topic of the article

• Body Text: Main text that elaborates the details of the news story; there is
usually a major claim that is specifically highlighted and that shapes the angle of
the publisher


Image/Video: Part of the body content of a news article that provides visual
cues to frame the story
Based on these raw content attributes, different kinds of feature
representations can be built to extract discriminative characteristics of fake news.

19
Typically, the news content we are looking at will mostly be linguistic-based and
visual- based, described in more detail below.

Linguistic-based: Since fake news pieces are intention- ally created for
financial or political gain rather than to re- port objective claims, they often contain
opinionated and inflammatory language, crafted as “clickbait” (i.e., to en- tice
users to click on the link to read the full article) or to incite confusion. Thus, it is
reasonable to exploit linguistic features that capture the different writing styles and
sensational headlines to detect fake news. Linguistic- based features are extracted
from the text content in terms of document organizations from different levels, such
as characters, words, sentences, and documents. In order to capture the different
aspects of fake news and real news, existing work utilized both common linguistic
features and domain-specific linguistic features. Common linguistic features are
often used to represent documents for various tasks in natural language processing.
Typical common linguistic features are:
(i) lexical features, including character- level and word-level features, such
as total words, characters per word, frequency of large words, and unique words;
(ii) syntactic features, including sentence-level features, such as frequency of
function words and phrases or punctuation and parts- of-speech (POS) tagging.
Domain-specific linguistic features, which are specifically aligned to news domain,
such as quoted words, external links, number of graphs, and the average length of
graphs, etc. Moreover, other features can be specifically designed to capture the
deceptive cues in writing styles to differentiate fake news, such as lying- detection
features.

Visual-based: Visual cues have been shown to be an important manipulator


for fake news propaganda. As we have characterized, fake news exploits the
individual vulnerabilities of people and thus often relies on sensational or even
fake images to provoke anger or other emotional response of consumers. Visual-
based features are extracted from visual elements (e.g. images and videos) to
capture the different characteristics for fake news. Faking images were identified

20
based on various user-level and tweet-level hand-crafted features using
classification framework. Recently, various visual and statistical features has been
extracted for news verification. Visual features include clarity score, coherence
score, similarity distribution histogram, diversity score, and clustering score.
Statistical features include count, image ratio, multi-image ratio, hot image ratio,
long image ratio, etc.

Social Context Features:


In addition to features related directly to the content of the news articles,
additional social context features can also be derived from the user-driven social
engagements of news consumption on social media platform. Social engagements
represent the news proliferation process over time, which provides useful
auxiliary information to infer the veracity of news articles. Note that few papers
exist in the literature that detect fake news using social context features.
However, because we believe this is a critical aspect of successful fake news
detection, we introduce a set of common features utilized in similar research
areas, such as rumor veracity classification on social media. Generally, there are
three major aspects of the social media context that we want to represent: users,
generated posts, and networks. Below, we investigate how we can extract and
represent social context features from these three aspects to support fake news
detection.

User-based: As we mentioned in Section 2.3, fake news pieces are likely to be


created and spread by non-human accounts, such as social bots or cyborgs. Thus,
capturing users’ profiles and characteristics by user-based features can provide useful
information for fake news detection. User- based features represent the characteristics
of those users who have interactions with the news on social media. These features
can be categorized across different levels: individual level and group level. Individual
level features are extracted to infer the credibility and reliability for each user using
various aspects of user demographics, such as registration age, number of
followers/followees, number of tweets the user has authored, etc. Group level user

21
features capture overall characteristics of groups of users related to the news. The
assumption is that the spreaders of fake news and real news may form different
communities with unique characteristics that can be depicted by group level
features. Commonly used group level features come from aggregating (e.g.,
averaging and weighting) individual level features, such as ‘percentage of
verified users’ and ‘average number of followers.

Post-based: People express their emotions or opinions to- wards fake news
through social media posts, such as skeptical opinions, sensational reactions, etc.
Thus, it is reasonable to extract post-based features to help find potential fake
news via reactions from the general public as expressed in posts. Post-based
features focus on identifying useful in- formation to infer the veracity of news
from various aspects of relevant social media posts. These features can be
categorized as post level, group level, and temporal level. Post level features generate
feature values for each post. The aforementioned linguistic-based features and
some embed- ding approaches for news content can also be applied for each post.
Specifically, there are unique features for posts that represent the social response
from general pub- lic, such as stance, topic, and credibility. Stance features (or
viewpoints) indicate the users opinions towards the news, such as supporting,
denying, etc. Topic features can be extracted using topic models, such as latent
Dirichlet allocation (LDA). Credibility features for posts assess the degree of
reliability. Group level features aim to aggregate the feature values for all
relevant posts for specific news articles by using “wisdom of crowds”. For
example, the average credibility scores are used to evaluate the credibility of
news. A more comprehensive list of group-level post features can also be found
in. Temporal level features consider the temporal variations of post level feature
values. Unsupervised embedding methods, such as re- current neural network
(RNN), are utilized to capture the changes in posts over time. Based on the shape
of this time series for various metrics of relevant posts (e.g, number of posts),
mathematical features can be computed, such as SpikeM parameters.

22
Network-based: Users form different networks on social media in terms of
interests, topics, and relations. As mentioned before, fake news dissemination
processes tend to form an echo chamber cycle, highlighting the value of extracting
network-based features to represent these types of network patterns for fake news
detection. Network-based features are extracted via constructing specific networks
among the users who published related social media posts. Different types of
networks can be constructed. The stance network can be built with nodes indicating
all the tweets relevant to the news and the edge indicating the weights of similarity of
stances. Another type of network is the co- occurrence network, which is built based
on the user engagements by counting whether those users write posts relevant to the
same news articles. In addition, the friendship network indicates the
following/followees structure of users who post related tweets. An extension of this
friendship network is the diffusion network, which tracks the trajectory of the spread
of news, where nodes represent the users and edges represent the information
diffusion paths among them. That is, a diffusion path between two users ui and
uj exists if and only if (1) uj follows ui , and (2) uj posts about a given news only
after ui does so. After these networks are properly built, existing network metrics
can be applied as feature representations. For example, degree and clustering
coefficient have been used to characterize the diffusion network and friendship
network. Other approaches learn the latent node embedding features by using
SVD or network propagation algorithms.

Model Construction
In the previous section, we introduced features extracted from different sources,
i.e., news content and social context, for fake news detection. In this section, we
discuss the details of the model construction process for several existing approaches.
Specifically we categorize existing methods based on their main input sources as:
News Content Models and Social Context Models.

News Content Models


In this subsection, we focus on news content models, which mainly rely on
23
news content features and existing factual sources to classify fake news. Specifically,
existing approaches can be categorized as Knowledge-based and Style-based.

Knowledge-based: Since fake news attempts to spread false claims in news


content, the most straightforward means of detecting it is to check the truthfulness
of major claims in a news article to decide the news veracity. Knowledge based
approaches aim to use external sources to fact-check proposed claims in news
content. The goal of fact-checking is to assign a truth value to a claim in a
particular context. Fact-checking has attracted increasing attention, and many
efforts have been made to develop a feasible automated fact-checking system.
Existing fact-checking approaches can be categorized as expert-oriented,
crowdsourcing-oriented, and computational-oriented.


Expert-oriented fact-checking heavily relies on human domain experts to
investigate relevant data and documents to construct the verdicts of claim
veracity, for example PolitiFact, Snopes, etc. However, expert- oriented fact-
checking is an intellectually demanding and time-consuming process, which
limits the potential for high efficiency and scalability.

Crowdsourcing-oriented fact-checking exploits



the “wisdom of crowd” to enable
normal people to annotate news content; these annotations are then aggregated to
produce an overall assessment of the news veracity. For example, Fiskkit allows
users to discuss and annotate the accuracy of specific parts of a news article. As
another example, an anti-fake news bot named “For real” is a public account in
the instant communication mobile application LINE, which allows people to
report suspicious news content which is then further checked by editors.

Computational-oriented fact-checking aims



to provide an automatic scalable
system to classify true and false claims. Previous computational-oriented fact
checking methods try to solve two majors issues:
(i) identifying check-worthy claims and

24
(ii) discriminating the veracity of fact claims. To identify check-worthy claims,
factual claims in news content are extracted that convey key statements and
viewpoints, facilitating the subsequent fact-checking process. Fact-checking for
specific claims largely relies on external resources to determine the truthfulness
of a particular claim. Two typical external sources include the open web and
structured knowledge graph. Open web sources are utilized as references that
can be compared with given claims in terms of both the consistency and
frequency. Knowledge graphs are integrated from the linked open data as a
structured network topology, such as DB- pedia and Google Relation
Extraction Corpus. Fact- checking using a knowledge graph aims to check
whether the claims in news content can be inferred from existing facts in the
knowledge graph.

Style-based: Fake news publishers often have malicious intent to spread


distorted and misleading information and influence large communities of
consumers, requiring particular writing styles necessary to appeal to and persuade
a wide scope of consumers that is not seen in true news articles. Style-based
approaches try to detect fake news by capturing the manipulators in the writing
style of news content. There are mainly two typical categories of style-based
methods: Deception-oriented and Objectivity-oriented.

• Deception-oriented stylometric methods capture the deceptive statements or claims


from news content. The motivation of deception detection originates from forensic
psychology (i.e., Undeutsch Hypothesis) and various forensic tools including
Criteria-based Content Analysis and Scientific-based Content Analysis have been
developed. More recently, advanced natural language processing models are applied
to spot de- ception phases from the following perspectives: Deep syntax and
Rhetorical structure. Deep syntax models have been implemented using probabilistic
context free grammers (PCFG), with which sentences can be transformed into rules
that describe the syntax structure. Based on the PCFG, different rules can be
developed for deception detection, such as unlexicalized/ lexicalized production rules

25
and grandparent rules. Rhetorical structure theory can be utilized to capture the
differences between deceptive and truthful sentences. Deep network models, such as
convolutional neural networks (CNN), have also been applied to classify fake news
veracity.

• Objectivity-oriented approaches capture style signals that can indicate a decreased


objectivity of news content and thus the potential to mislead consumers, such as
hyperpartisan styles and yellow-journalism. Hyperpartisan styles represent
extreme behavior in favor of a particular political party, which often correlates
with a strong motivation to create fake news. Linguistic- based features can be
applied to detect hyperpartisan articles. Yellow-journalism represents those
articles that do not contain well-researched news, but instead rely on eye-catching
headlines (i.e., clickbait) with a propensity for exaggeration, sensationalization,
scare-mongering, etc. Often, news titles will summarize the major viewpoints of
the article that the author wants to convey, and thus misleading and deceptive
clickbait titles can serve as a good indicator for recognizing fake news articles.

Social Context Models:


The nature of social media provides researchers with additional resources to
supplement and enhance News Con- tent Models. Social context models include
relevant user social engagements in the analysis, capturing this auxiliary
information from a variety of perspectives. We can classify existing approaches
for social context modeling into two categories: Stance-based and Propagation-based.
Note that very few existing fake news detection approaches have utilized social
context models. Thus, we also introduce similar methods for rumor detection
using social media, which have potential application for fake news detection.

Stance-based: Stance-based approaches utilize users’ view- points from


relevant post contents to infer the veracity of original news articles. The stance of
users’ posts can be represented either explicitly or implicitly. Explicit stances are
direct expressions of emotion or opinion, such as the “thumbs up” and “thumbs

26
down” reactions expressed in Facebook. Implicit stances can be automatically
extracted from social media posts. Stance detection is the task of automatically
determining from a post whether the user is in favor of, neutral toward, or against
some target entity, event, or idea. Previous stance classification methods mainly
rely on hand-crafted linguistic or embedding features on individual posts to
predict stances. Topic model methods, such as latent dirichlet allocation (LDA)
can be applied to learn latent stance from topics. Using these methods, we can
infer the news veracity based on the stance values of relevant posts. Tacchinietal.
proposed to con- struct a bipartite network of user and Facebook posts using the
“like” stance information; based on this network, a semi-supervised probabilistic
model was used to predict the likelihood of Facebook posts being hoaxes. Jin et
al. explored topic models to learn latent viewpoint values and further exploited
these viewpoints to learn the credibility of relevant posts and news content.

Propagation-based: Propagation-based approaches for fake news detection


reason about the interrelations of relevant social media posts to predict news
credibility. The basic assumption is that the credibility of a news event is highly
related to the credibilities of relevant social media posts. Both homogeneous and
heterogeneous credibility networks can be built for propagation process.
Homogeneous credibility networks consist of a single type of entities, such as
post or event. Heterogeneous credibility networks involve different types of
entities, such as posts, sub-events, and events. Gupta et al. proposed a PageRank-
like credibility propagation algorithm by encoding users’ credibilities and tweets’
implications on a three layer user-tweet- event heterogeneous information
network. Jin et al. proposed to include news aspects (i.e., latent sub-events), build
a three-layer hierarchical network, and utilize a graph optimization framework to
infer event credibilities. Recently, the conflicting viewpoint relationships are
included to build a homogeneous credibility network among tweets and guide the
process to evaluate their credibilities

27
1.5 SCOPE AND LIMITATIONS OF THE FAKE NEWS DETECTION

1.5.1 GROWTH OF FAKE NEWS DETECTION


In the past few years, the research community has dedicated growing interest to the
issue of false news circulating on social networks. The widespread attention on detecting
and characterizing deceptive information has been motivated by considerable political and
social backlashes in the real world. As a matter of fact, social media platforms exhibit
peculiar characteristics, with respect to traditional news outlets, which have been
particularly favorable to the proliferation of false news. They also present unique
challenges for all kind of potential interventions on the subject.

As this issue becomes of global concern, it is also gaining more attention in


academia. The aim of this survey is to offer a comprehensive study on the recent advances
in terms of detection, characterization and mitigation of false news that propagate on social
media, as well as the challenges and the open questions that await future research on the
field. We use a data-driven approach, focusing on a classification of the features that are
used in each study to characterize false information and on the datasets used for instructing
classification methods. At the end of the survey, we highlight emerging approaches that
look most promising for addressing false news.

1.5.2 HOW THE GOVERNMENT IS PREVENTING FAKE NEWS

The U.S. government is currently developing programs that detect fake news and
false information.

In the U.S. Department of Defense, as part of the DARPA (Defense Advanced


Research Projects Agency) program, Media Forensics has used AI to detect manipulated
videos and photos.

These fake media outlets use visuals to show false information from the Internet.
Whether this was specific program developed due to Russian interference in the 2016
Presidential Elections is unclear. It is a great starting point by the government in the fight to
counter the dissemination of false information.

28
Political Polarization:

The spread of misinformation, especially during election seasons, has led to


increased political polarization in numerous countries. This has led many democracies,
including the United States and France, to see an increase in the targeting of political
parties.

Many multinational companies, such as Microsoft, have taken initiative in


defending the values of democracy. Microsoft’s “Defending Democracy Program” aims to
tackle cybersecurity threats posed by foreign entities. They do this by protecting
campaigns, increasing advertising transparency, protecting the integrity of the electoral
process, and defending against disinformation by identifying, targeting, and obtaining
domains that are used for misinformation campaigns.

The foundation of any democracy contains the freedoms of speech, protest, and
right to vote. Governments around the world have plans to address the fake news issue, but
to varying degrees. One question comes to mind: does a law to discourage or halt fake news
truly promote democracy?

This will vary on a case by case basis, but authoritarian countries (such as China
and Russia) jail dissidents whom they claim are spreading false information online. This is
often not the case and is only being used as justification to jail political critics and
journalists. The Chinese model of media censorship limits the freedoms of speech in the
public sphere on social media.

Free Speech vs. Fake News:

Western nations and institutions, such as the European Union, United Kingdom, and
the United States have enacted laws and regulations. They are attempting to remove social
media posts and accounts of individuals who disseminate fake news and conspiracies.

Many groups interpret this as an attempt to suppress free speech. The fine line is
being drawn on a country-by-country basis. In Europe, large social media companies such
as Facebook and Twitter may face fines and tougher regulations on these social media
giants. This is if they disregard or fail to remove posts and accounts that are deemed

29
“harmful and/or illegal”.

The European Union voted in favor of Articles 11 and 13. To sum up the articles,
they will call for companies like Google “to pay media companies a so-called “link tax”
when sharing their content.” Article 13 wants social media platforms to monitor content
uploaded to posts “ahead of their publication by using automated software that would
detect and filter out intellectual property violations” (Quartz).

What the People Can Do to Help:

The fine line between regulating fake news while allowing the freedom of speech is
challenging. This challenge is something governments worldwide will have to figure out
and come to terms with soon enough.

Cybersecurity threats are not always the conventional, short-term gains of the
hacking of voter systems, banks, or government entities, but are now long-term plays in
undermining the democracy and political polarization of countries.

Unfortunately, government regulations can only do so much in stopping the spread


of fake news. It’s up to the public to decide what is real and what isn’t, and many people
will already have their minds made up.

1.5.3 COMPANIES/ STARTUPS USING FAKE NEWS DETECTION

1. Distil Networks promises to help their clients fight the bad bots and gain
visibility over web-based traffic. Distil Networks was founded in April 2011 by Rami
Essaid, Engin Akyol and Andrew Stein. They claim to be the pioneers of ‘bot mitigation’.
They can identify a bot’s source with the help of a technology called ‘device
fingerprinting’.
2. PressCoin is a platform that offers trustworthy news, while the people using the
platform can earn PressCoins.
3. Digital Shadows, a cybersecurity startup, helps fight fake news by monitoring
activity on the dark web. The company manages and remediates digital risk across data
sources within the open, deep, and dark web to protect an organization’s business, brand,
and reputation.

30
4. Another website that works on setting the record straight is AltNews. Founded by
Pratik Sinha and the anonymous ‘Unofficial Sususwamy’, AltNews busts propaganda and
misinformation.
5. PerimeterX provides protection against automated attacks by detecting malicious
web behavior. It uses human behavior analysis as well as analysis of applications and
networks to catch automated attacks in real-time. The company was founded in 2014 by
CEO Omri Iluz.
6. Indian startup SM Hoax Slayer began in 2015 as a Facebook page. Founded by
Pankaj Jain, the website now deals with fake news in any form, be it, religious, political, or
communal.
7. Headed by CEO Dhruv Ghulati, Factmata, is a startup that calls itself a ‘fact
checking community’. It uses AI to help journalists and fact checkers detect, verify and fact
check media information in close to real time. They also help advertisers avoid placing
advertising on fake news, hate speech, and extremist content.
8. An ad-tech startup called Storyzy was launched in 2012, helps verify quotes that
are attributed to public figures or celebrities using Natural Language Processing (NLP) in
real time.
9. Userfeeds is using block chain technology to protect news. The idea is that if
every piece of information is swaddled in encrypted protection, its integrity can be vouched
for.
10. Crisp Thinking, which started in 2005, provides services that protects social
media reputations of companies, and also protects children and teens from cyberbullying
and inappropriate content.
11. Check4Spam was founded by Shammas Oliyath and Bal Krishn Birla in 2015,
to help bust fake news and hoaxes.
12. Rappler, a Philippines-based startup, has won the 2017 Democracy Award from
the National Democratic Institute for its journalistic efforts in curbing the spread of fake
news. The name, ‘Rappler’, is an amalgamation of ‘rap’ (discuss) and ‘ripple’ (create
waves).
The ‘Postcard News’ incident is a wake-up call for the millions of social media
users who blindly believe every forward and every morphed photograph. However,

31
awareness among the public is being spread by the above mentioned startups and
companies. Given time, better technology, and funding, a whole army of startups could be
out there fighting fake news.

1.5.4 SOURCE OF FAKE NEWS DETECTION

Lately the fact-checking world has been in a bit of a crisis. Sites like Politifact and
Snopes have traditionally focused on specific claims, which is admirable but tedious; by the
time they’ve gotten through verifying or debunking a fact, there’s a good chance it’s
already traveled across the globe and back again.

Social media companies have also had mixed results limiting the spread of
propaganda and misinformation. Facebook plans to have 20,000 human moderators by the
end of the year, and is putting significant resources into developing its own fake-news-
detecting algorithms.

Researchers from MIT’s Computer Science and Artificial Intelligence Lab (CSAIL)
and the Qatar Computing Research Institute (QCRI) believe that the best approach is to
focus not only on individual claims, but on the news sources themselves. Using this tack,
they’ve demonstrated a new system that uses machine learning to determine if a source is
accurate or politically biased.

“If a website has published fake news before, there’s a good chance they’ll do it
again,” says postdoc Ramy Baly, the lead author on a new paper about the system. “By
automatically scraping data about these sites, the hope is that our system can help figure out
which ones are likely to do it in the first place.”

Baly says the system needs only about 150 articles to reliably detect if a news
source can be trusted — meaning that an approach like theirs could be used to help stamp
out new fake-news outlets before the stories spread too widely.

32
The system is a collaboration between computer scientists at MIT CSAIL and
QCRI, which is part of the Hamad Bin Khalifa University in Qatar. Researchers first took
data from Media Bias/Fact Check (MBFC), a website with human fact-checkers who
analyze the accuracy and biases of more than 2,000 news sites; from MSNBC and Fox
News; and from low-traffic content farms.

They then fed those data to a machine learning algorithm, and programmed it to
classify news sites the same way as MBFC. When given a new news outlet, the system was
then 65 percent accurate at detecting whether it has a high, low or medium level of
factuality, and roughly 70 percent accurate at detecting if it is left-leaning, right-leaning, or
moderate.

The team determined that the most reliable ways to detect both fake news and
biased reporting were to look at the common linguistic features across the source’s stories,
including sentiment, complexity, and structure.

For example, fake-news outlets were found to be more likely to use language that is
hyperbolic, subjective, and emotional. In terms of bias, left-leaning outlets were more likely
to have language that related to concepts of harm/care and fairness/reciprocity, compared to
other qualities such as loyalty, authority, and sanctity. (These qualities represent a popular
theory — that there are five major moral foundations — in social psychology.)

Co-author Preslav Nakov, a senior scientist at QCRI, says that the system also
found correlations with an outlet’s Wikipedia page, which it assessed for general — longer
is more credible — as well as target words such as “extreme” or “conspiracy theory.” It
even found correlations with the text structure of a source’s URLs: Those that had lots of
special characters and complicated subdirectories, for example, were associated with less
reliable sources.

“Since it is much easier to obtain ground truth on sources [than on articles], this
method is able to provide direct and accurate predictions regarding the type of content

33
distributed by these sources,” says Sibel Adali, a professor of computer science at
Rensselaer Polytechnic Institute who was not involved in the project.

Nakov is quick to caution that the system is still a work in progress, and that, even
with improvements in accuracy, it would work best in conjunction with traditional fact-
checkers.

“If outlets report differently on a particular topic, a site like Politifact could instantly
look at our fake news scores for those outlets to determine how much validity to give to
different perspectives,” says Nakov.

Baly and Nakov co-wrote the new paper with MIT Senior Research Scientist James
Glass alongside graduate students Dimitar Alexandrov and Georgi Karadzhov of Sofia
University. The team will present the work later this month at the 2018 Empirical Methods
in Natural Language Processing (EMNLP) conference in Brussels, Belgium.

The researchers also created a new open-source dataset of more than 1,000 news
sources, annotated with factuality and bias scores, that is the world’s largest database of its
kind. As next steps, the team will be exploring whether the English-trained system can be
adapted to other languages, as well as to go beyond the traditional left/right bias to explore
region-specific biases (like the Muslim world’s division between religious and secular).

“This direction of research can shed light on what untrustworthy websites look like
and the kind of content they tend to share, which would be very useful for both web
designers and the wider public,” says Andreas Vlachos, a senior lecturer at the University
of Cambridge who was not involved in the project.

Nakov says that QCRI also has plans to roll out an app that helps users step out of
their political bubbles, responding to specific news items by offering users a collection of
articles that span the political spectrum.

34
“It’s interesting to think about new ways to present the news to people,” says
Nakov. “Tools like this could help people give a bit more thought to issues and explore
other perspectives that they might not have otherwise considered."

1.5.5 VISION OF FAKE NEWS DETECTION

Fake news – or we can also just call it lies – is certainly not a phenomenon peculiar
to the modern, digital world. Nor is using it to achieve ideological, political or economic
aims anything new. Nevertheless, the issue of fake news is at the centre of public debate –
at the latest since the election of the current President of the USA, Donald Trump – and it is
associated directly with digitalisation and the social media. Social media are usually the
first channels that rumours, lies and fake news appear on, from which they are extensively
distributed and from which they find their way into public debate and awareness. You will
no doubt have heard how, in 2015, Chancellor Angel Merkel had her photo taken with the
Brussels assassin, whom she had let into the country as a refugee. This is, of course, an
absurd piece of fake news. But everyone who saw it remembers the accompanying photo.
Worse still, in many cases it brings back the memory of the fake news item even outside the
wrong context.

The strategy behind this story is nothing new. Much of it may even, with a great
deal of effort, have been possible 200 years ago. What is new is that the organization,
technology and manpower needed are much less today. And it is precisely this that makes
fake news seem to us to be so dangerous in the “Digital Age”.

The openness and anonymity provided by social networks make possible a great
amount of diversity and freedom of opinion, as well as protection wherever the free
expression of opinion in an “analogue” world is dangerous. But, to the same extent, they
offer opportunities for abuse. The abstract nature of a simple user account and, at the same
time, the availability of programming interfaces with social networks make it possible to
spread and duplicate all types of content in huge amounts – sometimes even automatically.
Unlike the (likewise automatic) distribution of spam mails, the aim is not to reach less
interested users by means of massive replication. The metadata on users which are freely

35
available on social networks, as well as the networking of interest groups, permit content to
be distributed in a highly targeted way. If a (usually ideologically motivated) piece of fake
news is placed in a suitable environment, it is often forwarded without being checked, or
even deliberately.

The path taken by fake news does not, however, end there. It really becomes
important when it makes the jump from social media to those media working with editors
and which are often considered to be trustworthy and have a large reach. This jump
succeeds because topics from social media increasingly serve as triggers for journalistic
stories. And it is not rare for social media communities to be seen a group representing
society.

The question remains: can we not protect ourselves from fake news by using
technology? An automatic recognition of fake news would mean being able to state
mechanically whether the content of a piece of news is true or false. At present, there is no
method of reliably doing this – nor is there any in sight. Not for nothing does Facebook
deploy a host of checkers to detect fake news. Demonstrating the existence of so-called
social bots is not always effective. Even when automated profiles are discovered, it is not as
a rule clear whether they are part of a campaign or simply utility software. And in any case,
not all campaigns are carried out in a fully automated way.

One universally applicable approach for identifying automation and fake news
would appear to be the detection of campaigns themselves. If the existence of these is
proven, then both content and players can be easily extracted and checked. Currently, this
approach has not been researched to any great extent and it is a highly interesting issue for
research to be undertaken on.

1.6 SIGNIFICANCE OF FAKE NEWS DETECTION

Using social media as a medium for news updates is a double-edged sword. On one
hand, social media provides for easy access, little to no cost, and the spread of information
at an impressive rate (Shu, Sliva, Wang, Tang, & Liu, 2017). However, on the other hand,
social media provides the ideal place for the creation and spread of fake news. Fake news

36
can become extremely influential and has the ability to spread exceedingly fast. With the
increase of people using social media, they are being exposed to new information and
stories every day. Misinformation can be difficult to correct and may have lasting
implications. For example, people can base their reasoning on what they are exposed to
either intentionally or subconsciously, and if the information they are viewing is not
accurate, then they are establishing their logic on lies. In addition, since false information is
able to spread so fast, not only does it have the ability to harm people, but it can also be
detrimental to huge corporations and even the stock market. For instance, in October of
2008, a journalist posted a false report that Steve Jobs had a heart attack. This report was
posted through CNN’s iReport.com, which is an unedited and unfiltered site, and
immediately people retweeted the fake news report. There was much confusion and
uncertainty because of how widespread it became in such a short amount of time. The stock
of Job’s company, Apple Inc., fluctuated dramatically that day due to one false news report
that had been mistaken for authentic news reporting.

However, the biggest reason why false information is able to thrive continuously is
that humans fall victim to Truth-Bias, Naïve Realism, and Confirmation Bias. When
referring to people being naturally “truth-biased” this means that they have “the
presumption of truth” in social interactions, and “the tendency to judge an interpersonal
message as truthful, and this assumption is possibly revised only if something in the
situation evokes suspicion” (Rubin, 2017). Basically humans are very poor lie detectors and
lack the realization that there is the possibility they are being potentially lied to. Users of
social media tend to be unaware that there are posts, tweets, articles or other written
documents that have the sole purpose of shaping the beliefs of others in order to influence
their decisions. Information manipulation is not a well-understood topic and generally not
on anyone’s mind, especially when fake news is being shared by a friend. Users tend to let
their guard down on social media and potentially absorb all the false information as if it
were the truth. This is also even more detrimental considering how young users tend to rely
on social media to inform them of politics, important events, and breaking news (Rubin,
2017). For instance, “Sixty-two percent of U.S. adults get news on social media in 2016,
while in 2012, only fort-nine percent reported seeing news on social media,” which

37
demonstrates how more and more people are becoming tech savvy and relying on social
media to keep them updated (Shu et al., 2017). In addition, people tend to believe that their
own views on life are the only ones that are correct and if others disagree then those people
are labeled as “uniformed, irrational, or biased,” otherwise known as Naïve Realism and
python

This leads to the problem of Confirmation Bias, which is the notion that people
favor receiving information that only verifies their own current views. Consumers only
want to hear what they believe and do not want to find any evidence against their views.
For instance, someone could be a big believer of unrestricted gun control and may desire to
use any information they come across in order to support and justify their beliefs further.
Whether that is using random articles from uncredible sites, posts from friends, re-shared
tweets, or anything online that does agrees with their principles. Consumers do not wish to
find anything that contradicts what they believe because it is simply not how humans
function. People cannot help but favor what they like to hear and have a predisposition for
confirmation bias. It is only those who strive for certain academic standards that may be
able to avoid or limit any biasness, but the average person who is unaware of false
information to begin with will not be able to fight these unintentional urges. In addition, not
only does fake news negatively affect individuals, but it is also harmful to society in the
long run. With all this false information floating around, fake news is capable of ruining the
“balance of the news ecosystem” (Shu et al., 2017). For instance, in the 2016 Presidential
Election, the “most popular fake news was even more widely spread on Facebook” instead
of the “most popular authentic mainstream news” (Shu et al., 2017). This demonstrates how
users may pay more attention to manipulated information than authentic facts. This is a
problem not only because fake news “persuades consumers to accept biased or false
beliefs” in order to communicate a manipulator’s agenda and gain influence, but also fake
news changes how consumers react to real news (Shu et al., 2017). People who engage in
information manipulation desire to cause confusion so that a person’s ability to decipher the
true from the false is further impeded. This, along with influence, political agendas, and
manipulation, is one of the many motives why fake news is generated.

38
1.6.1 CONTRIBUTORS OF FAKE NEWS

While many social media users are very much real, those who are malicious and out
to spread lies may or may not be real people. There are three main types of fake news
contributors: social bots, trolls, and cyborg users (Shu et al., 2017). Since the cost to create
social media accounts is very low, the creation of malicious accounts is not discouraged. If
a social media account is being controlled by a computer algorithm, then it is referred to as
a social bot. A social bot can automatically generate content and even interact with social
media users. Social bots may or may not always be harmful but it entirely depends on how
they are programmed. If a social bot is designed with the sole purpose of causing harm,
such as spreading fake news in social media, then they can be very malicious entities and
contribute greatly to the creation of fake news. For example, “studies shows that social bots
distorted the 2016 US presidential election discussions on a large scale, and around 19
million bot accounts tweeted in support of either Trump or Clinton in the week leading up
to the election day,” which demonstrates how influential social bots can be on social media.

However, fake humans are not the only contributors to the dissemination of false
information; real humans are very much active in the domain of fake news. As implied,
trolls are real humans who “aim to disrupt online communities” in hopes of provoking
social media users into an emotional response (Shu et al., 2017). For instance, there has
been evidence that claims “1,000 Russian trolls were paid to spread fake news on Hilary
Clinton,” which reveals how actual people are performing information manipulation in
order to change the views of others (Shu et al., 2017). The main goal of trolling is to
resurface any negative feelings harvested in social media users, such as fear and even
anger, so that users will develop strong emotions of doubt and distrust (Shu et al., 2017).
When a user has doubt and distrust in their mind, they won’t know what to believe and may
start doubting the truth and believing the lies instead.

While contributors of fake news can be either real or fake, what happens when it’s a
blend of both? Cyborg users are a combination of “automated activities with human input”
(Shu et al., 2017). The accounts are typically registered by real humans as a cover, but use

39
programs to perform activities in social media. What makes cyborg users even more
powerful is that they are able to switch the “functionalities between human and bot,” which
gives them a great opportunity to spread false information.

Now that we know some of the reasons why and how fake news progresses, it
would be beneficial to discuss the methods of detecting online deception in word-based
format, such as e-mails. The two main categories for detecting false information are the
Linguistic Cue and Network Analysis approaches.

40
2. LITERATURE REVIEW

2.1 REVIEW OF RELATED LITERATURES

Data Mining techniques have been used in past to explore and analyze the data in
order to find better business ways in an organization. The huge amount of unexplored data
is freely available on web and data mining (DM) techniques have been applied to extract
some hidden useful information which may be useful to enhance the business of an
organization. There is literature available that supports this fact that DM techniques have
been used in past to develop new business opportunities. There are various applications of
DM techniques, sentiment analysis and opinion mining is one of them which can be applied
on un-structured data. Sentiment analysis or Opinion Mining is a deterministic technique
for classifying and evaluating other people's opinions. Now a day's people builds their
perception and make decisions by analyzing the facts and reviews of other people either
manually or computationally. Since everything is online now a day's , hence internet has
become an integrated part of human lives and is thus used for exchanging all aspects of
human life viz. sentiments, emotions, affection, support, opinions, trade, business etc. With
the onset of social media there has been numerous platform such as blogs, discussion
forums, reviews and social networks where an individual can post his or her reviews,
feedbacks and list their likes and dislikes for a product's attributes or features or
comparison of different products (same or different feature). These reviews are
gathered and are analyzed to evaluate the overall orientation of the collected reviews. This
chapter focuses the past work done related to sentiment analysis and opinion mining. We
have presented the outcome of research papers which have shown the application of
machine learning techniques on online reviews. This chapter also discusses the research
papers in which methods and techniques used for gathering and analyzing the reviews,
extracting the phrases based on the Subjectivity and thereafter some work is also discussed
for calculating the semantic orientation of the collected reviews. Sentiment analysis is the
part of Subjectivity analysis which is also very popular by the name Opinion mining.
Opinion Mining is mainly concern with analysis of linguistic natural expression of

41
individual’s opinion about certain product or any other area where public opinion or review
matter a most. Subjectivity analysis aims at determining the attitude of the writer or author
of opinion with respect to some topic or product or services or the overall contextual
polarity or tonality of a document or review. The attitude may involve the user's
experience, evaluation, judgment, the emotional state or intended emotional effect. It is a
Natural Language Processing and Information Extraction task that identifies the writer's
feelings and experiences expressed in positive and negative comments, questions and
requests, by analyzing monstrous amount of information available over the web. The major
force behind the emergence of Opinion Mining today, is the exponential increase in Internet
usage and exchange or share of public views and opinions. It was observed by that the
some of the opinions can be topic based where documents are classified into predefined
topic classes, e.g, science, sports, entertainment, politics etc. Topic related words are
important in topic based classification.

Subjectivity Analysis

Opinion Mining Review Mining Sentiment Analysis Appraisal Extraction

Figure 2.1: Names often interchangeably used for Sentiment analysis

However, in Sentiment classification they are least considered. Here, the


classification is at document- level, where whole document is classified based on its
polarity i.e words that indicate negative or positive opinions (sometimes neutral) are
important e.g, great, poor, excellent, bad, disgusting etc. This classification can also be
extended to sentence level, comparative sentence i.e to classify each sentence as expressing
a positive, negative or neutral opinion. The client created online reviews are reprimanded
by a few scientists to be seen as having lower believability and trust than traditional word
of mouth because of the nonattendance of source signs on the virtual world like the
Internet. A few scientists have likewise reported in writing that conventional word of mouth
42
frequently depends on meaningful gestures (e.g., social connection between word of mouth
communicators) that can upgrade word of mouth convincingness. Be that as it may, in
circumstances like online reviews, these relevant signs may not be accessible. The absence
of meaningful gestures in online reviews powers customers to assess their influence
exclusively in view of accessible constrained substance. In any case, significant discoveries
of a past examination likewise uncovered that buyers assessed the online reviews as more
reliable and valuable while seeing an assertion between the reviews and their own
sentiments.To enable the above visualization, identify product's review's phrases in which
customers have presented their views These opinions consist of user's viewpoint, fancy,
attitude, sensibility, etc. The reviews can be of product's feature, its attributes or it could
contain the comparison of different products of same realm. Completely dissecting and
arranging conclusions includes undertakings that identify with some genuinely profound
semantic and syntactic investigation of the content.
These incorporate perceiving that the content is subjective, as well as figuring out
what the conclusion is about, and which of numerous conceivable positions the holder of
the opinions communicates with respect to that subject. Next we present the some of the
research paper summary, it present the work which has been done in sentiment analysis.

2.1.1 WORK RELATED TO ONLINE REVIEWS

The investigation of online item reviews has gotten a lot of consideration from
specialists and as of late, a developing number of studies have investigated diverse
viewpoints related with item reviews. The experimental discoveries of various studies
related to sentiment analysis on product reviews has presented that detail analysis of these
reviews is very much beneficial from both customer and organization point of view. It was
observed by in his early study related with online verbal data investigated the effect of
negative customer reviews. The outcomes demonstrate that the impact of negative online
word of mouth on saw retailer dependability and buyer buy aims is negative and this
negative impact relies on upon commonality with the retailer. Customers who are less
acquainted with a retailer are more probable influenced by the negative reviews. The
concentrate likewise proposes that the degree of word of mouth inquiry relies on upon the

43
customer's purposes behind picking online vendor. observed that reviews collected on
books from amazon web site were analyzed and the customer feedback influenced other
consumers for better deal. (Sorensen and Rasmussen, 2004) also done the similar kind of
work they study and researched the effect of New York Times reviews with different
fiction title. In their work they have concluded the effect of consumer feedback matters a
lot on purchaser mind. The positive reviews have influenced the customer segment in large
numbers. have also investigated the impact of online item surveys on relative offers of two
online bookshops.

(Sen and Lerman, 2007) investigated in their work that polarity or orientation of
customer reviews influence the customer mindset. These reviews have significant impact
on customer mindset. A similar type of research was conducted by Li and Hitt (2008), there
has been a huge advancement in research identified with online reviews and their effect on
buyers' behavioral aim, state of mind and buy basic leadership forms. The study proposed a
model to analyze the particular inclinations of early purchasers and its effect on long haul
customer buy conduct and also on the social welfare made by review frameworks. The
effect of negative and positive electronic word of mouth on electronic word of mouth
impact for various item sorts (look versus experience) was likewise examined by Park and
Lee (2007).
The outcomes demonstrated that the positive electronic word of mouth impact was
more prominent for negative electronic word of mouth than for positive electronic word of
mouth. The tests additionally reported that built up sites indicated more electronic word of
mouth impact and these impact was more prominent for experience products than for
pursuit merchandise (Park and Lee, 2007).

It was observed by (Lee and BradLow, 2011) that online reviews have additionally
been utilized for mechanized statistical surveying to support the examination and
perception of business sector structure. The study proposes that business sector structure
examination can be performed via consequently evoking item traits from online consumer
reviews. This sort of business sector structure examination can encourage investigation of
item substitutes and supplements. The impact of outsider item audits on money related

44
estimation of firms presenting new items was additionally contemplated in a late research
(Chen, Liu, and Zhang, 2012). The outcomes proposed that such surveys assume huge parts
in influencing firm esteem as the financial specialists overhaul their assumption about new
item deals potential (Chen, et al., 2012). Some late studies have further proposed that the
examination of item audits at various granularity levels can uncover item characteristic
quality and shortcoming (Zhang, Xu, and Wan, 2012) which thus can clarify particular
inclinations of every client (Wei, Chen, Yang, and Yang, 2010).

Work related to Sentiment analysis and opinion mining: Sentiment analysis or


Opinion Mining is a comparatively new research topic in recent years. The early work of
Sentiment Detection began in late 1990s (Kessler, Nunberg, & SchÄutze, 1997; Spertus,
1997; Argamon, Koppel, & Avneri, 1998;), but only in the early 2000s it become a major
subfield of the information management discipline (Kobayashi, Inui, & Inui, 2001; Raubern
& Muller-Kogler, 2001. Most of the work has focused on various product reviews, L.
Dini & G. Mazzini in 2002 studied customer views about a product from Web and applied
syntactical structure and vocabulary symbol of handling to these in order to ensure a
controlled response from natural text, for later handling with information (data) excavating
procedures. Their methodology primarily consisted of superficial syntactic parsing with a
method known as chunking, encapsulated with a vocabulary symbol parsing set of rules
developing a similar template" system (i.e. an advanced regular expression matcher)
applying a sentiment separation to the constructed lines. (Peter D. Turney , 2002) published
a paper in which they suggested an unsupervised learning algorithm for
categorizing investigation as recommended or not recommended. The
procedure uses part of speech categorization to classify adjectives and adverbs, and
therefore uses a procedure for deriving sequence-wise Mutual Information besides with
information retrieval to quantify the likeness of the results with opinion-bearing words with
known opinion reference words (e.g excellent for positive). For every instance of
evaluation, the likeness results of all the opinion- bearing words are averaged, and if it is on
the positive side, the review is given a positive print, else it’s granted a negative print.
Turney attained 74 percent of average correctness, on analyses oscillating from
automobiles to big screen over banks and tourist places (Turney, P, 2002) .

45
(Pang & Lee, 2002) in 2002 studied sentiment classification in the area of film
assessment. They performed it in such a manner which has no prior experience or
information or teaching. (Pang et al.,2004) used very easy structure of machine learning in
their document. The approach was employed by the authors for conducting learning for the
enactment of words chosen by subject. This work was carried out further by couple of
students from computer science fraternity. They analytically tested the prevailing ways of
using human-interpreted seed words as per testing response for sentiment analysis, such as
that used by Turney in the same year. The two students proposed a list of words for
beneficial and unfavorable words, respectively. These words were then vetted with a list
created from self-assessment and basic measurements of the facts and figures. The list
demonstrated considerably far consistent as compared to the ones being used by human
recommended seed words. Although the learning was very much confined in volume,
suggested that programmed administered feature selection could produce features that can
be better than those generated by social speculation based methods. (Pang et al. 2008)
therefore exploited this processed information for producing a very easy, fully self-
assessment based opinion classifier process using machine learning.

Pang and Lee (2008) wrote a book that offers a complete impression of the
investigation in the concerned area. Pang et al. (2002) lead initial polarity classification of
assessments using supervised tactics. The techniques which were explored are backing up
Vector Machines (SVMs), Naive Bayes and Python; this study used data sets with a
different set of functions, for instance unigrams, bigrams, binary and term frequency
feature weights and others. The outcome of their observation was that sentiment
classification is not that easy than regular topic-based classification they also concluded
that using a SVM classifier with binary unigram based structures generates the best output.
A succeeding advancement was the identification and deduction of the neutral portions of
documents and the implementation of a polarity classifier on the remaining (Pang and Lee,
2004). This helped achieving text soundness with contiguous text lengths which were
expected to belong to the same subjectivity or objectivity class. Documents were portrayed
as charts/graphs with sentences as nodes and attached scores in between as edges. Two
supplementary points characterized the subjective and objective nodes. The weights

46
between the two nodes were derived using three various, empirical decomposing tasks.
Identifying a partition that reduces the cost function splits the objective from the subjective
verdicts. They stated a statistically important enhancement over Naive Bayes standard
using the complete text however, with only very minute hike as compared to using a SVM
classifier on the overall document. (Pang, B. and L. Lee, 2008)

Mullen and Collier (2004) used SVMs and prolonged the feature set for
demonstrating the documents with favorable methods from a range of various sources.
They hosted structures based on Osgood’s Theory of Semantic Differentiation (Osgood,
1967) using WordNet to develop the values of effectiveness, motion and evaluative of
adjectives and Turney’s semantic coordination (Turney, 2002). Their conclusions
showcased that using a hybrid SVM classifier, which usesas features the distance of
documents from the splitting hyperplane, with all the stated features yields the superlative
outcomes (Mullen, T. and N. Collier, 2004).

Whitelaw et al. (2005) contributed very minute detailed semantic differences in the
feature set. This approach was constructed on a lexicon formed in a semi-supervised
environment and then physically fine-tuned. It was unified of 1329 adjectives and their
modifiers categorized under numerous nomenclatures of evaluation attributes based on
Martin and White’s Appraisal Theory (2005). They collectively created appraisal clusters
with unigram-based manuscript representations as features to a Backing Vector Machine
identifier (Witten and Frank, 1999), resulting in substantial amount of hike in precision.
Lexicon-based procedures trust on a sentiment lexicon, an assembly of identified and
precompiled sentiment terms. The major contribution was given by (Popescu and Etzioni,
2005; Scharl andWeichselbraun, 2008; Taboada et al., 2011). Machine learning
methodologies make use of syntactic and/or philological structures (Pak and Paroubek,
2010b; Go et al., 2009, Boiy and Moens, 2009). There also exists hybrid approach, with
sentiment lexicons plays a vital part in a lot of procedures, e.g. (Diakopoulos et al., 2010).
For example, (Moghaddam and Popowich, 2010) establishes the polarization of
assessments by recognizing the division of the adjectives that occurs in them, with a
correctness testified of about 10% greater than pure machine learning performances.

47
However, such relatively effective performances often don’t comply when switched
towards different realms or context forms, due to the inflexibility in the ambiguity of
sentiment terms. The sentiment terms might designate partiality, but there could be
unsatisfactory circumstance to calculate its semantic positioning, principally for adjectives
in sentiment lexicons (Mullaly et al., 2010). Numerous estimations have showcased the
importance of circumstantial facts and evidence (Weichselbraun et al., 2010; Wilson et al.,
2009), and have recognized context words with a greater influence on the separation of
uncertain conditions (Gindl et al., 2010).
For example, the adjective unpredictable” might have an adverse coordination in an
automotive review, in an expression such as “unpredictable steering”, but it might have a
constructive coordination in a movie review, in an expression such as “unpredictable plot”.
Consequently two sequential words are extracted, where in one associate of the duo is an
adjective or an adverb and the following offers background.

In the recent past, procedures for opinion mining have started focusing at various
public medias, in combination with a inclination towards its solicitation as a pro- active
instead of a reactive technique. Understanding general outlook can have vital magnitudes
for assessing and forecasting the upcoming events and trends.

One of the very predominant and most recognizable application of this is for review
rating:(Peter D. Turney,2002) found that, with 410 assessments from Epinions, the
algorithm achieves an average precision of 74%. It also gives the idea that film assessments
are harder to characterize, in light of the fact that the entire is not as a matter of course the
whole of the parts; in this manner the precision on motion picture audits is around 66%.
Then again, in case of Financial Institutions, banks and cars, it appears that the entire is the
whole complete set of information, and the precision is 80% to 84%. Tourism assessment is
a form of transitional case. Another implementation of the same is in the area of stock
market forecasts: (Bollen and Mao, 2011) found that, opposing to the anticipation that if the
stock markets falls, then general population mood would also be shifted towards negative
side, in fact fall in general population’s opinion acts as a precursor to a collapse in the stock
market.

48
Practically, all the work on opinion mining from Twitter has used machine learning
techniques. (Pak and Paroubek, 2010b) aiming to identify random tweets on the foundation
of positive, negative and neutral sentiment, building a simple binary classifier which uses
n-gram and POS features, and proficient on various cases which had been annotated
permitting to the existence of positive and negative emotions.

Their methodology has much in common with a previous sentiment classifier


developed by (Go et al., 2009), which also uses unigrams, bigrams and POS tags, though
the former performed via investigation that the distribution of certain POS tags fluctuates
between positive and negative posts. Hence, this is one of the main cause for the relative
paucity of linguistic procedures for opinion mining on social media is most likely due to the
complications in using NLP on low quality text, something which machine learning
procedures can – to some extent – bypass with ample test or training data. For example. the
Stanford NER dips from 90.8% to 45.88% when applied to a bunch of tweets (Liu et al.,
2010). (Ritter et al., 2011) also, it exhibits some of the complications while implementing
orthodox POS tagging, chunking and Named Entity Recognition techniques to tweets,
recommending a solution based on Labeled LDA (Ramage et al., 2009).

Opinion Mining can be helpful in a few ways. For instance, in advertising it aids in
identifying the accomplishments of a campaign battle or new item dispatch, to figure out
which of the options of an item or administration are well known and evenidentified which
demographics like or aversion specific components. Hence, the belief classification is
beneficial to prospective consumers (buyers) and product manufacturers both.

For a prospective consumer, although he/she could go through all assessments of


various yields at trader’s places to mentally assess and weigh the pros and cons of each
product in order to arrive at a conclusion which of them will be beneficial, it is much more
appropriate and very less time taking activity to see a pictorial feature-by- feature opinion
in the assessments. A system like ours can be installed at a vendor’s location that has
evaluations so that prospective buyers could not only match the rates and product details

49
(which can already be done at some sites), but also the voice of prevailing customers. For a
product vendor, assessing customer’s thoughts about it’s products and those of its
opponents to find their weakness and strengths is very critical for marketing intelligence
and for product benchmarking. This type of work is usually performed manually which is
very intensive, hectic and takes a lot of time. This method is very supportive and comes
very handy in this case.

The opinions communicated by shoppers in online item reviews can be an


imperative wellspring of promoting insight that can help supervisors and advertisers to
comprehend consumer' worries and interests. For dissecting the vast volume of item review
information and for extricating insight from the same, tools and best practices in view of
content data excavation and natural language processing have been projected in (Chung and
Tseng, 2012).

On applying free text treatment procedures on web based assessments could find
patterns and subjects that might be important for bigger purchaser and business groups
(Gamon, Aue, Corston-Oliver, and Ringger, 2005). Furthermost applicable exploration on
handling web based consumer reviews concentrates on sentiment analysis and opinion
mining which mean to find analysts' attitude, whether incremental or decrement,
concerning an item in general or different components of the item.

Study related to Sentiment Classification Methods: In this section we will discuss


various sentiment classification techniques. In the earlier section we have explained that
how a opinion can be represented in a document in the form of entity(e) and sentiment
(s). The problem of sentiment classification can be discussed in two ways , if s takes
categorical value then this task is under classification problem, if s takes numeric value
then it becomes problem of regression. First we discuss sentiment classification, which
implies classifying the tagged reviews into classes of polarity: positive and negative.
Positive polarity implies positive orientation of review and negative polarity implies
negative orientation of review.

50
The basic approach in classifying opinion is to treat the problem as a topic-based
text classification problem, then any text classification algorithm can be applied to
determine the semantic orientation of the tagged reviews , such as Naive Bayesian, SVM or
kNN (Yugowati P; Shaou-Gang M; Hui-Ming W, 2013) . The orientation can also be
determined using score function. We discuss three main approaches , Naïve Bayesian ,
Support Vector Machine and Maximum Entropy. (Bo Pang et al., 2008) proposed a novel
machine learning strategy that applies content order procedures to change in accordance
with simply the subjective part of the record, which is in taking after procedure:
(1) mark the sentences in the report as either subjective or target, disposing of the
last mentioned; and after that
(2) Apply a standard machine learning classifier to the subsequent concentrate. Creators
utilized a proficient and instinctive diagram construct detailing depending with respect to
discovering least cuts. Their investigations include ordering film audits as either positive or
negative. To assemble subjective sentences or expressions, creators gathered 5000 motion
picture audit pieces from www. rottentomatoes.com, and for target information they taken
from the web motion picture dataset (www.imdb.com). Both Naïve Bayes and SVMs can
be prepared on subjectivity dataset and afterward utilized as a fundamental subjective
finder.
(N. Kobayashi et al., 2004) experimented with machine learning techniques and suggested
that the proper investigation of useful attribute from the reviews can help in extracting the
useful information from online reviews. Authors conducted experiment with Japanese web
documents. The authors have used SVM as machine learning classifier , they also used
feature selection method in order to identify beat attribute for dimensionality reduction.

(Xavier et al, 2011) proposed a profound learning approach which figures out how
to remove a significant representation for every review in unsupervised strategy. It depends
on calculation for finding middle representation worked in a progressive way. The
information is collected in the form of reviews from amazon. Their analysis demonstrates
that linear classifiers prepared with this more elevated amount learnt highlight
representation of reviews beat the present condition of state-of-art.

51
(Andrew L. Maas et al., 2011) exhibited a model that uses a blend of unsupervised
and managed methods to learn word vectors catching semantic term-report data and also
rich notion content. This model catch both semantic and sentiment similitudesamong
words. Creators assess this model with report level and sentence level classification
undertaking in the area of online motion picture audits. For archive and sentence level
creators contrast this present model's statement representations and a few sacks of words
weighting technique, and option ways to deal with word vector prompting. For test creators
utilized IMDB audit dataset. They assess classifier execution after cross approving
classifier parameters on the preparation set using a linear SVM in all cases. However their
model showed superior performance to other approaches, and performed best when
concatenated with bag of words representation (Andrew et al, 2011) likewise performed
sentence level subjectivity order. For this undertaking classifier is prepared to choose
whether a given sentence is subjective , communicating the author's sentiments , or
objective , communicating the essayist's assessments , or objective ,communicating simply
actualities. Creator's uses dataset of (Pang and Lee 2004), which contains subjective
sentences from film audit rundowns and target sentences from motion picture plot
synopses. Creators in haphazardly split the 10,000 case into 10 overlap and report 10 fold
cross acceptance precision utilizing the SVM preparing convention of Pang and Le(2004).
However authors find that their model gave predominant element when analyzed against
others SVM. (Gizem et al., 2012) proposed and evaluate new feature to be used in a word
polarity based approach to sentiment classification.

(Wei Wang et al.,2013) proposed novel half and half affiliation principle digging
technique for understood component recognizable proof in Chinese item audits. Firstly
creators extricate applicant highlight marker based word division, parts of discourse
labeling and highlight grouping, then process the co-event degree between the hopeful
element pointers and highlight words utilizing five collocation extraction calculations. For
experiments data crawled from Chinese shopping sites 360buy.com.In authors designed
five rules for implicit feature identification, however they find that basic rules is best rule
among five rules.

52
(S. Saha et al.,2011) show that it is possible to develop a model using multi-
objective optimization techniques based on Generics Algorithms. For an experiment
authors used BART[60] a modular kit for anaphora resolution that support state of the art
statistical approaches to the task and enables efficient feature engineering . Authors
evaluated their approach on the ACE-02 dataset which is divided in three subsets: bnews,
npaper, and nwire. However authors claim that optimizing according to the multiple metrics
simultaneously may result in better results with respect to each individual metric then
optimizing according to that metric only.

(Siddu P. Algur et al.,2010) purposed technique for identification of spam reviews


from a dataset. Their approach for spam review detection used conceptual level similarity
measure. The test results demonstrate that, there are bigger quantities of copy spam audits
identified utilizing the theoretical level similitude measure.(Ee- Peng Lim et al.2004) in
introduced scoring techniques to quantify the level of spam for every analyst. With the
marked spammers, we now prepare a straight relapse model to foresee the quantity of spam
votes of a given analyst's spamming practices, i.e., GD, ED, TP, TG scores. Product review
Dataset taken from Amazon.com for an experiment. Author’s results show that their
proposed procedures are result oriented in learning spammers and outperform other
baseline method. (C.L. Lai et al., 2010) exhibited a novel audit spam location technique
which is supported by an unsupervised inferential dialect demonstrating structure. In
examination Author's spoken to every audit by a TFIDF vector, and after that each pair of
surveys of an item class depended on the cosine comparability measure. A bolster vector
machine was likewise connected to characterize untruthful survey and all default
parameters of the SVM light bundle were utilized. However Researchers test results
demonstrates that the proposed inferential dialect model furnished with high order idea
affiliation information is powerful in untruthful survey location when contrasted and other
pattern strategies. (A. Mukherjee et al.2011) projected an operative procedure to identify
sets review spam. His projected mode can be segregated in three stages:
Step 1 - Frequent Pattern Mining to Find Candidate Groups:, Step 2 – Computing
Spam Indicator Values:
Step 2 - Ranking Using SVM Rank: their trial is directed utilizing a substantial

53
number of analysts and surveys of fabricated items from Amazon.com. The results
showcases that proposed method positioning is very efficient and they replicate people’s
opinions of spam and non- spam. (M. Ottwe et al.,) developed and tested the three
methodologies for identification of unreliable opinion spam, Genre identification,
Psycholinguistic deception detection and Text categorization and develop a classifier to
detect opinion Spam. Researchers extract all 6,977 assessments from the 20 supreme and
prevalent Chicago hotels on TripAdvisor and ranked three programmed methodologies for
identifying unreliable opinion spam. Author’s use SVM light to train their linear SVM
models on all three approaches and find that programmed classifiers outper form human
judges for every metric, instead of honest recall where JUDGE 2 implements best. Authors
get nearly 90% precise on gold-standard opinion spam at a set.

(R.Y. K. LAU et al.,2012) proposed computational models for recognizing fake


audits. This content mining model is created and coordinated into a semantic dialect model
for the discovery of untruthful surveys. The dataset collected from amazon.com and
evaluated based on a real world. A support vector machine (SVM) classifier used for
training and testing.

(Zheng, L., et.al, 2014) presented in their work the significance of customer attitude
on product. They have presented a multidimensional approach for sentiment analysis. They
have used sentiment lexicon approach and also remove word ambiguity in different
dimension. They have proposed a new algorithm and conducted experiments on a very
huge amount of dataset which consist of approximately 28 million reviews. (Zhang,
Yongfeng,et. al, 2014) authors have made use of Explicit Factor model (EFM) for
recommendation system and use phrase level sentiment analysis on online user reviews.
Their work was able to generate results on real world data sets and predicted top-k
recommendations. Their work also shows that different features can be useful for different
category of users.

(Zhang, Y., et. al, 2015) This work has used phrase level sentiment analysis on user
reviews in a recommender system. Their approach was mainly focus on explicit features in

54
user reviews. They have use collaborative filtering approach in their work. (Vinodhini, G,
et, al. 2014) presented a hybrid model using principle component analysis for classification
of product reviews. They have used logistic regression and support vector machine as
machine learning methods. They have showed with experiments that hybrid model for
opinion mining is more promising. (D’Avanzo, et. al, 2015) presented and investigated that
online reviews are really helpful for making purchasing decision, Maximum buyers take
help from online reviews before shopping. The shopping experience of various buyers is
can be found in the reviews they post. Authors have presented cognitively based procedure
that mines users opinions from specific kinds of market

55
3. SOFTWARE
3.1 PYTHON SOFTWARE

Python's name is derived from the British comedy group Monty Python, whom
Python creator Guido van Rossum enjoyed while developing the language and first released
in 1991. Python's design philosophy emphasizes code readability with its notable use
of significant whitespace. Its language constructs and object-oriented approach aim to help
programmers write clear, logical code for small and large-scale projects.

Python is dynamically typed and garbage-collected. It supports


multiple programming paradigms, including procedural, object-oriented, and functional
programming. Python is often described as a "batteries included" language due to its
comprehensive standard library.

History: Python was conceived in the late 1980s as a successor to the ABC
language. Python 2.0, released 2000, introduced features like list comprehensions and
a garbage collection system capable of collecting reference cycles. Python 3.0, released
2008, was a major revision of the language that is not completely backward-compatible,
and much Python 2 code does not run unmodified on Python 3. Due to concern about the
amount of code written for Python 2, support for Python 2.7 (the last release in the 2.x
series) was extended to 2020. Language developer Guido van Rossum shouldered sole
responsibility for the project until July 2018 but now shares his leadership as a member of a
five-person steering council.

The Python 2 language, i.e. Python 2.7.x, is "sunsetting" on January 1, 2020, and
the Python team of volunteers will not fix security issues, or improve it in other ways after
that date. With the end-of-life, only Python 3.6.x and later will be supported.

Python interpreters are available for many operating systems. A global community
of programmers develops and maintains CPython, an open source reference
implementation. A non-profit organization, the Python Software Foundation, manages and
directs resources for Python and CPython development.

56
Python's large standard library, commonly cited as one of its greatest
strengths, provides tools suited to many tasks. For Internet-facing applications, many
standard formats and protocols such as MIME and HTTP are supported. It includes
modules for creating graphical user interfaces, connecting to relational
databases, generating pseudorandom numbers, arithmetic with arbitrary-precision
decimals, manipulating regular expressions, and unit testing.

Some parts of the standard library are covered by specifications (for example,
the Web Server Gateway Interface (WSGI) implementation wsgiref follows PEP), but

most modules are not. They are specified by their code, internal documentation, and test
suites (if supplied). However, because most of the standard library is cross-platform Python
code, only a few modules need altering or rewriting for variant implementations.

3.1.1 PACKAGES
As of March 2018, the Python Package Index (PyPI), the official repository for
third-party Python software, contains over 130,000 packages with a wide range of
functionality, including:

 Graphical user interfaces


 Web frameworks
 Multimedia
 Databases
 Networking
 Test frameworks
 Automation
 Web scraping
 Documentation
 System administration
 Scientific computing
 Text processing
 Image processing

57
Human vision is unique and superior in that it detects and discriminates the objects
around with ease. It can perceive the 3-D structures with perfection and also categorize
them efficiently. The texture of an object can be well distinguished by the human eye.
Rather, computer vision is a broad term which describes the computer performing the
function of an eye by using different mathematical algorithms on a digital image. Computer
vision researchers have made tremendous progress and much of their work has been
practically applied in various fields of the human day-to-day life. Computer Vision is
precise in its identification and quicker in execution, with a good and uncomplicated data.
Human visual perception at very high spatial frequencies are less due to physical
limitations, unlike computer vision which have no constraints as such. Computer Vision
involves intensive processing of huge amount of data which consumes quiet a lot of
resources and memory of the computer.

When a 2-D image is converted and presented numerically, it becomes a digital


image. It is made-up of pixels that has the spatial co-ordinates x and y with the amplitude
values of the function of being finite, discrete quantities. The digital images do not possess
uniform distribution of intensities and the changes in these intensities form a repeated
pattern on the image surface, giving rise to the image texture. These patterns might be
formed because of the roughness of the real object or due to the changes in the reflectance
of light. Texture is considered as a structure composed of a large number of more or less
ordered similar elements or patterns. Texture analysis is a process where a class of
mathematical algorithms and techniques extract the features of the digital image by
characterizing the patterns on the image. Texture analysis is most widely used in digital
image processing.

The basic types of an image are binary, grayscale and true colour (colour image).
Binary image has only two possible values 0 or 1 for each pixel. In 8-bit grayscale image
each pixel is a shade of gray, values ranging from 0 (black) to 255 (white). In true colour
images (24 bit), each pixel is a representation of different amounts of red, green and blue
and each pixel in a true colour image has the possibility of having 256 possible colours.

Digital image processing refers to the manipulation of digital images to obtain


information and to accentuate or de-emphasize some details of information present in the
image. It also involves statistical and other analytical techniques to extract meaningful
58
information from the image. It is a type of signal processing where the computer algorithms
manipulate the digital images and produce an image output or the features of an image.

Spatial resolution is the density of pixels over the image and it is the smallest
discernible detail in an image. When there is a greater spatial resolution, it means that there
are more pixels involved to display the image. The number of grayscales which is used to
represent an image is called quantization. Convolution is a process by which a mask is
moved from pixel to pixel in an image and at each pixel a predefined quantity is computed
using mathematical operations. The output image so obtained is of the same size as the
input image. Convolution can be used to implement operators such as spatial filters and
feature detectors, to achieve image smoothing and image sharpening.

Texture is an real construct that defines local spatial organization of spatially


varying spectral values that is repeated in a region of larger spatial scale. Thus, the
perception of texture is a function of spatial and radiometric scales”. The texture of a region
can also be defined as the description of the properties of the pixel pattern attributes in that
region.

Texture of an image is made up of repetitive patterns in an image which have


different intensity attributes. The texture of an image is portrayed by the types, the number
of its primitives and the spatial organization or arrangement of its primitives. The patterns
which are developed from the large primitives are called as macrotexture and within these
primitives, further smaller textures can be found in some images and they form the
microtexture. A statistical approach is used to study the microtextures, while structural
approach is employed to study the macrotextures as it is essential to find the shape and its
structural properties.

The similar textural elements that are replicated over the region of an image are
called texels. Texels are the basic elements of textures and they have certain specific
characteristics that are explained below.

1. The texels are oriented in different directions.

2. The texels are of various sizes and show degrees of uniformity.

3. The texels are placed at varying distances and in various directions.

59
4. The contrast they exhibit show different magnitudes and variations.

5. Between the texels, a variable background may be exhibited.

6. The variations which form the texture may have varying degrees of randomness versus
regularity

The texture has intuitive properties of its own and it was summed up by Tuceryan et
al. as given below

• Texture characterises areas or regions and the texture of a point remains undefined.
Therefore, texture can be a considered to possess a contextual property and it is defined in
terms of gray values in a spatial neighborhood. The size of the spatial neighbourhood relies
on the type of texture and the size of the primitives which define the texture.

• Since texture includes the spatial distribution of the gray levels, the use of co-occurrence
matrices are considered to be a favourable texture analysis tool.

• When the texture of an image is observed at various resolutions, it can be perceived at


different scales.

• A region in an image is observed to have texture, when the primitives in the region are
more in number. When only few primitives are present, then a group of countable objects
are perceived. Therefore, a texture can be observed only when notable individual forms are
absent.

Uniformity, density, coarseness, roughness, regularity, linearity, directionality,


direction frequency and phase are important properties which describe the texture. Some of
the mentioned properties are not independent. Frequency is a property which is not
independent of density and the property of direction applies only to directional textures.

The two significant characteristics of a texture are directionality and coarseness and
the two prime texture analysis approaches are statistical and structural. Textures in an
image can be created artificially or in the case of natural images they can be observed in
that captured image. The natural images have naturally created patterns or textures in them,
while the artificially created images have textures which are produced by mechanical

60
influence. The repeated arrangement of texels with same intensity values over an area is
called as spatial frequency which describes the spatial distribution of gray values.

Basically, human vision detects the patterns, geometrical portrayals and variations
in an image, which helps it to identify and determine an object. So a human vision can
assess textures in an image only qualitatively, but since there is a need for quantitative
assessment of textures the texture properties must be defined and computed
mathematically.

3.1.2 TEXTURE ANALYSIS

Texture analysis is a process which characterises the textural contents in an image


using mathematical algorithms or models, in order to extract useful information. Spatial
variations in pixel intensities of the image give rise to the unique repetitive patterns or
texture which are quantified by texture analysis.

Texture analysis is used to extract features in an image for recognition. The useful
information extracted after texture analysis, are then interpreted by various methods for
identification or classification. The mathematical computations for the quantitative texture
analysis are performed through various algorithms.

The region in an image is an area with similar pixel values and are computed over
large neighbourhood. In the natural images, homogeneous regions may be surrounded by
non-homogeneous regions with irregular boundaries and such regions can be easily studied
with segmentation and edge detection techniques. The non-homogeneous regions have
varying attributes of intensity and colour which provide the cue for segmented analysis of
the texture. Structural approach is well suited for textures which have an even structure
with textural primitives that are large enough to be segmented individually. Structural
approach represents texture by well defined textural elements which appears repeatedly,
corresponding to placement rules.

In general, texture analysis involves few random steps and they include
preprocessing, feature extraction, texture classification and texture segmentation. Pre-
processing is used for noise attenuation, correction of image orientation and so on. The pre-
processing techniques like homomorphic filtering, histogram equalization, adaptive
61
histogram equalization, contrast limited adaptive histogram equalization, gamma correction
are widely used. Pre-processing is considered to improve the contrast of the image,
especially before the textures are computed.

Feature extraction recognizes and determines a set of unique and well described
features to characterise a texture. Detecting the perceived qualities of texture in an image is
the first important step towards building mathematical models for texture. The intensity
alterations in an image which characterize texture are mostly due to some underlying
physical variations in the scene.

Feature extraction to describe the characteristics of texture falls into different types
of models which include statistical, structural, geometrical, model based and signal
processing methods. Since the texture of different objects vary and are composed of
complicated parameters, a variety of approaches are necessary to characterize the texture
and classify them.

3.1.3 SENTIMENTAL ANALYSIS

The sentiment analysis phase to identify sentiment based cliques with respect to
various issues or events. The normalized and transformed tweets in the form of unigrams in
the preprocessing phase have been used to identify the sentiments of the tweets. The
Twitter Sentiment Analysis (TSA) has been used for several applications which include
product reviews, political orientation extraction, stock market prediction etc. More over the
real time tweets analysis on various issues has become a strong indicator to analyse the
human behaviour and reaction on various issues. this shows the tweets generated in the
name of leading chief ministerial candidate in Delhi during the election result declaration.
The analysis of such tweets will give strong insight on the popularity of these persons in the
community.

Topical diversity in content, linguistic flexibility in expression and volume of tweets


are the important challenges in analysing tweets. Topical diversity in content demands a
domain independent solution for sentiment analysis. We propose an unsupervised and
domain independent approach by using the polarity scores from three lexical resources -
Sentiwordnet 3.0, SenticNet 2 and SentislangNet. SentiWordNet contains polarity scores of

62
uni-grams. SenticNet 2.0 is a publicly available semantic resource which contains
commonly used polarity concepts.

Our method exploits SenticNet along with SentiwordNet to analyse Twitter


sentiments. A sentiment lexicon for slangs and acronyms called SentislangNet is also
created using SenticNet. The algorithm has been implemented in parallel python framework
and shown to scale well with multiple cores for large volume of data. The following
sections discuss the issues in the existing approaches and describe the method and their
implementation in parallel python environment to address the scalability issue due to the
volume of data.

3.1.4 SYSTEM ARCHITECTURE

It illustrates the general framework for the sentiment analysis. Twitter corpus
consist ngrams generated by the pre-processing module from the tweets. Sentiment analyser
make use of three sentiment lexicons (SentiWordNet, SenticNet and SentiSlangNet) to find
the polarity of each tweets and use this information to generate cliques based on the
sentiments on each issues.

63
4. ANALYSIS
4.1 CODING
#!/usr/bin/env python
# coding: utf-8

# In[1]:

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# In[2]:

df=pd.read_csv('D:\\DATA FLAIR\\news.csv')
df.shape
df.head()

# In[3]:

labels=df.label
labels.head()

64
# In[4]:

x_train,x_test,y_train,y_test=train_test_split(df['text'],labels,test_size=0.2, random_state=7)

# In[5]:

tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)

# In[6]:

pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

# In[7]:

confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

65
4.2 ADVANCED PYTHON PROJECT-DETECTING FAKE NEWS

Do you trust all the news you hear from social media? All news are not real, right?
So how will you detect the fake news? The answer is Python. By practicing this advanced
python project of detecting fake news, you will easily make a difference between real and
fake news. Before moving ahead in this advanced Python project, get aware of the terms
related to it like fake news, tfidfvectorizer, PassiveAggressive Classifier.

4.2.1 WHAT IS A TFIDVECTORIZER?

TF (Term Frequency): The number of times a word appears in a document is its Term
Frequency. A higher value means a term appears more often than others, and so, the
document is a good match when the term is part of the search terms.
IDF (Inverse Document Frequency): Words that occur many times a document, but also
occur many times in many others, may be irrelevant. IDF is a measure of how significant a
term is in the entire corpus.
The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF
features.

4.2.2 WHAT IS A PASSIVE AGGRESSIVE CLASSIFIER?

Passive Aggressive algorithms are online learning algorithms. Such an algorithm


remains passive for a correct classification outcome, and turns aggressive in the event of a
miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge.
Its purpose is to make updates that correct the loss, causing very little change in the norm of
the weight vector.

Detecting Fake News with Python – About the Python Project

This advanced python project of detecting fake news deals with fake and real news.
Using sklearn, we build a TfidfVectorizer on our dataset. Then, we initialize a
PassiveAggressive Classifier and fit the model. In the end, the accuracy score and the
confusion matrix tell us how well our model fares.

66
4.3 INSTALL PACKAGES
4.3.1 WHAT IS A NUMPY?
NumPy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays.
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:

 A powerful N-dimensional array object


 Sophisticated (broadcasting) functions
 Tools for integrating C/C++ and Fortran code
 Useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data.
Arbitrary data-types can be defined using Numpy which allows NumPy to seamlessly and
speedily integrate with a wide variety of databases.

Installation:
 Windows does not have any package manager analogous to that in linux or mac.
Please download the pre-built windows installer for NumPy(according to your system
configuration and Python version).
And then install the packages manually.
 To install numpy in python
 Open command prompt and type the below code:
 pip install numpy

To import numpy module in python


>>> import numpy

4.3.2 WHAT IS PANDAS?


Pandas is a popular Python package for data science, and with good reason: it offers
powerful, expressive and flexible data structures that make data manipulation and analysis
easy, among many other things. The DataFrame is one of these structures
To install pandas in python
Open command prompt and type the below code
67
Pip install pandas
To import pandas module in python
>>> import pandas
4.3.3 WHAT IS ITERTOOLS?
The Python itertools module is a collection of tools for handling iterators. Simply
put, iterators are data types that can be used in a for loop. The most common iterator in
Python is the list.
To install itertools in python
Open command prompt and type the below code
Pip install itertools
To import itertools module in python
>>> import itertools

4.3.4 WHAT IS A SKLEARN?


Scikit learn is a library in Python that provides many unsupervised and supervised
learning algorithms. It’s built upon some of the technology you might already be familiar
with, like NumPy, pandas, and Matplotlib!

The functionality that scikit-learn provides include:

 Regression, including Linear and Logistic Regression


 Classification, including K-Nearest Neighbors
 Clustering, including K-Means and K-Means++
 Model selection
 Preprocessing, including Min-Max Normalization

4.4 INTERPRETATION

Follow the below steps for detecting fake news and complete your first advanced
Python Project –

Make necessary imports:

68
2. Now, let’s read the data into a DataFrame, and get the shape of the data and the first 5
records.

69
3. And get the labels from the DataFrame.

4. Split the dataset into training and testing sets.

5. Let’s initialize a TfidfVectorizer with stop words from the English language and a
maximum document frequency of 0.7 (terms with a higher document frequency will be
discarded). Stop words are the most common words in a language that are to be filtered out
before processing the natural language data. And a TfidfVectorizer turns a collection of raw
documents into a matrix of TF-IDF features.
Now, fit and transform the vectorizer on the train set, and transform the vectorizer on the
test set.

70
6. Next, we’ll initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train
and y_train.

Then, we’ll predict on the test set from the TfidfVectorizer and calculate the accuracy with
accuracy_score() from sklearn.metrics.

7. We got an accuracy of 93.05% with this model. Finally, let’s print out a confusion matrix
to gain insight into the number of false and true negatives and positives.

71
72
5. CONCLUSION
Fake news is an important challenge that takes place over social and computational
infrastructure. In this paper, we have proposed multiple hypotheses related to three different
characteristics of fake news: origin, proliferation and linguistic tone. The hypotheses are
tested using statistical methods on the Fake News Net dataset collected via Twitter. The
results of these hypotheses suggest the following: 1) fake news is not published on popular
websites, rather they are published by lesser known media outlets or websites; 2) fake news
are proliferated more by unverified users compared to verified users; and 3) the fake news
stories are written in a specific linguistic tone, though it is inconclusive to say which one
(negative, positive or neutral). The results expand the understanding of fake news as a
phenomenon and motivate future work, which includes: expanding the study on additional
hypotheses, and designing and developing a multifarious fusion model to better detect
given news for fakeness or legitimacy.

73
6. BIBLIOGRAPHY
Sadia Afroz, Michael Brennan, and Rachel Green- stadt. Detecting hoaxes,
frauds, and deception in writing style online. In ISSP’12.
Hunt Allcott and Matthew Gentzkow. Social media and fake news in the 2016
election. Technical report, National Bureau of Economic Research, 2017.
Solomon E Asch and H Guetzkow. Effects of group pressure upon the modification
and distortion of judgments. Groups, leadership, and men , pages 222–236, 1951.
Meital Balmas. When fake news becomes real: Combined exposure to multiple news
sources and political attitudes of inefficacy, alienation, and cynicism. Communication
Research, 41(3):430–454, 2014.
Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead,
and Oren Etzioni. Open information extraction from the web. In IJCAI’07.
Alessandro Bessi and Emilio Ferrara. Social bots dis- tort the 2016 us presidential
election online discussion. First Monday, 21, 2016.
Prakhar Biyani, Kostas Tsioutsiouliklis, and John Blackmer. ” 8 amazing secrets
for getting more clicks”: Detecting clickbaits in news streams using article in-
formality. In AAAI’16.
Jonas Nygaard Blom and Kenneth Reinecke Hansen. Click bait: Forward-
reference as lure in online news headlines. Journal of Pragmatics, 76:87–100, 2015.
Paul R Brewer, Dannagal Goldthwaite Young, and Michelle Morreale. The impact of
real news about fake news: Intertextual processes and political satire. International
Journal of Public Opinion Research, 25:323–343, 2013.
Carlos Castillo, Mohammed El-Haddad, Jurgen P feffer, and Matt Stempeck.
Characterizing the life cycle of online news stories using social media reactions.
In CSCW’14.
Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. Information credibility
on twitter. In WWW’11.
Abhijnan Chakraborty, Bhargavi Paranjape, Sourya Kakarla, and Niloy Ganguly.
Stop clickbait: Detecting and preventing clickbaits in online news media. In
ASONAM’16.

74

You might also like