You are on page 1of 7

Computing the Linguistic-Based Cues of Fake News in the

Philippines Towards its Detection


Aaron Carl T. Fernandez Dr. Madhavi Devaraj
Mapua University Mapua University
Manila, Philippines Manila, Philippines
actfernandez@mymail.mapua.edu.ph mdevaraj@mapua.edu.ph

ABSTRACT This paper has been limited to Philippine English news only.
The primary reason is that there are few lexical resources such
Fake news deliberately beguiles its consumers to accept false or as dictionaries, lexicons part-of-speech taggers in Tagalog, much
biased ideologies. But as menacing as it sounds to the society, it less in other Philippine native languages [9]. Also, the
poses a good classification problem to computer science. This construction of these compels the collective efforts of various
paper presents the disparity in writing style between legitimate natural language processing researchers [9] and requires
news and fake news in the Philippines, and how effective these stringent research efforts.
are as machine learning features. To capacitate this, credible Albeit the limitation, this study remains significant as most
news samples, as well as, fake news samples in the Philippines print media such as newspapers and magazines in the
were harvested online. The best feature set was consistent across Philippines are predominantly in English [10]. Also, it has been
all experiments and attained a precision of 94% on both the found out that there is no significant difference between
feature selection process and the test set of the final model. It American news and Philippine English news [11]. This due to
also attained a precision of 93% on a more recent data collected, the tendency of Filipino news writers to look up to the American
which was not part of the initial corpus constructed. English as the standard and adhere to the international
Furthermore, this paper exhibits how legitimate and fake news journalism governed by American and European journalists [11].
in the Philippines can be differentiated just by their headline or
content at a precision of 87% and 88%, respectively.
2 Background of the study
KEYWORDS The “Fake News Challenge – FNC-1” was one of the first
Fake News; Natural Language Processing; Feature Selection; initiatives of computer science in solving the “fake news”
Naïve Bayes; Logistic Regression; Support Vector Machines problem. The UCL Machine Reading team, who placed third at
the competition [12], derived their model from Davis and
Proctor’s initial work on the competition’s “Fake News Challenge
1 Introduction Dataset” [13]. Both represented their news samples as a bag-of-
Humans are instinctively poor at differentiating between words representation of the term frequency (TF) vectors of the
legitimate and fake news. Findings over the years have been headline and the news body, as well as its term frequency-
consistent that humans are only 4% better than the 50% base inverse document frequency (TF-IDF) cosine similarity, which
chance level at distinguishing truth from lies in text [1]. was then fed into a Multi-Layer Perceptron (MLP).
This vulnerability is exploited by some of the politicians and The objective of the whole competition was only to identify
their political parties as a cheaper mechanism to manipulate the the stance of the news content with respect to its headline and
electoral outcomes and people’s opinions about certain topics [2], not really to determine if it is fake or not. Perhaps, a better way
[3]. For example, several media reports suggested that Donald would be is to look for some hints that a fake news writer may
Trump would not have been elected president were it not for the unconsciously leave on their writing style which could enable a
influence of fake news [4], [5], [6]. The two of the biggest false machine to discover “That” critical pattern that would empower
stories during the 2016 US elections were in favor of Trump [7]. it to discriminate between fake and real.
These amassed over 30 million Facebook shares [7] compared to The “Linguistic-Based Cues” was what Zhou, et al sought in
the overall 8 million fake news shares favoring Hilary Clinton. [14] to distinguish deceptive from truthful messages. On their
Similarly, President Rodrigo Duterte of the Philippines has experiment, liars had used more words, verbs, noun phrases and
been acclaimed to have won the 2016 Philippine elections based were also less diversified at both content and lexical level and
on a misleading depiction of the Philippines as a “Narco-state” use fewer punctuations.
[8]. According to the United Nations Office on Drugs and Crime, These were some of the linguistic cues that Yang, et al in [15]
the prevalence of drug use in the Philippines is lower than the combined with their proposed “Visual Cues” such as the
global average [8]. However, this false narco-state message was resolution of the images and the number of faces on it. Their
used to justify more than 7,000 extra-judicial killings, earning model was a concatenated “Textual” and “Visual” convolutional
Duterte an immense popularity and notoriety that won the neural network (CNN), which found out that their “Visual Cues”
hearts and minds of most Filipino voters [8]. gave only a slight improvement to the “Textual Cues”. This is
after their “Textual”-only CNN, Long Short-Term Memory and articles were collected was from January 1, 2016 to October 30,
Gated Recurrent Unit models performed almost as good as their 2018.
proposed ensemble “Textual & Visual” CNN model. The news articles were scraped using BeautifulSoup [19]. The
This study delves deeper into the power of these “Linguistic- details collected for each news article were the Headline, News
Based Cues” at the task of Fake News Detection by coming up Content, Authors, Date, URL, and News Source. Having said that,
with a parsimonious list of linguistic features and investigate only the Headline and the News Content were considered as the
which of these have a significant impact on the task at hand and important fields of each news article. Any news data with
to how good it can discriminate when used with machine missing, corrupted or duplicate value on these fields were
learning algorithms. dropped. All other fields were just included for auditing
purposes and to ease the investigation during the experiments.
3 Methodology The constructed Philippine Fake News Corpus contains 14,802
legitimate news samples and 7,656 fake news samples. All of
3.1 Corpus Construction which had its Headline and News Body cleaned out from any
The construction of this experiment’s data set – Philippine attached bylines or other forms of metadata (e.g. “Like Share
Fake News Corpus 1 followed the guidelines Rubin, et al Subscribe”, “Like us on Facebook”, “Manila, Philippines - ”) that
described in [16]. The credible news samples were obtained from may had been included during the scraping. Data cleaning was
three of the national broadsheets in the Philippines, namely: The done via Regular expressions or RegEx, which can extract
Philippine Daily Inquirer, Manila Bulletin, and The Manila Times. information from any type of text by providing it the patterns to
To further bolster the credibility of these news sources, these are find or where it is positioned.
the newspapers available in the National Library of the
Philippines2. 3.2 Feature Engineering
For the negative samples, this study referred to the exact list 76 linguistic features were computed for this study. These
presented during the proposition of Senate Bill No. 1492 or the were further sorted into 8 broad categories namely: Readability
anti-fake news act in the Philippines3. The list was consistent Scores, Linguistic Dimensions, Summative Cues, Affective
with the Center for Media Freedom and Responsibility / National Cues, Informality Cues, Cognitive Cues, Punctuation Cues,
Union of Journalists of the Philippines (CMFR/NUJP)4 and the and Time-Orientation Cues (see Table 1).
Catholic Bishop’s Conference of the Philippines (CBCP)5 list of
news websites that contain fake or unverified content. Table 1: Linguistic features computed and its respective
At the time of data extraction, most of the identified fake categories.
news websites were not available anymore. The remaining cited Readability Scores
fake news sources from which the fake news samples were Flesch Kincaid Grade See Equation 1 [20]
Flesch Reading Ease See Equation 2 [21]
extracted from are: Adobo Chronicles, GR Pundit, Get Real Coleman Liau Index See Equation 3 [22]
Philippines, VerifiedPH, Pinoy Trending Altervista, Pinoy Trending Automated Readability Index See Equation 4 [23]
News, Thinking Pinoy, Duterte Today, Pinoy News Blogger, Dale Chall Readability Score See Equation 5 [24]
Gunning Fog Index See Equation 6 [25]
Pilipinas Online Updates, Hot News Philippines, News Media
SMOG Grading See Equation 7 [26]
Philippines, and Philippine News Courier. Linguistic Dimensions
Only the hard news under the “Nation” category had been Word Count No. of words
taken into the corpus construction. This is to align the news Syllables Count No. of syllables
Sentences Count No. of sentences
genre of the news samples in the dataset as well as to ensure that
Words per Sentence Total words / Total Sentences
the its lengths are consistent across all individual points. Long Words Count No. of words with > 6 letters
Both [17] and [18] affirmed that the emergence of fake news Difficult Words Count No. of words with > 3 syllables
in the Philippines started towards the run-up to the 2016 Type Token Ratio No. of unique words / Total no. of words
No. of Words in All Caps No. of words in all uppercase.
Philippine Presidential Elections. Thus, it is only logical to pin Function Words % Percentage of total no. of pronouns,
the collection of news samples from the start of the campaign prepositions, articles, conjunctions, and
period set by the Philippine law, which was on February 9, 2016. auxiliary verbs
Pronouns % Percentage of total no. of personal, first-
Having said that, indirect campaigning in the form of
personal singular pronouns, etc.
propagandas and fake news peddling has started as early as late Personal Pronouns % Examples: I, we, she
2015 [18]. To capture this, the timeframe on which the news First-Person Singular % Examples: I, me
First-Person Plural % Examples: We, us
Second Person % Examples: You, your
Third-Person Singular % Examples: She, he, her
1 https://github.com/aaroncarlfernandez/Philippine-Fake-News-Corpus Third-Person Plural % Examples: They, Them
2 http://web.nlp.gov.ph/nlp/?q=node/8270 Impersonal Pronouns % Examples: It, that, anything
3http://verafiles.org/articles/aquino-list-shows-many-fake-news-sites-bear-dutertes-
Articles % Examples: a, an, the
name Prepositions % Examples: below, all, much
4 https://cmfr-phil.org/in-context/knowing-your-source-think-before-you-click/
5 Auxiliary Verbs % Examples: Have, did, are
http://www.cbcplaiko.org/2017/01/31/appendix-ii-partial-list-of-web-news-blog-
sites-in-the-philippines-with-fake-or-unverified-content/ Common Adverbs % Examples: Just, usually, even
Conjunctions % Examples: Until, so, and
Negations % Examples: No, never, not 𝑤𝑜𝑟𝑑𝑠 𝑠𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠
Common Verbs % Examples: Run, walk, climb 𝐹𝐾𝐺 = 0.39 (𝛴
𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠
) + 11.8 (𝛴
𝑤𝑜𝑟𝑑𝑠
) − 15.59 (1)
Common Adjectives % Examples: Enormous, silly, fun
𝑤𝑜𝑟𝑑𝑠 𝑠𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠
Comparisons % Examples: After, better, great 𝐹𝑅𝐸 = 206.835 − 1.015 (𝛴 ) − 84.6 (𝛴 ) (2)
Interrogatives % Examples: What, how, why 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑤𝑜𝑟𝑑𝑠
Concrete Figures % Examples: 100, 250, 453.12 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠
𝐶𝐿𝐼 = 0.0588 ( ) − 0.296( ) − 15.8 (3)
Direct Quotes % Percentage of total no. of words, phrases, 100 𝑤𝑜𝑟𝑑𝑠 100 𝑤𝑜𝑟𝑑𝑠
sentences inside quotes. 𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 𝑤𝑜𝑟𝑑𝑠
Quantifiers % Examples: Many, much, few 𝐴𝑅𝐼 = 4.71 (𝛴 ) + 0.5 (𝛴 ) − 21.43 (4)
𝑤𝑜𝑟𝑑𝑠 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠
Dictionary Words % No. of words in a news article that are in 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡 𝑤𝑜𝑟𝑑𝑠 𝑤𝑜𝑟𝑑𝑠
the Default LIWC2015 dictionary [27] 𝐷𝐶𝑅𝑆 = 0.1579 (𝛴 ∗ 100) + 0.0496 (𝛴 ) (5)
𝑤𝑜𝑟𝑑𝑠 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠
Summative Cues
𝑤𝑜𝑟𝑑𝑠 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡 𝑤𝑜𝑟𝑑𝑠
Analytical thinking Percentage of categorical language use 𝐺𝐹𝐼 = 0.4 (𝛴
𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠
) + 0.0496 (𝛴
𝑤𝑜𝑟𝑑𝑠
) (6)
(articles and preposition) which is
associated with better academic 30
performance across all four years of 𝑆𝑀𝑂𝐺 = 1.043 √(𝛴 𝑤𝑜𝑟𝑑𝑠 > 3 𝑠𝑦𝑙𝑙𝑎𝑏𝑙𝑒𝑠)( ) + 3.1291 (7)
𝛴 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠
college according to [28].
Clout Percentage of words usage that show
status, power, dominance, and prestige The Readability scores and all other features that do not
[29]
require dictionary such as word count, syllables count, words per
Authenticity Percentage of usage of first-person
singular pronouns, third-person pronouns, sentences, type-token ratio, etc. were implemented by the authors
negative emotion words, exclusive words in Python using both Regular Expressions and the Natural
and motion verbs [30]. Language Toolkit6.
Emotional tone Summative variable for positive and
negative words usage, wherein the higher
The Linguistic Inquiry and Word Count 2015 software by
percentage, the more positive a text is and Pennebaker, et al. [32] was used to extract all dictionary-based
percentage below 50% denotes a more features such as the number of function words, pronouns, articles,
negative tone [31]
punctuations, as well as the psycholinguistic features (summative
Affective Cues
Affective Processes Examples: Cried, happy
cues, affective cues, cognitive cues, informality cues and time-
Positive Emotion Examples: Sweet, nice, good orientation cues) of the compiled news data. The works of [28],
Achievement Examples: Success, better, win [29], [30], [31] can attest that the LIWC 2015 software can
Negative Emotion Examples: Ugly, nasty, hurt
effectively reflect the underlying psychology of various
Anxiety Examples: Fearful, worried
Anger Examples: Killed, annoyed, hate demographic characteristics, social context and group dynamics.
Sadness Examples: Grief, depressed, crying Thus, it is can be considered as an important text analysis tool
Cognitive Cues for applied natural language processing [32].
Cognitive Processes Examples: Know, ought, cause
All 76 features in Table 1 were extracted from both the
Insight Examples: Think, thought, knew
Causation Examples: Since, effect, because Headline (except the readability scores) and News Content,
Discrepancy Examples: Would, should resulting to a total of 145 features. The reason for leaving out the
Tentative Examples: Perhaps, maybe Headline in the computation of the readability scores is that
Fillers Examples: You know, I mean,
Certainty Examples: Never, always these formulas were designed for texts of more than 100 words.
Differentiation Examples: But, else, hasn’t Hence, these will not give an accurate measure for short texts
Informality Cues like news headlines.
Informal Language Total no. of swear, sexual, slang and non-
fluencies.
Swear Words Examples: Shit, damn, fuck
3.3 Features and Parameters Selection
Internet Slang Examples: LOL, LMAO, thx, btw Three learning algorithms were used in this experiment
Sexual Words Examples: Incest, horny, vagina
Non-fluencies Examples: uhm, hm, er
namely: Gaussian Naïve Bayes, Logistic Regression, and Linear
Time-Orientation Cues Support Vector Machines. The best performing features on each
Past Focus Examples: Talked, did, ago machine learning model were determined using the recursive
Present Focus Examples: Now, is, today feature elimination algorithm or RFE. The optimal C-value for
Future Focus Examples: Soon, will, may
Punctuation Cues
the Logistic Regression and Support Vector Machine model was
Punctuations % Percentage of the total number of “ . ”, “ , ”, determined using the Grid Search algorithm.
“:”,”;”,”?”, etc. in a text. In the features and parameter selection process of this study,
Periods % Percentage of the total no. of “ . ”. the Recursive Feature Elimination algorithm has been embedded
Commas % Percentage of the total no. of “ , ”.
Colons % Percentage of the total no. of “ : ”. inside the Grid Search algorithm resulting to a nested 3-fold
Semicolons % Percentage of the total no. of “ ; ”. cross-validation.
Question Marks % Percentage of the total no. of “ ? ”. In this process, each fold of the Grid Search is further split
Exclamation Marks % Percentage of the total no. of “ ! ”.
into three-fold for the RFE algorithm to determine the optimal
Dashes % Percentage of the total no. of “ - ”.
Apostrophes % Percentage of the total no. of “ ’ ”. feature set for each parameter being evaluated. The evaluation
Parentheses % Percentage of the total no. of “ (, ) ”.
Other Punctuations % Percentage of the total no. of “$, %, #, etc.”.
6 https://www.nltk.org/
score is the averaged accuracy score over all folds, in which, the • Fake news content is slightly angrier than legitimate news
constructed model with the highest score is always selected. This (Leg. x̃ = 0.34, Fake x̃ = 0.46, Z = -2.87, p = .004). This is according
prevents the information to leak and overfit when the same data to the usage of words tagged as being “angry” in the LIWC2015
is used for evaluating the feature/parameter selection and Dictionary [32].
building the final model [33]. This process is to be referred as • Legitimate news content focuses more in the past than fake
“RFE+GS” in the succeeding sections of this paper. news (Leg. x̃ = 4.78, Fake x̃ = 3.40, Z = 41.23, p < .001). On the
The Gaussian Naïve Bayes classifier did not go through the other hand, fake news content focuses more in the present than
Grid Search algorithm since it does not have any parameters to legitimate news (Leg. x̃ = 4.27, Fake x̃ = 6.74, Z = -60.44, p < .001).
tune with. This is based on the time-orientation cues scored by the LIWC
software [32]. This suggests that legitimate news is likely to
3.4 Building the final model report only the events that had really happened contrary to the
The best performing feature set and C-value determined in present-centric fake news.
section 3.3 were used to build the final model. Due to the uneven • Legitimate news in the Philippines requires a higher
distribution between the legitimate and fake news samples in the education to read than fake news as confirmed by the following
constructed Philippine Fake News dataset, the legitimate news readability scores: Flesch-Kincaid Grade (Leg. x̃ = 13.51, Fake x̃ =
articles were under-sampled to match the fake news sample of 12.05, Z = 38.57, p < .001), Dale-Chall Readability Score (Leg. x̃ =
7,656. This brought in 15,312 even distribution of legitimate and 10.96, Fake x̃ = 9.17, Z = 56.87, p < .001), Gunning Fog Index (Leg.
fake news samples. x̃ = 14.27, Fake x̃ = 12.69, Z = 39.19, p < .001), and SMOG (Leg. x̃ =
This has further been divided 50:50, bringing forth 7,656 even 12.42, Fake x̃ = 11.07, Z = 42.53, p < .001).
legitimate and fake news samples for each split. The first split
was used in the feature and parameter selection described in 4.2 The Linguistic-Based Cues for Automated
section 3.3 (RFE+GS) while the second half was used in building Fake News Detection
the final model. In this way, it can be ensured that the final A total of 25 experiments were ran for this study. The first
model is built on data that has not yet seen by the feature and three set of experiments concentrated on the usage of headline
parameter selection process (RFE+GS). and content features and were fitted using Gaussian Naïve
In building the final model, the data was divided into a 60:40 Bayes, Logistic Regression and Support Vector Machines.
train and test set. Also, another set of data was harvested from After these experiments, it was observed that Logistic
the same sources mentioned in section 3.1 but on a different Regression and Support Vector Machines models were at par in
timeframe (January 2019). This is to evaluate how good the terms of prediction quality, but the latter was more
models are at classifying news that are not part of the Philippine computationally expensive and takes a longer time to train.
Fake News Corpus. Hence, SVM was dropped in the succeeding set of experiments,
The new dataset has 609 legitimate news samples and 85 fake wherein each of the linguistic categories in Table 1 (Readability
news samples and is to be referred as “Jan2019_data”. The Scores, Linguistic Dimensions, Summative Cues, Affective Cues,
legitimate news samples were again under-sampled during the Informality Cues, Cognitive Cues, Punctuation Cues, and Time-
experiments to match the number of fake news samples, Orientation Cues) was tried out each on its own to see how good
resulting to 170 even news samples. it can classify between fake and real news without relying on
features from the other categories.
4 Results and Discussion 4.2.1 Best Feature Set. The first experiment set fed all the
combined 145 headline and content features into the RFE+GS
4.1 Disparity between Legitimate News and algorithm for each of the learning algorithms. Table 1
Fake News in the Philippines enumerates the best feature set determined by the RFE+GS
The central tendency (median) of the features in Table 1 were algorithm for this experiment set while the optimal C-value
computed from the constructed Philippine Fake News Corpus. determined for the Logistic Regression and Support Vector
This is to investigate any difference between legitimate news Machine models were C = 1 and C = 0.1, respectively. The
from credible news sources in the Philippines (n = 14,802) and accuracies achieved during this experiment are detailed in Table
news from fake news sources cited by the Philippine Senate, 3, under the “Both Headline and Content Features” sub-section.
CMFR/NUJP, and CBCP (n = 7,656). Wilcoxon rank sum-test was This feature set will be referred to as “headline + content” in the
also used to confirm if the following findings were statistically succeeding discussions on this paper.
significant: The Logistic Regression and Support Vector Machines both
• Fake news headlines have more words than legitimate news outperformed the Gaussian Naïve Bayes Model, attaining an
headlines (Leg. x̃ = 7, Fake x̃ = 11, Z = -75.6, p < .001). Having said accuracy of 94% on both the RFE+GS and the test set of the final
that, legitimate news content has more words than fake news model. Both models were also able to classify the news samples
(Leg. x̃ = 372, Fake x̃ = 203, Z = 49.37, p < .001). in the “Jan2019_data“ equivalently well at a precision of 93%.
• Legitimate news also has more sentences in its content than 4.2.2 Prominent Features across all experiments. The exact
fake news (Leg. x̃ = 18, Fake x̃ = 11, Z = 37.83, p < .001). feature sets determined by the RFE+GS for the succeeding 22
experiments were not detailed on this paper for brevity.
Although, most of the features, if not all, which have been spat words, it calibrates the feature’s weight autonomously according
by the RFE+GS on these experiments revolved around the same to how much it corresponds to the target class. This results to
features in Table 2. poor predictions when fed with features that are highly
As an alternative, all features determined by the RFE+GS in dependent on each other.
all 25 experiments were summated as a tag cloud in Figure 1 and On the other hand, discriminative models like Logistic
Figure 2. Regression and Linear Support Vector Machines set all the
Based on Figure 1, the most common features across all feature weights collectively, producing a linear decision function
experiments were computed from the news content. All features which drawn to be high or low for positive or negative classes,
in large font size in Figure 1 were chosen by the RFE+GS respectively [34]. The weights or coefficients assigned to each
algorithm 8 times in all 25 experiments in this study. feature is determined by optimizing the conditional likelihood,
Overall, the most common feature selected by the RFE+GS which results to a discriminative training that can offset the
algorithm, regardless if it was from the headline or the news problems when conditional independence among features is
content, was the “Exclamation Marks %”. This feature was chosen assumed [34].
14 times in all 25 experiments, followed by “Long Words Count”,
“Words per Sentence”, “Analytical thinking”, “Commas %”, Table 2. The Best Headline + Content Feature Set
“Semicolons %”, which were selected 13 times and “Function Determined by the Recursive Feature Elimination
Words %”, “Syllables Count”, “Type Token Ratio”, “Prepositions %”, Algorithm on Each of the Learning Algorithms Used.
“Question Marks %”, “Apostrophes %”, which were selected 12 Algorithm Features
times. Headline Features
4.2.3 Classifying Fake vs. Real just by its Headline or its Linguistic Dimensions: Word Count, Syllables Count,
Content. The Logistic Regression model can differentiate Difficult Words Count, Words per Sentence, Articles
Legitimate News and Fake News just by their headline or their %, No. of Words in All Caps
content at a precision of 87% and 88%, respectively. It is worth Content Features
noting, that the Support Vector Machine was almost as equally Readability Scores: Flesch Kincaid Grade, Flesch
as good, if only it did not fell short on classifying the headlines Gaussian Reading Ease, Dale-Chall Readability Score, Gunning
Fog Index, SMOG
in the “Jan2019_data”. It only had a precision of 80% on this data Naïve
Linguistic Dimensions: Words per Sentence ,
set, which is sub-par to its performance on the final model test Bayes Analytical thinking, Long Words Count, Type Token
set, which was 86%. Ratio, Pronouns %, Impersonal Pronouns %, Common
4.2.4 Classifying power of each Linguistic-based cue category. Adverbs %
The results in Table 2 and Table 3 suggests that the “Linguistic Informality Cues: Informal Language
Dimension” features have the strongest classification power Time-Orientation Cues: Present Focus
among all the features proposed in Table 1. Punctuation Cues: Exclamation Marks %, Apostrophes
Much of the accuracy and precision success of the headline + %
Headline Features
content features can be attributed to this category. This is after
Linguistic Dimensions: Syllables Count, Sentences
achieving a precision of 93% in the test set of the final model, Count, Words per Sentence, Long Words Count,
even without the reinforcement of any feature from the other Difficult Words Count, Type Token Ratio,
categories. Having said that, it has not achieved the same No. of Words in All Caps, Function Words %,
consistency of the headline + content feature set as what its Articles %, Prepositions %, Negations %,
performance in the “Jan2019_data” suggests. Common Adjectives %, Comparisons %,
Although, these “Linguistic Dimension” features have a strong Interrogatives %, Direct Quotes %,
predictive power at classifying between real and fake news Dictionary Words %
Summative Cues: Analytical thinking
articles on its own, its correlation with features from the other
Affective Cues: Anxiety
categories provides the stability it lacks as what the consistent Cognitive Cues: Discrepancy, Tentative,
performance of the headline + content features on the RFE+GS, Logistic Differentiation
Final Model and “Jan2019_data” implies. Regression Informality Cues: Informal Language
The other linguistic categories performed satisfactorily, Punctuation Cues: Periods %, Commas %, Semicolons
although not anywhere close to the “Linguistic Dimensions”, but %, Question Marks %, Exclamation Marks %,
all still performed better than chance level. Apostrophes %, Other Punctuations %
4.2.5 Correlations among features. The superior performance Content Features
Readability Scores: Flesch Kincaid Grade, Dale Chall
of the Logistic Regression / Support Vector Machine models
Readability Score, Gunning Fog Index, SMOG
compared to the Gaussian Naïve Bayes model suggests that there Linguistic Dimensions: Sentences Count, Long Words
is a high correlation among the features proposed. Count, Type Token Ratio, No. of Words in All Caps,
Naïve Bayes Classifiers consider its features to be Dictionary Words %, Function Words %, Pronouns %,
conditionally independent [34]. It is a generative model that First-Person Plural %, Second Person %, Impersonal
predicts the posterior probability 𝑃(𝑦|𝑋) by computing the joint Pronouns %, Prepositions %, Common Adverbs %,
distribution of feature 𝑋 and its target class 𝑦 [34]. In other Conjunctions %, Negations %, Common Adjectives %,
Interrogatives %, Concrete Figures %, Quantifiers %, GNB 61.29% 61.64% 69% k = .234 67% k = .224
Direct Quotes % LR 63.21% 63.14% 64% k = .263 67% k = .329
Summative Cues: Analytical thinking, Authenticity Time-Orientation Cues Only
Affective Cues: Affective Processes, Positive Emotion, GNB 70.30% 70.06% 71% k = .401 76% k = .494
Anxiety LR 70.27% 69.24% 69% k = .385 74% k = .471
Cognitive Cues: Insight, Discrepancy, Tentative,
Punctuation Cues Only
Certainty
GNB 71.36% 72.22% 77% k = .445 74% k = .376
Informality Cues: Informal Language, Swear Words,
LR 79.70% 80.57% 81% k = .612 78% k = .553
Internet Slang, Sexual Words
Time-Orientation Cues: Past Focus, Present Focus
Punctuation Cues: Commas %, Colons %, Semicolons
%, Question Marks %, Exclamation Marks %, Dashes %,
Apostrophes %, Parentheses %
Headline Features
Linguistic Dimensions: Word Count, Syllables Count,
Difficult Words Count, Words per Sentence, Long
Words Count, No. of Words in All Caps, Function
Words %, Dictionary Words %, Articles %,
Support
Prepositions %, Negations %, Comparisons %,
Vector
Interrogatives %, Direct Quotes %
Machines Summative Cues: Analytical thinking
Informality Cues: Informal Language
Punctuation Cues: Periods %, Commas %, Semicolons
%, Question Marks %, Exclamation Marks %,
Apostrophes %, Other Punctuations %

Table 3: Accuracy, Precision, and Kappa of Each Models on Figure 1: A tag cloud of the headline and content features
the RFE+GS, Final Model and Jan2019_data. determined by the RFE+GS across all 25 experiments.
RFE+GS Final Final model Jan2019
(accuracy) Model (precision & data
(accuracy) kappa) (precision &
kappa)
Both Headline and Content Features
GNB 87.79% 89.78% 90% k = .796 85% k = .682
LR 93.93% 94.38% 94% k = .888 93% k = .859
SVM 94.06% 94.42% 94% k = .888 93% k = .859
Headline Features Only
GNB 83.31% 83.64% 84% k = .673 78% k = .541
LR 86.68% 86.35% 87% k = .727 84% k = .671
SVM 86.72% 86.19% 86% k = .724 80% k = .6
Content Features Only
GNB 77.38% 78.75% 80% k = .575 82% k = .553
LR 87.85% 88.34% 88% k = .767 90% k = .8
SVM 87.72% 88.31% 88% k = .766 89% k = .776
Readability Scores Only
GNB 65.19% 67.16% 68% k = .343 64% k = .271
LR 69.08% 69.80% 70% k = .396 74% k = .471
Linguistic Dimensions Only
GNB 87.37% 87.79% 88% k = .756 84% k = .682
LR 92.02% 92.59% 93% k = .852 89% k = .788 Figure 2: A tag cloud of all the features determined by the
Summative Cues Only RFE+GS across all 25 experiments, regardless if it was used
GNB 62.03% 62.91% 66% k = .259 67% k = .282 in the headline or the news content.
LR 66.22% 66.47% 68% k = .330 72% k = .4
Affective Cues Only
GNB 58.44% 59.84% 60% k = .197 57% k = .129 5 Conclusion
LR 58.48% 60.20% 60% k = .204 58% k = .153
There are statistically significant differences between
Cognitive Cues Only legitimate news and fake news in the Philippines in terms of its
GNB 63.78% 64.54% 67% k = .292 65% k = .224
word count, sentences count, verb tenses used and readability
LR 64.13% 65.16% 65% k = .304 67% k = .329
scores.
Informality Cues Only
While most of the features proposed in this paper could [18] M. D. Labiste (2017). Journalists, Bishops Battle Fake News. Asian Policy &
Politics. 9(4), 697-700. DOI: https://doi.org/10.1111/aspp.12348.
discriminate between fake and real news considerably well, the [19] L. Richardson (2007). Beautiful Soup. Available:
“Linguistic Dimensions” features were the strongest and can be https://www.crummy.com/software/BeautifulSoup/
[20] J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom (1975).
considered as the most essential feature set. Derivation of new readability formulas (Automated Readability Index, Fog
The combined headline + content feature set performed best Count and Flesch Reading Ease formula) for Navy enlisted personnel. Naval
and was consistent across the feature selection process, final Air Station Memphis, Millington, TN, USA.
[21] R. Flesch (1979). How to Write Plain English. New York: Harper & Row.
model construction, and validation on a new data. However, [22] M. Coleman and T. L. Liau (1975). A computer readability formula designed
there could be other linguistic features not covered on this paper for machine scoring. Journal of Applied Psychology. 60 (2), 283-284. DOI:
10.1037/h0076540.
such as grammatical and spelling scores, which may further [23] R. J. Senter and E. A. Smith (1967). Automated Readability Index. Wright-
improve the performance of these linguistic-based models. Patterson Air Force Base, p. iii. AMRL-TR-6620.
Another interesting extension of this work is to investigate [24] J. S. Chall and E. Dale (1995). Readability revisited: the new Dale-Chall
readability formula. Cambridge, Mass.: Brookline Books.
how simpler textual representation like token count matrix, [25] R. Gunning (1968). The Technique of Clear Writing. New York, USA:
normalized TF or TF-IDF matrix would fare against the proposed McGraw-Hill.
[26] G. H. McLaughlin (1969). SMOG grading: A new readability formula. Journal
models when used on the same dataset constructed in this study. of Reading, 12 (8), 639-646.
[27] Y. R. Tausczik and J. W. Pennebaker (2010). The Psychological Meaning of
Words: LIWC and Computerized Text Analysis Methods. Journal of
REFERENCES Language and Social Psychology. 29(1), 24-54. DOI: 10.1.1.470.9621.
[1] J. P. Blair, T. R. Levine and A. S. Shaw (2010). Content in Context Improves [28] J. W. Pennebaker, C. K. Chunk, J. Frazee, G. M. Lavergne, and D. I. Beaver
Deception Detection Accuracy. Human Communication Research. 36(3), (2014). When Small Words Foretell Academic Success: The Case of College
423-442. DOI: https://doi.org/10.1111/j.1468-2958.2010.01382.x. Admissions Essays. PLoS ONE. 9(12). DOI: 10.1371/journal.pone.0115844.
[2] Pay to sway: report reveals how easy it is to manipulate elections with fake [29] E. Kacewicz, J. W. Pennebaker, M. Davis, M. Jeon, and A. C. Graesser (2014).
news. 2017. https://www.theguardian.com/media/2017/jun/13/fake-news- Pronoun Use Reflects Standings in Social Hierarchies. Journal of Language
manipulate-elections-paid-propaganda. Accessed: 2018-11-08. and Social Psychology. 33(2), 125-143. DOI:
[3] The Fake News Machine: How Propagandists Abuse the Internet and https://doi.org/10.1177/0261927X13502654.
Manipulate the Public (2017). [30] M. L. Newman, J. W. Pennebaker, D. S. Berry, and J. M. Richards (2003).
https://documents.trendmicro.com/assets/white_papers/wp-fake-news- Lying words: predicting deception from linguistic styles. Personality &
machine-how-propagandists-abuse-the-internet.pdf. Accessed: 2018-11-08. Social Psychology Bulletin. 29(5), 665-675. DOI:
[4] Facebook Fake-News Writer: “I Think Donald Trump is in the White House https://doi.org/10.1177/0146167203029005010.
Because of Me.” (2017). https://www.washingtonpost.com/news/the- [31] M. A. Cohn, M. R. Mehl, and J. W. Pennebaker (2004). Linguistic markers of
intersect/wp/2016/11/17/facebook-fake-news-writer-i-think-donald-trump- psychological change surrounding September 11, 2001. Psychological
is-in-the-white-house-because-of-me/?utm_term=.8189afb168f6. Accessed: science. 15, 687-693. DOI: https://doi.org/10.1111/j.0956-7976.2004.00741.x.
2018-11-08. [32] J. W. Pennebaker, R. J. Booth, and M. F. Francis (2007). Linguistic Inquiry and
[5] Click and Elect: how fake news helped Donald Trump win a real election Word Count: LIWC [Computer Software]. Austin, TX: liwc.net
(2016). https://www.theguardian.com/commentisfree/2016/nov/14/fake- [33] G. C. Cawley, and N. L. C. Talbot (2010). On over-fitting in model selection
news-donald-trump-election-alt-right-social-media-tech-companies. and subsequent selection bias in performance evaluation. Journal of Machine
Accessed: 2018-11-08. Learning Research. (11), 2079-2107.
[6] Donald Trump Won Because of Facebook (2016). [34] A. Y. Ng and M. I. Jordan (2001). On discriminative vs. generative classifiers:
http://nymag.com/intelligencer/2016/11/donald-trump-won-because-of- a comparison of logistic regression and naïve Bayes. In Proceedings of the
facebook.html. Accessed: 2018-11-08. 14th International Conference on Neural Information Processing Systems:
[7] Read all about it: The biggest fake news stories of 2016 (2016). Natural and Synthetic. DOI: 10.1007/s11063-008-9088-7.
https://www.cnbc.com/2016/12/30/read-all-about-it-the-biggest-fake-news-
stories-of-2016.html. Accessed: 2018-11-08.
[8] A. Yee (2017). Post-Truth Politics & Fake News in Asia. Global Asia. 12(2),
67-71.
[9] R. Raga (2016). Reflections on the Awareness and Progress of Natural
Language Processing (NLP) Research in the Philippines. Philippine
Computing Journal. 11(1), 1-9.
[10] K. Bolton and M. L. S. Bautista (2008). Philippine English: Linguistic and
Literary Perspectives (Asian Englishes today), Hong Kong University Press,
HKU.
[11] L. Gustilo (2002). A Contrastive Analysis of American English and
Philippine English News Leads. Philippine Journal of Linguistics. 33(2), 53-
66.
[12] B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel, (2017). A simple
but tough-to-beat baseline for the Fake News Challenge stance detection
task. arXiv preprint arXiv:1707
[13] R. Davis and C. Proctor (2017). Fake News, Real Consequences: Recruiting
Neural Networks for the Fight Against Fake News. https://web.stanford.edu/
class/cs224n/reports/2761239.
[14] L. Zhou, J. Burgoon, J.F. Nunamaker, and D. Twitchell (2004). Automating
Linguistics-Based Cues for Detecting Deception in Text-Based
Asynchronous Computer-Mediated Communication. Group Decision and
Negotiation. 13(1), 81-106. DOI:
https://doi.org/10.1023/B:GRUP.0000011944.62889.6f.
[15] Y. Yang, L. Zheng, J. Zhang, Q. Cui, Z. Li, and P.S. Yu (2018). TI-CNN:
Convolutional Neural Networks for Fake News Detection. arXiv preprint
arXiv:1806.00749.
[16] V. Rubin, Y. Chen and N. J. Conroy (2015). Deception Detection for News:
Three Types of Fakes. In Proceedings of the 78th ASIS&T Annual Meeting:
Information Science with Impact: Research in and for the Community. 52(1),
1-4. DOI: https://doi.org/10.1002/pra2.2015.145052010083.
[17] J. C. Ong and J. V. Cabanes (2018). Architects of Networked Disinformation.
The Newton Tech4Dev Network. University of Leeds, Leeds, UK.

You might also like