Professional Documents
Culture Documents
Introducción al Análisis
de Datos:
Ejemplos de Investigaciones
Tomás Baviera
Universitat Politècnica de València
Octubre 2019
a n ti fi ca r a l g o
s p o s ib le c u
¿E m o es l a
o c o
tan complej ?
pe rs o n a l i d a d
1978: NEO (3 factors)
1985: NEO-PI (5 factors)
Participants and their Likes are represented as a matrix, where entries are set to
1 if there exists an association between a participant and a Like and 0 otherwise
(second panel). The matrix is used to fit five LASSO linear regression models
(16), one for each self-rated Big Five personality trait (third panel). A 10-fold
cross-validation is applied to avoid overfitting. The models are built on participants
having at least 20 Likes.
Computer-based personality judgment accuracy (y axis), plotted against the number of Likes
available for prediction (x axis).
The red line represents the average accuracy of computers’ judgment across the
five personality traits. The five-trait average accuracy of human judgments is
positioned onto the computer accuracy curve. For example, the accuracy of an
average human individual (r = 0.49) is matched by that of the computer models
based on around 90–100 Likes. The gray ribbon represents the 95% CI.
The external validity of personality judgments and self-ratings across the range of life outcomes,
expressed as correlation (continuous variables; Upper) or AUC (dichotomous variables; Lower).
The red, yellow, and blue bars indicate the external validity of self-ratings, human
judgments, and computer judgments, respectively. For example, self-rated scores
allow predicting network size with accuracy of r = 0.23, human judgments achieve
r = 0.17 accuracy (or 0.06 less than self-ratings). Compound variables (variables
represented across a few subvariables) are marked with an asterisk.
Brussels, 14 J une 2019
Today the Commission and the High Representative report on the progress achieved in the
fight against disinformation and the main lessons draw n from the European elections, as a
European
contribution to the discussions by EU leaders Commission
next
European w eek. -- PPress
Commission ress release
release
Protecting our democratic processes and institutions from disinformation is a major challenge for
societies across the globe. To tackle this, the EU has demonstrated leadership and put in place a robust
framework for coordinated action, with full respect for European values and fundamental rights.
AA Europe
Europe that
Today's joint Communication protects:
that sets EU
EU reports
out how
protects: the on
Action
reports progress
onPlan in
in fighting
against
progress fighting disinformation
Disinformation and the Elections
disinformation
ahead
aheadto
Package have helped of European
offight
European Council
Council and preserve the integrity of the European Parliament
disinformation
elections. Brussels,
Brussels, 14
14 JJune
une 2019
2019
High Representative/
Today Vice
Today the Presidentand
the Commission
Commission Federica
and the
the HighMogherini,
High Representative
Representative Vice-President
report
report onon the for the achieved
the progress
progress Digital
achievedSingle
in the Market
in the
fight against
Andrus Ansip, Commissioner disinformation and
for J ustice,and
fight against disinformation the main
Consumers lessons
the main lessons draw
and Gender n from the
Equality
draw n from European elections,
ourová, as aa
Věra J elections,
the European as
contribution to the discussions by EU leaders next
contribution to the discussions by EU leaders next w eek. w eek.
Commissioner for the Security Union J ulian K ing, and Commissioner for the Digital Economy and
Protecting
Protecting our
our democratic
democratic processes
processes and
and institutions
institutions from
from disinformation
disinformation is
is aa major
major challenge
challenge for
for
Society Mariya Gabriel said in a joint statement:
societies across the globe. To tackle this, the EU has demonstrated leadership and put in place a robust
societies across the globe. To tackle this, the EU has demonstrated leadership and put in place a robust
framework
“The record high turnoutfor
framework
Today's
incoordinated
for coordinated
the European action,
action, with with full
full respect
Parliament respect for
for European
elections European has values
values and
and fundamental
underlined fundamental
thethe rights.
rights.
increased interest of
Today's joint
joint Communication
Communication sets sets out
out howhow thethe Action
Action PlanPlan against
against Disinformation
Disinformation and and the Elections
Elections
citizens in European
Package
Package democracy.
have
have helped
helped to toOur
fightactions,
fight disinformation
disinformation including
and
and preserve the setting-up
preserve the
the integrity
integrity of ofthe
of election
the European
European networks
Parliament at national
Parliament
and European level, helped in protecting our democracy from attempts at manipulation.
elections.
elections.
High
High Representative/
Representative/ Vice
Vice President
President Federica
Federica Mogherini,
Mogherini, Vice-President
Vice-President for the
the Digital
Digital Single
fordisinformation Single Market
Market
We are confident that
Andrus our efforts
Ansip, Commissionerhave contributed
for J ustice, Consumers to limit and the
Genderimpact
Equality of Věra J ourová, operations,
Andrus Ansip, Commissioner for J ustice, Consumers and Gender Equality Věra J ourová,
including from foreign
Commissioner
Commissioneractors, for through
for the
the Securitycloser
Security Union coordination
Union JJulian
ulian KKing,
ing, and between the
and Commissioner
Commissioner for EU
for the and Member
the Digital
Digital Economy
Economy and States.
and
However, much Societyremains
Society Mariya
Mariya to Gabriel
be done.
Gabriel saidThe
said in a European
joint
in a joint statement:
statement: elections were not after all free from
disinformation; “Thewe should
record not
high accept
turnout in thethis as
European
“The record high turnout in the European Parliament the new
Parliament normal.
elections
elections Malign
has actorsthe
has underlined
underlined constantly
the increased change
increased interest
interest of their
of
citizens
citizens in
in European
European democracy.
democracy. Our
Our actions,
actions, including
including the
the setting-up
setting-up of
of election
election networks
networks at
at national
national
strategies. We mustand
strive to be aheadprotectingof them. Fighting disinformation is a common, long-term
and European
European level,level, helped
helped in in protecting our our democracy
democracy from from attempts
attempts at at manipulation.
manipulation.
challenge for EU institutions and Member States.
We
We are
are confident
confident that that our
our efforts
efforts have have contributed
contributed to to limit
limit thethe impact
impact of of disinformation
disinformation operations,
operations,
including
including from
Ahead of the elections, we foreign
from actors,
actors, through
saw evidence
foreign closer
closer coordination
of coordinated
through coordination between
between the
inauthentic EU
EU and
behaviour
the and Member aimed
Member States.
at spreading
States.
However,
However, much
much remains
remains to
to be
be done.
done. The
The European
European elections
elections were
were not
not after
after all
all free
free from
from
divisive material on online platforms,
disinformation; including through the useMalign of bots and fake accounts. So online
disinformation; we we should
should notnot accept
accept thisthis as
as thethe newnew normal.
normal. Malign actors actors constantly
constantly change change their
their
platforms have strategies.
a particular
strategies. We
We mustresponsibility
must strive
strive toto be to tackle
be ahead
ahead of
of them.
them. disinformation.
Fighting
Fighting disinformation Withisisour
disinformation activelong-
aa common,
common, support,
long- term Facebook,
term
Google and Twitter have
challenge
challenge formade
for EU some progress
institutions
EU institutions and Memberunder
and Member States. the Code of Practice on disinformation. The latest
States.
monthly reports,Ahead which
Ahead of the
of the weelections,
are publishing
elections, we saw today,
evidence
we saw evidence of confirm this
of coordinated
coordinated trend.behaviour
inauthentic
inauthentic We nowaimed
behaviour expect
aimed at online platforms to
at spreading
spreading
divisive material
divisive and
material on online
on online platforms,
platforms, including
including through
through the use
the use of of bots
bots and fake accounts.
and fake accounts. So
So online
online
maintain momentum platforms
to step up their efforts and implement all commitments under the Code.”
platforms havehave aa particular
particular responsibility
responsibility to to tackle
tackle disinformation.
disinformation. With With ourour active
active support,
support, Facebook,
Facebook,
Google
While it is still too
Google and
early
andto Twitter
draw
Twitter have made
final
have made some
some progress
conclusions progress aboutunder
under the the
the Codelevel
Code of Practice
ofand on
on disinformation.
impact
Practice of disinformation
disinformation. The
The latest
latest in the
monthly
monthly reports,
reports, which
which wewe are
are publishing
publishing today,
today, confirm
confirm this this trend.
trend. WeWe now now expect
expect online
online platforms
platforms to to
recent European Parliament
maintain
maintain momentum
elections,
momentum and and toto step
it isupclear
step up their that the
their efforts
efforts and
actions taken
and implement
implement all
by the EU
all commitments
commitments under
– the
under the
together
Code.”
Code.”
with
¿Cómo podemos diferenciar
entre una cuenta real y una
cuenta automatizada?
Estructura 1.150
datos numéricos
asociados a la
cuenta, que
representan su
comportamiento
ducethefurther criteriathatthey musthaveproducedatleast
200tweets intotal and90tweets duringthethree-month ob-
servationwindow(oneper day onaverage). Our final sample
includes approximately 14 million user accounts that meet
both criteria. For each of these accounts, we collected their
tweetsthroughtheTwitter SearchAPI. Werestrictedthecol- Figure 1: ROC cur
lection to the most recent 200 tweets and 100 mentions of ferent datasets. Acc
each user, as described earlier. Owing to Twitter API limits,
this greatly improved our datacollection speed. This choice
also reduces theresponsetimeof our serviceandAPI. How- Evaluating Mode
ever the limitation adds noise to the features, due to the Toevaluateour clas
scarcity of dataavailableto computethem. dataset, we examin
for eachbot-scored
Manual Annotations
We achieved classi
Wecomputed classification scores for each of theactiveac- the accounts in the
counts using our initial classifier trained on the honeypot human accounts. W
dataset. Wethengroupedaccountsbytheir botscores, allow- scores inthe(0.8, 1
ing us to evaluate our systemacross the spectrumof human counts in the grey-
and bot accounts without being biased by thedistribution of 60% and 80%. Intu
bot scores. We randomly sampled 300 accounts from each lenging accounts to
bot-scoredecile. Theresultingbalancedsetof 3000accounts annotators overlapi
were manually annotated by inspecting their public Twitter binisweightedby t
profiles. Some accounts haveobvious flags, such as using a fromwhich thema
stock profile image or retweeting every message of another obtain 86% overall
accountwithinseconds. Ingeneral, however, thereisnosim- We also compare
tified a large, representative sample of users by monitor- pleset of rulesto assesswhether anaccount ishumanor bot. counts in each bot-
ing a Twitter stream, accounting for approximately 10% Withthehelpof four volunteers, weanalyzedprofileappear- scores are higher f
of public tweets, for 3 months starting in October 2015. ance, content produced and retweeted, and interactions with
This approach avoids known biases of other methods such
lower for accounts
other users in terms of retweets and mentions. Annotators is moredifficult for
as snowball and breadth-first sampling, which rely on the were not given a precise set of instructions to perform the
selection of an initial group of users (Gjoka et al. 2010; opposed to human-
Morstatter et al. 2013). We focus on English speaking users
classification task, but rather shown a consistent number of We observe a sim
as they represent thelargest group on Twitter (Mocanu et al. both positiveand negativeexamples. Thefinal decisions re- quired on averaget
2013). flect each annotator’s opinion and are restricted to: human, notators employed
To restrict our sample to recently active users, we intro- bot, or undecided. Accountslabeledasundecidedwereelim- accounts and 37 sec
ducethefurther criteriathat they must haveproduced at least inated fromfurther analysis. Fig. 1 shows the
200 tweets in total and 90 tweets during thethree-month ob- We annotated all 3000 accounts. We will refer to this set vestigate our ability
servation window (oneper day onaverage). Our final sample of accounts as the manually annotated data set. Each anno- baseline ROC curv
includes approximately 14 million user accounts that meet tator was assigned a random sample of accounts from each model on the man
both criteria. For each of these accounts, we collected their decile. We enforced a minimum 10% overlap between an-
tweets through theTwitter Search API. Werestrictedthecol- thebaselineaccurac
Figu re 1
notations : ROC
to curve
assess the s of m
reliability ofod elsannotator.
each trained a nd teste
This d on dif-
cross-validating on
lection to the most recent 200 tweets and 100 mentions of fe rent an
yielded daaverage
tasets.pairwise
Accura cy isent
agreem me ofasure
75% d by
and AUC.
moder-
each user, as described earlier. Owing to Twitter API limits, themodel is not tra
this greatly improved our data collection speed. This choice ate inter-annotator agreement (Cohen’s = 0.41). We also
computed the agreement between annotators and classifier
also reduces theresponsetimeof our serviceand API. How- Evaluating M odels Using Annotated Data Dataset Effect on
ever the limitation adds noise to the features, due to the outcomes, assuming that a classification score above 0.5 is
To evaluaas teour We can update ou
scarcity of data available to compute them. interpreted a bot.cla ssifica
This tion
resulted insys temtrapairwise
an average ined ontheh one ypot
da tase t, we exa m in e d the cla ss ifica tion accu racy annotated
se pa rate and hon
ly
agreement of 79% and a moderately high Cohen’s = 0.5.
for each bot-scoredecileof themanually annonateanced d data datasets
set. and
M anual Annotations Theseresults suggest high confidencein theannotation pro-
We achieved classification accuracy greater thanevaluatetheaccura 90% for
We computed classification scores for each of the active ac- cess,a
the as well as in
ccounts inthe
theagreement
(0.0, 0.4 between
) ranannotations
ge, which and
includes mostly
counts using our initial classifier trained on the honeypot model
hu mapredictions.
n accounts. We also observe accuracy above• 70% Annotation:
for We
dataset. Wethengroupedaccounts by their bot scores, allow- scores in the(0.8, 1.0) range(mostly bots). Accuracy for ac-
ing us to evaluate our system across the spectrum of human counts in the grey-area range (0.4, 0.8) fluctuates between
and bot accounts without being biased by the distribution of 60% and 80%. Intuitively, this rangecontains themost chal-
Figure 2: Distribution of classifier score for human and bot
accounts in the two datasets.
I argue that advocacy organizations are most likely to stimulate comments from
new social media audiences if they create “cultural bridges,” or produce messages
that connect discursive themes that are seldom discussed together. Such
messages may not only provoke comments from multiple audiences but also put
these audiences into conversations with one another, creating new, hybrid
conversational themes, or “cultural trellises,” within a social media advocacy field.
Three-stage sampling process used to recruit advocacy organizations to install Facebook
application.