. Note that notationally we exclude TeaParty candidates from the Republican set. When it isinteresting to analyze the inclusion or exclusion of TeaParty candidates we employ the notation Rep+TP and Rep-TP respectively.Using Twitter’s API, we downloaded 460,038 tweets forcandidate accounts dating back to March 25, 2007. Figure1 shows the number of tweets in the days (a) and hours (b)surrounding the Election Day. We see temporal patterns, asless activity is observed during weekends and nights. Asexpected, the volume of tweets increases towardsNovember the 2
, abruptly decreasing afterward.
The data include 84, 81 and 522 candidates from theSenate elections, the gubernatorial elections and theCongressional elections respectively, covering about 50%of the number of candidates in each of the races. Wecrawled all the edges connecting users in our dataset. Toidentify social structures we consider a “
” relation as a directed edge going from thefollower to the followed user (identifying 4,429 such edgesbetween candidates in our pool).To enrich the dataset we crawled the homepage of candidates who maintained one and each of the valid URLsthat appeared in the tweets and considered them asadditional documents. Out of 351,926 URLs (186,000distinct) 233,296 were valid pages (132,376 distinct),
The Tea Party classification was obtained from The New York Timesfeature “Where Tea Party Candidates are Running,” October 14, 2010(nytimes.com/interactive/2010/10/15/us/politics/tea-party-graphic.html).
which eventually contributed 96% of the content to thedataset (182,523,302 terms out of 190,290,041). Wefiltered out stop words and extracted both unigram and bi-gram terms. We found no significant difference when n-grams of higher order were considered.
In this work, we analyze two aspects of the data – thecontent produced by the users and the structure of thenetwork formed by the follow-up edges. We start byproviding some theoretical background to our contentanalysis methods.
User Profile Model
Our system consists of a set of candidates
where eachcandidate has a set of documents
associated with her.The entire corpus is denoted by
Documentsare represented using the
Bags of Words
model where eachterm
is associated with its number of occurrences inthe document
The vocabulary of the corpus isdenoted by
Our model is based on the
model;therefore we make use of the document frequency of aterm
and the inverse document frequency
. We denote the document frequencyof a term in the set of user
’s documents by
.We also make use of
, themaximum likelihood estimation of the probability to findterm
We set the initial weight of a term in a user LM to be
stands for theaverage frequency of term
in the collection
Inaddition, we calculate the marginal probability of
inthe language model of the entire corpus as
These values are then normalized in order to obtain aprobability distribution over the terms.
We then smooth the weights using the LM of the corpus,
using a normalization factor of
. Finally, wedivide these values by their sum to normalize them.
In a similar manner we constructed a LM-based profilefor the Democrat and Republican parties, as well as to thegroup of Tea Party members. In order to compute the LM-based profile of a group
we applied the same process
Figure 2. Plot of the candidate network (force-directedgraph embedding layout modified to emphasizeseparation, nodes size proportional to indegree)Figure 3. Number of explicit follower edges and unique@mention edges (follower / mention)