You are on page 1of 11

Karmanomics 

Analysis of Reddit’s popularity ranking


system

grimtrigger
12/8/2010
Contents 
Introduction ...................................................................................................................................3
Data ..............................................................................................................................................4
Descriptive Statistics .................................................................................................................4
Models ..........................................................................................................................................5
Model 1......................................................................................................................................5
Model 2......................................................................................................................................5
Model 3......................................................................................................................................5
Model 4......................................................................................................................................5
Model 5......................................................................................................................................5
Model 6......................................................................................................................................6
Empirical Results ..........................................................................................................................6
Model 1......................................................................................................................................6
Model 2......................................................................................................................................7
Model 3......................................................................................................................................8
Model 4......................................................................................................................................8
Model 5......................................................................................................................................9
Model 6......................................................................................................................................9
Discussion of variables across models ...................................................................................10
Conclusion ..................................................................................................................................11
Summary .................................................................................................................................11
Further Research ....................................................................................................................11
Limitations ...............................................................................................................................11
Introduction 
Social media is a radical new way for individuals to reach online content. It gives the user
access to millions of sources, and gives non-traditional sources a level playing field with
traditional sources. Understanding how readers digest these sources is important to content
creators and special interests seeking to promote specific content.

This paper looks into one site in particular, Reddit.com. Reddit users, referred to as “Redditors”,
can give submissions either positive or negative votes. The sum of these votes is referred to as
“Karma” and reflects the popularity of such posts. It is important to note that Karma begins at 1
and cannot drop below 0. The most popular posts make it to the front page, where they receive
they receive the most attention.

Submissions can take many forms, specifically: articles, blogs, self-posts, images, audio/visual
and other. Most of these are self-explanatory but self-posts refer to simple text submissions by
Redditors. They do not link outside of Reddit.com, and are similar to a blog post with
Reddit.com as the blog host. These different types of content are likely to receive different levels
of Karma since they are digested differently.

Reddit also has a subReddit system. SubReddits are communities organized around ideologies,
interests, or lifestyles. Users who identify with certain subReddits have an option to subscribe,
and will have their personalized-front page contain submissions to that subReddit. SubReddits
are also biased towards articles which reflect their idealogy, so a liberal article posted to
/r/conservative can expect very little Karma.

The size of the subReddit to which an item was submitted is a likely predictor of that
submission’s Karma score for two reasons. Firstly, the numbers of readers in that subReddit
reflect the level of interest in that subject matter. Secondly, the number of readers in a
subReddit is an indicator of how much facetime a new submission will get before it is bumped
off. In other words, there is a tradeoff between submitting to a large subReddit with a lot of
interest but too many voices, and a small subReddit with little interest but each voice is well
heard. In between these two extremes, there is likely a sweet spot which is most fertile for
Karma.

It is important to note an anomaly in the subReddit system: the main Reddit.com subReddit.
This subReddit has over 33,000 readers but has no discernable special interest. It is a vestigial
part of Reddit from before the subReddit system. I mark in my data submissions to this
subReddit to see if it affects Karma.

The most important predictor of Karma is immeasurable: engagement. Some posts are more
popular simply because they better catch or hold the user’s attention. I use number of
comments as an imperfect proxy for engagement. The rational for this is that commenting takes
more effort than simply voting. Commenting on a post indicates that the reader has made
enough of a connection to the submission to invest a response.
Data 
A sample of 200 submissions was randomly selected from the 100,000,000 submissions made
to Reddit prior to submission number 24274635, which was made on December 5th, 2010.
These 100,000,000 only include submissions made after Reddit’s subreddit feature was brought
out of beta-mode in 2008. Data was obtained by randomly generating 200 numbers within the
range, converting these random numbers to base-36, and appending it to the url:
www.reddit.com/comments/ . 51 of these 200 observations were deemed as spam and not
included in the data.

Karma
Popularity ranking

Comments
Number of comments

Readers
Number of Redditors subscribed to that specific subreddit as of December 5th, 2010

Main
A binary variable; 1 if it was submitted to the “Reddit.com” subreddit, 0 if it was not

Type
Article, Blog, Self, AV, Image, Other are mutually exclusive binary variables which tell what type
of submission it was; AV is short for Audio/Visual

Descriptors
Political and NSFW are binary variables that describe the link; NSFW is short for Not Safe For
Work and usually refers to pornographic submissions

Descriptive Statistics 
Mean Median Max Min Std dev
Karma 11.39 1 397 0 46.12
Comments 6.36 0 205 0 22.26
Readers 232478.2 334022 437813 10 19037.1
Main .26 0 1 0 .44
Article .31 0 1 0 .46
Blog .13 0 1 0 .34
Self .20 0 1 0 .32
AV .13 0 1 0 .33
Image .19 0 1 0 .40
Other .05 0 1 0 .22
Political .12 0 1 0 .32
NSFW .03 0 1 0 .18
Models 

Model 1 
2
LOG(KARMA+1) = β1LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE + β6BLOG +
β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW

This is the basic model I will be testing. It gives a logarithmic interpretation to the quantitative
variables since they have exponential distributions. It is regressed through the origin since it can
be assumed that if all the dependent variables evaluate to 0, LOG(KARMA+1) should be 0.

Model 2 
2
LOG(KARMA+1) = β1ABS(SELF-1)*LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE
+ β6BLOG + β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW

This is similar to model 1, but only includes COMMENTS as a dependent variable for non-
SELF submissions. This model recognizes that self-posts are often have more comments, not
because they are more engaging, but they are asking a question. Karma is ma

Model 3 
2
LOG(KARMA+1) = β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE + β6BLOG + β7SELF + β8IMAGE + β9AV
+ β10POLITICAL + β11NSFW

This model does not include COMMENTS. It recognizes that COMMENTS is an imperfect
measure of how engaging a submission is, and it may be better to leave it out all together.

Model 4 
2
LOG(KARMA+1) = β1ABS(SELF-1)*LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE
+ β6BLOG + β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW + C

This is the same as Model 1 but includes an intercept. It goes against the idea that if all the
dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the
model more flexibility.

Model 5 
2
LOG(KARMA+1) = β1ABS(SELF-1)*LOG(COMMENTS+1) + β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE
+ β6BLOG + β7SELF + β8IMAGE + β9AV + β10POLITICAL + β11NSFW + C

This is the same as Model 2 but includes an intercept. It goes against the idea that if all the
dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the
model more flexibility.
Model 6 
2
LOG(KARMA+1) = β2LOG(READERS) + β3LOG(READERS) + β4MAIN + β5ARTICLE + β6BLOG + β7SELF + β8IMAGE + β9AV
+ β10POLITICAL + β11NSFW + C

This is the same as Model 3 but includes an intercept. It goes against the idea that if all the
dependent variable evaluate to 0, then LOG(KARMA+1) should be 0, however it grants the
model more flexibility.

Empirical Results 

Model 1 
Dependent Variable: LOG(KARMA+1)

Variable Coefficient Std. Error t-Statistic Prob.

LOG(COMMENTS+1) 0.806755 0.058561 13.77636 0.0000


LOG(READERS) 0.210524 0.060243 3.494584 0.0006
LOG(READERS)^2 -0.015733 0.004187 -3.757704 0.0003
MAIN 0.336567 0.165573 2.032741 0.0440
ARTICLE 0.487596 0.239094 2.039354 0.0434
BLOG -0.152499 0.262717 -0.580469 0.5626
SELF -0.279417 0.251761 -1.109850 0.2690
IMAGE 0.415287 0.255414 1.625938 0.1063
AV 0.117190 0.271638 0.431421 0.6669
POLITICAL 0.133671 0.200026 0.668270 0.5051
NSFW -0.318165 0.356252 -0.893090 0.3734

R-squared 0.631585 Adjusted R-Squared 0.604295

The R2 and adjusted R2 in the model are promising, explaining 63% and 60% of the variance
respectively. But the absolute t-statistics for the non binary variables, LOG(COMMENTS+1),
LOG(READERS), and LOG(READERS)^2 are large. The absolute t-statistics for MAIN and
ARTICLE are smaller, but would still be rejected at the 5% level.

Although the explanatory power of the model is still in question, it does seem to match the
theory. LOG(COMMENTS+1) has a positive coefficient. The positive and negative coefficients
of LOG(READERS) and LOG(READERS)^2 form a downward facing parabola that matches the
theory.

One thing that does not make sense is the positive coefficient of MAIN. However, it does not
necessarily validate the tossing of the model.

LOG(KARMA+1) is maximized when a subReddit has 646,711 readers. This is, worryingly,
above the maximum amount of READERS observed.
 

Model 2 

Dependent Variable: LOG(KARMA+1)

Variable Coefficient Std. Error t-Statistic Prob.

ABS(SELF-1)*LOG(COMMENTS+1) 0.896279 0.085624 10.46765 0.0000


LOG(READERS) 0.210645 0.069645 3.024566 0.0030
LOG(READERS)^2 -0.014996 0.004837 -3.099978 0.0024
MAIN 0.475060 0.194943 2.436923 0.0161
ARTICLE 0.376671 0.275028 1.369576 0.1731
BLOG -0.263911 0.302394 -0.872741 0.3844
SELF 1.047127 0.286963 3.649001 0.0004
IMAGE 0.196042 0.296138 0.661997 0.5091
AV 0.047760 0.312909 0.152633 0.8789
POLITICAL 0.182298 0.230359 0.791364 0.4301
NSFW -0.207078 0.413124 -0.501249 0.6170

R-squared 0.510749 Adjusted R-squared 0.474509

The R2 and adjusted R2 took a hit. While the t-statistics of LOG(READERS) and
LOG(READERS)^2 became slightly more promising, they would still be rejected at the 1% level.
This model has less explanatory power that Model 1 but partially solves high t-statistic of
LOG(COMMENTS+1).

One thing that does not make sense is the positive coefficient of MAIN. However, it does not
necessarily validate the tossing of the model.

LOG(KARMA+1) is maximized when a subReddit has 1,260,155 readers. Once again, this is
alarmingly high.
Model 3 
Dependent Variable: LOG(KARMA+1)

Variable Coefficient Std. Error t-Statistic Prob.

LOG(READERS) 0.331916 0.091763 3.617118 0.0004


LOG(READERS)^2 -0.020601 0.006424 -3.206884 0.0017
MAIN -0.086370 0.249326 -0.346416 0.7296
ARTICLE 0.215134 0.365959 0.587864 0.5576
BLOG -0.358161 0.403927 -0.886697 0.3768
SELF 0.513964 0.377419 1.361787 0.1755
IMAGE 0.560227 0.393019 1.425446 0.1563
AV -0.001797 0.418129 -0.004299 0.9966
POLITICAL 0.311065 0.306698 1.014238 0.3123
NSFW -0.926862 0.544405 -1.702522 0.0909

R-squared 0.114504 Adjusted R-squared 0.056333

Compared to the previous two models, this one seems much less able to explain the variation in
LOG(KARMA+1). It seems clear that COMMENTS should be included in the model in one way
or another. This is also the first of the models to give MAIN a negative coefficient.

LOG(KARMA+1) is maximized when a subReddit has 9,935,701 readers. Once again, this is
alarmingly high.

Model 4 
Dependent Variable: LOG(KARMA+1)

Variable Coefficient Std. Error t-Statistic Prob.

LOG(COMMENTS+1) 0.823607 0.058245 14.14045 0.0000


LOG(READERS) -0.075551 0.142707 -0.529417 0.5974
LOG(READERS)^2 -0.000407 0.008085 -0.050355 0.9599
MAIN 0.251460 0.167757 1.498955 0.1362
ARTICLE 0.263746 0.256681 1.027523 0.3060
BLOG -0.374099 0.277856 -1.346376 0.1805
SELF -0.529140 0.272857 -1.939254 0.0546
IMAGE 0.133272 0.282462 0.471823 0.6378
AV -0.136549 0.291516 -0.468411 0.6403
POLITICAL 0.084747 0.198470 0.427003 0.6701
NSFW -0.338229 0.351383 -0.962566 0.3375
C 1.436578 0.651598 2.204700 0.0292

R-squared 0.644481 Adjusted R-squared 0.615297

Compared to its intercept-lacking counterpart, Model 1, this Model explains the variance in the
model slightly better. On another positive note, all the variables except LOG(COMMENTS) and
C would not be rejected at the 5% level. The addition of an intercept increases the explanatory
power of Model 1.
However, the large coefficient of C, however, runs contrary to theory. The positive coefficient of
MAIN also does not make sense. Even more troubling, both LOG(READERS) and
LOG(READERS)^2 have negative coefficients, meaning a subReddit of 1 readers is most fertile
for Karma. This is obviously untrue, and I feel it is enough to throw this model out.

Model 5 
Dependent Variable: LOG(KARMA+1)

Variable Coefficient Std. Error t-Statistic Prob.

ABS(SELF-1)*LOG(COMMENTS+1) 0.898469 0.085850 10.46556 0.0000


LOG(READERS) 0.107513 0.165289 0.650453 0.5165
LOG(READERS)^2 -0.009460 0.009390 -1.007499 0.3155
MAIN 0.442396 0.201006 2.200903 0.0295
ARTICLE 0.293853 0.300690 0.977264 0.3302
BLOG -0.345612 0.325407 -1.062092 0.2901
SELF 0.963876 0.311929 3.090049 0.0024
IMAGE 0.093972 0.331711 0.283295 0.7774
AV -0.045044 0.341285 -0.131983 0.8952
POLITICAL 0.165556 0.232088 0.713334 0.4769
NSFW -0.217200 0.414193 -0.524393 0.6009
C 0.521011 0.756962 0.688292 0.4925

R-squared 0.512473 Adjusted R-squared 0.472452

Relative to Model 2, which lacks an intercept, this Model explains about the same amount of
variance. The t-statistics for all variables other than ABS(MAIN-1)*ABS(SELF-
1)*LOG(COMMENTS+1) is promising. And although including an intercept runs contrary to
theory, the intercept is relatively small.

LOG(KARMA+1) is maximized when a subReddit has 86,250 readers. This is the first such
value to make practical sense.

Model 6 

Dependent Variable: LOG(KARMA+1)

Variable Coefficient Std. Error t-Statistic Prob.

LOG(READERS) 0.286928 0.220038 1.303996 0.1944


LOG(READERS)^2 -0.018185 0.012518 -1.452713 0.1486
MAIN -0.101241 0.258768 -0.391243 0.6962
ARTICLE 0.178752 0.401224 0.445516 0.6567
BLOG -0.394004 0.435486 -0.904747 0.3672
SELF 0.476977 0.412836 1.155366 0.2500
IMAGE 0.515965 0.440681 1.170835 0.2437
AV -0.042448 0.456793 -0.092925 0.9261
POLITICAL 0.303863 0.309424 0.982028 0.3278
NSFW -0.932055 0.546788 -1.704600 0.0906
C 0.227920 1.012463 0.225114 0.8222
R-squared 0.114834 Adjusted R-squared 0.049748

The addition of an intercept to model 3 barely changes the explanatory power of the model.
However, the absolutely t-statistic of each individual variable is impressively small with only
NSFW being rejected at the 10% level. Also promising is that even though including a intercept
runs contrary to theory, the coefficient is small. The negative coefficient of MAIN is also
promising.

LOG(KARMA+1) is maximized when a subReddit has 7,119,007 readers. Once again, this is
alarmingly high.

Discussion of variables across models 
The inclusion of LOG(COMMENTS+1) in some form consistently increased the explanatory
power of the model, however its removal increased the probability that the other variables were
part of the model. Model 5 seems to exemplify the best tradeoff, where comments relate to
Karma only for non-self posts.

All but Model 4 showed that there is a number of readers where Karma is maximized at some
point. However, Model 5 was the only one to give a maximum number of readers within the
range of readers observed. 86,250 is the magic number of readers, according to Model 5.

MAIN was only negative in 2 models, which does not include Model 5 which seems the most
promising. In Model 5 however, MAIN would be rejected at the 5% level. More testing is needed
to see how submitting to the Reddit.com subReddit effects Karma.

ARTICLE and IMAGE have positive coefficients in every models, and with comfortable t-
statistics. It seems conclusive that articles and images receive more Karma than their
counterparts.

AV and SELF sometimes have positive, and sometimes have negative coefficients. In Model 5,
AV has a negative effect and SELF has a negative effect. However, more testing is needed to
verify.

BLOG was the only variable to be negative in every single model. Furthermore, none of the
models would reject the effect of BLOG at the 5% level.

POLITICAL has a consistently positive coefficient, while NSFW has a consistently negative
coefficient.
Conclusion 

Summary 
My analysis shows that articles, images, and submissions with a political spin lend themselves
to more Karma. Blogs and pornography, however, generate less Karma. Self posts and
audio/visual submissions are somewhere in between.

There is also a sweet spot in the size of a subReddit. 86,250 is a rough estimate of where that
sweet spot may be. SubReddits smaller than this don’t display the numbers needed to maximize
Karma, while subReddits larger than this crowd it out.

There is a lack of evidence to conclude that comments are a good predictor of Karma.

Further Research 
The use of comments as a proxy for engagement seemed to be the most notable flaw in my
models. If a better proxy could be found, better models could be created.

Limitations 
The sample size was unfortunately small, 149 not including spam posts. A larger sample size
would greatly increase the validity of the model as well as better expose variable interactions.

Another limitation was the blurry line between the submission categories, particularly between
articles and blogs. It is hard to decipher what is a blog when established sources hire bloggers
to create content, and individual bloggers attempt to present themselves as established
sources.

Another difficult question is what would be considered political. Is a news story which is not
overtly editorialized but slightly biased considered political? If the bias wasn’t obvious at a
cursory glance, I wouldn’t mark it as political. However, a deeper reading might reveal it was so.

You might also like