The Wayback Machine

1/27/2020 How Much Of The Internet Does The Wayback Machine Really Archive?
48,976 views | Nov 16, 2015, 09:04am
How Much Of The Internet Does The

Wayback Machine Really Archive?
Kalev Leetaru Contributor
AI & Big Data
I write about the broad intersection of data and society.
This article is more than 2 years old.
Internet Archive servers in 2006 (AP Photo/Ben Margot)
The Internet Archive turns 20 years old next year, having archived nearly two
decades and 23 petabytes of the evolution of the World Wide Web. Yet, surprisingly
little is known about what exactly is in the Archive’s vaunted Wayback Machine.
Beyond saying it has archived more than 445 billion webpages, the Archive has neve
published an inventory of the websites it archives or the algorithms it uses to
determine what to capture and when. Given the Archive’s recent announcements of
new efforts to make its web archive accessible to scholarly research, it is critically
https://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive/#67f6c8349446 1/9
important to understand what precisely makes up this 445-billion-page archive and

how that composition might affect the kinds of research scholars can perform with it
Regular users of the Wayback Machine are familiar with the myriad oddities of its
holdings. For example, despite CNN.com launching in September 1995, the Archive’
first snapshot its homepage does not appear until June 2000. In contrast, BBC’s
website has been archived since December 1996, but the volume of snapshots ebbed
and flowed in fits and starts through 2012. To truly understand the Archive it is clea
we must move beyond casual anecdotes to a systematic assessment of the collection’
holdings.
Since the Archive does not publish a master inventory of the domains preserved in
the Wayback Machine, the Alexa ranking of the top one million most popular
websites in the world was used, which is compiled from browsing activity in more
than 70 countries. The complete history of all snapshots ever recorded by the Archiv
for the homepage of each website was requested using the Wayback CDX Server API
through November 5, 2015. While this only reflects snapshots of homepages, rather
than sites as a whole, it nonetheless captures a key metric of how often the Archive i
crawling each site.
Today In: Tech
The enormous technical resources required to crawl and archive the open web can b
seen in this data. In all, the homepages of the top one million Alexa sites have been
snapshotted by the Internet Archive just over 240 million times since 1996. Just ove
2 terabytes of bandwidth was consumed downloading those homepages, with more
than 307 gigabytes required in 2015 alone.
Looking at 2015, the top 15 sites with the most snapshots were seriesyonkis.sx (a
Spanish site offering free access to TV and movies, which Chrome currently blocks
due to security risks and which was previously shut down for alleged movie pirating)
avtozapchasty.ru (a Russian autoparts website), savy.lt (a Lithuanian loans
website),videox-amateur.org (a pornography website), most.bg (a Bulgarian
computer parts website), fastpic.ru (a Russian image hosting site that appears to hos
a large amount of pornography), royalkona.com (a Hawaiian resort hotel),
trampolinepartsandsupply.com (a trampoline parts website), radikal.ru (another
Russian image hosting site), youtube.com, zohraa.com (Indian women’s fashion site
arcelikal.com (Turkish appliance and electronics website), localiser-ip.com (IP whoi

lookup), jobsalibaba.com (online jobs website), and myspace.com.
Thus, of the top 15 websites with the most snapshots taken by the Archive thus far
this year, one is an alleged former movie pirating site, one is a Hawaiian hotel, two
are pornography sites and five are online shopping sites. The second-most
snapshotted homepage is of a Russian autoparts website and the eighth-most-
snapshotted site is a parts supplier for trampolines.
Looking in more detail at the Wayback’s archive of Lithuanian loans website savy.lt
can be seen that the Archive crawled the site sporadically from January 1999 to May
2003, then did not return for more than a decade. In 2015 it crawled it heavily in lat
March and April and then very heavily in May and June, a few times on July 1, and
never again in the following four months. In all, the Archive’s crawlers accessed
savy.lt a total of 203,945 times over this period, most of it in a single massive burst o
crawling. Yet, the public Wayback profile of the site asserts it has only been crawled
868 times.
The reason for this is that the public-facing Wayback website reports the number of
hours with at least one snapshot, rather than the actual total number of snapshots,
which is why it reports a maximum of 24 captures per day, rather than the thousand
of captures per day it actually sees for some websites. Unfortunately, the Archive doe
not clarify this on its website, instead casually referencing it deep within the technica
documentation for their CDX Server API on GitHub.
Reranking the top one million sites by the number of hours with at least one snapsho
from that hour and calculating the percent of hours since 12:01AM January 1, 2015
there is a snapshot from, the top 15 sites are myspace.com (93%), yahoo.com (86%),
cnn.com (80%), youtube.com (78%), msn.com (76%), twitter.com (76%),
facebook.com (72%), msnbc.com (70%), abcnews.go.com (70%), today.com (69%),
nbcnews.com (67%), cbsnews.com (65%), infoseek.co.jp (65%), cnbc.com (63%), an
tinypic.com (58%). Nine of the top 15 websites by hourly snapshots are news
websites, offering what appears to be a more reasonable ranking. Indeed, news
websites make up many of the domains in the top 50.
Yet, a closer look at this ranking also reveals a number of anomalies. Site walb.com
has an Alexa ranking of 100,803, yet is ranked 24th for most hours with a snapshot,
while mountvernonnews.com is ranked 363,013 in Alexa and 43rd by snapshot hour
This appears to be a general trend, with no noticeable connection between Alexa ran
and the number of times or hours a website homepage has been snapshotted.
In fact, the total number of snapshots and the total number of hours with at least on
snapshot are only weakly correlated at r=0.35. Alexa rank and number of snapshots
are not meaningfully correlated at r=-0.03, while Alexa rank and number of distinct
hours with snapshots are inversely correlated at r=-0.15. Put into simpler terms,
these numbers mean that the number of snapshots and number of hours with at leas
one snapshot are largely unrelated to its Alexa ranking. More popular sites do not
have more snapshots than less popular sites. On the one hand, this might make sens
since the popularity of a site is not necessarily indicative of how frequently it update
Yet, on the circa-2015 web highly popular sites tend to update constantly with new
content – a site that is updated once every few years will likely draw little traffic.
Thus, one could argue that the content update rate of a site and its popularity are at
least somewhat related.
Looking across years, the correlation of Alexa rank with hours and snapshots is
remarkably consistent from 2013 to 2015, varying from -0.15 to -0.17 for hours and
-0.03 to -0.04 for snapshots. However, the correlation between hours and snapshots
varies considerably, changing from 0.35 in 2015 to 0.29 in 2014 to 0.46 in 2013 to
0.38 in 2012. The fact that correlation of captures with Alexa rank remains constant
across the last three years suggests that the Archive does not factor popularity into
crawling behavior. On the other hand, the considerable change in the correlation of
total snaps with snap hours suggests that the recrawl behavior of the Archive is
constantly changing, which will have a profound effect on research using the Archive
as a dataset to study web evolution.
News outlets represent a special kind of website that combines a high update rate of
new content with considerable societal importance from the standpoint of archival.
To examine how well the Archive has been preserving online news, the top 20,000
news websites by volume monitored by the GDELT Project were selected and the
country of origin for each outlet identified. The total number of snapshot hours were
summed for all news outlets from each country for 2013, 2014, and 2015, and divide
by the total number of monitored outlets from each country, yielding the following
maps of the average number of snapshot hours per news outlet in each country by
year.
Average number of hours with at least one snapshot by outlet for online news outlets by country in... [+]
Clearly visible in this sequence of maps is a strong centralization of the Archive’s

crawling resources towards a relatively small number of countries in terms of
snapshot hours. In 2013 there were just a few outliers, with most countries having
relatively similar hours per outlet. Over the three years there has been a steady
reorientation towards a more uneven breakdown of archival resources. The
significant geographic change over time adds further evidence that the behavior of
the Archive’s crawlers is constantly changing in profound and undocumented ways.
Taken together, these findings suggest that far greater understanding of the Internet
Archive's Wayback Machine is required before it can be used for robust reliable
scholarly research on the evolution of the web. Historical documentation on the
algorithms and inputs of its crawlers are absolutely imperative, especially the
workflows and heuristics that control its archival today. One possibility would be for
the Archive to create a historical archive where it preserves every copy of the code
and workflows powering the Wayback Machine over time, making it possible to look
back at the crawlers from 1997 and compare them to 2007 and 2015.
More detailed logging data is also clearly a necessity, especially of the kinds of
decisions that lead to situations like the extremely bursty archival of savy.lt or why
the CNN.com homepage was not archived until 2000. If the Archive simply opens it
doors and releases tools to allow data mining of its web archive without conducting
this kind of research into the collection's biases, it is clear that the findings that resu
will be highly skewed and in many cases fail to accurately reflect the phenomena
being studied.
What can we learn from all of this? Perhaps the most important lesson is that, like so
many of the massive data archives that define the “big data” world, we have precious
little understanding of what is actually in the data we use. Few researchers stop to as
the kinds of questions explored here and even fewer archives make any kind of
detailed statistics available about their holdings. Instead, the “big data” era is
unfortunately being increasingly defined by headline-grabbing results computed fro
datasets being plucked off the shelf with little attempt to understand their inner
biases.
Another theme is that of unexpected discovery. This analysis originally began as a

study of online news archiving practices of the Internet Archive, intended to explore
whether it archived Western outlets more frequently than those of other countries.
The original expectation was that the Archive’s holdings would reflect popularity and
rate-of-change, with language and geographic location being the primary
differentiators. However, once the data was examined, it was clear the archival
landscape of the Wayback Machine was far more complex.
The interfaces we use to access these vast archives often silently transform it in ways
that are not apparent or visibly documented but that can have profound impacts on
our understanding of the results we obtain from them. For example, neither the
Wayback homepage nor the detailed FAQ inform users that the snapshot counts on
the web interface report the number of distinct hours with at least one snapshot,
rather than the actual number of times that the Archive crawled a page. This fact is
only available buried deeply within a technical API reference page on Github.
In my opening keynote address at the 2012 IIPC General Assembly at the Library of
Congress, I noted that for scholars to be able to use web archives for research, we
needed far greater information on how those archives were being constructed. Three
and a half years later few major web archives have produced such documentation,
especially relating to the algorithms that control what websites their crawlers visit,
how they traverse those websites, and how they decide what parts of an infinite web
to preserve with their limited resources. In fact, it is entirely unclear how the
Wayback Machine has been constructed, given the incredibly uneven landscape it
offers of the top one million websites, even over the past year.
The findings above demonstrate how critical this kind of insight is. When archiving
an infinite web with finite resources, countless decisions must be made as to which
narrow slices of the web to preserve. At the most basic level, one can chose either
completely random archival (selecting pages without regard to any other factors),
archival prioritized by rate of change (archiving pages more often that change more
frequently – though this tends to emphasize dynamically-generated sites), or archiva
prioritized by popularity (this emphasizes the pages the most people use today, but
risks failing to preserve relatively unknown pages that may become important in the
future). Human input can also play a critical role as with the Archive's specialized
Archive-It program.
Each approach has distinct benefits and risks. One might reasonably ask: 20 years
from now, which are we more likely to want to look back at, a Lithuanian loan
website, a trampoline parts supplier, or the breaking news homepage of a major new
outlet like CNN? Decisions as critical as what to preserve for the future require far
greater input from the community, especially the scholars who rely on these
collections. Given the current state of the Archive’s holdings, it is clear that far great
visibility is needed into their algorithms and that critical engagement is needed with
the broader scholarly community. We simply can’t leave something as important as
the preservation of the online world to the decisions of blind algorithms that we hav
no understanding of how they function.
Indeed, just as libraries have formalized over thousands of years how they make
acquisition and collection decisions based on community engagement, it is clear tha
web archives must adopt similar processes and partner with a wide range of
organizations to help them do so. Given that up to 14% of all online news monitored
by the GDELT Project is no longer accessible after two months, it is clear that the we
is disappearing before our very eyes and thus it is imperative that we do a better job
of archiving the online world and do it before this material is lost forever.
Kalev Leetaru
Follow
Based in Washington, DC, I founded my ﬁrst internet startup the year after the Mosaic web
browser debuted, while still in eighth grade, and have spent the last 20 year... Read More

The Wayback Machine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Wayback Machine

Uploaded by

Copyright:

Available Formats

1/27/2020 How Much Of The Internet Does The Wayback Machine Really Archive?

48,976 views | Nov 16, 2015, 09:04am

How Much Of The Internet Does The

This article is more than 2 years old.

Internet Archive servers in 2006 (AP Photo/Ben Margot)

important to understand what precisely makes up this 445-billion-page archive and

Today In: Tech

arcelikal.com (Turkish appliance and electronics website), localiser-ip.com (IP whoi

Clearly visible in this sequence of maps is a strong centralization of the Archive’s

Another theme is that of unexpected discovery. This analysis originally began as a

You might also like