Professional Documents
Culture Documents
June 2013
Page 1 of 6
25
s3://aws-publicdatasets/common-crawl/
Thousand
ARC files parse-output/segment/[segment]/[ARC_
file]. It allows to link a given web document
to the ARC file it is stored in. ARC file names
EMR Hadoop Cluster
35 x x 6h / cluster = 1260 inst. hours are unique so the segment name is not necessary
Core
m1.xlarge
Master
m1.xlarge
Core
m1.xlarge
for identification.
spot inst. spot inst. spot inst.
Page 2 of 6
TLD Abs. freq. Rel. freq.
.com 2,139,229,462 0.5587
.org 230,777,285 0.0603
.net 208,147,478 0.0544
.de 181,658,774 0.0474
.uk 132,414,696 0.0346
.pl 68,528,722 0.0179
.ru 65,147,873 0.0170
.nl 54,871,489 0.0143
.info 50,395,860 0.0132
.it 49,719,965 0.0130
.fr 49,648,844 0.0130
.jp 43,790,880 0.0114
others 554,450,743 0.1448
understand more about the structure of the corpus the internet and whether it is biased towards certain
and its representativeness. TLDs. The Spearman rank correlation coefficient
gave a value of 0.84 for ρ which indicates a good
Top-level domains positive correlation for the top 75 TLDs.
To extrapolate the bias, we took the ratio of the
One of the main questions was which TLDs had been relative frequency in the CC corpus and the expected
crawled and what their percentage was with respect value from the web technology survey. We labelled
to the total corpus. For this, we aggregated counts of TLDs with values above 1 as overrepresented and the
public suffixes like .org.uk and .co.uk under .uk. ones with values below 1 as underrepresented. The
Figure 3 summarizes these statistics by listing all top 10 over- and underrepresented domains are listed
TLDs above a relative frequency of 0.01, i.e. 1%. in Tables 1a and 1b. Most of the overrepresented
For the 2012 corpus, there are 12 TLDs above this TLDs are domains of English or European countries.
threshold. The underrepresented domains are mostly Asian and
It becomes immediately apparent that more than some are South American.
half of the documents, which have been crawled,
are registered under the .com domain. This can be
explained by the fact that this TLD contains sites Second-level domains
from all over the world. In Table 2a and 2b the top 10 second-level do-
Comparing these figures to the general usage of mains are given by document frequency and by
TLDs for websites provided by the web technology data in terabytes. With 2.5% of all websites and
survey 10 it is possible to make assumptions about the 4.2% of the total data, youtube.com is the high-
representativeness of the CC corpus as a sample of est ranking second-level domain in the CC cor-
10
Source: http://w3techs.com/technologies/overview/ pus 2012. Other high-ranking domains are blog
top_level_domain/all (March 2013). publishing services like blogspot.com, wordpress.
Page 3 of 6