Professional Documents
Culture Documents
June 2013
Page 1 of 6
25
s3://aws-publicdatasets/common-crawl/
Thousand
ARC files parse-output/segment/[segment]/[ARC_
file]. It allows to link a given web document
to the ARC file it is stored in. ARC file names
EMR Hadoop Cluster
35 x x 6h / cluster = 1260 inst. hours are unique so the segment name is not necessary
Core
m1.xlarge
Master
m1.xlarge
Core
m1.xlarge
for identification.
spot inst. spot inst. spot inst.
Page 2 of 6
TLD Abs. freq. Rel. freq.
.com 2,139,229,462 0.5587
.org 230,777,285 0.0603
.net 208,147,478 0.0544
.de 181,658,774 0.0474
.uk 132,414,696 0.0346
.pl 68,528,722 0.0179
.ru 65,147,873 0.0170
.nl 54,871,489 0.0143
.info 50,395,860 0.0132
.it 49,719,965 0.0130
.fr 49,648,844 0.0130
.jp 43,790,880 0.0114
others 554,450,743 0.1448
understand more about the structure of the corpus the internet and whether it is biased towards certain
and its representativeness. TLDs. The Spearman rank correlation coefficient
gave a value of 0.84 for ρ which indicates a good
Top-level domains positive correlation for the top 75 TLDs.
To extrapolate the bias, we took the ratio of the
One of the main questions was which TLDs had been relative frequency in the CC corpus and the expected
crawled and what their percentage was with respect value from the web technology survey. We labelled
to the total corpus. For this, we aggregated counts of TLDs with values above 1 as overrepresented and the
public suffixes like .org.uk and .co.uk under .uk. ones with values below 1 as underrepresented. The
Figure 3 summarizes these statistics by listing all top 10 over- and underrepresented domains are listed
TLDs above a relative frequency of 0.01, i.e. 1%. in Tables 1a and 1b. Most of the overrepresented
For the 2012 corpus, there are 12 TLDs above this TLDs are domains of English or European countries.
threshold. The underrepresented domains are mostly Asian and
It becomes immediately apparent that more than some are South American.
half of the documents, which have been crawled,
are registered under the .com domain. This can be
explained by the fact that this TLD contains sites Second-level domains
from all over the world. In Table 2a and 2b the top 10 second-level do-
Comparing these figures to the general usage of mains are given by document frequency and by
TLDs for websites provided by the web technology data in terabytes. With 2.5% of all websites and
survey 10 it is possible to make assumptions about the 4.2% of the total data, youtube.com is the high-
representativeness of the CC corpus as a sample of est ranking second-level domain in the CC cor-
10
Source: http://w3techs.com/technologies/overview/ pus 2012. Other high-ranking domains are blog
top_level_domain/all (March 2013). publishing services like blogspot.com, wordpress.
Page 3 of 6
Rank SLD Abs. freq. Rel. freq. Rank SLD terabytes Rel. data
1 youtube.com 95,866,041 0.0250 1 youtube.com 8.7560 0.0417
2 blogspot.com 45,738,134 0.0119 2 wordpress.com 1.0128 0.0048
3 tumblr.com 30,135,714 0.0079 3 flickr.com 0.7500 0.0036
4 flickr.com 9,942,237 0.0026 4 hotels.com 0.2839 0.0014
5 amazon.com 6,470,283 0.0017 5 typepad.com 0.1693 0.0008
6 google.com 2,782,762 0.0007 6 federal-hotel.com 0.1617 0.0008
7 thefreedictionary.com 2,183,753 0.0006 7 shopzilla.com 0.1230 0.0006
8 tripod.com 1,874,452 0.0005 8 shopping.com 0.1210 0.0006
9 hotels.com 1,733,778 0.0005 9 yoox.com 0.1081 0.0005
10 flightaware.com 1,280,875 0.0003 10 tripadvisor.es 0.1074 0.0005
(a) .com domains by document frequency. (b) .com domains by data in terabytes.
Rank SLD Abs. freq. Rel. freq. Rank SLD terabytes Rel. data
1 citysite.net 1194938 0.0003 1 tripadvisor.es 0.1074 0.0005
2 yahoo.co.jp 1022024 0.0003 2 tripadvisor.in 0.1051 0.0005
3 amazon.de 864516 0.0002 3 ca.gov 0.0857 0.0004
4 wrzuta.pl 827315 0.0002 4 epa.gov 0.0803 0.0004
5 dancesportinfo.net 675029 0.0002 5 iha.fr 0.0781 0.0004
6 atwiki.jp 665594 0.0002 6 amazon.fr 0.0768 0.0004
7 weblio.jp 642366 0.0002 7 who.int 0.0763 0.0004
8 blogg.se 611502 0.0002 8 europa.eu 0.0590 0.0003
9 kijiji.ca 608583 0.0002 9 autotrends.be 0.0569 0.0003
10 rakuten.co.jp 564760 0.0001 10 astrogrid.org 0.0555 0.0003
(c) Non-.com domains by document frequency. (d) Non-.com domains by data in terabytes.
Page 4 of 6
Character Rel.
Abs. freq.
encoding freq.
utf-8 1,866,333,314 0.4874
unknown 1,647,477,248 0.4303
iso-8859-1 229,671,038 0.0600
windows-1251 26,798,707 0.0070
iso-8859-2 10,088,397 0.0026
iso-8859-15 8,605,343 0.0022
windows-1256 5,454,253 0.0014
shift-jis 5,289,261 0.0014
windows-1252 5,173,227 0.0014
euc-jp 4,201,400 0.0011
others 18,074,100 0.0047
Rel.
Media type Abs. freq.
freq.
text/html 3,532,930,141 0.9227
application/pdf 92,710,175 0.0242
text/xml 80,184,383 0.0209
text/css 22,872,511 0.006
application/x-
21,198,040 0.0055
javascript
image/jpeg 14,116,839 0.0037
applica-
11,548,630 0.003
tion/javascript
text/plain 10,713,438 0.0028
applica-
6,648,861 0.0017
tion/msword
application/xml 4,999,123 0.0013
applica-
4,200,583 0.0011
tion/rss+xml
Page 5 of 6