Professional Documents
Culture Documents
June 2013
Page 1 of 6
25
s3://aws-publicdatasets/common-crawl/
Thousand
ARC files parse-output/segment/[segment]/[ARC_
file]. It allows to link a given web document
to the ARC file it is stored in. ARC file names
EMR Hadoop Cluster
35 x x 6h / cluster = 1260 inst. hours are unique so the segment name is not necessary
Core
m1.xlarge
Master
m1.xlarge
Core
m1.xlarge
for identification.
spot inst. spot inst. spot inst.
Page 2 of 6