Professional Documents
Culture Documents
Working With Rdds in Spark: Chapter 11
Working With Rdds in Spark: Chapter 11
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
AggregaIng
Data
with
Pair
RDDs
13
WriIng
and
Deploying
Spark
ApplicaIons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaEerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐2
Working
With
RDDs
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐3
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐4
RDDs
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐5
CreaIng
RDDs
From
CollecIons
§ Useful
when
– TesIng
– GeneraIng
data
programmaIcally
– IntegraIng
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐6
CreaIng
RDDs
from
Files
(1)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐7
CreaIng
RDDs
from
Files
(2)
§ textFile maps each line in a file to a separate RDD element
I've never seen a purple cow.\n I've never seen a purple cow.
I never hope to see one;\n I never hope to see one;
But I can tell you, anyhow,\n
But I can tell you, anyhow,
I'd rather see than be one.\n
I'd rather see than be one.
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐8
Input
and
Output
Formats
(1)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐9
Input
and
Output
Formats
(2)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐10
Whole
File-‐Based
RDDs
(1)
file2.json
§ sc.wholeTextFiles(directory) {
– Maps
enIre
contents
of
each
file
in
a
directory
"firstName":"Barney",
"lastName":"Rubble",
to
a
single
RDD
element
"userid":"234”
}
– Works
only
for
small
files
(element
must
fit
in
memory)
(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":”234"} )
(file3.xml,… )
(file4.xml,… )
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐11
Whole
File-‐Based
RDDs
(2)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐12
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐13
Some
Other
General
RDD
OperaIons
§ Single-‐RDD
TransformaCons
– flatMap
–
maps
one
element
in
the
base
RDD
to
mulIple
elements
– distinct
–
filter
out
duplicates
– sortBy
–
use
provided
funcIon
to
sort
§ MulC-‐RDD
TransformaCons
– intersection
–
create
a
new
RDD
with
all
elements
in
both
original
RDDs
– union
–
add
all
elements
of
two
RDDs
into
a
single
new
RDD
– zip
–
pair
each
element
of
the
first
RDD
with
the
corresponding
element
of
the
second
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐14
Example:
flatMap
and
distinct
> sc.textFile(file) \
Python
.flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file).
Scala
flatMap(line => line.split(' ')).
distinct()
I’ve I’ve
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
a a
But I can tell you, anyhow,
purple purple
I'd rather see than be one.
cow cow
I I
never hope
hope to
to …
…
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐15
Examples:
MulI-‐RDD
TransformaIons
rdd1.union(rdd2)
rdd1
rdd2
Chicago San Francisco
Boston Boston
Chicago
Paris Amsterdam
Boston
San Francisco Mumbai
Paris
Tokyo McMurdo Station
San Francisco
Tokyo
rdd1.subtract(rdd2) rdd1.zip(rdd2) San Francisco
Boston
Tokyo (Chicago,San Francisco) Amsterdam
Paris (Boston,Boston) Mumbai
Chicago (Paris,Amsterdam) McMurdo Station
(San Francisco,Mumbai)
(Tokyo,McMurdo Station)
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐16
Some
Other
General
RDD
OperaIons
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐17
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐18
EssenIal
Points
§ RDDs
can
be
created
from
files,
parallelized
data
in
memory,
or
other
RDDs
§ sc.textFile
reads
newline
delimited
text,
one
line
per
RDD
record
§ sc.wholeTextFile
reads
enCre
files
into
single
RDD
records
§ Generic
RDDs
can
consist
of
any
type
of
data
§ Generic
RDDs
provide
a
wide
range
of
transformaCon
operaCons
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐19
Chapter
Topics
©
Copyright
2010-‐2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-‐20
Homework:
Process
Data
Files
with
Spark
© Copyright 2010-‐2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-‐21