You are on page 1of 7

Collect action ------ Don’t use collect action.

All the data from the file it will read so its very
expensive.. all data will write it to memory

Name is dataframe, sc is sparkcontext

We are applying collect() function on RDD


RDDname. countByValue() == we can using in list how manytimes one name or word is repeating

RDD.take(5) ##### to read only 5 result. We can change the number

Transformations

If data dependency is not there shuffling is not required at that time from node1 RDD1 data we will
move it to node1 RDD2,

Example: If we want to filter data by odd number we will just apply filter transformation.
Map:

If we want to map any function to RDD. We will use it. Elements and partitions are equal in Map.

Example: If we want to multiply all elements with *2. We will create lambda function and mapp it to
RDD

.collect to show the output

Num is rdd

.map is action

(lambda a : power of a,2 means we are mentioning the element power as 2

****Adding word to all words in list*****

We are using lambda function a : “mr. “+a and mapping it to RDD


FlatMap: elements and partitions in number are not equal.

Filter
Filter is the operation in which it will give us a new dataset but by selecting some filter criteria we
will filter some criteria on the source which will return some elements suppose we want to search
odd values even values or multiplication.

********Finding even numbers in list by Filter*****

*****Filter the words which are started with letter “B” ****

Union
Combining both data

Adding two datasets. Union will not give order.

Sample
Wide Transformation

GroupBY:

For one dataset we are applying groupby. In groupby we have used lambda function.

Lambda x : x[0] means what ever first letter is there take that letter and apply by group by. So
starting with B letter name will be in one group like that.

To check the results we are applied key and value. we can use for loop
For (k,v) in the new rdd dataset and print those.

Intersection: To know common records from the two rdd’s we can use intersection.

Oder doesn’t matter we can mention the rdd’s any place it will give the common records. Like inner
join.

Subtract

It will work like rdd1-(minus) rdd2

Distinct

You might also like