You are on page 1of 1
CHAPTER 3 © OSTREAMS: REAL-TIME RODS The Anatomy of a Spark Streaming Application Let’ dissect the example translation application: 1. Depending on the submissfon process, the driver program begins executing. neither the user’ local machine or a cluster node. On execution, the driver program creates a new StreaningContext and connects tothe Spark subsystem, [Note thata StreaningContext can also be created from a checkpointed fle or an existing SparkContext. Checkpointing isthe process of saving the state of the StreamingContext or aDStrear to secondary storage. In case of DStreans, checkpointing ensures that the root ofthe lineage graph remains within bounds. That is why checkpointing is a mandatory step for stateful DStreas operations, as you see shorly Internally, StreaningContext creates a SparkContext object and a JobScheduler object. The latter is used to ship streaming jobs to Spark. 2 Therestofthelines before start() setup the computation but donot actually perform the execution. Let's wait on those lines and first examine start(). Behind the scenes, start) kicks ofthe execution of Jobscheduler, which {nur stants JobGenerator. JobGenerator isin charge of creating jobs from DStreans, RDD checkpointing, and metadata checkpointing every batch interval, anaitTermination() uses a condition vaiable-like mechanism to block unt either the application explicitly invokes stop() or the application is terminated. ‘The first actual processing line to be executed isthe creation of an input Stream ‘using textFileStrean(), which in turn invokes fileStrean(). fileStream() returns a FileInputDStream object. 4, FileInputDstream internally monitors the specified directory on the filesystem, and every batch interval picks up new fils that have become visible. Each one of ‘these files is tumed into an RD. FileInput0Strean in its conpute() methods returns a UnionR00 of all these Biles. 5. Invokingmap() on the FileInputOStream results in the creation of a appedDStrean, which simply invokes the map function on the underlying RD, 6. Once the nap function has finished execution, saveAsTextFiles() results in the invocation of saveAsTextFile() for each RD in the Dstream 7. Steps 4to6 are repeated (potentially) forever. Figure 3-2 shows the execution ofa single batch. Each blue box represents a single DStrean with asingle RDD (because you only analyzed a single book). There are four DStreans. The first contains a UnionROD for the input data. textFileStream() internally performs a map transformation co cull the key from the ‘key-value pair of line number and line content. That accounts forthe first MapPartitionsRDD. Itis worth. highlighting that behind the scenes, an operation on a DStrean invokes the same operation on its undeslying RDDs. For example, aap on a OStzean applies the map to each RD in it The second map operation is from line 30 of Listing 3-1. The last NapPartitionsROD is for the output operation on line 31 (internally, saveAsTextFiles makes a cal to saveAsTextFile on each RDD, which in turn calls a save function for each partition in that RD using a map). 36 oscar azofeita segura@ gma

You might also like