You are on page 1of 4

Why do we Require SPARK:

Spark allows both the REPL( Read-print- evalute-loop) environment and also
provides a stable environment for deployment.
It is a general purpose engine for data processing and analysis.
It provides features of SQL, REPL environement, Stable deployment environement.
It is a part of Hadoop Ecosystem
It uses distrbuted computing.
Spark is better version of Hadoop MapReduce.
Spark Uses RDD , it is fundamental unit.
RDDs are in memory objects and all data is processed using them.

COMPONENTS OF SPARK:
Spark Core - Contains basic functionality of spark and provides API for RDD. It is
the computing engine.
Storage System - Stores data. SPARK supports local file system, HDFS, Hive,
Cassandra etc.
cluster Manager - Distributes the work across a cluster of machine. Apche Messos
is default cluster manager , it can also support Hadoop's YARN.
Spark SQL - Provides SQL interface , while getting in-memory performance It
allows python/java data manipulations on a same dataset within single program.
Spark Streaming - Enables real time data processing
MLlib - It is a machine Learning functionality . It helps in providing the built in
libraries of python/R/Scala into a distributed environment. As machine learning
requires processing of data through multiple , thus the in-memory RDD is faster
and provides better performance comapred to MapReduce.
GraphX - Used for representation of data.

Few Examples -

1. We have airline file having a code and description column. We will perform some operations
on it.

-- val arilines = sc.textFile("path to file")

-- airlines.collect()

This will display the info in a form of an array

-- airlines.first()

This will got the name of colums i.e. code, description.

-- airlines.take(10)

This will print only first 10 rows.

-- airlines.count()

This given total no.of entries

-- airlines.filter(x => !x.contains("Description"))


This removes the header row where the column name is "Description"

You might also like