Professional Documents
Culture Documents
Spark allows both the REPL( Read-print- evalute-loop) environment and also
provides a stable environment for deployment.
It is a general purpose engine for data processing and analysis.
It provides features of SQL, REPL environement, Stable deployment environement.
It is a part of Hadoop Ecosystem
It uses distrbuted computing.
Spark is better version of Hadoop MapReduce.
Spark Uses RDD , it is fundamental unit.
RDDs are in memory objects and all data is processed using them.
COMPONENTS OF SPARK:
Spark Core - Contains basic functionality of spark and provides API for RDD. It is
the computing engine.
Storage System - Stores data. SPARK supports local file system, HDFS, Hive,
Cassandra etc.
cluster Manager - Distributes the work across a cluster of machine. Apche Messos
is default cluster manager , it can also support Hadoop's YARN.
Spark SQL - Provides SQL interface , while getting in-memory performance It
allows python/java data manipulations on a same dataset within single program.
Spark Streaming - Enables real time data processing
MLlib - It is a machine Learning functionality . It helps in providing the built in
libraries of python/R/Scala into a distributed environment. As machine learning
requires processing of data through multiple , thus the in-memory RDD is faster
and provides better performance comapred to MapReduce.
GraphX - Used for representation of data.
Few Examples -
1. We have airline file having a code and description column. We will perform some operations
on it.
-- airlines.collect()
-- airlines.first()
-- airlines.take(10)
-- airlines.count()