You are on page 1of 2

HDFS Harnessing Big Data: OLTP: Online Transaction Processing (DBMSs) • OLAP: The per-application ApplicationMaster has the

The per-application ApplicationMaster has the responsibility of negotiating


Characteristics of Big Data: Scale(Volume), Complexity(Variety) & Online Analytical Processing (Data Warehousing) • RTAP: Real-Time Analytics appropriate resource containers from the Scheduler, tracking their status and
Speed(Velocity) Processing (Big Data Architecture & technology) monitoring for progress.
HDFS should have the following properties: Distributed - Fault and Error Key points:- There is one ApplicationMaster for every application running in the cluster.
tolerant - Scalable - High performance - High availability - Able to store and Features of HDFS: multi-node system - fault-tolerant and uses redundance for ApplicationMaster can request more containers by contacting the Scheduler in
manipulate large files - Low (or medium) cost. storage - best suited for batch jobs and can provide high throughput of data the ResourceManager
Distributed FS: Mounted on multiple servers (transparent to client) – Consistent processing - uses inexpensive, commodity hardware. One ApplicationMaster can manage multiple application containers.
namespace across all (clients can think of it as one big system, without worrying Block size in HDFS much larger as compared to other operating systems To
about location) minimize disk seek time as compared to data transfer time Task Tracker: There is a process called TaskTracker that runs on each DataNode.
-Concurrency – Failure tolerant - Should allow remote access. namenode data storage and recovery operations: Secondary namenode It monitors the tasks and communicates results with a JobTracker that runs on
HDFC Concepts: Latency (measures speed of response. e.g. If you hit a key, how periodically merges the fsimage and edits file into a new fsimage file - NN stores NameNode
fast do you get the data back) - Throughput (measures how much data you are able all meta-data about block-node mapping in memory - fsimage is a point-in-time
to process in a unit of time.) - HDFS is designed for large batch jobs, not smaller snapshot of the metadata - Incremental edits after the point-in-time snapshot are Speculative Execution
interactive jobs. written to the edits file A MapReduce job is dominated by the slowest task.
NameNode: NameNode does not directly read/write data. (This is one of the Properties of HDFS as a storage medium: Distributed, Partitioned, Fault- MapReduce attempts to locate slow tasks (stragglers) and run redundant
reasons of scalability of HDFS) - Client interacts with NN only to update Tolerant by using replication, Write-once and read-many, Commodity Hardware, (speculative) tasks that will optimistically commit before the corresponding
namespace (file names) or get block locations. - Client interacts with DN to get File stored as blocks, Designed for high latency, high throughput batch processing. stragglers.
data. (HDFS is single-write, multiple-read) - Checks periodically on the DataNode This process is known as speculative execution.
state Only one copy of a straggler is allowed to be speculated.
How does NN persist metadata?: Two different constructs: Whichever copy (among the two copies) of a task commits first, it becomes the
Fsimage - point-in-time snapshot of file system’s metadata Think of definitive copy, and the other copy is killed by JT
it like Mac’s time machine - It’s good for writing bulk changes, but MapReduce
not very suitable for incremental changes e.g. if you rename a file, you MapReduce Design Consideration: process vast amounts of data (multi-terabyte
don’t want to go and write it to the fsimage. data-sets) - parallel processing - large clusters (thousands of nodes) of commodity
edits file - smaller incremental changes are recorded. Thus, the major hardware - reliable - fault-tolerant - should be able to increase processing power by
changes (checkpoints) are recorded in the fsimage file and incremental adding more nodes ("scale-out" and not "scale-up") - sharing data or processing
updates are in the edits file. between nodes is bad (ideally want "shared-nothing" architecture.) - want batch
NN Failure: Namenode is a single point of failure. What happens if it crashes?? 2 processing (process entire dataset and not random seeks.)
solutions: - Backup the files that make up the persistent state to multiple filesystems
- Run a secondary namenode. It periodically merges the latest namespace image to Key-Value Pairs • Mappers and Reducers are users’ code (provided functions) •
the edit log. Creates checkpoints. Just need to obey the Key-Value pairs interface • Mappers: Consume <key, value>
Conversions: pairs and Produce <key, value> pairs • Reducers: Consume <key, <list of values>>
and Produce <key, value> • Shuffling and Sorting: • Hidden phase between Spark
mappers and reducers • Groups all similar keys from all mappers, sorts and passes In MapReduce, the only way to share data across jobs is stable storage which is
them to a certain reducer in the form of <key, <list of values>> slow!
Slow due to replication and disk I/O, but necessary for fault tolerance.
Hadoop MapReduce Architecture Goal is the In memory sharing - 10-100× faster than network/disk, but how to
Design Decisions: What would happen if a cluster needs more processing power? The concept of YARN was started in Hadoop 2.
- HDFS is based on concept of “scale-out” and not “scale-up”. - That means, you get FT?
YARN allows multiple application engines and frameworks such as graph Resilient Distributed Datasets (RDDs):-
can add more commodity hardware easily. - You don’t need to upgrade to expensive processing, stream processing, batch processing, etc
hardware - Another advantage- there is no single point of failure. They are fault-tolerant in-memory collection of elements that can be operated on
HDFS Federation • In Hadoop 2.x, concept of HDFS federation is introduced. • in parallel.
Functionalities of resource management and job scheduling/monitoring into Restricted form of distributed shared memory
Allows a cluster to scale by adding namenodes, each of which manages a portion separate daemons
of the filesystem namespace. - Immutable(read-only), partitioned collections of records
The ResourceManager is the ultimate authority that arbitrates resources among - Can only be built through coarse-grained deterministic transformations
Diff B/W RDBMS & HDFS: • RDBMS is for structured data, HDFS can work all the applications in the system.
with unstructured, larger datasets. • RDBMS is transactional, has ACID properties; (map, filter, join, …)
The NodeManager is the per-machine framework agent responsible for Efficient fault recovery using lineage
HDFS is batch oriented, failure tolerant. • RDBMS supports read-write operations, containers, monitoring their resource usage (cpu, memory, disk, network) and
HDFS has write-once, read-many property. - Log one operation to apply to many elements
reporting the same to the ResourceManager/Scheduler. - Recompute lost partitions on failure – No cost if nothing fails
What is high throughput? How is it achieved? - Throughput measures the NodeManager runs on the worker node.
amount of processing done in a unit of time. - By performing computation in a Despite their restrictions, RDDs can express surprisingly many parallel
distributed, parallel way, HDFS achieves high throughput. - HDFS provides algorithms.
The ResourceManager has two main components: Scheduler and 2 ways of creating RDDs?
streaming access to data, which ensures the entire data stream can be accessed and ApplicationsManager.
processed in the most efficient manner - parallelize an existing collection.
What is commodity hardware? Would you use it for the namenode also? - reference an existing dataset in external storage system e.g. HDFS, Hbase,
The Scheduler is responsible for allocating resources to the various running etc
Commodity hardware is a non-expensive systems which is not of high quality or applications subject to familiar constraints of capacities, queues etc.
high-availability. Hadoop can be installed in any average commodity hardware. We  Scheduler allocates resources based on the idea of container which incorporates
don’t need super computers or high-end hardware to work on Hadoop. No. RDDs support two types of operations:
elements such as memory, cpu, disk, network etc. 1. Transformations: which create a new dataset from existing one. (map, filter,
Namenode can never be a commodity hardware because the entire HDFS rely on Scheduling policies can be FIFO scheduler, CapacityScheduler and the
it. It is the single point of failure in HDFS. Namenode has to be a high-availability sample, groupByKey, reduceByKey, sortByKey, flatMap, union, join, cogroup,
FairScheduler. cross, mapValues)
machine.
What is secondary namenode? Is it a backup node to the main namenode? 2. Action: returns a value to the driver program after running a computation on the
The ApplicationsManager is responsible for accepting job-submissions, and dataset. (collect, reduce, count, first, save, lookupKey)
Secondary namenode keeps the checkpoint on the namenode, It reads the edit logs provides the service for restarting the ApplicationMaster container on failure.
from the namenode continuously after a specific interval and applies it to the Application Manager is responsible for starting application masters in the worker
fsimage copy of secondary namenode. In this way the fsimage file will have the All transformations in Spark are lazy
nodes. They do not compute their results right away. Instead, they just remember the
most recent state of HDFS. The ApplicationManager is responsible for negotiating the first container of an transformations applied to some base dataset (e.g. a file).
application from the resource manager.
The transformations are only computed when an action requires a result to be ML - In the next iteration of the training phase, the decision tree will help correct
returned to the driver program. feature extractors available in Spark MLlib for converting text to numerical for previous iteration's mistakes.
vectors: CountVectorizer, Word2Vec, FeatureHasher, TF-IDF - The Log Loss is an appropriate loss function for a classification task.
Key-Value Pair Operations:- Feature Tranformation Algorithms in ML Lib applied on text data: - GBTs iteratively train an ensemble of decision trees in order to minimize a loss
1) reduceByKey(func): Combine values with the same key Tokenizer, StopWordsRemover, n -gram, Binarizer, PCA, PolynomialExpansion, function.
2). groupByKey(): Group values with the same key Discrete Cosine Transform (DCT), StringIndexer, IndexToString, OneHotEncoder, - GBT can handle non-linearities and feature interactions in the data.
3) substractByKey(rdd): Remove element with a key present in the right RDD. VectorIndexer, Interaction, Normalizer, StandardScaler, RobustScaler,
4). Join(rdd): Perform an inner join between two RDDs MinMaxScaler, MaxAbsScaler, Bucketizer, ElementwiseProduct,
5). rightOuterJoin(rdd): Perform a join between two RDDs where the key must be SQLTransformer, VectorAssembler, VectorSizeHint, QuantileDiscretizer, Imputer.
present in the first RDD. Feature Hasher: work on categorical or numerical variables, Output lower MapReduce Structured & Unstructured Data
6). leftOuterJoin(rdd): Perform a join between two RDDs where the key must be dimensionality feature vector, "MurmurHash 3" hash algorithm used. Structured data (Processing relational data with MapReduce)
present in the right RDD. TF-IDF Feature Extractor: Unstructured data (Basics of indexing and retrieval, Inverted indexing in
7). cogroup(rdd): Group data from both RDDs sharing the same key. - For computing TF, either HashingTF or CountVectorizer can be used. MapReduce)
- The size of the output of the HashingTF transformer is determined by the
BroadCast Variables: BroadCast Variables let programmer keep a read-only numFeatures parameter. Boolean Retrieval
variable cached on each machine rather than shipping a copy of it with tasks. For - Two separate transformers need to be used to compute the final value. Users express queries as a Boolean expression (AND, OR, NOT)(Can be
example, to give every node a copy of a large input dataset efficiently. - Inverse document frequency (IDF) is a numerical measure of how much arbitrarily nested)
information a term provides. Retrieval is based on the notion of sets (Any given query divides the collection
Accumulators: Accumulators are variables that can only be added to through an Tokenizer: Tokenization is the process of taking text (such as a sentence) and into two sets: retrieved, not-retrieved)(Pure Boolean systems do not define an
associative operation. Used to implement counters and sums, efficiently in parallel. breaking it into individual terms (usually words). ordering of the results)
StopWordsRemover takes as input a sequence of strings (e.g. the output of a
Spark provides three options for persist RDDs: Tokenizer) and drops all the stop words from the input sequences. Ranked Retrieval
(1) in-memory storage as deserialized Java Objects (fastest, JVM can access RDD n-gram is a sequence of n tokens (typically words) for some integer n . The Order documents by how likely they are to be relevant (Estimate
natively) NGram class can be used to transform input features into n -grams. relevance(q, di))(Sort documents by relevance)(Display sorted results)
(2) in-memory storage as serialized data (space limited, choose another efficient Binarizer: Binarization is the process of thresholding numerical features to
representation, lower performance cost) binary (0/1) features.
(3) on-disk storage (RDD too large to keep in memory, and costly to recompute) IndexToString maps a column of label indices to a column containing the
original labels as strings.
RDD Fault tolerance & Recovery: StringIndexer can encode a column containing categorical values that are of
RDDs maintain lineage information that can be used to reconstruct lost string datatype into another column of label indices.
partitions. StandardScaler transforms a dataset of Vector rows, normalizing each feature
RDDs track the graph of transformations that built them (their lineage) to rebuild to have unit standard deviation and/or zero mean.
lost data. OneHotEncoder converts categorical variables into binary vectors with a single
one in each element.
Benefits of RDD Model • Consistency is easy due to immutability • Inexpensive VectorIndexer can automatically identify the categorical features in a dataset of
fault tolerance (log lineage rather than replicating/checkpointing data) • Locality- vectors.
aware scheduling of tasks on partitions • Despite being restricted, model seems RobustScaler transforms a dataset of Vector rows, removing the median and Term Weighting
applicable to a broad variety of applications scaling the data according to a specific quantile range (by default the IQR: Term weights consist of two components
Interquartile Range, quantile range between the 1st quartile and the 3rd quartile). - Local: how important is the term in this document?
VectorAssembler is a transformer that combines a given list of columns into a - Global: how important is the term in the collection?
single vector column. It is useful for combining raw features and features generated Terms that appear often in a document should get high weights
by different feature transformers into a single feature vector, in order to train ML Terms that appear in many documents should get low weights
models like logistic regression and decision trees. How do we capture this mathematically?
Normalizer is a Transformer which transforms a dataset of Vector rows, - Term frequency (local)
normalizing each Vector to have unit norm. It takes parameter p, which specifies - Inverse document frequency (global)
the p-norm used for normalization. (p=2 by default.).
- If the input vector is [2.0, 1.0, 1.0] and value of p = 1, the output vector would
[0.5,0.25,0.25].
- It takes a parameter p which specifies the value of the norm.
- It transforms a dataset of Vector rows into another column of vectors having
RDD Dependencies:- unit norm.
1) Narrow Dependencies: map, filter, join with input co-partitioned
2) Wide Dependencies: groupByKey, join with inputs not co-partitioned logistic regression model:
regParam decreases  Overfitting increases
Important features of Apache Spark:- regParam increases  Underfitting increases
 Open-source cluster computing framework DataFrame
 Developed to provide real-time, low latency queries on data that is stored in a LinearRegression Algorithm: DataFrames are part of Spark SQL.
cluster, such as Hadoop - By default the regParam = 0 i.e. there is no regularization. Like RDDs, DataFrames (DF) are immutable, distributed, partitioned
Uses partitioned, and distributed in-memory datasets, known as Resilient - There are two steps involved in model training: collection of data
Distributed Datasets (RDD) to speed up computation. 1. Defining the algorithm and specifying hyperparameters They have all the properties of RDDs, such as lazy evaluation, recovery
Disk I/O, which is the limiting factor in case of traditional MapReduce 2. fitting the algorithm on the training data. through lineage graphs, etc.
algorithms, is avoided by using RDDs - It can work with various types of regularizers, such as L1 and L2. They contain specialized APIs for working with tabular data, and have
Runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x - The aim of the algorithm is to minimize the specified loss function, with named columns.
faster on disk. regularization.
 Uses lazy evaluation for efficient processing
RDDs are immutable i.e. they cannot be updated once created Gradient-Boosted Trees (GBTs) algorithm:
Spark core is the base engine for computation.

You might also like