Cheat Sheet v4

Lecture 1. 5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value.
Data engineering: development, operation and maintenance of on-premise,

cloud or hybrid data infrastructure to support downstream ML, analytics, reporting applications. Databases: OLTP, row-based, CRUD, many concurrent
access, writing and updating data in transactional env. Datawarehouse (DW): OLAP, column-based, analytics/ small userbase, read-heavy operations.
Datamart: transformed, smaller DW for biz units. Serverless — compute and storage resources are provisioned and managed by cloud provider, no need for
users to manage servers, infrastructure or scaling.
Lecture 2. Data Formats Consideration: Text/ Binary: ease of use; Splitability: for scalability and for distributed file system; Compressibility: only for
text file formats as binary formats already has compression by definition. Verbosity; Supported data types (does it support NULL values, does it support
complex data type); schema enforcement (XML has DTD to enforce data, while CSV has no schema enforcement); schema evolution support; row/column
based storage; ecosystem and community support. Design 1 stores transactions info. Design 2 doesn’t store qty info & lost transaction info
Text- CSV/TSV: splitable; XLM: verbose, not splitable, DTD schema validation; JSON: compact, not splitable, compressible, heterogenous arrays and
objects as dictionaries Binary- serialization converts complex data types to stream of bytes for transfer; AVRO: row-based, splitable, compressible, schema
evolution, JSON defined schema; PARQUET: column based (more compressible) format of Hadoop for analytics queries, write-once-read-many, splitable,
schema evolution, restricted to batch processing. Memory Hierarchy: Register, CPU cache, RAM, SSD, HDD, Object storage, Archival storage.
Move Data tmd = size of data * (1/speed of computation + 1/network speed) Move Compute tmc = size of program/ network speed + size of data/ number
of nodes/ speed of computation). We can achieve speed-up by moving computation to the data especially in the big data era
Lecture 3. Weak entity: Owner entity and weak entity: one-to-many r/s. Weak entity must have total participation. B+ tree LSM Tree
B-tree stores data for efficient retrieval in a block-oriented storage. Each B-tree node fits into a disk block or page, allowing for efficient retrieval of entire Write data in blocks(pages) of 4kb Sequentially write data in blocks of a few MB
nodes with a single disk read operation. Reduce number of disk seeks and reads. High spatial locality and pre-fetching can improve I/O efficiency. B-trees
Update and del data directly Append only, consolidation at segment compaction
used in DBMS to implement indexing structures. B-trees are balanced, store keys in sorted order, uses a hierarchical index, uses partially full blocks. stage
B+ trees, only leaf nodes contain pointers to data records. Internal nodes has only keys. B+ more storage efficient and supports range queries. Data write O(logn) O(1)
Query Processing (use statistical info on data to reduce cost of query evaluation): SQL Query, Parse & Rewrite Query, Logical Query Plan: relational
algebra tree, Physical Query Plan, Query Execution, Disk. Equivalence rules in relational algebra to generate logically equivalent query plans. Commuta- Data read O(logn) O(n), O(logn) with index and cache optimisation
tive, associative and pushing predicate to be applied first. (R⋈S = S⋈R. R˅S = S˅R) Selinger Algorithm to find optimal join sequence in a left deep join Constant node split and merge due to Frequent compaction has many compare and merge
tree using dynamic programming. O(2^n) < O(n!). f*(R1, R2, R3) = min( f*({R1, R2}⋈R3, f*({R1, R3}⋈R2), f*({R2, R3}, R1) ) lots of insertion operations (CPU and I/O) heavy
Lecture 4. ACID (SQL): Atomicity: all or nothing. Consistency: consistent with constraints. Isolation: all transactions run in isolation. Durability:
committed transactions persist. BASE (Distributed, NoSQL): Basic Availability (multiple servers): database appears to work most of the time. Soft State:
temporary inconsistency amongst replicas. Eventual Consistency. BASE transactions: weak consistency, availability first, best effort, approximate answer
SQL DB are based on ACID. NoSQL do not prioritise ACID as distributed parallel system is difficult to ensure ACID properties.
CAP: Consistency: reading the same data from any node. Availability: no downtime. Partition Tolerance: functions even when communication among
servers is unreliable. RDBMS (MySQL), only one server running and no data duplication thus no data consistency issue and the system is always available.
However, compared to a distributed NoSQL DB, RDBMS has no partition tolerance. NoSQL have ‘P’ due to distributed system and always have multiple
nodes and partition tolerance for high scalability requirement. CP: MongoDB, HBase, Redis, AP: CouchDB, Cassandra, DynamoDB, Riak.
Horizontal Scaling: Master-slave architecture: focus on maintaining consistency. Only master node handles write request. Failure of master node will lead
to application downtime. Read is scalable, write is not scalable Master-slave architecture with sharding: data is partitioned into group of keys. single point Master slave architecture with sharding
of failure becomes multi-point of failures. Common means of increasing availability when master node fails is to employ a master failover protocol.
Challenges: Fault tolerantly store big data, Process big data parallelly, Manage continuously evolving schema and metadata for semi-structured and unstruc-
Replication
tured data. Internet data is massive, sparse, (semi) un-structured, however, DBMS is optimised for dense largely uniform structured data. Master (R/W)
Application
RDBMS for dense, structured data and NoSQL: large data volumes, schema flexibility, ACID-> BASE, async inserts & updates, scalable, distribut-
designed for “single” machine and vertical ed and massive parallel computing, consistency->availability, lower cost, limited query capabilities (cannot
scaling join), no standardization, queries need to return answers quickly, mostly query, few updates Slave (read-only)
CAP theorem scenario. N1 gets request to update id2 salary from 800 to 1000. Due to connection issue, not able to send update from N2. If a read request comes to N2 and N2 respond with salary=800 the system is available but
inconsistent. However, if N2 responds with an error, the system is consistent by not returning inconsistent data but is unavailable.
Log structured merge tree storage structure (Storage structure and NOT a tree data structure). High write throughput, append only with very rare random read. Memtable: key value pairs sorted by key. When updating value,
LSM appends another record; no need to search for key. During read, find the most recent key-value pair. Examples: HBase, BigTable
Key-value: designed to handle huge amounts of data, schema-less, stores data as hashtables where each key is unique and the value can be string, JSON, basic large object, etc. keys needs to be hashable, follows AP. Good for shopping
cart contents, user session attribute. Example: Redis, Amazon DynamoDB
Columnar: High performance on aggregation queries and better compression as column have the same type. Hybrid of RDBMS and key -value. Values are stored in groups of zero or more columns (column family), but in column order.
Values are queried by matching keys. Good for time series data, marketing data, financial data. Example: Google BigTable good for data with high velocity to capture the data first, HBase.
Graph: Collection of nodes, edges which can be labelled to narrow searches. Each node and edge can have any no. of attributes/ “columns” Uses cases: key players identification, pagerank, community detection, label propagation, path
finding, cycle detection. Example: neo4j, ArangoDB, Apache Giraph
Document: Document is a loosely structured set of key/value pairs. Documents are treated as a whole and are addressed in the database via a unique key. Documents have flexible. Comparisons between RDBMS and document: Tables-
> Collections. Row-> Documents. Columns -> Key-value pairs. Examples: MongoDB, CouchDB, Firestore. Document DB for a more complex structure, whereas for key-value, the internal structure of the “value” is opaque to the DB.
Lecture 5. Data Pipeline & Orchestration. Workflow management with Airflow: Define task and dependencies; Scalable as it schedule, distribute
and execute tasks across worker nodes. Monitor job status; Accessibility of log file; Ease of deployment of workflow changes; Handle errors and failures; Tasks can
pass parameters downstream; Integration with other infrastructure. Data pipeline for streaming data: Lambda Architecture– duplicate code, data quality, added
complexity, two distributed systems. Kappa– simplified, single codebase , single processing path for data -> improved consistency, ease of maintenance.
Twitter case study: high scale and throughput in real-time processing lead to data loss and inaccuracies, exacerbated by back pressure. Back pressure is where data
is produced faster than it can be consumed leading to overload and congestion.
Lecture 6. Cloud computing: use of hosted services such as data storage, servers, databases, networking and software
over the internet. Cloud computing benefits: pay for what you use, more flexibility, more scalability, no server space
required, better data security, provides disaster recovery, automatic software updates, teams can collaborate from wide-
spread locations, data can be accessed and shared anywhere over the internet, rapid implementation.
Symmetric multi-processing on shared memory architecture: not easily scalable as memory and I/O bus width are
bottlenecks. Massive parallel processing on shared disk architecture: SMP clusters share common disk. Coordination
protocol for cache coherency. Sharing on shared-nothing architecture: data is partitioned across multiple computation
nodes. Shared nothing has become the dominant system. scalability: horizontal partitioned across nodes, each node
responsible for rows on its local disk. Cheap commodity hardware. However, Heterogenous workload (high I/O light
compute vs low I/O, heavy compute) membership changes (node failures, system resize) lead to significant performance
impact as computing resources is used for data shuffling rather than serving request. And hard to keep system online during
upgrading. Multiple Cluster Shared architecture: S3 for data storage. Cloud services: manage virtual warehouses,
queries, transactions, all metadata, access control information, usage stats. Virtual warehouse: each can contain one or
more computation servers. For different size of the virtual warehouse, they will allocate different number of machines.
Sharding: chop the data to multiple pieces. No cache coherence problem anymore
Rank (CSV, XML, JSON, AVRO):Storage efficiency - Avro [binary] > CSV [delimiter makes it compact] > JSON > XML [need to
read and close]Scalability [how much we can split it and run in parallel] - Avro [binary] > CSV [text format] > JSON [not splitable, but
less texts than XML] > XML [not splitable] Ease of use - (JSON, CSV) [Excel & Google avail to load and read] > XML [text] > Avro
Hardware separation of
memory space. No coherence
problem if all read-only
request
Connected to
the same hard
disk
Google BigQuery Example: Columnar oriented storage for OLAP use cases. Capacitor (File format) uses Run-Length Encoding to encode sparse data. Encoding is not trivial. Reorder
rows is a NP complete problem. And not all columns are equal. Encoding columns with long string provides more gain in improvements. Some columns are more likely to be selected in a
query. Some columns are more likely to be used in a filter.
Colossus (Distributed File System): Sharding and geo-replication (same data in different regions and different zones in the same region) for failover and serving purposes.
Storage optimisation (Partition): large amount of data and low number of distinct value. partitions >1GB and are stored in the same data block. Storage optimisation (Clustering): lower
level data rearrangement, data is sorted for faster selection and filtering. Partitioned into different blocks and stored based on clustering columns.
Query processing (Dremel): high speed data transfer between colossus and Dremel. Movement of data from colossus to Dremel. Not movement of compute. Worker process data and store
intermediate results in shuffle then another worker fetch data from shuffle. Steps of executing a query: API request management -> Lexing and parsing SQL -> query planning (catalog
resolution) -> query execution (scheduling and dynamic planning) -> finalise results
Hadoop Example: 3 features: Fault tolerance, Data movement and Lifecycle management.
ResourceManager (RM) arbitrates resources among all the applications in the system. NodeManager is per-machine framework agent responsible for containers, monitoring their resource
usage (cpu, memory, disk, network) and sends heartbeat to the RM. RM has 2 components: Scheduler (SC) and ApplicationsManager(AM). SC allocates resources to the various running
Data partitioned by creation_date clustered by tags
applications. AM is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster. Per-application ApplicationMaster has
the responsibility of negotiating appropriate resource containers from the SC, tracking their status and monitoring for progress, it works with NodeManger to execute task. After completion of Creation_date | Tags | Title
an application, the application master will un-register all containers and un-register itself.
Client submit application -> RM start applicationmaster, RM allocates containers to applicationmaster on needs basis -> Node manager sends heartbeat to RM. Creation_date| Title | Tags
Every node manager registers itself as (NEW), after successful registration, becomes (RUNNING). Based on heartbeat, final status will be (DECOMISSIION, REBOOTED, LOST)
HDFS fault tolerant distributed file system run on commodity hardware. Moving Computation > Moving Data. Write-once-read-many and mostly, append to. Master/Slave architecture to keep
metadata and coordinate access. HDFS cluster: 1 NameNode, master server, manages filesystem namespace and metadata, regulates access to files by clients. Data never flows through the
NameNode. In addition, there are many DataNodes, usually one per node in the cluster. Internally, a file is split into one or more blocks (64MB or 128MB). HDFS uses a rack-aware replica
placement policy. Blocks are replicated across DataNodes across different machines for fault tolerance. NameNode gets heartbeat and Blockreport from DataNodes. Creation_date | Tags | Title
Apache Storm Example: Kafka distributed event streaming platform sends data from source to destination. Apache Storm: computation graph where nodes are individual computation. Nodes
send data between one another in form of tuples. Stream is an unbounded sequence of tuples. Spout is the source of a stream that reads data from an external data source and emit tuples. Bolts
subscribe to stream, streams transformers, process incoming streams and produce new ones. Topology is deployed as a set of processing entities over a cluster of computation resource.
Parallelism is achieved by running multiple replicas of the same spout/ bolt. Partitioned
Storm architecture: MasterNode runs Nimbus, a central job master which topologies are submitted. In charge of scheduling, job orchestration, communication, fault tolerance and distribute & Clustered
code around cluster assign task to worker nodes. WorkerNodes are where applications are executed. Each WorkerNode runs a supervisor. Each supervisor listens for work assigned to its
worker node by Nimbus and starts/ stops worker processes Input Splitting: split the Mapping: create key- Shuffling: data with the Reducing
Frameworks for distributed computing: Apache Spark, Flink, Beam, Storm, Dask. Streaming data processing platforms: Kafka, Spark Streaming, Flink, STORM, Amazon Kinesis input value pairs same key is transferred to
the same reducer node
Hadoop MapReduce is a software framework to process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. MapReduce job splits
input data into independent chunks which are processed by the map tasks parallelly. Sort the outputs of the maps, then input to the reduce tasks. Both the input and the output of the job
are stored in HDFS which is running on the same set of nodes. Effectively schedule tasks on the nodes where data is already present. Framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
MAPPER: Key is what the data will be grouped on and value is info pertinent to the analysis in the reducer. REDUCER: takes grouped data as input and runs a
reduce function once per key grouping. The function is passed the key and an iterator over all of the values associated with that key. The data can be aggre-
gated, filtered, and combined. MapReduce runtime concerns: Orchestration of distributed computation, scheduling of tasks, handle data distribution, handle
synchronisation, handle errors and faults (detect worker failures and restart jobs), everything happens on top of a distributed file system.
5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value. Data stream processing deals with velocity. Data stream is a real-time, continuous ordered PageRank
sequence of structured records. Massive volume at a high rate, impossible to store locally and insights must be generated real-time, continuous query. R(t) = αHR(t-1) + (α-1)s
Structure: Infinite sequence of items, same structure and Explicit or Implicit timestamp and physical or logical representation of time. Application: Transac- Damping factor (α) represents the probability that a user will continue
tional (logs interactions) measurement (monitor evolution of entity states) Data Stream Management Service Manage continuous data streams. Executes clicking on links versus jumping to a random page. Mapper emits (child
continuous query that is permanently installed. Stream query processor produces new results as long as new data arrive at the system. Scratch space is tempo- (X), share of Y) for each Y. Reducer group all the child(X) together and
rary storage used as a buffering where new data joins the queue to be processed. With stream algorithms, generate insights by going over the data once as sum the share of Y.
data grows linearly, the summary statistics should grow sub-linearly, at most log. Kmeans clustering
Mapper emit (key: (which centorid), value:(point, 1)) Reducer group all the
keys (which has the same centroid) together sum up all the values of the
points and divide to find the new mean which is the new centroid. Broad-
cast the new means to each machine on the cluster
Linear Regression
Mapper receives a subset(batch) of data and computes partial gradient.
length Reduce phase: gradients from all the mappers are aggregated to compute
the total gradient. Update phase: update the weights using the total gradient
computed from all subsets
2 array items below
Total 32 byte
End of array
Heavy hitters (potential candidates for further investigation) s-heavy hitters: count(e) at time t > s * (length of stream so far). Maintain a structure contain-
ing [ceiling(1/s) - 1] s-heavy hitter elements. False positives are okay but no false negatives. However, false positives are problematic if heavy hitters are used
for billing/ punishment . Window-based algorithm: window based on ordering attribute, based on record counts or based on explicit markers.
CQ: Continuous queries
Lossy counting N: current length of the stream. s (0, 1): support threshold, ε (0, 1): error. Output: Elements with counter values exceeding (s-ε) * N
Rule of thumb ε = 0.1 * s
Guarantees: Frequencies underestimated by at most ε*N. No false negatives. False positives haves true frequency at least (s-ε) * N. All elements exceeding
frequency s* N will be output
Example: User interested in identifying all items whose freq is at least 10%. s = 0.1 . Assume ε = 0.01. Guarantee 1: all elements with freq > 10% will be
returned. Guarantee 2: no element with frequency below 9% will be returned. False positives between 10% and 9% might or might not be outputed. Guar-
antee 3: all individual frequencies are less than their true frequencies by at most ε * N.
Lossy counting (step 1): divide the stream to window size w = ceiling(1 / ε). (Step 2) Go through elements. If counter exists, increase by one, if not
create one and initialise it to one. (Step 3) At window boundary, decrement all counters by 1. If counter is zero for a specific element, drop it..
Data is stationary
Sticky Sampling Counting algorithm using sampling approach. Probabilistic sampling decides if a counter for a distinct element is created. If a counter exists DBMS DSMS
for a certain element, every future instance of this element will be counted. N: current length of the stream. s (0, 1): support threshold, ε (0, 1): error, δ (0,1):
probability of failure. Data Nature Persistent data, relatively static Read-only (append-only) data.
Guarantees: Guarantees are the same as for lossy counting except for the small probability that it might fail to provide correct answers. DSMS process data to generate
insights not to modify the
Example: (step 1) Dynamic window size doubling window size of each new window where t = 1/ε log(1/(s * δ)) . Wk = 2k * t for k >= 1. W0 = 2t. (step 2) Go streams
through elements. If counter exists, increase it. If not, create a counter with probability 1/r and initialise it to one. Sampling rate r starts with 1 and grows in
proportion to window size W. (Step 3) At window boundary, go through elements of each counter. For each counter, we repeatedly toss a coin until coin toss Storage “Unbounded” disk store, theoreti- Bounded main memory
is successful. Diminish counter by one for every unsuccessful coin toss. If counter becomes zero, drop it. Decrement happens probabilistically. Heavy hitters cally load entire dataset into
will have a higher chance of staying in the dictionary. memory and generate insights
Access pattern Random access Sequential access

Viz: Clarity (use labels), Accuracy, Efficiency (communicate effectively & quickly). Machine Learning: F2 score (0-1) harmonic mean of
Elements of a plot: geometric objects, scales & coordinates, annotations precision and recall. Computation Query driven processing model Data driven processing model
Data Viz Fatctors: visual effect—it includes the usage or appropriate shapes, colors, and Image generation: diffusion model able to use prompt unlike model (pull-based) (push-based)
sizes to represent the analysed data. coordinate system—organise the data within provided GANs.
Query answer One time query Continuous query
coordinates. data types and scale—correctly interpret the data type informative interpreta- Linear regression: use L2 loss, take small steps in negative
tion—use labels, titles, legends and pointers gradient direction. Update Relatively low update rate Possibly multi-GB arrival rate
Aesthetics of plot: position, shape, size, color, line width, line type Logistic regression uses an activation function: 1/ (1 + e^(-y)) frequency
Folium for map data, NetworkX for network, Seaborn, Matplotlib Input textual data: bag of words (order and structure discard-
Types of Charts: Line chart for time series, scatter plot to see r/s (third variable can be ed, large vocab lead to large and sparse table), word2vec Transactions ACID properties No transactions management
color or size of points), histogram/ density plot for distribution, bar chart for categorical (similar words are close together), transformer (multiple latent
data, heatmap to visualise distribution of matrix of values, pairplots to visualise r/s among representation, computation can be parallelized). Data precision Assume precise data Data stale/ imprecise
pairs of numerical data Supervised: classification, regression, unstructured output.
Statistical data analysis. Numerical data: distribution histogram, Categorical data: Unsupervised: association rule (basket analysis), clustering,
cardinality—number of unique values for the categorical attribute & unique counts— dimension reduction Window is of varying length if fluctuat-
number of occurrences for each unique value of the categorical attribute, Textual data: Computer vision: classification -> semantic segmentation -> ing input rates. If constant then N =
number of unique tokens, document frequency to penalise term frequency. object detection (multiple objects) -> instance segmentation arrival rate * time based window length
Count-based window
Correlation analysis: numeric vs numeric: pairwise plot, correlation heatmap, linear (object detection then semantic segmentation)
regression, categorical vs categorical: cardinality and unique counts bar charts, information
gain, numerical vs categorical: density plot. (two density lines, one for each category) Class Streaming Mean:
Outlier: boxplot, outside 1.5IQR, Zscore >3, violin plot. def__init__(self):
self. result = 0
Tricks for better viz: scales to add dimensions to data, use colors, perform value transfor- self.n = 0
mation, e.g. take log of one axis. place important information in the top left of your def update(self, element): Ordering attribute window
dashboard, put your main message in the title of the tooltip self. result = (self. result * self.n + element) / (self.n + 1)
self.n += 1
Class StreamingSum:
def__init__(self):
self. result = 0
def update(self, element):
self. Result +=element
Each node can have at most m key fields and m+1 pointer fields. Half-full must be satisfied (except root node): Class StreamingMinimum:
If m is even and m = 2d, Leaf node half full: at least d entries, Non-leaf node half full: at least d entries def__init__(self):
If m is odd and m = 2d + 1, Leaf node half full: at least d + 1 entries Non-leaf node half full: at least d entries (i.e., d + 1 self. result = inf
pointers) def update(self, element):
self.result = min(self.result, element)
Comparing lossy and If S =1/2, then this is the majority element problem. If S = 1/3,
sticky sampling for then this is the -heavy hitters problem.
Unique and Zipf where E1, D, B, D, D5, D, B, B, B, B, B11, E, E, E, E, E16
frequency of item is At time 1, it is E, At time 5, it is D, At time 11, both B and D
inversely proportional to At time 15, it is B, At time 16, it is B and E
rank in frequency table.
Sticky is due to tendency
to remember every
unique element that gets
sampled. Lossy is good
at pruning low-freq
elements quickly. For
highly skewed data both
algo require much less
space than worse-case
bounds.
Microprocessor speed (vertical scaling) stagnated: thermal barrier, system bottleneck, material limits
Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of com-
modity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which
are then input to the reduce tasks. Typically both the input and the output of the job are stored in a
file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework
and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of
nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data
is already present, resulting in very high aggregate bandwidth across the cluster.
Map-reduce for PageRank, Kmeans clustering & Gradient Descent

In the PageRank algorithm, the damping factor is typically denoted by the symbol α. The damping
factor represents the probability that a user will continue clicking on links versus jumping to a ran-
dom page.
Apache Hadoop includes three modules:

Hadoop Distributed File System (HDFS A distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: Parallel processing of large data sets.
MAPPER: The key is what the data will be grouped on and the value is the information pertinent to
the analysis in the reducer.
REDUCER: The reducer takes the grouped data as input and runs a reduce function once per key
grouping. The function is passed the key and an iterator over all of the values associated with that
key. A wide range of processing can happen in this function, as we’ll see in many of our patterns. The
data can be aggregated, filtered, and combined in a number of ways.
Three features managed by Hadoop: Fault tolerance, data movement and lifecycle management
YARN: resource management and job scheduling layer of Hadoop.
ResourceManager is the ultimate authority that arbitrates resources among all the applications in the
system. The NodeManager is the per-machine framework agent who is responsible for containers,
monitoring their resource usage (cpu, memory, disk, network) and sends heartbeat to the Re-
sourceManager/Scheduler.
The ResourceManager has two main components: Scheduler and ApplicationsManager. The Scheduler
is responsible for allocating resources to the various running applications. ApplicationsManager is re-
sponsible for accepting job-submissions, negotiating the first container for executing the application
specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on
failure. The per-application ApplicationMaster has the responsibility of negotiating appropriate re-
source containers from the Scheduler, tracking their status and monitoring for progress.
HDFS fault tolerant distributed file system designed to run on commodity hardware. Moving Compu-
tation is Cheaper than Moving Data. Write-once-read-many and mostly, append to. Master/slave archi-
tecture to keep metadata and coordinate access. An HDFS cluster consists of a single NameNode, a
master server that manages the file system namespace and regulates access to files by clients. The
NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a
way that user data never flows through the NameNode. In addition, there are a number of DataNodes,
usually one per node in the cluster, which manage storage attached to the nodes that they run on. Inter-
nally, a file is split into one or more blocks (64MB or 128MB) and these blocks are stored in a set of
DataNodes. The blocks of a file are replicated for fault tolerance. The NameNode makes all decisions
regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster.
The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata
for all the files and directories in the tree.

Cheat Sheet v4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cheat Sheet v4

Uploaded by

Copyright:

Available Formats

Lecture 1. 5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value.

Data engineering: development, operation and maintenance of on-premise,

Access pattern Random access Sequential access

Map-reduce for PageRank, Kmeans clustering & Gradient Descent

Apache Hadoop includes three modules:

YARN: resource management and job scheduling layer of Hadoop.

You might also like