Professional Documents
Culture Documents
Lecture 3. Weak entity: Owner entity and weak entity: one-to-many r/s. Weak entity must have total participation. B+ tree LSM Tree
B-tree stores data for efficient retrieval in a block-oriented storage. Each B-tree node fits into a disk block or page, allowing for efficient retrieval of entire Write data in blocks(pages) of 4kb Sequentially write data in blocks of a few MB
nodes with a single disk read operation. Reduce number of disk seeks and reads. High spatial locality and pre-fetching can improve I/O efficiency. B-trees
Update and del data directly Append only, consolidation at segment compaction
used in DBMS to implement indexing structures. B-trees are balanced, store keys in sorted order, uses a hierarchical index, uses partially full blocks. stage
B+ trees, only leaf nodes contain pointers to data records. Internal nodes has only keys. B+ more storage efficient and supports range queries. Data write O(logn) O(1)
Query Processing (use statistical info on data to reduce cost of query evaluation): SQL Query, Parse & Rewrite Query, Logical Query Plan: relational
algebra tree, Physical Query Plan, Query Execution, Disk. Equivalence rules in relational algebra to generate logically equivalent query plans. Commuta- Data read O(logn) O(n), O(logn) with index and cache optimisation
tive, associative and pushing predicate to be applied first. (R⋈S = S⋈R. R˅S = S˅R) Selinger Algorithm to find optimal join sequence in a left deep join Constant node split and merge due to Frequent compaction has many compare and merge
tree using dynamic programming. O(2^n) < O(n!). f*(R1, R2, R3) = min( f*({R1, R2}⋈R3, f*({R1, R3}⋈R2), f*({R2, R3}, R1) ) lots of insertion operations (CPU and I/O) heavy
Lecture 4. ACID (SQL): Atomicity: all or nothing. Consistency: consistent with constraints. Isolation: all transactions run in isolation. Durability:
committed transactions persist. BASE (Distributed, NoSQL): Basic Availability (multiple servers): database appears to work most of the time. Soft State:
temporary inconsistency amongst replicas. Eventual Consistency. BASE transactions: weak consistency, availability first, best effort, approximate answer
SQL DB are based on ACID. NoSQL do not prioritise ACID as distributed parallel system is difficult to ensure ACID properties.
CAP: Consistency: reading the same data from any node. Availability: no downtime. Partition Tolerance: functions even when communication among
servers is unreliable. RDBMS (MySQL), only one server running and no data duplication thus no data consistency issue and the system is always available.
However, compared to a distributed NoSQL DB, RDBMS has no partition tolerance. NoSQL have ‘P’ due to distributed system and always have multiple
nodes and partition tolerance for high scalability requirement. CP: MongoDB, HBase, Redis, AP: CouchDB, Cassandra, DynamoDB, Riak.
Horizontal Scaling: Master-slave architecture: focus on maintaining consistency. Only master node handles write request. Failure of master node will lead
to application downtime. Read is scalable, write is not scalable Master-slave architecture with sharding: data is partitioned into group of keys. single point Master slave architecture with sharding
of failure becomes multi-point of failures. Common means of increasing availability when master node fails is to employ a master failover protocol.
Challenges: Fault tolerantly store big data, Process big data parallelly, Manage continuously evolving schema and metadata for semi-structured and unstruc-
Replication
tured data. Internet data is massive, sparse, (semi) un-structured, however, DBMS is optimised for dense largely uniform structured data. Master (R/W)
Application
RDBMS for dense, structured data and NoSQL: large data volumes, schema flexibility, ACID-> BASE, async inserts & updates, scalable, distribut-
designed for “single” machine and vertical ed and massive parallel computing, consistency->availability, lower cost, limited query capabilities (cannot
scaling join), no standardization, queries need to return answers quickly, mostly query, few updates Slave (read-only)
CAP theorem scenario. N1 gets request to update id2 salary from 800 to 1000. Due to connection issue, not able to send update from N2. If a read request comes to N2 and N2 respond with salary=800 the system is available but
inconsistent. However, if N2 responds with an error, the system is consistent by not returning inconsistent data but is unavailable.
Log structured merge tree storage structure (Storage structure and NOT a tree data structure). High write throughput, append only with very rare random read. Memtable: key value pairs sorted by key. When updating value,
LSM appends another record; no need to search for key. During read, find the most recent key-value pair. Examples: HBase, BigTable
Key-value: designed to handle huge amounts of data, schema-less, stores data as hashtables where each key is unique and the value can be string, JSON, basic large object, etc. keys needs to be hashable, follows AP. Good for shopping
cart contents, user session attribute. Example: Redis, Amazon DynamoDB
Columnar: High performance on aggregation queries and better compression as column have the same type. Hybrid of RDBMS and key -value. Values are stored in groups of zero or more columns (column family), but in column order.
Values are queried by matching keys. Good for time series data, marketing data, financial data. Example: Google BigTable good for data with high velocity to capture the data first, HBase.
Graph: Collection of nodes, edges which can be labelled to narrow searches. Each node and edge can have any no. of attributes/ “columns” Uses cases: key players identification, pagerank, community detection, label propagation, path
finding, cycle detection. Example: neo4j, ArangoDB, Apache Giraph
Document: Document is a loosely structured set of key/value pairs. Documents are treated as a whole and are addressed in the database via a unique key. Documents have flexible. Comparisons between RDBMS and document: Tables-
> Collections. Row-> Documents. Columns -> Key-value pairs. Examples: MongoDB, CouchDB, Firestore. Document DB for a more complex structure, whereas for key-value, the internal structure of the “value” is opaque to the DB.
Lecture 5. Data Pipeline & Orchestration. Workflow management with Airflow: Define task and dependencies; Scalable as it schedule, distribute
and execute tasks across worker nodes. Monitor job status; Accessibility of log file; Ease of deployment of workflow changes; Handle errors and failures; Tasks can
pass parameters downstream; Integration with other infrastructure. Data pipeline for streaming data: Lambda Architecture– duplicate code, data quality, added
complexity, two distributed systems. Kappa– simplified, single codebase , single processing path for data -> improved consistency, ease of maintenance.
Twitter case study: high scale and throughput in real-time processing lead to data loss and inaccuracies, exacerbated by back pressure. Back pressure is where data
is produced faster than it can be consumed leading to overload and congestion.
Lecture 6. Cloud computing: use of hosted services such as data storage, servers, databases, networking and software
over the internet. Cloud computing benefits: pay for what you use, more flexibility, more scalability, no server space
required, better data security, provides disaster recovery, automatic software updates, teams can collaborate from wide-
spread locations, data can be accessed and shared anywhere over the internet, rapid implementation.
Symmetric multi-processing on shared memory architecture: not easily scalable as memory and I/O bus width are
bottlenecks. Massive parallel processing on shared disk architecture: SMP clusters share common disk. Coordination
protocol for cache coherency. Sharing on shared-nothing architecture: data is partitioned across multiple computation
nodes. Shared nothing has become the dominant system. scalability: horizontal partitioned across nodes, each node
responsible for rows on its local disk. Cheap commodity hardware. However, Heterogenous workload (high I/O light
compute vs low I/O, heavy compute) membership changes (node failures, system resize) lead to significant performance
impact as computing resources is used for data shuffling rather than serving request. And hard to keep system online during
upgrading. Multiple Cluster Shared architecture: S3 for data storage. Cloud services: manage virtual warehouses,
queries, transactions, all metadata, access control information, usage stats. Virtual warehouse: each can contain one or
more computation servers. For different size of the virtual warehouse, they will allocate different number of machines.
Sharding: chop the data to multiple pieces. No cache coherence problem anymore
Rank (CSV, XML, JSON, AVRO):Storage efficiency - Avro [binary] > CSV [delimiter makes it compact] > JSON > XML [need to
read and close]Scalability [how much we can split it and run in parallel] - Avro [binary] > CSV [text format] > JSON [not splitable, but
less texts than XML] > XML [not splitable] Ease of use - (JSON, CSV) [Excel & Google avail to load and read] > XML [text] > Avro
Hardware separation of
memory space. No coherence
problem if all read-only
request
Connected to
the same hard
disk
Google BigQuery Example: Columnar oriented storage for OLAP use cases. Capacitor (File format) uses Run-Length Encoding to encode sparse data. Encoding is not trivial. Reorder
rows is a NP complete problem. And not all columns are equal. Encoding columns with long string provides more gain in improvements. Some columns are more likely to be selected in a
query. Some columns are more likely to be used in a filter.
Colossus (Distributed File System): Sharding and geo-replication (same data in different regions and different zones in the same region) for failover and serving purposes.
Storage optimisation (Partition): large amount of data and low number of distinct value. partitions >1GB and are stored in the same data block. Storage optimisation (Clustering): lower
level data rearrangement, data is sorted for faster selection and filtering. Partitioned into different blocks and stored based on clustering columns.
Query processing (Dremel): high speed data transfer between colossus and Dremel. Movement of data from colossus to Dremel. Not movement of compute. Worker process data and store
intermediate results in shuffle then another worker fetch data from shuffle. Steps of executing a query: API request management -> Lexing and parsing SQL -> query planning (catalog
resolution) -> query execution (scheduling and dynamic planning) -> finalise results
Hadoop Example: 3 features: Fault tolerance, Data movement and Lifecycle management.
ResourceManager (RM) arbitrates resources among all the applications in the system. NodeManager is per-machine framework agent responsible for containers, monitoring their resource
usage (cpu, memory, disk, network) and sends heartbeat to the RM. RM has 2 components: Scheduler (SC) and ApplicationsManager(AM). SC allocates resources to the various running
Data partitioned by creation_date clustered by tags
applications. AM is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster. Per-application ApplicationMaster has
the responsibility of negotiating appropriate resource containers from the SC, tracking their status and monitoring for progress, it works with NodeManger to execute task. After completion of Creation_date | Tags | Title
an application, the application master will un-register all containers and un-register itself.
Client submit application -> RM start applicationmaster, RM allocates containers to applicationmaster on needs basis -> Node manager sends heartbeat to RM. Creation_date| Title | Tags
Every node manager registers itself as (NEW), after successful registration, becomes (RUNNING). Based on heartbeat, final status will be (DECOMISSIION, REBOOTED, LOST)
HDFS fault tolerant distributed file system run on commodity hardware. Moving Computation > Moving Data. Write-once-read-many and mostly, append to. Master/Slave architecture to keep
metadata and coordinate access. HDFS cluster: 1 NameNode, master server, manages filesystem namespace and metadata, regulates access to files by clients. Data never flows through the
NameNode. In addition, there are many DataNodes, usually one per node in the cluster. Internally, a file is split into one or more blocks (64MB or 128MB). HDFS uses a rack-aware replica
placement policy. Blocks are replicated across DataNodes across different machines for fault tolerance. NameNode gets heartbeat and Blockreport from DataNodes. Creation_date | Tags | Title
Apache Storm Example: Kafka distributed event streaming platform sends data from source to destination. Apache Storm: computation graph where nodes are individual computation. Nodes
send data between one another in form of tuples. Stream is an unbounded sequence of tuples. Spout is the source of a stream that reads data from an external data source and emit tuples. Bolts
subscribe to stream, streams transformers, process incoming streams and produce new ones. Topology is deployed as a set of processing entities over a cluster of computation resource.
Parallelism is achieved by running multiple replicas of the same spout/ bolt. Partitioned
Storm architecture: MasterNode runs Nimbus, a central job master which topologies are submitted. In charge of scheduling, job orchestration, communication, fault tolerance and distribute & Clustered
code around cluster assign task to worker nodes. WorkerNodes are where applications are executed. Each WorkerNode runs a supervisor. Each supervisor listens for work assigned to its
worker node by Nimbus and starts/ stops worker processes Input Splitting: split the Mapping: create key- Shuffling: data with the Reducing
Frameworks for distributed computing: Apache Spark, Flink, Beam, Storm, Dask. Streaming data processing platforms: Kafka, Spark Streaming, Flink, STORM, Amazon Kinesis input value pairs same key is transferred to
the same reducer node
Hadoop MapReduce is a software framework to process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. MapReduce job splits
input data into independent chunks which are processed by the map tasks parallelly. Sort the outputs of the maps, then input to the reduce tasks. Both the input and the output of the job
are stored in HDFS which is running on the same set of nodes. Effectively schedule tasks on the nodes where data is already present. Framework takes care of scheduling tasks, monitoring
them and re-executes the failed tasks.
MAPPER: Key is what the data will be grouped on and value is info pertinent to the analysis in the reducer. REDUCER: takes grouped data as input and runs a
reduce function once per key grouping. The function is passed the key and an iterator over all of the values associated with that key. The data can be aggre-
gated, filtered, and combined. MapReduce runtime concerns: Orchestration of distributed computation, scheduling of tasks, handle data distribution, handle
synchronisation, handle errors and faults (detect worker failures and restart jobs), everything happens on top of a distributed file system.
5 Vs of Big Data: Volume, Velocity, Variety, Veracity, Value. Data stream processing deals with velocity. Data stream is a real-time, continuous ordered PageRank
sequence of structured records. Massive volume at a high rate, impossible to store locally and insights must be generated real-time, continuous query. R(t) = αHR(t-1) + (α-1)s
Structure: Infinite sequence of items, same structure and Explicit or Implicit timestamp and physical or logical representation of time. Application: Transac- Damping factor (α) represents the probability that a user will continue
tional (logs interactions) measurement (monitor evolution of entity states) Data Stream Management Service Manage continuous data streams. Executes clicking on links versus jumping to a random page. Mapper emits (child
continuous query that is permanently installed. Stream query processor produces new results as long as new data arrive at the system. Scratch space is tempo- (X), share of Y) for each Y. Reducer group all the child(X) together and
rary storage used as a buffering where new data joins the queue to be processed. With stream algorithms, generate insights by going over the data once as sum the share of Y.
data grows linearly, the summary statistics should grow sub-linearly, at most log. Kmeans clustering
Mapper emit (key: (which centorid), value:(point, 1)) Reducer group all the
keys (which has the same centroid) together sum up all the values of the
points and divide to find the new mean which is the new centroid. Broad-
cast the new means to each machine on the cluster
Linear Regression
Mapper receives a subset(batch) of data and computes partial gradient.
length Reduce phase: gradients from all the mappers are aggregated to compute
the total gradient. Update phase: update the weights using the total gradient
computed from all subsets
2 array items below
Total 32 byte
End of array
Heavy hitters (potential candidates for further investigation) s-heavy hitters: count(e) at time t > s * (length of stream so far). Maintain a structure contain-
ing [ceiling(1/s) - 1] s-heavy hitter elements. False positives are okay but no false negatives. However, false positives are problematic if heavy hitters are used
for billing/ punishment . Window-based algorithm: window based on ordering attribute, based on record counts or based on explicit markers.
CQ: Continuous queries
Lossy counting N: current length of the stream. s (0, 1): support threshold, ε (0, 1): error. Output: Elements with counter values exceeding (s-ε) * N
Rule of thumb ε = 0.1 * s
Guarantees: Frequencies underestimated by at most ε*N. No false negatives. False positives haves true frequency at least (s-ε) * N. All elements exceeding
frequency s* N will be output
Example: User interested in identifying all items whose freq is at least 10%. s = 0.1 . Assume ε = 0.01. Guarantee 1: all elements with freq > 10% will be
returned. Guarantee 2: no element with frequency below 9% will be returned. False positives between 10% and 9% might or might not be outputed. Guar-
antee 3: all individual frequencies are less than their true frequencies by at most ε * N.
Lossy counting (step 1): divide the stream to window size w = ceiling(1 / ε). (Step 2) Go through elements. If counter exists, increase by one, if not
create one and initialise it to one. (Step 3) At window boundary, decrement all counters by 1. If counter is zero for a specific element, drop it..
Data is stationary
Sticky Sampling Counting algorithm using sampling approach. Probabilistic sampling decides if a counter for a distinct element is created. If a counter exists DBMS DSMS
for a certain element, every future instance of this element will be counted. N: current length of the stream. s (0, 1): support threshold, ε (0, 1): error, δ (0,1):
probability of failure. Data Nature Persistent data, relatively static Read-only (append-only) data.
Guarantees: Guarantees are the same as for lossy counting except for the small probability that it might fail to provide correct answers. DSMS process data to generate
insights not to modify the
Example: (step 1) Dynamic window size doubling window size of each new window where t = 1/ε log(1/(s * δ)) . Wk = 2k * t for k >= 1. W0 = 2t. (step 2) Go streams
through elements. If counter exists, increase it. If not, create a counter with probability 1/r and initialise it to one. Sampling rate r starts with 1 and grows in
proportion to window size W. (Step 3) At window boundary, go through elements of each counter. For each counter, we repeatedly toss a coin until coin toss Storage “Unbounded” disk store, theoreti- Bounded main memory
is successful. Diminish counter by one for every unsuccessful coin toss. If counter becomes zero, drop it. Decrement happens probabilistically. Heavy hitters cally load entire dataset into
will have a higher chance of staying in the dictionary. memory and generate insights
Class StreamingSum:
def__init__(self):
self. result = 0
def update(self, element):
self. Result +=element
Each node can have at most m key fields and m+1 pointer fields. Half-full must be satisfied (except root node): Class StreamingMinimum:
If m is even and m = 2d, Leaf node half full: at least d entries, Non-leaf node half full: at least d entries def__init__(self):
If m is odd and m = 2d + 1, Leaf node half full: at least d + 1 entries Non-leaf node half full: at least d entries (i.e., d + 1 self. result = inf
pointers) def update(self, element):
self.result = min(self.result, element)
Comparing lossy and If S =1/2, then this is the majority element problem. If S = 1/3,
sticky sampling for then this is the -heavy hitters problem.
Unique and Zipf where E1, D, B, D, D5, D, B, B, B, B, B11, E, E, E, E, E16
frequency of item is At time 1, it is E, At time 5, it is D, At time 11, both B and D
inversely proportional to At time 15, it is B, At time 16, it is B and E
rank in frequency table.
Sticky is due to tendency
to remember every
unique element that gets
sampled. Lossy is good
at pruning low-freq
elements quickly. For
highly skewed data both
algo require much less
space than worse-case
bounds.
Microprocessor speed (vertical scaling) stagnated: thermal barrier, system bottleneck, material limits
Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of com-
modity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are processed by
the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which
are then input to the reduce tasks. Typically both the input and the output of the job are stored in a
file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.
Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework
and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of
nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data
is already present, resulting in very high aggregate bandwidth across the cluster.
MAPPER: The key is what the data will be grouped on and the value is the information pertinent to
the analysis in the reducer.
REDUCER: The reducer takes the grouped data as input and runs a reduce function once per key
grouping. The function is passed the key and an iterator over all of the values associated with that
key. A wide range of processing can happen in this function, as we’ll see in many of our patterns. The
data can be aggregated, filtered, and combined in a number of ways.
Three features managed by Hadoop: Fault tolerance, data movement and lifecycle management
ResourceManager is the ultimate authority that arbitrates resources among all the applications in the
system. The NodeManager is the per-machine framework agent who is responsible for containers,
monitoring their resource usage (cpu, memory, disk, network) and sends heartbeat to the Re-
sourceManager/Scheduler.
The ResourceManager has two main components: Scheduler and ApplicationsManager. The Scheduler
is responsible for allocating resources to the various running applications. ApplicationsManager is re-
sponsible for accepting job-submissions, negotiating the first container for executing the application
specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on
failure. The per-application ApplicationMaster has the responsibility of negotiating appropriate re-
source containers from the Scheduler, tracking their status and monitoring for progress.
HDFS fault tolerant distributed file system designed to run on commodity hardware. Moving Compu-
tation is Cheaper than Moving Data. Write-once-read-many and mostly, append to. Master/slave archi-
tecture to keep metadata and coordinate access. An HDFS cluster consists of a single NameNode, a
master server that manages the file system namespace and regulates access to files by clients. The
NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a
way that user data never flows through the NameNode. In addition, there are a number of DataNodes,
usually one per node in the cluster, which manage storage attached to the nodes that they run on. Inter-
nally, a file is split into one or more blocks (64MB or 128MB) and these blocks are stored in a set of
DataNodes. The blocks of a file are replicated for fault tolerance. The NameNode makes all decisions
regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster.
The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata
for all the files and directories in the tree.