You are on page 1of 266

Quizzes & Exams

QUIZ 1 (Classes 1-7)

1 – Despite its capabilities, TEZ still needs the storage of intermediate output to HDFS. (2020, 2021)
(2020, 2021)
FALSE
(Week 4, slide 48)

2 – Volume drives the need for processing and storage parallelism, and its management during
processing of large datasets.
(2020, 2021)
TRUE
(Week 1, slide 35)

3 – The servers where the data resides can only perform the map operation.
(2020, 2021)
TRUE
(Week 4, slide 27)

4 – The process of generating a new fsimage from a merge operation is called the Checkpoint process.
(2020, 2021)
TRUE
(Week 3, slide 15)

5 – Hive follows a “schema on read” approach, unlike RDMBS, which enforces “schema on write.”
(2020, 2021)
TRUE
(Week 5, slide 29)

6 – Big Data is a characterization only for volumes of data above one petabyte.
(2020, 2021)
FALSE
(Definition)

7 – The term MapReduce refers in exclusive to a programming model.


(2020, 2021)
FALSE
(Week 4, slide 8)

8 – Hadoop is considered a schema-on-write-system regarding write operations.


(2020, 2021)
FALSE
(Week 2, slide 30)

9 – One of the key design principles of HDFS is that it should favor low latency random access over high
sustained bandwidth.
(2020, 2021)
FALSE
(Week 2, slide 34)

10 – One of the key design principles of HDFS is that it should be able to use commodity hardware.
(2020, 2021)
TRUE
(Week 2, slide 24)

11 – The fundamental architectural principles of Hadoop are: large scale, distributed, shared everything
systems, connected by a good network, working together to solve the same problem.
(2020, 2021)
FALSE
(Week 2, slide 23)
12 – Apache Tez is an engine built on top of Apache Hadoop YARN.
(2020, 2021)
TRUE
(Week 4, slide 42)

13 – The name node is not involved in the actual data transfer.


(2020, 2021)
FALSE
(Week 3, slide 33)

14 – Business benefits are frequently higher when addressing the variety of data than when addressing
volume
(2020, 2021)
TRUE
(Week 1, slide 37)

15 – In object storage, data is stored close to processing, just like in HDFS, but with rich metadata.
(2020, 2021)
FALSE
(Week 5, slide 6)

16 – The map function in MapReduce processes key/value pairs to generate a set of intermediate
key/value pairs.
(2020, 2021)
TRUE
(Week 4, slide 26)

17 - Essentially, MapReduce divides a computation into two sequential stages: map and reduce.
(2020, 2021)
FALSE
(Week 4, slide 26)

18 – When a HiveQL is executed, Hive translates the query into MapReduce, saving the time and effort
of writing actual MapReduce jobs.
(2020, 2021)
TRUE
(Week 5, slide 21)

19 – When we drop an external table in Hive, both the data and the schema will be dropped.
(2020, 2021)
FALSE
(Week 6, slide 34)

20 – Hive can be considered a data warehousing layer over Hadoop that allows for data to be exposed
as structured tables.
(2020, 2021)
TRUE
(Week 5, slide 20)

--//--

21 – YARN manages resources and monitors workloads, in a secure multitenant environment, while
ensuring high availability across multiple Hadoop clusters.
(2021)
TRUE
(Week 4, slide 34)
22 – TEZ provides the capability to build an application framework that allows for a complex DAG of tasks
for high-performance data processing only in batch mode.
(2021)
FALSE
(Week 4, slide 42)

23 – The operation
CREATE TABLE external
(col1 STRING,
col2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
creates an external table in Hive.
(2021)
FALSE
(Should be CREATE EXTERNAL)

24 – Internal tables in Hive can only be stored as TEXTFILE.


(2021)
FALSE
(Week 6, slide 50)

25 – We can say data internal tables in Hive adhere to a principle called “data on schema”.
(2021)
TRUE
(Week 5, slide 31)

QUIZ 2 (Classes 8-12)

1 – Consider the scenario in Fig. 1, for a log analytics solution. The web server in this example is an
Amazon Elastic Compute Cloud (EC2) instance. In step 1, Amazon Kinesis Firehose will continuously pull
log records from the Apache Web Server.
(2020, 2021)
FALSE

2 – We can configure, in Step 1, each shard in Amazon Kinesis Firehose to ingest up to 1MB/sec and 1000
records/sec, and emit up to 2MB/sec.
(2020, 2021)
FALSE

3 – In step 2, Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service
(Amazon S3) for durable storage of the raw log data.
(2020, 2021)
TRUE

4 – In step 2, also, Amazon Kinesis Analytics application will continuously run a Kinesis Streaming Python
script against the streaming input data.
(2021)
FALSE

5 – In step 3, the Amazon Kinesis Analytics application will create an aggregated data set every minute
and output that data to a second Firehose delivery stream.
(2021)
TRUE

6 – All data in DynamoDB is replicated in two availability zones


(2020, 2021)
FALSE

7 – An end user provisions a Lambda service with similar steps as it provisions an EC2 instance
(2020, 2021)
FALSE

8 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE

9 – In a publish/subscribe model, although data producers are decoupled from data consumers,
publishers know who the consumers are.
(2020, 2021)
FALSE

10 – In Hive most of the optimizations are not based on the cost of query execution.
(2020, 2021)
TRUE

10 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE

11 – The number of shards cannot be modified after the Kinesis stream is created.
(2020, 2021)
FALSE

12 – The basic idea of vectorized query execution is to process a batch of columns as an array of line
vectors.
(2020, 2021)
FALSE

13 – In Hive bucketing, data is evenly distributed between all buckets based on the hashing principle.
(2020, 2021)
TRUE

14 – The selection of the partition key is always an important factor for performance. It should always
be a low-cardinal attribute to avoid so many sub-directories overhead.
(2020, 2021)
TRUE

15 – To partition the table customer by country we use the following HiveQL statement
CREATE TABLE customer (id STRING, name STRING, gender STRING,
state STRING, country STRING)PARTITIONED BY (country STRING)
(2020, 2021)
FALSE

16 – We can configure the values for the Amazon S3 buffer size (1 MB to 128 MB) or buffer interval (60
seconds to 900 seconds). The condition satisfied first triggers data delivery to Amazon S3.
(2020, 2021)
TRUE

17 – Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service.
(2021)
TRUE

18 – Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose.


(2021)
TRUE

19 – AWS Lambda polls the stream periodically (once per second) for new records. When it detects new
records, it invokes the Lambda function by passing the new records as a parameter. If no new records
are detected, the Lambda function is not invoked.
(2020, 2021)
TRUE

20 – DynamoDB tables do not have fixed schemas, but all items must have a similar number of attributes.
(2020, 2021)
FALSE

21 – The main drawback of Kinesis Data Firehose is that to scale up or down you need to manually
provision servers using the "AWS Kinesis Data Firehose Scaling API".
(2021)
FALSE

22 – Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC.
(2021)
TRUE
23 – Since it uses SQL as query language, Amazon Athena is a relational/transactional database.
(2021)
FALSE

24 – AWS Glue ETL jobs are Spark-based.


(2021)
TRUE

25 – Kinesis Data Streams is good choice for long-term data storage and analytics
(2021)
FALSE

26 – Amazon OpenSearch/Elasticsearch stores CSV documents.


(2021)
FALSE

-- // --

27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE

28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE

29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE

29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE
BDF2020 Review Questions 1 Questions

1. What is HDFS, and what are HDFS design goals?


HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
• Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes. Fault
tolerance—Keeps multiple copies of data to recover from failure.
• Capability to run on commodity hardware—Designed to run on commodity hardware. Write once
and read many times—Based on a concept of write once, read multiple times, with an assumption
that once data is written, it will not be modified. Its focus is thus retrieving the data in the fastest
possible way.
• Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
• Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and
a JobTracker (processing component). Processing is done where data exists, to avoid data
movement across nodes of the cluster.
• High throughput—Designed for parallel data storage and retrieval.
• HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

2. In terms of storage, what does a name node contain and what do data nodes contain?
• HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (workers of HDFS) contain
application data in a partitioned manner for parallel writes and reads.
• The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in physical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host
OS drives. The name node simultaneously responds to the multiple client requests (in a
multithreaded system) and provides information to the client to connect to data nodes to write
or read the data. While writing, a file is broken down into multiple chunks of 128MB (by default,
called blocks, 64MB sometimes). Each block is stored as a separate file on data nodes. Based on
the replication factor of a file, multiple copies or replicas of each block are stored for fault
tolerance.

3. What is the default data block placement policy?


By default, three copies, or replicas, of each block are placed, per the default block placement policy
mentioned next. The objective is a properly load-balanced, fast-access, fault-tolerant file system:
• The first replica is written to the data node creating the file.
• The second replica is written to another data node within the same rack.
• The third replica is written to a data node in a different rack.

4. What is the replication pipeline? What is its significance?


Data nodes maintain a pipeline for data transfer. Having said that, data node 1 does not need to wait
for a complete block to arrive before it can start transferring it to data node 2 in the flow. In fact, the
data transfer from the client to data node 1 for a given block happens in smaller chunks of 4KB. When
data node 1 receives the first 4KB chunk from the client, it stores this chunk in its local repository and
immediately starts transferring it to data node 2 in the flow. Likewise, when data node 2 receives the
first 4KB chunk from data node 1, it stores this chunk in its local repository and immediately starts
transferring it to data node 3, and so on. This way, all the data nodes in the flow (except the last one)
receive data from the previous data node and, at the same time, transfer it to the next data node in
the flow, to improve the write performance by avoiding a wait at each stage.

5. What is client-side caching, and what is its significance when writing data to HDFS?
HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to
improve the performance of the block write operation and to minimize network congestion. The HDFS
client transparently caches the file into a temporary local file; when it accumulates enough data for a
block size, the client reaches out to the name node. At this time, the name node responds by inserting
the filename into the file system hierarchy and allocating data nodes for its storage. The client then
flushes the block of data from the local, temporary file to the closest data node, and that data node
transfers the block to other data nodes (as instructed by the name node, based on the replication
factor of the file). This client-side caching avoids continuous use of the network and minimizes the risk
of network congestion.

6. How can you enable rack awareness in Hadoop?


You can make the Hadoop cluster rack aware by using a script that enables the master node to map
the network topology of the cluster using the properties topology.script.file.name or
net.topology.script.file.name, available in the core-site.xml configuration file. First, you must change
this property to specify the name of the script file. Then you must write the script and place it in the
file at the specified location. The script should accept a list of IP addresses and return the
corresponding list of rack identifiers. For example, the script would take host.foo.bar as an argument
and return /rack1 as the output.

7. What is the data block replication factor?


An application or a job can specify the number of replicas of a file that HDFS should maintain. The
number of copies or replicas of each block of a file is called the replication factor of that file. The
replication factor is configurable and can be changed at the cluster level or for each file when it is
created, or even later for a stored file.

8. What is block size, and how is it controlled?


When a client writes a file to a data node, it splits the file into multiple chunks, called blocks. This data
partitioning helps in parallel data writes and reads. Block size is controlled by the dfs.blocksize
configuration property in the hdfs-site.xml file and applies for files that are created without a block
size specification. When creating a file, the client can also specify a block size specification to override
the cluster-wide configuration.

9. What is a checkpoint, and who performs this operation?


The process of generating a new fsimage by merging transactional records from the edit log to the
current fsimage is called checkpoint. The secondary name node periodically performs a checkpoint by
downloading fsimage and the edit log file from the name node and then uploading the new fsimage
back to the name node. The name node performs a checkpoint upon restart (not periodically,
though—only on name node start-up).

10. How does a name node ensure that all the data nodes are functioning properly?
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

11. How does a client ensure that the data it receives while reading is not corrupted? Is there a way
to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file and
stores these checksums in a separate hidden file in the same HDFS file system namespace. Later, while
reading the blocks, the client references these checksums to verify that these blocks were not
corrupted (corruption might happen because of faults in a storage device, network transmission faults,
or bugs in the program). When the client realizes that a block is corrupted, it reaches out to another
data node that has the replica of the corrupted block, to get another copy of the block.

12. How can you access and manage files in HDFS?


You can access the files and data stored in HDFS in many different ways. For example, you can use
HDFS FS Shell commands, leverage the Java API available in the classes of the org.apache.hadoop.fs
package, write a MapReduce job, or write Hive or Pig queries. In addition, you can even use a web
browser to browse the files from an HDFS cluster.

13. What two issues does HDFS encounter in Hadoop 1.0?


First, the name node in Hadoop 1.0 is a single point of failure. You can configure a secondary name
node, but it’s not an active-passing configuration. The secondary name node thus cannot be used for
failure in case the name node fails. Second, as the number of data nodes grows beyond 4,000, the
performance of the name node degrades, setting a kind of upper limit to the number of nodes in a
cluster.

14. What is a daemon?


The word daemon comes from the UNIX world. It refers to a process or service that runs in the
background. On a Windows platform, we generally refer to it is as a service. For example, in HDFS, we
have daemons such as name node, data node, and secondary name node.

15. What is YARN and what does it do?


In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer created on top of
HDFS. This new layer, called YARN (Yet Another Resource Negotiator), takes care of two major
functions: resource management and application life-cycle management. The JobTracker previously
handled those functions. Now MapReduce is just a batch-mode computational layer sitting on top of
YARN, whereas YARN acts like an operating system for the Hadoop cluster by providing resource
management and application life-cycle management functionalities. This makes Hadoop a general-
purpose data processing platform that is not constrained only to MapReduce.

16. What is uber-tasking optimization?


The concept of uber-tasking in YARN applies to smaller jobs. Those jobs are executed in the same
container or in the same JVM in which that application-specific Application Master is running. The
basic idea behind uber-tasking optimization is that the distributed task allocation and management
overhead exceeds the benefits of executing tasks in parallel for smaller jobs, hence its optimum to
execute smaller job in the same JVM or container of the Application Master.

17. What are the different components of YARN?


Aligning to the original master-slave architecture principle, even YARN has a global or master Resource
Manager for managing cluster resources and a per-node and -slave Node Manager that takes direction
from the Resource Manager and manages resources on the node. These two, form the computation
fabric for YARN. Apart from that is a per-application Application Master, which is merely an
application-specific library tasked with negotiating resources from the global Resource Manager and
coordinating with the Node Manager(s) to execute the tasks and monitor their execution. Containers
also are present—these are a group of computing resources, such as memory, CPU, disk, and network.
What were the key design strategies for HDFS to become fault tolerant?
HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes.
Fault tolerance—Keeps multiple copies of data to recover from failure.
Capability to run on commodity hardware—Designed to run on commodity hardware.
Write once and read many times—Based on a concept of write once, read multiple times, with an
assumption that once data is written, it will not be modified. Its focus is thus retrieving the data in
the fastest possible way.
Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and a
JobTracker (processing component). Processing is done where data exists, to avoid data movement
across nodes of the cluster.
High throughput—Designed for parallel data storage and retrieval.
HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

To what does the term data locality refer?


Data locality is the concept of processing data locally wherever possible. This concept is central to
Hadoop, a platform that intentionally attempts to minimize the amount of data transferred across the
network by bringing the processing to the data instead of the reverse.
2nd Exam 2020

You may assume that you are a Data Scientist in a consulting assignment with the CDO (Chief Data
Officer) of Exportera (a large company). The CDO asks you lots of questions and appreciates very much
precise answers to his questions.

1. You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the effects
of flooding from the River Guardião. The municipality has already distributed IoT devices across
the river, that are able to measure the flow the wates. In the kick-off meeting, you say: "I have
an idea for a possible solution. We may need to use a number of AWS services.” And then you
explain your solution. What do you say?
In order to collect and process these real time streams of data records regarding the flow of the water,
the data from the IoT devices across the river is sent to AWS Kinesis Data Streams. Then this data is
processed using the AWS Lambda module and stored in AWS S3 buckets. The data collected into
Kinesis Data Streams can be used for simple data analysis and reporting in real time, using AWS
Elasticsearch. Finally we can send the processed records to dashboards to visualize the variability of
the flow of water, using Amazon QuickSight.

You do not need to worry about computer resources if you use AWS Kinesys. This system will enable
you to acquire data through a gateway on the cloud that will receive your data, than you can process
it using AWS Lambda module and later store the data in a storage system e.g. AWS's S3

2. You go to a conference and hear a speaker saying the following: "Real-time analytics is a key
factor for digital transformation. Companies everywhere are using their datalakes based on
technologies such as Hadoop and HDFS to give key insights, in real time, of their sales and
customer preferences." You rise your hand for a comment. What are going to say? Please, justify
carefully.
Firstly, I would like to clarify that Hadoop is a stack of different components, one of which is the HDFS,
which is its distributed file system. Moreover, although HDFS is thriving recently, there are plenty of
other choices such as Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for cloud storage, or
Apache Kudu for IoT and analytic data.

Hadoop is a framework that incorporates several technologies including batch, streaming (you
mentioned) and others. HDFS is Hadoop's file system.

3. In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive
Driver is and what are its key functions/roles." What do you say, trying to be very complete and
precise?
To begin with, Hive is Hadoop’s SQL-like analytics query engine, which enables the writing of data
processing logic or queries in a declarative language similar to SQL, called HiveQL. The brain of Hive is
the Hive Driver, which maintains the lifecycle of a HiveQL statement: it comprehends a query compiler,
optimizer and executor, executing the tasks plan generated by the compiler in proper dependency
order while interacting with the underlying Hadoop instance.

Hive is Hadoop's SQL analytics query framework. It is the same as saying Hive is a SQL abstraction layer
over Hadoop MapReduce with a SQL-like query engine. It enables several types of connections to
other data bases using Hive driver. HIVE DRIVER can for instances connect through ODBC to relational
databases.

4. On implementing Hadoop, the CDO is worried on understanding how a name node ensures that
all the data nodes are functioning properly. What can you tell him to reassure him?
The name node, or master node, contains the metadata related to the Hadoop Distributed File System,
such as the file name, file path or configurations. It keeps track of all the data nodes, or slave nodes,
using the heartbeat methodology. Each data node regularly send heartbeat signals to the name node.
After receiving these signals the name node ensures that the slave nodes are live and functioning
properly. In the event that the name node does not receive a heartbeat signal from the data node,
that particular data node is considered inactive and starts the process of block replication on some
other data node.

Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

5. The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is
concerned on losing data if the data delivery to the destination is falling behind data writing to
the delivery stream. Can you help him out understanding how this process works, in order to
alleviate his concerns?
In Amazon Kinesis Firehose, the scaling is handled automatically, up to gigabytes per second, in order
to meet the demand. Amazon Kinesis Firehose will write each log record to Amazon Simple Storage
Service (Amazon S3) for durable storage of the raw log data, and the Amazon Kinesis Analytics
application will continuously run a Kinesis Streaming SQL statement against the streaming input data.
This way, you should not be concerned.

6. The CDO of Exportera asks you to prepare the talking points for a presentation he must make
to the board regarding a budget increase for his team. It is important that the board members
understand what are the impacts of the data variety versus the data volume. What can you tell
them?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s). Volume is the most commonly
recognized characteristic of Big Data, representing the large amount of data available for analysis to
extract valuable information. The time and expense required to process large datasets drives the need
for processing and storage parallelism. On the other hand, data variety represents the need to analyze
data from multiple sources, domains and data types to bring added value to the business, which drives
the need for distributed processing. This way, a budget increase is needed to meet the added
requirements that big data processing brings to the table.

While the volume can be handle scaling the processing and storage resources the issues related with
variety require customization of processing, namely interfaces, and programming of handlers. For
instances handling objects coming from SQL sources or OSI-PI sources require different type of
programming skills. Not having the right skills to handle this can ruin the ambition of storing different
-and so rich - data.
To have this kind of skills we need a budget increase in the department.

7. The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables can
only be made on the basis of one parameter, making it not as useful as it could be." What can
you answer him?
You are wrong. Hive partitioning can be made with more than one parameter of the original table.
However, the partition key should always be a low-cardinal attribute to avoid many partitions
overhead. It is very useful, as a query with partition filtering will only load data from the specified
partitions, so it can execute much faster than a normal query that filters by a non-partitioning field.
Hive partitioning can be made with more than one parameter from the original table. The partitioning
can be made by dividing the table in small tables by folders against each combination of parameters.
And even the partitioning can be optimized by bucket files.

8. The CDO of Exportera has been exploring stream processing and is worried about the latency of
a solution based on a platform such as Kinesis Data Streams. To address the question, what can
you tell him?
Kinesis Data Streams can be used efficiently to solve a variety of streaming data problems. It ensures
durability and elasticity, which enables you to scale the stream up or down, so that you never lose
data records before they expire. The delay between the time a record is put into the stream and the
time it can be retrieved is typically less than 1 second, so you don’t have to worry!

Kinesis Data Stream platform is based on a high throughput, large bandwidth performance
components.
Kinesis ensures durability and elasticity of data acquired through streaming. The elasticity of Kinesis
Data Streams enables you to scale the stream up or down, so that you never lose data records before
it expires (1 day default).

9. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully
understand it. Than you explain very clearly to him with a small example: counting the words of
the following text:
"Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
It's fleece was white as snow"
Justify carefully.
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
One of the applications of the MapReduce algorithm is to count words. If we take into consideration
the example mentioned:
1 – In the first step, the map() phase, all the words in the text are split into a list ordered from the first
word of the text until the last one:
“Mary”,
“had”,
“a”,
“little”,
“lamb”,
“Little”,
“lamb”,
(…)
“white”,
“as”,
“snow”

Still in the map() phase, each word is converted into a key-value pair, being the key the word and the
value 1 :
(“Mary”: 1),
(“had”: 1),
(“a”: 1),
(…)
(“snow”:1).
2 – In the shuffle phase, all the key-value pairs are sorted alphabetically by the key, so that similar keys
can get together
3 – Finally, in the reduce phase, there is an aggregation by key (word), performing a count of the
respective values. This way, the final result is the key-value pairs of the distinct words and their count
of occurrences throughout the text

MapReduce is a program model, or a processing technique that can be applied in several contexts. Its
implementation is usually divided in three steps: Mapping the data variables considered, shuffling
those variables following a pattern/directive and then aggregating/reducing them.
Let us take the text above as reference for the following example: We want to count the number of
words in the text.
In the first step we would map all the words, let say, continuously, getting a list like:
"Mary
had
a
little
...
as
snow"
If one wants to count the number of words, a number should be associated with each word for later
counting. So we can associate "1", so we can create, still in Map step the kind of Key-Value database
associating each word with "1", e.g.:
<"Mary",1>, where the word is the Key.
<Mary,1>
<little,1>
In the next step one could shuffle this created database so that all similar keys can get together. But
since one just wants to count the number of words, we just need to sum the values and this operation
is the Reduce operation in this case.

10. In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in
motion is no different than processing data at rest." If you had to join the conversation, what
would you say to them?
I don’t agree with you. Processing data in motion is very different than processing data at rest. The
operational difference between streaming data and data at rest lies in when the data is stored and
analyzed. When dealing with data at rest, each observation is recorded and stored before performing
analysis on the data. On the other hand, in streaming data, each event is processed as it is read, and
subsequent results are stored in a database. However, there are some similarities: data at rest and
streaming data can come from the same source, they can be processed with the same analytics and
can be stored with the same storage service
Processing data in motion (streaming) is much different from processing data at rest. The main
differences are that:
1. Analytics over streaming data needs to be done with data incoming and not in steady data.
2. Data in motion needs buffering to process amounts of data while similar steady data processing
can be done through querying.
3. The concept of incoming data in streaming does not apply in data at rest. In the former case one
needs to adapt the acquisition to the amount of incoming data.
Of course, there are also some similarities, namely in quality control of data, namely the detection of
common types of errors like valid values or missing values. Anyway, the processing- which is what we
are talking about - is much different in both cases.

--//--

1. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand
it. Can you explain very clearly its processes, and the role of the NameNode and the DataNodes in a
MapReduce operation?" Can you help him out?
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
During this process, the NameNode allocates which DataNodes, meaning individual servers, will
perform the map operation and which will perform the reduce operation. This way, mapper
DataNodes perform the map phase and produce key-value pairs and the reducer DataNodes apply the
reduce function to the key-value pairs and generate the final output.

2. The CDO of Exportera asks you to prepare the talking points for a presentation he must make to
the board regarding a budget increase for his team. It is important that the board members
understand what data variety versus data variability is, as well as its impacts on the analytics
platforms (and, of course, business benefits of addressing them.) What do you write to make it
crystal clear?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s).
While the variety characteristic represents the need to analyze data from multiple sources and data
types, the variability refers to the changes in dataset characteristics, whether in the data flow rate,
format/structure and/or volume. When it comes to variety of data, distributed processing should be
applied on different types of data, followed by individual pre-analytics. On the other hand, variability
implies the need to scale-up or scale-down to efficiently handle the additional processing load, which
justifies the use of cloud computing. Finally, regarding the benefits to the business, both have their
advantages: variety of data brings an additional richness to the business, as more details from multiple
domains are available; variability keep the systems efficient as we don’t always have to design and
resource to the expected peak capacity.
Big Data Foundations
Week 1
Syllabus
Defining Big Data
Class outline

• Syllabus
• Defining Big Data

2
Class outline

• Syllabus
• Defining Big Data

3
Course agenda
• Defining Big Data
• Datacenter scale computing
• Foundational Big Data Core Components
• Cloud Storage and Modern Data Lake Components
• Batch Analysis
• Query Optimization for Batch Analysis
• Data collection
• Near-real time
• Real time
• Streaming Analysis
• Use case: Designing a modern Big Data solution

4
Evaluation and contact

• Evaluation:
• First assessment period:
• Quizzes in class (2): 2 x 5%
• Case analysis + status report: 40%
• Final Exam: 50%
• Second assessment period:
• Second Exam: 100%
• Office hours:
• Tuesdays & Wednesdays, 18h00-18h30, by appointment
• Contact:
• hcarreiro@novaims.unl.pt

5
Quizzes

• The quizzes will be a set of multiple choices or Y/N questions


• They will be answered in class time
• It will be mostly an “exam training”

6
Bibliography

• Class slides, scientific papers, analysts reports.


• Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019). Architecting
Modern Data Platforms. O’Reilly.
• Gorelik, A. (2019). The Enteprise Big Data Lake. O’Reilly.
• Talia, D., Trunfio, P. & Marozzo, F. (2016). Data Analysis in the
Cloud: Models, Techniques and Applications. Elsevier.

7
Technology stack

• Amazon Web Services Stack


• Power BI as special guest
• As possible complements/alternatives:
• Cloudera Quickstart VM
• VMware Workstation Player
• Microsoft Azure

8
Where to download Cloudera VM

• VMWare
• https://downloads.cloudera.com/demo_vm/vmware/cloudera-
quickstart-vm-5.13.0-0-vmware.zip

9
Approach

• The objective of the course will be to prepare the students to


help drive the strategies for data platforms in their companies or
in consulting assignements
• Big Data is a rapidly evolving topic, with some very large players
launching new services almost every week, so we will also look
at what’s happening in the market to spot trends
• Our approach will be, whenever possible, multiplaform

10
A note about online classes

• The course will be available also through Zoom classes


• But, the best experience for students will be in the classroom
• The course has a lot of hands-on practice (labs) – and the
classroom is the best experience both for sharing your results
with your colleagues, address questions and share overall
learnings during the execution of the labs

11
Your expectations for the
course?

12
Case Analysis:
MovieLens Dataset(s)

13
MovieLens Dataset(s)

• The objectives of the project are the following:


• Becoming familiar with a scalable dataset that can start small and grow
to be really big (yet, manageable with limited hardware)
• Becoming familiar with the Hadoop ecosystem tools that can help you
analysing such a dataset

14
MovieLens Dataset(s)
• Based initially on the 100-K MovieLens Dataset, get to know the dataset and start
asking some questions that you might find interesting. Once comfortable, you
can scale up to the 20M dataset (see bellow). (Even for the MovieLens 1B).
• Some possible questions
• What are the movie genres in the dataset?
• What is the number of movies per genre?
• What is the movie rating distribution per user?
• What are the movies with the highest average rating?
• The more interesting the questions, the deeper the answers.
• You can start small and grow to larger versions of the dataset (available in
https://grouplens.org/datasets/movielens)
• You should, at least, use HDFS and/or a cloud store (for sure) and Hive/Presto,
although you can explore other Hadoop components.
• Feel free to use Power BI or other visualization tool to present your insights.
Deliverables (1)

• The project shall be made by groups up to 4 elements. It is not


recommended that the project is done by a single person, but it
can be, under justifiable circumstances.
• You should present your assessment and conclusions in a set of
slides.
• You should include a video of the presentation (may be just the
slides and your voices in off or include demos).
• The maximum length of the video should be 10 minutes.
Deliverables (2)

• All the complementary information that you feel it is important,


should be put into background slides. The total number of
slides, the ones used for presentation and the ones used for
background info should not exceed 30 (hopefully, less than
that).
• The slides should be delivered in PDF format, through Moodle.
• (Don’t forget to identify to put your names and numbers in each
deliverable).

17
Status report

• You should submit a status report in Moodle at a set time


• The status report is the answer to a short questionnaire

18
Descriptors for evaluating the case analysis
EVALUATION ITEM DESCRIPTORS PERCENTAGE OF THE TOTAL CLASSIFICATION

1 Quality and comprehensiveness of the questions in order to explore the dataset. The questions: 15%
a) explore the possibilities presented by the dataset
b) give rise to relevant insights in the context.

2 Quality and scalability of the presented solution. The solution: 40%


a) is in accordance with the conditions of the proposed case,
b) gives a sustained answer to the questions referred to in point 1,
c) was created with performance optimization in mind,
d) is adaptable to larger datasets.

3 Quality of the presentation. The presentation: 20%


a) is fluid and clear,
b) uses language understandable to non-specialists,
c) is visually appealing.

4 Quality of the report (in the form of the presentation text and complementary The report: 15%
slides). a) presents the solution clearly,
b) indicates the main options taken,
c) indicates the limitations,
d) indicates possibilities for future work.

5 Respect for presentation time. The time allocated to the group for the presentation was respected. 10%

19
Logistics

• The delivery will be through Moodle


• This includes the videos (if too large, can be shared by
WeTransfer)
• You should upload your Presentation as PDF, well as any scripts
and .PBIX files, in one .ZIP file, if possible
• The name of the .ZIP file should include your numbers (Mxxxxxx)

20
What is Big Data?

21
Example of a modern Big Data stack: AWS

22
Example of a modern Big Data stack: Apache
Foundation projects

Source: Cloudera
23
Example: Azure server generations

Gen 2 Gen 3 HPC Gen 4 Godzilla Gen 5.1 GPU Gen 5 Beast Gen 6
Processor 2 x 6 Core 2.1 GHz Processor 2 x 8 Core 2.1 GHz Processor 2 x 12 Core 2.4 GHz Processor 2 x 12 Core 2.4 GHz Processor 2 x 16 Core 2.0 GHz Processor 2 x 20 Core 2.3 GHz Processor 2 x 8 Core 2.6 GHz Processor 4 x 18 Core 2.5 GHz Processor 2 x Skylake 24 Core 2.7GHz

Memory 32 GiB Memory 128 GiB Memory 128 GiB Memory 192 GiB Memory 512 GiB Memory 256 GiB Memory 256 GiB Memory 4096 GiB Memory 192GiB DDR4

Hard 5 x 1 TB Hard Drive 1 x 2 TB Hard Drive None


Hard 6 x 500 GB Hard Drive 1 x 4 TB Hard Drive 4 x 2 TB Hard Drive None Hard Drive None Hard Drive None
Drive Drive
SSD 4 x 9600 GB M.2 SSDs and
SSD 1 x 960 GB SATA
SSD 6 x 960 GB PCIe Flash SSD 4 x 1920 GB NVMe and 1 1 x 960 GB SATA
SSD None
SSD None SSD 5 x 480 GB SSD 4 x 480 GB SSD 9 x 800 GB and 1 x 960 GB SATA x 960 GB SATA
NIC 10 Gb/s IP, 40 Gb/s IB
NIC 40 Gb/s NIC 40 Gb/s + FPGA
NIC 40 Gb/s

NIC 40 Gb/s
NIC
24
40 Gb/s

NIC 1 Gb/s NIC 10 Gb/s NIC 40 Gb/s GPU 2 x 2 Compute GPU FPGA Yes
Example of a Big Data architecture in the
enterprise: Uber

25
Data science and Big Data

Data science is the extraction of useful knowledge directly from


data through a process of discovery, or of hypothesis formulation
and hypothesis testing.

A data scientist is a practitioner who has sufficient knowledge in


the overlapping regimes of business needs, domain knowledge,
analytical skills, and software and systems engineering to
manage the end-to-end data processes in the analytics life cycle.
Source: NIST

26
Class outline

• Syllabus
• Defining Big Data

27
Big Data characteristics

• Moore’s Law:
• In a 1965 paper, Gordon Moore
estimated that the density of
transistors on an integrated
circuit board was doubling
every two years.
• The growth rates of data
volumes are estimated to be
faster than Moore’s Law, with
data volumes more than
doubling every eighteen
months.

Source: Intel 28
Source: https://ourworldindata.org/technological-progress 29
Nielsen’s Law

30
Kryder's law

• Kryder's law is the storage equivalent of Moore's Law: Seagate's


vice president of research said back in 2005 that magnetic disk
storage density doubles approximately every 18 months.
• That also means the cost of storage halves every eighteen
months, enabling online services to give us more storage
without charging any more for it.
• It's worth mentioning that SSDs aren't subject to Kryder's Law:
as they're solid state, Moore's Law is more relevant.

31
Big Data: what’s your own definition?

Exercise: try to write a comprehensive definition of Big Data.

32
Big Data NIST definition

NIST Definition: “Big Data consists of extensive


datasets⎯primarily in the characteristics of volume,
velocity, variety, and/or variability⎯that require a scalable
architecture for efficient storage, manipulation, and
analysis.”

Source: https://doi.org/10.6028/NIST.SP.1500-1r1
33
The origins of the Big Data “V’s”

• Doug Laney, 3D Data


Management: Controlling
Data Volume, Velocity, and
Variety, Meta Group, 2001

34
Parallelizing data handling

• Big Data refers to the need to parallelize the data handling in


data-intensive applications.
• Characteristics of Big Data that force new architectures:
• Volume (i.e., the size of the dataset);
• Velocity (i.e., rate of flow);
• Variety (i.e., data from multiple repositories, domains, or types); and
• Variability (i.e., the change in velocity or structure).

35
Big Data: the impacts of the 4 V’s in terms of
services architecture?

Exercise: try to write some of the impacts.

36
Volume

• The most commonly recognized characteristic of Big Data is the


presence of extensive datasets—representing the large amount of
data available for analysis to extract valuable information.
• Much of the advances from machine learning arise from those
techniques that process more data. As an example, object
recognition in images significantly improved when the numbers of
images that could be analyzed went from thousands into millions
through the use of scalable techniques.
• The time and expense required to process massive datasets was one
of the original drivers for distributed processing.
• Volume drives the need for processing and storage parallelism, and
its management during processing of large datasets.
37
Velocity

• Velocity is a measure of the rate of data flow. Traditionally, high-velocity


systems have been described as streaming data.
• Data in motion is processed and analyzed in real time, or near real time,
and must be handled in a very different way than data at rest (i.e.,
persisted data).
• Data in motion tends to resemble event-processing architectures, and
focuses on real-time or operational intelligence applications.
• The need for real-time data processing, even in the presence of large data
volumes, drives a different type of architecture where the data is not
stored, but is processed typically in memory.
• Time constraints for real-time processing can create the need for
distributed processing even when the datasets are relatively small—a
scenario often present in the Internet of Things (IoT).

38
Variety
• The variety characteristic represents the need to analyze data from
multiple repositories, domains, or types.
• The variety of data from multiple domains was previously handled
through the identification of features that would allow alignment of
datasets, and their fusion into a data warehouse.
• Distributed processing allows individual pre-analytics on different
types of data, followed by different analytics to span these interim
results.
• While volume and velocity allow faster and more cost-effective
analytics, it is the variety of data that allows analytic results that were
never possible before.
• Business benefits are frequently higher when addressing the variety
of data than when addressing volume.

39
Variability
• Variability refers to changes in a dataset, whether in the data flow rate,
format/structure, and/or volume, that impacts its processing.
• Impacts can include the need to refactor architectures, interfaces,
processing/algorithms, integration/fusion, or storage.
• Variability in data volumes implies the need to scale-up or scale-down. virtualized
resources to efficiently handle the additional processing load, one of the
advantageous capabilities of cloud computing.
• Dynamic scaling keeps systems efficient, rather than having to design and
resource to the expected peak capacity (where the system at most times sits idle).
• It should be noted that this variability refers to changes in dataset characteristics,
whereas the term volatility refers to the changing values of individual data
elements.
• Since the latter does not affect the architecture—only the analytics—it is only
variability that affects the architectural design.

40
Structured, semi-structured, unstructured data

• Structured: Datasets where each record was consistently


structured and can be described efficiently in a relational model.
Records are conceptualized as the rows in a table where data
elements are in the cells
• Semi-structured: Datasets in formats such as XML (eXtensible
Markup Language) or JavaScript Object Notation (JSON) that
has an overarching structure, but with individual elements that
are unstructured.
• Unstructured: Unstructured datasets, such as text, image, or
video, do not have a predefined data model or are not
organized in a predefined way.

41
Recommended reading

• Chang, W. L., & Boyd, D. (2018). NIST Big Data Interoperability


Framework: Volume 1, Big Data Reference Architecture (No.
Special Publication (NIST SP)-1500-1r1).

42
Class takeaways

• Syllabus
• Defining Big Data

43
Big Data Foundations

Week 2
Core Components Overview
Course agenda
• Defining Big Data
• Datacenter scale computing
• Hadoop Core Components
• Cloud Storage
• Batch Analysis with Hive/Presto
• Optimizing Hive/Presto
• Data collection
• Near-real time
• Real time
• Streaming Analysis
• Use case: Designing a modern Big Data solution

2
Classe outline

• Defining Big Data (slides from the previous class)


• Scalability for Big Data
• Core Components Overview
• HDFS

3
Classe outline

• Scalability for Big Data


• Core Components Overview
• HDFS

4
Wise words

In pioneer days they used oxen


for heavy pulling, and when one
ox couldn’t budge a log, they
didn’t try to grow a larger ox.
We shouldn’t be trying for
bigger computers, but for more
systems of computers.

Rear Admiral Grace Murray


Hopper, American computer
scientist

5
Vertical versus horizontal scaling

• The Big Data Paradigm consists of the distribution of data


systems across horizontally coupled, independent resources to
achieve the scalability needed for the efficient processing of
extensive datasets.
• Vertical Scaling is increasing the performance of data processing
through improvements to processors, memory, storage, or connectivity.
• Horizontal Scaling is increasing the performance of distributed data
processing through the addition of nodes in a cluster.

6
What do you call a few PB of free space?

Denis Serenyi, Google

7
What do you call a few PB of free space?

An emergency low disk space condition

Denis Serenyi, Google

8
Example: Google platform

• 130 trillion pages


• Index 100 PB (stacking 2TB drives up: 0.8 mile)
• 3 billion searches per day (or 35K per second)

Data sources:
https://www.google.com/insidesearch/howsearchworks/thestory
http://www.seobook.com/learn-seo/infographics/how-search-works.php
http://www.ppcblog.com/how-google-works

9
GFS (and Colossus)
• Google’s GFS is an example of a storage system with a simple file-like abstraction
(Google’s Colossus system has since replaced GFS, but follows a similar
architectural philosophy).
• GFS was designed to support the web search indexing system (the system
that turned crawled web pages into index files for use in web search), and
therefore focuses on high throughput for thousands of concurrent readers/writers
and robust performance under high hardware failures rates.
• GFS users typically manipulate large quantities of data, and thus GFS is further
optimized for large operations. The system architecture consists of a primary
server (master), which handles metadata operations, and thousands of
chunkserver (secondary) processes running on every server with a disk drive, to
manage the data chunks on those drives.
• In GFS, fault tolerance is provided by replication across machines instead of within
them, as is the case in RAID systems. Cross-machine replication allows the system
to tolerate machine and network failures and enables fast recovery, since replicas
for a given disk or machine can be spread across thousands of other machines.

Source: Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018). 10


Warehouse Data Center Systems Stack

• Here are some terms used to describe the different software layers in
a typical WSC deployment.
• Platform-level software: The common firmware, kernel, operating system
distribution, and libraries expected to be present in all individual servers to
abstract the hardware of a single machine and provide a basic machine
abstraction layer.
• Cluster-level infrastructure: The collection of distributed systems software
that manages resources and provides services at the cluster level. Ultimately,
we consider these services as an operating system for a data center.
Examples are distributed file systems, schedulers and remote procedure call
(RPC) libraries, as well as programming models that simplify the usage of
resources at the scale of data centers, such as MapReduce, Dryad, Hadoop,
Sawzall, BigTable, Dynamo, Dremel, Spanner, and Chubby.

Source: Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018). 11


Warehouse Data Center Systems Stack

• Application-level software: Software that implements a specific


service. It is often useful to further divide application-level software into
online services and offline computations, since they tend to have
different requirements. Examples of online services are Google Search,
Gmail, and Google Maps. Offline computations are typically used in
large-scale data analysis or as part of the pipeline that generates
the data used in online services, for example, building an index of
the web or processing satellite images to create map tiles for the
online service.
• Monitoring and development software: Software that keeps track of
system health and availability by monitoring application performance,
identifying system bottlenecks, and measuring cluster health.

12
Google Clusters at the beginning

Image source: Abhijeet Desai. "Google Cluster Architecture".


http://www.slideshare.net/abhijeetdesai/google-cluster-architecture 13
A server board, an accelerator board (Google’s
Tensor Processing Unit [TPU]), and a disk tray.

14
Source: Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018).
Building blocks

15
Source: Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018).
Optional reading

• Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018). The


datacenter as a computer: Designing warehouse-scale
machines. Synthesis Lectures on Computer Architecture, 13(3), i-
189.

16
Classe outline

• Scalability for Big Data


• Core Components Overview
• HDFS

17
The birth of the Big Data industry
• Many of the ideas that underpin the Apache Hadoop project are decades
old. Academia and industry have been exploring distributed storage and
computation since the 1960s.
• Real, practical, useful, massively scalable, and reliable systems simply could
not be found—at least not cheaply—until Google confronted the problem
of the internet in the late 1990s and early 2000s. Collecting, indexing, and
analyzing the entire web was impossible, using commercially available
technology of the time.
• Google dusted off the decades of research in large-scale systems. Its
architects realized that, for the first time ever, the computers and
networking they required could be had, at reasonable cost.
• Its work—on the Google File System (GFS) for storage and on the
MapReduce framework for computation—created the big data
industry.

Source: Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019). 18


The Hadoop stack

Source: Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019).


19
Source: Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019). 20
Source: Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019). 21
Demo

• Introducing our stack

22
Hadoop and modern platforms
• Once, the only storage system was the Hadoop Distributed File
System (HDFS), based on GFS.
• Today, HDFS is thriving, but there are plenty of other choices:
Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for cloud
storage, for example, or Apache Kudu for IoT and analytic data.
• Similarly, MapReduce was originally the only option for analyzing
data. Now, users can choose among MapReduce, Apache Spark for
stream processing and machine learning workloads, SQL engines like
Apache Impala and Apache Hive, and more.
• All of these new projects have adopted the fundamental
architecture of Hadoop: large-scale, distributed, shared-nothing
systems, connected by a good network, working together to
solve the same problem.

Source: Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019). 23


Shared-Nothing
• Commodity servers connected via commodity networking
• DB storage is “strictly local” to each node
Node 1 Node 2 Node K
Co-located compute
CPU CPU … CPU and storage

MEM MEM MEM

Interconnection Network

• Design scales extremely well

Source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
24
Some misconceptions

• Data in Hadoop is schemaless


• One copy of the data
• One huge cluster

Source: Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019). 25


Some general trends

• Horizontal Scaling
• Adoption of Open Source
• Embracing Cloud Compute
• Decoupled Compute and Storage

26
Shared Nothing to Shared Storage

Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
27
Why Shared Storage? Flexible Scaling

Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
28
Apache Hadoop

• Hadoop is a data storage and processing platform, based upon


a central concept: data locality.
• Data locality refers to the processing of data where it
resides by bringing the computation to the data, rather
than the typical pattern of requesting data from its location
(for example, a database management system) and sending
this data to a remote processing system or host.
• With Internet-scale data it is no longer efficient, or even possible
in some cases, to move the large volumes of data required for
processing across the network at compute time.
29
Apache Hadoop

• Hadoop is known as a schema-on-read system.


• This means that it can store and process a wide range of data,
from unstructured text documents, to semi-structured JSON
(JavaScript Object Notation) or XML documents, to well-
structured extracts from relational database systems.

30
Apache Hadoop

• Schema-on-read systems are a fundamental departure from


the relational databases we are accustomed to, which are, in
contrast, broadly categorized as schema-on-write systems,
where data is typically strongly typed and a schema is
predefined and enforced upon INSERT, UPDATE, or UPSERT
operations.

31
Apache Hadoop

• Because the schema is not interpreted during write operations


to Hadoop, there are no indexes, statistics, or other constructs
typically employed by database systems to optimize query
operations and filter or reduce the amount of data returned to a
client.
• This further necessitates the requirement for data locality.
• Hadoop is designed to find needles in haystacks, doing so by
dividing and conquering a large problem into a set of smaller
problems applying the concepts of data locality and shared
nothing as previously introduced.
32
Inside Hadoop

• Access unstructured and semi-structured data (e.g., log files, social


media feeds, other data sources)
• Break the data up into “parts,” which are then loaded into a file system
made up of multiple nodes running on commodity hardware using
HDFS
• Each “part” is replicated multiple times and loaded into the file system
for replication and failsafe processing
• A node acts as the Facilitator and another as Job Tracker
• Jobs are distributed to the clients, and once completed the results are
collected and aggregated using MapReduce

33
HDFS

• Although Hadoop can interact with many different filesystems,


HDFS is Hadoop’s primary input data source and target for data
processing operations.
• There are several key design principles behind HDFS.
• These principles require that the filesystem
• is scalable (economically), is fault tolerant,
• uses commodity hardware, supports high concurrency,
• favors high sustained bandwidth over low latency random access.
The HDFS write process and how blocks are
distributed across DataNodes

Source: Kunig, J., Buss, I., Wilkinson, P. & George, L. (2019).


35
Introduction to HDFS

• HDFS is a highly scalable, distributed, load-balanced,


portable, and fault-tolerant (with built-in redundancy at the
software level) storage component of Hadoop.
• In other words, HDFS is the underpinnings of the Hadoop
cluster.
• It provides a distributed, fault-tolerant storage layer for storing
Big Data in a traditional, hierarchical file organization of
directories and files.

36
Horizontal scalability

• HDFS is based on a scale-out model and can scale up to


thousands of nodes, for terabytes or petabytes of data.
• As the load increases, we can keep increasing nodes (or data
nodes) for additional storage and more processing power.

37
Fault tolerance

• HDFS assumes that failures (hardware and software) are


common and transparently ensures data redundancy (by
default, creating three copies of data: two copies on the same
rack and one copy on a different rack so that it can survive even
rack failure) to provide fault-tolerant storage.
• If one copy becomes inaccessible or gets corrupted, the
developers and administrators don’t need to worry about it—
the framework itself takes care of it transparently.

38
Fault tolerance

• In other words, instead of relying on hardware to deliver


high availability, the framework itself was designed to
detect and handle failures at the application layer.
• Hence, it delivers a highly reliable and available storage service
with automatic recovery from failure on top of a cluster of
machines, even if the machines (disk, node, or rack) are prone to
failure.

39
Classe takeaways

• Scalability for Big Data


• Core Components Overview

40
Big Data Foundations
Week 3
Lab Hive
Core Components Overview 2
Class outline

• Lab
• HDFS architecture and access
•A

2
Class outline

• Lab
• HDFS architecture and access
•A

3
Class outline

• Lab
• HDFS architecture and access
•A

4
Hadoop Cluster Terminology
• A cluster is a group of computers working together
–Provides data storage, data processing, and resource management
• A node is an individual computer in the cluster
–Master nodes manage distribution of work and data to worker nodes
• A daemon is a program running on a node
–Each Hadoop daemon performs a specific function in the cluster

Worker Node

Master Node Worker Node Master Node

Worker Node

Worker Node

5
Cluster Components

• Three main components of a cluster


• Work together to provide distributed data processing
• We will start with the Storage component
– HDFS Processing

Resource Storage
Management

6
HDFS Basic Concepts (1)
• HDFS is a filesystem writen in Java
–Based on Google’s GFS
• Sits on top of a native filesystem
–Such as ext3, ext4, or xfs
• Provides redundant storage for massive amounts of data
–Using readily-‐available, industry-‐standard computers

HDFS

Native OS filesystem

Disk Storage

7
HDFS Basic Concepts (2)

• HDFS performs best with a ‘modest’ number of large files


–Millions, rather than billions, of files
–Each file typically 100MB or more
• Files in HDFS are ‘write once’
–No random writes to files are allowed
• HDFS is optimized for large, streaming reads of files
–Rather than random reads

8
How Files Are Stored

• Data files are split into 128MB blocks which are distributed at load time
• Each block is replicated on multiple data nodes (default 3x)
• NameNode stores metadata
Block 1 Name
Block 3
Node
Block 1
Block 1
Metadata:
Block 2
information
Block 2
Block 2 about files
Very Block 3
Large and blocks
Block 4

Data File
Block 3 Block 2

Block 4

Block 1

Block 3
Block 4
Block 4
9
HDFS Secondary NameNode
The NameNode daemon must be running at all times
– If the NameNode stops, the cluster becomesinaccessible

HDFS classic mode


– One NameNode
Secondary
– One “helper” node called the Name Name
Secondary NameNode Node Node

– Bookkeeping, not backup

10
Hierarchical file organization

• HDFS is based on a traditional hierarchical file organization.


• A user or application can create directories or subdirectories and
store files inside.
• This means that we can create a file, delete a file, rename a file,
or move a file from one directory to another.
Fsimage

• All this information, along with information related to data


nodes and blocks stored in each of the data nodes, is recorded
in the file system namespace, called fsimage and stored as a file
on the local host OS file system at the name node daemon.
• This fsimage file is not updated with every addition or removal
of a block in the file system. Instead, the name node logs and
maintains these add/remove operations in a separate edit log
file, which exists as another file on the local host OS file system.
• Appending updates to a separate edit log achieves faster I/O.
Secondary name node

• A secondary name node is another daemon.


• Contrary to its name, the secondary name node is not a standby
name node, so it is not meant as a backup in case of name node
failure.
• The primary purpose of the secondary name node is to periodically
download the name node fsimage and edit the log file from the
name node, create a new fsimage by merging the older fsimage and
edit the log file, and upload the new fsimage back to the name node.
• By periodically merging the namespace fsimage with the edit log, the
secondary name node prevents the edit log from becoming too
large.
Checkpoint process

• The fsimage and the edit log file are central data structures that
contain HDFS file system metadata and namespaces.
• Any corruption of these files can cause the HDFS cluster
instance to become nonfunctional.
• For this reason, the name node can be configured to support
maintaining multiple copies of the fsimage and edit log to
another machine.
• This is where the secondary name node comes into play.
Checkpoint process

• The process of generating a


new fsimage from a merge
operation is called the
Checkpoint process.
• Usually the secondary name
node runs on a separate
physical machine than the
name node; it also requires
plenty of CPU and as much as
memory as the name node to
perform the Checkpoint
operation.
Example: Storing and Retrieving Files (1)
Local

Node A Node D
/logs/
031512.log

Node B Node E

/logs/
042313.log Node C

HDFS
Cluster

16
Example: Storing and Retrieving Files (2)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/031512.log: B1,B2,B3 B4: A,B,E
/logs/042313.log: B4,B5 B5: C,E,D

1 Node A Node D
/logs/ 2
031512.log
1 3 1 5
3 4 2

Node B Node E
1 2 2 5
3 4 4
/logs/
4
042313.log 5 Node C
3 5

17
Example: Storing and Retrieving Files (3)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/031512.log: B1,B2,B3 B4: A,B,E
/logs/042313.log: B4,B5 B5: C,E,D

1 Node A Node D
/logs/
/logs/042313.log?
2 1 3 1 5
031512.log
3 4 2

Node B Node E B4,B5


1 2 2 5
3 4 4
/logs/
4
042313.log 5 Node C Client
3 5

18
Example: Storing and Retrieving Files (4)

Metadata B1: A,B,D NameNode


B2: B,D,E
B3: A,B,C
/logs/031512.log: B1,B2,B3 B4: A,B,E
/logs/042313.log: B4,B5 B5: C,E,D

1 Node A Node D
/logs/
/logs/042313.log?
2 1 3 1 5
031512.log
3 4 2

Node B Node E B4,B5


1 2 2 5
3 4 4
/logs/
4
042313.log 5 Node C Client
3 5

19
HDFS Set up for High Availability

Set up for High Availability


– Two NameNodes: Active and Active Standby
Standby Name Name
Node Node

20
Options for Accessing HDFS
put
From the command line HDFS
– FsShell: Client Cluster
$ hdfs dfs get

In Spark
– Several libraries available, such as
snakebite, by Spotify

Other programs
– Java API
– Used by Hadoop MapReduce,
Impala, Hue, Sqoop,
Flume, etc.
– RESTful interface

21
Demo

• A few HDFS commands


• Put command
• List command
• Get command
• Cat command
• Tail command
• Text command
• Make directory command
• Move command
• Change file permissions
• Remove command
• Remove recursively command
22
HDFS Command Line Examples (1)
Copy file foo.txt from local disk to the user’s directory in HDFS

$ hdfs dfs -put foo.txt foo.txt

– This will copy the file to /user/username/foo.txt


Get a directory listing of the user’s home directory in HDFS

$ hdfs dfs -ls

Get a directory listing of the HDFS root directory

$ hdfs dfs –ls /

Further reference: https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html


23
HDFS Command Line Examples (2)
Display the contents of the HDFS file /user/fred/bar.txt

$ hdfs dfs -cat /user/fred/bar.txt

Copy that file to the local disk, named as baz.txt

$ hdfs dfs -get /user/fred/bar.txt baz.txt

Create a directory called input under the user’s home directory

$ hdfs dfs -mkdir input

Note: copyFromLocal is a synonym for put; copyToLocal is a synonym for get

24
HDFS Command Line Examples (3)

Delete the directory input_old and all its contents

$ hdfs dfs -rm -r input_old

25
The Hue HDFS File Browser
• The File Browser in Hue (Hadoop User Experience) lets you view and manage your
HDFS directories and files
–Create, move, rename, modify, upload, download and delete directories and files
–View file contents

26
Hue in AWS

• Graphical front-end for


applications on your EMR
cluster
• IAM integration: Hue Super-
users inherit IAM roles
• S3: Can browse & move data
between HDFS and S3

27
HDFS Recommendations
• HDFS is a repository for all your data
–Structure and organize carefully!
• Best practices include
–Define a standard directory structure
–Include separate locations for staging data
• Example organization
–/user/… – data and configuration belonging only to a single user
–/etl – Work in progress in Extract/Transform/Load stage
–/tmp – Temporary generated data shared between users
–/data – Data sets that are processed and available across the
organization for analysis
–/app – Non-‐data files such as configuration, JAR files, SQL files, etc.

28
AWS Elastic MapReduce (EMR)

• Managed Hadoop framework


on EC2 instances
• Includes Spark, HBase, Presto,
Flink, Hive & more
• EMR Notebooks
• Several integration points with
AWS

29
An EMR Cluster

• Master node: manages the


cluster
• Single EC2 instance
• Core node: Hosts HDFS data and
runs tasks
• Can be scaled up & down, but
with some risk
• Task node: Runs tasks, does not
host data
• No risk of data loss when
removing

30
EMR Storage

• HDFS
• EMRFS: access S3 as if it were HDFS
• EMRFS Consistent View – Optional for S3 consistency
• Uses DynamoDB to track consistency
• Local file system
• EBS for HDFS

31
Class outline

• Lab
• HDFS architecture and access
•A

32
Review questions

• In terms of storage, what does a name node contain and what


do data nodes contain?
• What were the key design strategies for HDFS to become fault
tolerant?
• What is a checkpoint, and who performs this operation?
• What is the default data block placement policy?
• How does a name node ensure that all the data nodes are
functioning properly?
• To what does the term data locality refer?
33
Big Data Foundations
Week 4

Core Components Overview 3


Course agenda
• Defining Big Data
• Datacenter scale computing
• Hadoop Core Components: HDFS, MapReduce, YARN, TEZ
• Cloud Storage
• Batch Analysis with Hive
• Optimizing Hive
• Data collection
• Near-real time
• Real time
• Streaming Analysis
• Use case: Designing a modern Big Data solution

2
Class outline

• Lab: Oil import prices with Hive,


• MapReduce (with a sideline on YARN)
• TEZ

•A

3
Lab: Oil import prices
analysis with Hive

4
Recommended reading

• Kunigk, J., Buss, I., Wilkinson, P. & George, L. (2019). Architecting


Modern Data Platforms. O’Reilly.
• (Chapter 1, mainly the parts related to HDFS, MapReduce, YARN, Spark,
as well as Hive, Impala and Kafka, related to future classes).

5
Class outline

• Lab: Oil import prices with Hive, Part 2


• MapReduce (with a sideline on YARN)
• TEZ

•A

6
Why MapReduce?

• Distributes the processing of data on the cluster


• Divides data up into partitions that are MAPPED (transformed)
and REDUCED (aggregated) by mapper and reducer functions
defined by the user
• Resilient to failure – an application master monitors the
mappers and reducers on each partition
MapReduce can refer to…

• The programming model


• The execution framework (aka “runtime”)
• The specific implementation

• Usage is usually clear from context!

8
Divide and Conquer

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Aggregate

Source: Jimmy Lin, University of Waterloo


Parallelization Challenges
• How do we assign work units to workers?
• What if we have more work units than workers?
• What if workers need to communicate partial results?
• What if workers need to access shared resources?
• How do we know when a worker has finished? (Or is simply waiting?)
• What if workers die?

Difficult because:
• We don’t know the order in which workers run…
• We don’t know when workers interrupt each other…
• We don’t know when workers need to communicate partial results…
• We don’t know the order in which workers access shared resources…

What’s the common theme of all of these challenges?


Source: Jimmy Lin, University of Waterloo
Common Theme?

Parallelization challenges arise from:


• Need to communicate partial results
• Need to access shared resources

(In other words, sharing state)

How do we tackle these challenges?

Source: Jimmy Lin, University of Waterloo


“Current” Tools

Basic primitives
Semaphores (lock, unlock)
Conditional variables (wait, notify, broadcast)
Barriers

Awareness of Common Problems


Deadlock, livelock, race conditions...
Dining philosophers, sleeping barbers, cigarette smokers...

Source: Jimmy Lin, University of Waterloo


“Current” Tools
Programming Models
Shared Memory Message Passing

Memory
P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

Design Patterns
producer consumer
coordinator

work queue
workers

Source: Jimmy Lin, University of Waterloo producer consumer


When Theory Meets Practices
Concurrency is already difficult to reason about…

Now throw in:


• The scale of clusters and (multiple) datacenters
• The presence of hardware failures and software bugs
• The presence of multiple interacting services

The reality:
• Lots of one-off solutions, custom code
• Write you own dedicated library, then program with it
• Burden on the programmer to explicitly manage everything

Bottom line: it’s hard!


Source: Jimmy Lin, University of Waterloo
Source: Ricardo Guimarães Herrmann
MapReduce Implementations

Google has a proprietary implementation in C++


Bindings in Java, Python

Hadoop provides an open-source implementation in Java


Development begun by Yahoo, later an Apache project
Used in production at Facebook, Twitter, LinkedIn, Netflix, …
Large and expanding software ecosystem
Potential point of confusion: Hadoop is more than MapReduce today

Lots of custom research implementations

Source: Jimmy Lin, University of Waterloo


Hadoop 1.0 to Hadoop 2.0

17
MR works like a UNIX Command Sequence

• grep | sort | count myfile.txt


• will produce a wordcount in the text document called myfile.txt.

18
grep | sort | count

19
Exercise

• Create an algorithm to count the words of the following text


using parallel processing (!)

• “We are going to a picnic near our house. Many of our friends
are coming. You are welcome to join us. We will have fun.”

• Hint: Split the text in sentences.

20
WordCount Example: Myfile.txt

• Myfile.txt : We are going to a picnic near our house. Many of our


friends are coming. You are welcome to join us. We will have fun.
• Split into a few equal segments. Could be done with each sentence
as a separate piece of text. The four segments will look as following:
• Segment1: We are going to a picnic near our house.
• Segment2: Many of our friends are coming.
• Segment3: You are welcome to join us.
• Segment4: We will have fun.

• Thus there will be 4 Map tasks, one for each segment of data.

21
Results of Map on each segment

Key Value Key Value Key Value Key Value


we 1 many 1 you 1 we 1
are 1 of 1 are 1 will 1
going 1 our 1 welcome 1 have 1
to 1 friends 1 to 1 fun 1
a 1 are 1 join 1
picnic 1 coming 1 us 1
near 1
our 1
house 1

22
Sorted results of Map operations

Key Value Key Value Key Value Key Value


a 1 are 1 are 1 fun 1
are 1 coming 1 join 1 have 1
going 1 friends 1 to 1 we 1
house 1 many 1 us 1 will 1
near 1 of 1 welcome 1
our 1 our 1 you 1
picnic 1
to 1
we 1

23
Results after Reduce phase

Key Value
a 1
ar 3
coming 1
friends 1
fun 1
going 1
have 1
house 1
join 1
many 1
near 1
of 1
our 2
picnic 1
to 2
us 1
we 2
welcome 1
will 1
you 1

24
Other example

25
Recap: Map, shuffle, reduce
• Essentially, MapReduce divides a computation into three sequential stages:
map, shuffle, and reduce.
• In the map phase, the relevant data is read from HDFS and processed in
parallel by multiple independent map tasks.
• These tasks should ideally run wherever the data is located—usually we
aim for one map task per HDFS block.
• The user defines a map() function (in code) that processes each record in
the file and produces key-value outputs ready for the next phase.
• In the shuffle phase, the map outputs are fetched by MapReduce and
shipped across the network to form input to the reduce tasks.
• A user-defined reduce() function receives all the values for a key in turn
and aggregates or combines them into fewer values which summarize the
inputs.

Source: Kunigk, J. et al (2019)


26
The following processes takes place during
MapReduce

• The client sends a request for a task.


• The NameNode allocates DataNodes (individual servers) that will perform the
map operation and ones that will perform the reduce operation.
• Note that the selection of the DataNode server is dependent upon whether the
data that is required for the operation is local to the server. The servers where
the data resides can only perform the map operation.
• DataNodes perform the map phase and produce key-value (k,v) pairs.
• As the mapper produces the (k,v) pairs, they are sent to these reduce nodes based
on the keys the node is assigned to compute. The allocation of keys to servers is
dependent upon a partitioner function, which could be as simple as a hash value
of the key (this is default in Hadoop).
• Once the reduce node receives its set of data corresponding to the keys it is
responsible to compute on, it applies the reduce function and generates the final
output.

27
Another example in diagram format

28
Pseudo-code for WordCount
• map(String key, String value):
• // key: document name and value: document
contents
• for each word w in value:
• EmitIntermediate (w, "1");

• reduce(String key, Iterator values):


• // key: a word, and values: a list of counts
• int result = 0;
• for each v in values:
• result += ParseInt (v);
• Emit (AsString(result));
29
MapReduce as core of Google Search during
years

Source: Maheshwari, Anil (2017). Big Data, McGraw-Hill Education. 30


Limitations…

• Urs Hölzle, senior vice president of technical infrastructure at the


Mountain View, California-based giant, said it got too
cumbersome once the size of the data reached a few petabytes.

Source: https://www.datacenterknowledge.com/archives/2014/06/25/
31
MapReduce at Google Search: long tenure

32
Hive and MapReduce

33
A few words about YARN: Yet Another
Resource Negotiator
• YARN is a large-scale, distributed operating system for Big Data
applications.
• It manages resources and monitors workloads, in a secure multi-
tenant environment, while ensuring high availability across
multiple Hadoop clusters.
• YARN is a common platform to run multiple tools and
applications such as interactive SQL (e.g. Hive), real-time
streaming (e.g. Spark), and batch processing (MapReduce), etc.
YARN application execution

Three clients run applications with different resource demands, which are translated into different-sized
containers and spread across the NodeManagers for execution.
Source: Kunigk, J. et al (2019)
35
YARN

• YARN runs a daemon on each worker node, called a NodeManager,


which reports in to a master process, called the ResourceManager.
• Each NodeManager tells the ResourceManager how much compute
resource (in the form of virtual cores, or vcores) and how much
memory is available on its node.
• Resources are parceled out to applications running on the cluster in
the form of containers, each of which has a defined resource
demand—say, 10 containers each with 4 vcores and 8 GB of RAM.
• The NodeManagers are responsible for starting and monitoring
containers on their local nodes and for killing them if they exceed
their stated resource allocations.

Source: Kunigk, J. et al (2019)


36
YARN

• An application that needs to run computations on the cluster must


first ask the ResourceManager for a single container in which to run
its own coordination process, called the ApplicationMaster (AM).
• Despite its name, the AM actually runs on one of the worker
machines.
• ApplicationMasters of different applications will run on different
worker machines, thereby ensuring that a failure of a single worker
machine will affect only a subset of the applications running on the
cluster.
• Once the AM is running, it requests additional containers from the
ResourceManager to run its actual computation.

Source: Kunigk, J. et al (2019)


37
YARN

• The ResourceManager runs a special thread, which is


responsible for scheduling application requests and ensuring
that containers are allocated equitably between applications and
users running applications on the cluster.
• This scheduler strives to allocate cores and memory fairly
between tenants.
• Tenants and workloads are divided into hierarchical pools, each
of which has a configurable share of the overall cluster
resources.

Source: Kunigk, J. et al (2019)


38
Class outline

• Lab: Oil import prices with Hive, Part 2


• MapReduce
• TEZ

•A

39
Apache Tez

• The MapReduce framework has been designed as a batch-mode


processing platform running over a large amount of data, but this
does not fit well in some important current uses cases that expect
near-real-time query processing and response.
• This is why Apache Tez, a distributed execution framework, came into
the picture.
• Apache Tez expresses computations as a data flow graph and
allows for dynamic performance optimizations based on real
information about the data and the resources required to
process it.
• It meet demands for fast response time and extreme throughput at a
petabyte scale.
Apache Tez

• Projects such as Apache Hive or Apache Pig (or even


MapReduce) can leverage the Apache Tez engine to execute a
complex DAG of tasks to process data that earlier required
multiple MapReduce jobs (in a Map-Reduce -> Map-Reduce
pattern; in which the Mapper of the next step takes input from
the Reducer of the previous step via intermediate storage of
data in HDFS).

41
Apache Tez

• Originally developed by Hortonworks, Apache Tez is an engine


built on top of Apache Hadoop YARN to provide the capability
to build an application framework that allows for a complex
directed acyclic graph (DAG) of tasks for high-performance
data processing in either batch or interactive mode.
• A DAG consists of a graph with finitely many vertices and edges (also
called arcs), with each edge directed from one vertex to another, such
that there is no way to start at any vertex v and follow a consistently-
directed sequence of edges that eventually loops back to v again.

42
Side note: SEM as DAG

43
The MapReduce framework versus the Tez
framework for processing.

44
Apache Tez

• Tez provides a distributed parallel-execution framework that


negotiates resources from the Resource Manager (a component
of YARN) and ensures the recovery of failed steps inside a job.
• It is capable of horizontal scaling, provides resource elasticity,
and comes with a shared library of ready-to-use components
(DAP API and Runtime API).

45
Apache Tez

• The two main components of Tez are:


• A master for the data processing application, with which you can put
together arbitrary data-processing tasks into a task represented as DAG
to process data as desired using the DAG API.
• The data processing pipeline engine where you can plug in input,
processing, and output implementations to perform arbitrary data
processing with the runtime API.
• The runtime API provides interfaces using the runtime framework and
user application code interact with each other.

46
Apache Tez

• Every task in Tez has the following triplet to specify what


actually executes in each task on the cluster nodes:
• Input (key/value pairs)—Reads the data correctly from native format
and converts it into the data format that the processor can understand
• Processor—Applies your business logic on the data received as input
by applying filters, transformations, and so on
• Output—Collects the processed key/value pairs and writes the output
correctly so that end users or the next consumer/input can consume it

47
Apache Tez

• Benefits of using the Tez execution framework:


• Eliminated storage of intermediate output to HDFS—(By default, 3x
replication for intermediate data in HDFS causes overhead.) With Tez,
either the next consumer or the next input directly consumes the
output, without writing an intermediate result to the HDFS.
• Minimized queue length—Job launch overhead of workflow jobs is
eliminated, as is queue and resource contention suffered by workflow
jobs (or to launch an application master) that are started after a
predecessor job completes.
• Efficient use of resources—Tez reuses resource containers to
maximize its utilization and minimizes the overhead of initializing it
every time. In various cases, it can prelaunch and prewarm the
container, for better performance.

48
Apache Tez

• Higher developer productivity—Application developers can focus on


application business logic instead of Hadoop internals. (Tez takes care
of Hadoop internals anyway.)
• Reduced network usage—Tez uses new data shuffle and movement
patterns for better data locality and minimal data transfer.
• Simple deployment—Tez is a completely client-side application that
leverages YARN local resources and distributed cache. You can simply
upload Tez libraries to HDFS and then use your Tez client to submit
requests with those libraries. Tez allows side-by-side execution, in which
you can have two or more versions of the libraries on your cluster. This
helps evaluate a new release on the same cluster side by side where the
older version is used in production workloads.

49
Apache Tez

• Traditionally, Hive and Pig use MapReduce as their execution


engine. Any query produces a graph of MapReduce jobs
potentially interspersed with some local/client-side work.
• This leads to many inefficiencies in the planning and execution
of queries.
• However, the latest releases of Apache Hive and Apache Pig
have been updated to leverage the Tez distributed
execution framework as well; hence, when you can run Hive
or Pig jobs leveraging Tez, you will notice significant
improvements in response times.
50
When you can choose: Hive on Tez

• Hive on Tez performs better:


• Cost-based optimization plans—A cost-based optimization (CBO)
engine uses statistics (table and column levels) within Hive tables to
produce more efficient and optimal query plans. This eventually leads
to better cluster utilization.
• Performance improvements of vectorized queries—When you
enable the vectorization feature, it fetches data in batches of 1000 rows
at a time instead of 1 row at time for processing, enabling improved
cluster utilization and faster processing. For now, the vectorization
feature works only on Hive tables with the Optimized Row Columnar
(ORC) file format.

51
Spark versus Tez

52
Class takeaways

• Lab: Oil import prices with Hive


• MapReduce (with a sideline on YARN)
• TEZ

•A

53
BIG DATA
FOUNDATIONS
Week 5

Data analysis with Hive


COURSE AGENDA
• Defining Big Data
• Datacenter scale computing
• Hadoop Core Components: HDFS, MapReduce, YARN, TEZ
• Cloud Storage
• Batch Analysis with Hive
• Optimizing Hive
• Data collection
• Near-real time
• Real time
• Streaming Analysis
• Use case: Designing a modern Big Data solution

2
CLASS OUTLINE

1. From GFS/HDFS to object storage


2. Apache Hive

3
FROM GFS/HDFS TO OBJECT
STORAGE

4
FROM GFS/HDFS TO MODERN APPROACHES

• Widening gap between network and disk performance


• Disk locality less relevant ► Design simplification
• Flash brings further questions
• Flash performance can achieve over 100x the throughput ot a disk drive
• Can saturate high performing network ports (40 Gb/s)
• Network will evolve to match
• but
• NVM (non-volatile memory) will provide higher bandwidth and sub-
microsecond access latency

Source: Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018). 5


OBJECT STORAGE

• Object storage takes each piece of data and designates it as an


object.
• Data is kept in separate storehouses and is bundled with
associated (custom, rich) metadata and a unique identifier to
form a storage pool.
• Examples:
• Amazon Web Services/AWS Simple Storage Service/S3
• Microsoft Azure Blob Storage
• Rackspace Files (code was donated to Openstack project and released
as OpenStack Swift)
• Google Cloud Storage

6
AWS S3

• Amazon S3 allows the storage of objects (files) in “buckets”


(directories)
• Buckets must have a globally unique name
• Buckets are defined at the region level
• Naming convention
• No uppercase
• No underscore
• 3-63 characters long
• Not an IP
• Must start with lowercase letter or number

7
RICH CONFIGURATION OPTIONS

• Amazon S3 Standard - General Purpose


• Amazon S3 Standard-Infrequent Access (IA)
• Amazon S3 One Zone-Infrequent Access
• Amazon S3 Intelligent Tiering
• Amazon Glacier

8
AWS S3 STANDARD

• High durability (99.999999999%) of objects across multiple AZ


• If you store 10,000,000 objects with Amazon S3, you can on average
expect to incur a loss of a single object once every 10,000 years
• 99.99% Availability over a given year
• Sustain 2 concurrent facility failures

• Use cases: Analytics, mobile applications, content distribution…

9
AWS GLACIER
• Low-cost object storage meant for archiving / backup
• Data is retained for the longer term (10s of years)
• Alternative to on-premise magnetic tape storage
• Average annual durability is 99.999999999%
• Very low cost per storage per month + retrieval cost
• Each item in Glacier is called “Archive” (up to 40TB)

• Archives are stored in ”Vaults”

• 3 retrieval options:
• Expedited (1 to 5 minutes retrieval)
• Standard (3 to 5 hours)
• Bulk (5 to 12 hours)

10
S3 LIFECYCLE RULES
• Rules to move data between different tiers, to save storage cost
• General Purpose ► Infrequent Access ► Glacier

• Transition actions
• It defines when objects are transitioned to another storage class.
• Example: We can choose to move objects to Standard IA class 60 days after
you created them or can move to Glacier for archiving after 6 months.
• Moving to Glacier is helpful for backup / long term retention / regulatory
needs.

• Expiration actions
• Helps to configure objects to expire (be deleted) after a certain time period.
• Example: Access log files can be set to delete after a specified period of time.

11
APACHE HIVE

12
INTRODUCTION TO APACHE HIVE

• Hive, developed at Facebook, now an Apache project, can be


seen as a SQL abstraction layer over Hadoop MapReduce with a
SQL-like query engine.

• Hive enables the writing of data processing logic or queries in a


SQL-like declarative language, called HiveQL, that is similar to
SQL.

• All documentation at: http://hive.apache.org.


HIVE USE CASES

• Hive is designed for a data warehouse type of workload


• Best suited for batch jobs over large sets of append-only data (such as
web logs).
• Does not work well with an OLTP type of workload that requires real-
time queries or row-level updates.

14
INTRODUCTION TO APACHE HIVE

• When a HiveQL query is executed, Hive translates the query into


MapReduce, saving the time and effort of writing actual
MapReduce jobs.
• Then Hive executes the query.

• Beware: an Hadoop cluster is not an RDBMS

15
HIVE ARCHITECTURE

External Interfaces — CLI, WebUI, JDBC,


ODBC programming interfaces

Thrift Server — Cross Language service


framework

Metastore — Meta data about the Hive


tables, partitions

Driver — Brain of Hive. Compiler,


Optimizer and Execution engine
HIVE THRIFT SERVER

• Framework for cross language services


• Server written in Java
• Support for clients written in different languages
- JDBC (java), ODBC (C++), PHP, Perl, Python
METASTORE

• System catalog which contains metadata about the Hive tables


• Stored in RDBMS/local fs. HDFS too slow(not optimized for random access)
• Objects of Metastore
➢ Database — Namespace of tables
➢ Table — list of columns, types, owner, storage, SerDes
➢ Partition — Partition specific column, Serdes and storage
HIVE DRIVER

• Driver — Maintains the lifecycle of HiveQL statement


• Query Compiler — Compiles HiveQL in a DAG of map reduce tasks
• Executor — Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance
HIGH-LEVEL HIVE QUERY EXECUTION FLOW

20
HIGH-LEVEL HIVE QUERY EXECUTION FLOW

21
HIGH-LEVEL HIVE QUERY EXECUTION FLOW
• Step-1: Execute Query – Interface of the Hive such as Command Line or Web user interface delivers
query to the driver to execute. In this step, UI calls the execute interface to the driver such as ODBC
or JDBC.
• Step-2: Get Plan – Driver designs a session handle for the query and transfer the query to the
compiler to make the execution plan. In other words, driver interacts with the compiler.
• Step-3: Get Metadata – In this step, the compiler transfers the metadata request to any database
and the compiler gets the necessary metadata from the metastore.
• Step-4: Send Metadata – Metastore transfers metadata as an acknowledgement to the compiler.
• Step-5: Send Plan – Compiler communicating with driver with the execution plan made by the
compiler to execute the query.
• Step-6: Execute Plan – Execute plan is sent to the execution engine by the driver. Execute Job. Job
Done. Dfs operation (Metadata Operation).
• Step-7: Fetch Results – Fetching results from the driver to the user interface (UI).
• Step-8: Send Results – Result is transferred to the execution engine from the driver. Sending results
to Execution engine. When the result is retrieved from data nodes to the execution engine, it returns
the result to the driver and to user interface (UI).

22
HIVE “SCHEMA ON READ”

• Hive abstraction layer allows for data to be exposed as


structured tables that supports both simple and complex ad hoc
queries using a query language called HiveQL for data
summarization, ad hoc queries, and analysis.

• Hive follows a “schema on read” approach, unlike RDMBS,


which enforces “schema on write.”

23
HIVE TABLES

• Hive table are analogous to Tables in Relational Databases.


Tables can be filtered, projected, joined and unioned.
• Additionally, all the data of a table is stored in a directory in
HDFS or an object in an object store.
• Hive also supports the notion of external tables wherein a
table can be created on preexisting files or directories in
HDFS by providing the appropriate location to the table
creation DDL.
• The rows in a table are organized into typed columns similar to
Relational Databases.
24
INTERNAL AND EXTERNAL TABLES IN HIVE

Internal tables External tables


• Create the table and load the data: data on • In this case data is available on HDFS and the
schema. table is created on HDFS data: schema on
data.
• Both data and schema will be removed if the
table is dropped. • At the time of dropping the table, only the
schema will be dropped, as data will be still
• We use internal tables: available in the HDFS as before.
✓ When data is temporary
• It also provides an option of creating multiple
✓ Data is not needed after deletion schemas for the data stored in HDFS instead of
✓ If Hive is using the table data completely deleting the data everytime when schema
not allowing any external sources to use updates.
the table.
• We use external tables:
✓ When data is available in HDFS
✓ When files are being used outside of Hive
INTERNAL TABLES IN HIVE

• The default option is to create a Hive internal table.


• Directories for internal tables are managed by Hive, and a DROP
TABLE statement for an internal table will delete the
corresponding files from HDFS. Internal table (folders
deleted when table is
dropped)
CREATE TABLE table1
(col1 STRING,
col2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '; Default location
(/hive/warehouse/table1)

26
INTERNAL TABLES IN HIVE WITH LOCATION
SPECIFIED

CREATE TABLE table2


(col1 STRING, Stored in a custom location (but
col2 INT) still internal, so the folder is
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' deleted when table is dropped)
STORED AS TEXTFILE LOCATION '/data/table2';

27
EXTERNAL TABLES IN HIVE

• External tables are created by specifying the keyword


EXTERNAL in the CREATE TABLE statement.
• This provides the schema and location for the object in HDFS,
but a DROP TABLE operation does not delete the directory and
files.
CREATE EXTERNAL TABLE table3
(col1 STRING,
External table (folders and files
col2 INT)
are left intact in when the table
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
is dropped)
STORED AS TEXTFILE LOCATION '/data/table3';

28
CLASS TAKEAWAY

From GFS/HDFS to object storage


Introduction to Apache Hive

29
BIG DATA
FOUNDATIONS
Week 6

Optimizing Hive
LAB: Hive project best practices
WHERE ARE WE?
• Defining Big Data
• Datacenter scale computing
• Hadoop Core Components: HDFS, MapReduce, YARN, TEZ
• Cloud Storage
• Batch Analysis with Hive
• Optimizing Hive
• Data collection
• Real time
• Near-real time
• Batch
• Streaming Analysis
• Use case: Designing a modern Big Data solution

2
CLASS OUTLINE

• Optimizing Hive
• LAB: Hive project best practices

3
1. OPTIMIZING HIVE

4
INTRODUCTION TO DATA
STORAGE FORMATS

5
DATA STORAGE FORMATS

• There are a few formats in which data can be stored in Hadoop,


and the selection of the optimal storage format depends on
your requirements in terms of read/write I/O speeds, how well
the files can be compressed and decompressed on demand,
and how easily the file can be split since the data will be
eventually stored as blocks.

6
DATA STORAGE FORMATS

• Parquet: Parquet is a columnar data storage format.


• This helps improve performance, sometimes significantly, by
permitting data storage and access on a per-column basis.
• Example: If you were working on a 1 GB file with 100 columns
and 1 million rows, and wanted to query data from only one of
the 100 columns, being able to access just the individual column
would be more efficient than having to access the entire file.

7
DATA STORAGE FORMATS

• ORC Files: ORC stands for Optimized Row-Columnar.


• Further layer of optimization over pure columnar formats such as
Parquet.
• ORCFiles store data not only by columns, but also by rows, also
known as stripes.
• A file with data in tabular format can thus be split into multiple
smaller stripes where each stripe comprises of a subset of rows from
the original file.
• By splitting data in this manner, if a user task requires access to only
a small subsection of the data, the process can interrogate the
specific stripe that holds the data.

8
2. PARTITIONING AND
BUCKETING

9
SOME HIVE OPTIMIZATION TECHNIQUES

1. Use partitioning and bucketing


2. Use Hive over Tez (when avaliable)
3. Use a file format such as ORC
4. Use vectorization
5. Use cost based query optimization
(Beware: not all platforms or versions support all optimizations.)

10
PARTITIONS

• Table partitioning means dividing the table data into parts


based on the unique values of particular columns (for example,
city and country) and segregating the input data records into
different files or directories.
• Partitioning in Hive is used to increase query performance.
• Hive is used to perform queries on large datasets, scanning of
an entire table. A simple query may take a long time to return
the result.
• The concept of partitioning can be used to reduce the cost of
querying the data.
11
PARTITIONED BY

• Partitioning in Hive is done using the PARTITIONED BY clause in


the create table statement of table.
• A table can have one or more partitions.
• A table can be partitioned on the basis of one or more columns.
The columns on which partitioning is done cannot be included
in the data of table.

12
SETTING DYNAMIC PARTITIONING

• set hive.exec.dynamic.partition = true


• set hive.exec.dynamic.partition.mode = nonstrict

13
PARTITION TABLE DESIGN

• A query with partition filtering will only load data from the
specified partitions (sub-directories), so it can execute much
faster than a normal query that filters by a non-partitioning
field.
• The selection of the partition key is always an important factor
for performance. It should always be a low-cardinal attribute
to avoid so many sub-directories overhead.

14
PARTITION TABLE DESIGN

• The following are some attributes commonly used as partition


keys:
• Partitions by date and time: Use date and time, such as year, month,
and day (even hours), as partition keys when data is associated with the
date/time columns, such as load_date, business_date, run_date, and so
on
• Partitions by location: Use country, territory, state, and city as
partition keys when data is location related
• Partitions by business logic: Use department, sales region,
applications, customers, and so on as partition keys when data can be
separated evenly by business logic

15
EXAMPLE

• Let's take an example of a customer’s table data and imagine


that we have the data of different customers of different
countries.
• If we don't enable any partitioning, then by default all data will
go into one directory.
• If the data is 1TB and we query for customers belonging to
Portugal, then this query will be executed on the entire dataset.
• By enabling partitioning, the query execution can be faster.

16
SPLITTING THE DATA

• If we want to split the data on the country basis, then the


following command can be used to create a table with the
partitioned column country:
• CREATE TABLE customer(id STRING, name STRING, gender
STRING, state STRING)
PARTITIONED BY (country STRING);

17
SPLITTING THE DATA

• The partitioning of tables changes the structure of storing the


data.
• A root-level directory structure remains the same as a normal
table; for example, if we create this customer table in the xyz
database, there will be a root-level directory, as shown here:
• hdfs://hadoop_namenode_server/user/hive/warehouse/xy
z.db/customer

18
SPLITTING THE DATA

• However, Hive will now create subdirectories reflecting the


partitioning structure, for example:
• .../customer/country=OI
• .../customer/country=UK
• .../customer/country=IN
• ...
• These subdirectories have the data of respective countries. Now
if a query is executed for a particular country, then only a
selected partition will be used to return the query result.

19
EXAMPLE

• Partitioning can also be done on the basis of multiple parameters. In


the preceding example, we have a field, state, in a customer record.
• Now if we want to keep the data of each state in different file for
each country, then we can partition the data on the country as well
as state attributes.
• The following command can be used for this purpose:
• CREATE TABLE customer(id STRING, name STRING,
gender STRING) PARTITIONED BY (country STRING,
state STRING);
• ...

20
EXAMPLE

• Hive will now create subdirectories for state as well.


• .../customer/country=OI/state=AB
• .../customer/country=OI/state=DC
• .../customer/country=UK/state=JR
• .../customer/country=IN/state=UP
• .../customer/country=IN/state=DL
• .../customer/country=IN/state=RJ

21
SELECT

• SELECT * FROM customer WHERE country = 'IN' AND state =


'DL’;

22
BUCKETS

• Data in each partition may in turn be divided into Buckets based


on the hash of a column in the table.
• Each bucket is stored as a file in the partition directory.
• Bucketing allows the system to efficiently evaluate queries that
depend on a sample of data (these are queries that use the
SAMPLE clause on the table).
• Enabling bucketing:
• set hive.enforce.bucketing=true;

23
CREATE A BUCKETED TABLE

• You can create a bucketed table using the following command:


• CREATE [EXTERNAL] TABLE
[db_name.]table_name[(col_name data_type
[COMMENT col_comment], ...)]CLUSTERED BY
(col_name data_type [COMMENT col_comment],
...)INTO N BUCKETS;
• The preceding command will create a bucketed table based on
the columns provided in the CLUSTERED BY clause. The number
of buckets will be as specified in the CREATE TABLE statement.

24
CREATE A BUCKETED TABLE

• The following is the example of bucketing the sales data of a


sales_bucketed table:
• CREATE TABLE sales_bucketed (id INT, fname STRING,
lname STRING, address STRING,city STRING,state
STRING, zip STRING, IP STRING, prod_id STRING, date1
STRING)
CLUSTERED BY (id) INTO 10 BUCKETS;

25
INSERTING DATA INTO A BUCKETED TABLE

• You can use the INSERT statement to insert the data into a
bucketed table.
• Let's put the data from another table sales into our bucketed
table sales_bucketed:
• INSERT INTO sales_bucketed SELECT * from sales;

26
BUCKET HASHING

• In this example, we have defined the id attribute as a bucketing


column and the number of buckets is equal to 10.
• All the data is distributed into 10 buckets based on the hashing
of the id attribute. Data is evenly distributed between all buckets
based on the hashing principle.
• Now when a query is executed to fetch a record for a particular id, a
framework will use the hashing algorithm to identify the bucket
number for that record and will return the result. For example:
• SELECT * FROM sales_bucketed where id = 1000;
• Rather than scanning the entire table, the preceding command will
be executed on a particular bucket, the ID of which is equal to 1000.

27
3. LAB MOVIELENS

28
THE DATASET

• We will use the Movielens 100K, to optimize performance


during the class
• The practices here should be extensible to other sizes of the
dataset
• Suggestion: go incremental if at all possible. Better to conclude
the project with 100K, get the feel for the data and the possible
results. Then optimize the performance – partitioning,
bucketing, others and/or grow the dataset.
• This will minimize any frustration and loss of time – being stuck
with some optimization, that doesn’t yield the expected results.
29
DATA INGESTION: START

30
FILE BROWSER

31
FROM PREVIOUS EXERCISE

32
CREATE DIRECTORY

33
CREATE

--Create the database


DROP DATABASE IF EXISTS ml100k;
CREATE DATABASE ml100k;
USE ml100k;

34
CREATE DATABASE

35
CONFIRM THE CREATION

36
THE DATABASE IN THE WAREHOUSE IS…

37
…JUST A DIRECTORY

38
DRAG AND DROP TO DOWNLOADS IN THE
LINUX VM

39
UPLOAD TO HDFS

40
NOW, IN FOLDER MOVIELENS IN HDFS

41
LOOKING AT A FILE

42
MOVE EACH FILE TO A DIRECTORY WITH THE
SAME NAME

43
PARAMETERS YOU MIGHT WANT TO SET

-- PARTITIONS
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=300;
SET optimize.sort.dynamic.partitioning=true;

-- BUCKETING
SET hive.enforce.bucketing=true;
SET hive.enforce.sorting=true;

44
PARAMETERS YOU MIGHT WANT TO SET
-- VECTORIZATION
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.vectorized.input.format.excludes=;

-- COST-BASED OPTIMIZATION
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;

45
CREATE AN EXTERNAL TEMPORARY FILE FOR
RATINGS
DROP TABLE IF EXISTS ml100k.aux;
CREATE EXTERNAL TABLE ml100k.aux
(
userId INT,
movieId INT,
rating FLOAT,
timestamp BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/cloudera/movielens/ratings'
TBLPROPERTIES ("skip.header.line.count"="1");

46
CREATING THE TABLE

47
MOVE THE EXTERNAL TABLE TO AN INTERNAL HIVE
TABLE

DROP TABLE IF EXISTS ml100k.ratings;

CREATE TABLE ml100k.ratings


(
userId INT,
movieId INT,
rating FLOAT,
timestamp TIMESTAMP
)
CLUSTERED BY (movieId) SORTED BY (movieId ASC) INTO 4 BUCKETS
STORED AS ORC;

48
CREATING THE TABLE

49
MOVING THE DATA FROM EXTERNAL TO
INTERNAL

INSERT OVERWRITE TABLE ml100k.ratings


SELECT userId,
movieId,
rating,
from_unixtime(timestamp) as timestamp
FROM ml100k.aux;

50
MOVING THE DATA FROM EXTERNAL TABLE TO
INTERNAL TABLE

51
CHECKING

52
THE DATA MODEL

• Fact tables
• Movie ratings
• Tags, relevance
• Movie metrics
• Dimension tables
• Time and date
• Movie dimension
• Tags dimensions
• Genre dimensions

53

You might also like