Professional Documents
Culture Documents
1 – Despite its capabilities, TEZ still needs the storage of intermediate output to HDFS. (2020, 2021)
(2020, 2021)
FALSE
(Week 4, slide 48)
2 – Volume drives the need for processing and storage parallelism, and its management during
processing of large datasets.
(2020, 2021)
TRUE
(Week 1, slide 35)
3 – The servers where the data resides can only perform the map operation.
(2020, 2021)
TRUE
(Week 4, slide 27)
4 – The process of generating a new fsimage from a merge operation is called the Checkpoint process.
(2020, 2021)
TRUE
(Week 3, slide 15)
5 – Hive follows a “schema on read” approach, unlike RDMBS, which enforces “schema on write.”
(2020, 2021)
TRUE
(Week 5, slide 29)
6 – Big Data is a characterization only for volumes of data above one petabyte.
(2020, 2021)
FALSE
(Definition)
9 – One of the key design principles of HDFS is that it should favor low latency random access over high
sustained bandwidth.
(2020, 2021)
FALSE
(Week 2, slide 34)
10 – One of the key design principles of HDFS is that it should be able to use commodity hardware.
(2020, 2021)
TRUE
(Week 2, slide 24)
11 – The fundamental architectural principles of Hadoop are: large scale, distributed, shared everything
systems, connected by a good network, working together to solve the same problem.
(2020, 2021)
FALSE
(Week 2, slide 23)
12 – Apache Tez is an engine built on top of Apache Hadoop YARN.
(2020, 2021)
TRUE
(Week 4, slide 42)
14 – Business benefits are frequently higher when addressing the variety of data than when addressing
volume
(2020, 2021)
TRUE
(Week 1, slide 37)
15 – In object storage, data is stored close to processing, just like in HDFS, but with rich metadata.
(2020, 2021)
FALSE
(Week 5, slide 6)
16 – The map function in MapReduce processes key/value pairs to generate a set of intermediate
key/value pairs.
(2020, 2021)
TRUE
(Week 4, slide 26)
17 - Essentially, MapReduce divides a computation into two sequential stages: map and reduce.
(2020, 2021)
FALSE
(Week 4, slide 26)
18 – When a HiveQL is executed, Hive translates the query into MapReduce, saving the time and effort
of writing actual MapReduce jobs.
(2020, 2021)
TRUE
(Week 5, slide 21)
19 – When we drop an external table in Hive, both the data and the schema will be dropped.
(2020, 2021)
FALSE
(Week 6, slide 34)
20 – Hive can be considered a data warehousing layer over Hadoop that allows for data to be exposed
as structured tables.
(2020, 2021)
TRUE
(Week 5, slide 20)
--//--
21 – YARN manages resources and monitors workloads, in a secure multitenant environment, while
ensuring high availability across multiple Hadoop clusters.
(2021)
TRUE
(Week 4, slide 34)
22 – TEZ provides the capability to build an application framework that allows for a complex DAG of tasks
for high-performance data processing only in batch mode.
(2021)
FALSE
(Week 4, slide 42)
23 – The operation
CREATE TABLE external
(col1 STRING,
col2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
creates an external table in Hive.
(2021)
FALSE
(Should be CREATE EXTERNAL)
25 – We can say data internal tables in Hive adhere to a principle called “data on schema”.
(2021)
TRUE
(Week 5, slide 31)
1 – Consider the scenario in Fig. 1, for a log analytics solution. The web server in this example is an
Amazon Elastic Compute Cloud (EC2) instance. In step 1, Amazon Kinesis Firehose will continuously pull
log records from the Apache Web Server.
(2020, 2021)
FALSE
2 – We can configure, in Step 1, each shard in Amazon Kinesis Firehose to ingest up to 1MB/sec and 1000
records/sec, and emit up to 2MB/sec.
(2020, 2021)
FALSE
3 – In step 2, Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service
(Amazon S3) for durable storage of the raw log data.
(2020, 2021)
TRUE
4 – In step 2, also, Amazon Kinesis Analytics application will continuously run a Kinesis Streaming Python
script against the streaming input data.
(2021)
FALSE
5 – In step 3, the Amazon Kinesis Analytics application will create an aggregated data set every minute
and output that data to a second Firehose delivery stream.
(2021)
TRUE
7 – An end user provisions a Lambda service with similar steps as it provisions an EC2 instance
(2020, 2021)
FALSE
8 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE
9 – In a publish/subscribe model, although data producers are decoupled from data consumers,
publishers know who the consumers are.
(2020, 2021)
FALSE
10 – In Hive most of the optimizations are not based on the cost of query execution.
(2020, 2021)
TRUE
10 – The concept of partitioning can be used to reduce the cost of querying the data
(2020, 2021)
TRUE
11 – The number of shards cannot be modified after the Kinesis stream is created.
(2020, 2021)
FALSE
12 – The basic idea of vectorized query execution is to process a batch of columns as an array of line
vectors.
(2020, 2021)
FALSE
13 – In Hive bucketing, data is evenly distributed between all buckets based on the hashing principle.
(2020, 2021)
TRUE
14 – The selection of the partition key is always an important factor for performance. It should always
be a low-cardinal attribute to avoid so many sub-directories overhead.
(2020, 2021)
TRUE
15 – To partition the table customer by country we use the following HiveQL statement
CREATE TABLE customer (id STRING, name STRING, gender STRING,
state STRING, country STRING)PARTITIONED BY (country STRING)
(2020, 2021)
FALSE
16 – We can configure the values for the Amazon S3 buffer size (1 MB to 128 MB) or buffer interval (60
seconds to 900 seconds). The condition satisfied first triggers data delivery to Amazon S3.
(2020, 2021)
TRUE
17 – Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service.
(2021)
TRUE
19 – AWS Lambda polls the stream periodically (once per second) for new records. When it detects new
records, it invokes the Lambda function by passing the new records as a parameter. If no new records
are detected, the Lambda function is not invoked.
(2020, 2021)
TRUE
20 – DynamoDB tables do not have fixed schemas, but all items must have a similar number of attributes.
(2020, 2021)
FALSE
21 – The main drawback of Kinesis Data Firehose is that to scale up or down you need to manually
provision servers using the "AWS Kinesis Data Firehose Scaling API".
(2021)
FALSE
22 – Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC.
(2021)
TRUE
23 – Since it uses SQL as query language, Amazon Athena is a relational/transactional database.
(2021)
FALSE
25 – Kinesis Data Streams is good choice for long-term data storage and analytics
(2021)
FALSE
-- // --
27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE
28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE
29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE
29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE
BDF2020 Review Questions 1 Questions
2. In terms of storage, what does a name node contain and what do data nodes contain?
• HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (workers of HDFS) contain
application data in a partitioned manner for parallel writes and reads.
• The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in physical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host
OS drives. The name node simultaneously responds to the multiple client requests (in a
multithreaded system) and provides information to the client to connect to data nodes to write
or read the data. While writing, a file is broken down into multiple chunks of 128MB (by default,
called blocks, 64MB sometimes). Each block is stored as a separate file on data nodes. Based on
the replication factor of a file, multiple copies or replicas of each block are stored for fault
tolerance.
5. What is client-side caching, and what is its significance when writing data to HDFS?
HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to
improve the performance of the block write operation and to minimize network congestion. The HDFS
client transparently caches the file into a temporary local file; when it accumulates enough data for a
block size, the client reaches out to the name node. At this time, the name node responds by inserting
the filename into the file system hierarchy and allocating data nodes for its storage. The client then
flushes the block of data from the local, temporary file to the closest data node, and that data node
transfers the block to other data nodes (as instructed by the name node, based on the replication
factor of the file). This client-side caching avoids continuous use of the network and minimizes the risk
of network congestion.
10. How does a name node ensure that all the data nodes are functioning properly?
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.
11. How does a client ensure that the data it receives while reading is not corrupted? Is there a way
to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file and
stores these checksums in a separate hidden file in the same HDFS file system namespace. Later, while
reading the blocks, the client references these checksums to verify that these blocks were not
corrupted (corruption might happen because of faults in a storage device, network transmission faults,
or bugs in the program). When the client realizes that a block is corrupted, it reaches out to another
data node that has the replica of the corrupted block, to get another copy of the block.
You may assume that you are a Data Scientist in a consulting assignment with the CDO (Chief Data
Officer) of Exportera (a large company). The CDO asks you lots of questions and appreciates very much
precise answers to his questions.
1. You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the effects
of flooding from the River Guardião. The municipality has already distributed IoT devices across
the river, that are able to measure the flow the wates. In the kick-off meeting, you say: "I have
an idea for a possible solution. We may need to use a number of AWS services.” And then you
explain your solution. What do you say?
In order to collect and process these real time streams of data records regarding the flow of the water,
the data from the IoT devices across the river is sent to AWS Kinesis Data Streams. Then this data is
processed using the AWS Lambda module and stored in AWS S3 buckets. The data collected into
Kinesis Data Streams can be used for simple data analysis and reporting in real time, using AWS
Elasticsearch. Finally we can send the processed records to dashboards to visualize the variability of
the flow of water, using Amazon QuickSight.
You do not need to worry about computer resources if you use AWS Kinesys. This system will enable
you to acquire data through a gateway on the cloud that will receive your data, than you can process
it using AWS Lambda module and later store the data in a storage system e.g. AWS's S3
2. You go to a conference and hear a speaker saying the following: "Real-time analytics is a key
factor for digital transformation. Companies everywhere are using their datalakes based on
technologies such as Hadoop and HDFS to give key insights, in real time, of their sales and
customer preferences." You rise your hand for a comment. What are going to say? Please, justify
carefully.
Firstly, I would like to clarify that Hadoop is a stack of different components, one of which is the HDFS,
which is its distributed file system. Moreover, although HDFS is thriving recently, there are plenty of
other choices such as Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for cloud storage, or
Apache Kudu for IoT and analytic data.
Hadoop is a framework that incorporates several technologies including batch, streaming (you
mentioned) and others. HDFS is Hadoop's file system.
3. In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive
Driver is and what are its key functions/roles." What do you say, trying to be very complete and
precise?
To begin with, Hive is Hadoop’s SQL-like analytics query engine, which enables the writing of data
processing logic or queries in a declarative language similar to SQL, called HiveQL. The brain of Hive is
the Hive Driver, which maintains the lifecycle of a HiveQL statement: it comprehends a query compiler,
optimizer and executor, executing the tasks plan generated by the compiler in proper dependency
order while interacting with the underlying Hadoop instance.
Hive is Hadoop's SQL analytics query framework. It is the same as saying Hive is a SQL abstraction layer
over Hadoop MapReduce with a SQL-like query engine. It enables several types of connections to
other data bases using Hive driver. HIVE DRIVER can for instances connect through ODBC to relational
databases.
4. On implementing Hadoop, the CDO is worried on understanding how a name node ensures that
all the data nodes are functioning properly. What can you tell him to reassure him?
The name node, or master node, contains the metadata related to the Hadoop Distributed File System,
such as the file name, file path or configurations. It keeps track of all the data nodes, or slave nodes,
using the heartbeat methodology. Each data node regularly send heartbeat signals to the name node.
After receiving these signals the name node ensures that the slave nodes are live and functioning
properly. In the event that the name node does not receive a heartbeat signal from the data node,
that particular data node is considered inactive and starts the process of block replication on some
other data node.
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.
5. The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is
concerned on losing data if the data delivery to the destination is falling behind data writing to
the delivery stream. Can you help him out understanding how this process works, in order to
alleviate his concerns?
In Amazon Kinesis Firehose, the scaling is handled automatically, up to gigabytes per second, in order
to meet the demand. Amazon Kinesis Firehose will write each log record to Amazon Simple Storage
Service (Amazon S3) for durable storage of the raw log data, and the Amazon Kinesis Analytics
application will continuously run a Kinesis Streaming SQL statement against the streaming input data.
This way, you should not be concerned.
6. The CDO of Exportera asks you to prepare the talking points for a presentation he must make
to the board regarding a budget increase for his team. It is important that the board members
understand what are the impacts of the data variety versus the data volume. What can you tell
them?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s). Volume is the most commonly
recognized characteristic of Big Data, representing the large amount of data available for analysis to
extract valuable information. The time and expense required to process large datasets drives the need
for processing and storage parallelism. On the other hand, data variety represents the need to analyze
data from multiple sources, domains and data types to bring added value to the business, which drives
the need for distributed processing. This way, a budget increase is needed to meet the added
requirements that big data processing brings to the table.
While the volume can be handle scaling the processing and storage resources the issues related with
variety require customization of processing, namely interfaces, and programming of handlers. For
instances handling objects coming from SQL sources or OSI-PI sources require different type of
programming skills. Not having the right skills to handle this can ruin the ambition of storing different
-and so rich - data.
To have this kind of skills we need a budget increase in the department.
7. The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables can
only be made on the basis of one parameter, making it not as useful as it could be." What can
you answer him?
You are wrong. Hive partitioning can be made with more than one parameter of the original table.
However, the partition key should always be a low-cardinal attribute to avoid many partitions
overhead. It is very useful, as a query with partition filtering will only load data from the specified
partitions, so it can execute much faster than a normal query that filters by a non-partitioning field.
Hive partitioning can be made with more than one parameter from the original table. The partitioning
can be made by dividing the table in small tables by folders against each combination of parameters.
And even the partitioning can be optimized by bucket files.
8. The CDO of Exportera has been exploring stream processing and is worried about the latency of
a solution based on a platform such as Kinesis Data Streams. To address the question, what can
you tell him?
Kinesis Data Streams can be used efficiently to solve a variety of streaming data problems. It ensures
durability and elasticity, which enables you to scale the stream up or down, so that you never lose
data records before they expire. The delay between the time a record is put into the stream and the
time it can be retrieved is typically less than 1 second, so you don’t have to worry!
Kinesis Data Stream platform is based on a high throughput, large bandwidth performance
components.
Kinesis ensures durability and elasticity of data acquired through streaming. The elasticity of Kinesis
Data Streams enables you to scale the stream up or down, so that you never lose data records before
it expires (1 day default).
9. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully
understand it. Than you explain very clearly to him with a small example: counting the words of
the following text:
"Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
It's fleece was white as snow"
Justify carefully.
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
One of the applications of the MapReduce algorithm is to count words. If we take into consideration
the example mentioned:
1 – In the first step, the map() phase, all the words in the text are split into a list ordered from the first
word of the text until the last one:
“Mary”,
“had”,
“a”,
“little”,
“lamb”,
“Little”,
“lamb”,
(…)
“white”,
“as”,
“snow”
Still in the map() phase, each word is converted into a key-value pair, being the key the word and the
value 1 :
(“Mary”: 1),
(“had”: 1),
(“a”: 1),
(…)
(“snow”:1).
2 – In the shuffle phase, all the key-value pairs are sorted alphabetically by the key, so that similar keys
can get together
3 – Finally, in the reduce phase, there is an aggregation by key (word), performing a count of the
respective values. This way, the final result is the key-value pairs of the distinct words and their count
of occurrences throughout the text
MapReduce is a program model, or a processing technique that can be applied in several contexts. Its
implementation is usually divided in three steps: Mapping the data variables considered, shuffling
those variables following a pattern/directive and then aggregating/reducing them.
Let us take the text above as reference for the following example: We want to count the number of
words in the text.
In the first step we would map all the words, let say, continuously, getting a list like:
"Mary
had
a
little
...
as
snow"
If one wants to count the number of words, a number should be associated with each word for later
counting. So we can associate "1", so we can create, still in Map step the kind of Key-Value database
associating each word with "1", e.g.:
<"Mary",1>, where the word is the Key.
<Mary,1>
<little,1>
In the next step one could shuffle this created database so that all similar keys can get together. But
since one just wants to count the number of words, we just need to sum the values and this operation
is the Reduce operation in this case.
10. In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in
motion is no different than processing data at rest." If you had to join the conversation, what
would you say to them?
I don’t agree with you. Processing data in motion is very different than processing data at rest. The
operational difference between streaming data and data at rest lies in when the data is stored and
analyzed. When dealing with data at rest, each observation is recorded and stored before performing
analysis on the data. On the other hand, in streaming data, each event is processed as it is read, and
subsequent results are stored in a database. However, there are some similarities: data at rest and
streaming data can come from the same source, they can be processed with the same analytics and
can be stored with the same storage service
Processing data in motion (streaming) is much different from processing data at rest. The main
differences are that:
1. Analytics over streaming data needs to be done with data incoming and not in steady data.
2. Data in motion needs buffering to process amounts of data while similar steady data processing
can be done through querying.
3. The concept of incoming data in streaming does not apply in data at rest. In the former case one
needs to adapt the acquisition to the amount of incoming data.
Of course, there are also some similarities, namely in quality control of data, namely the detection of
common types of errors like valid values or missing values. Anyway, the processing- which is what we
are talking about - is much different in both cases.
--//--
1. The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand
it. Can you explain very clearly its processes, and the role of the NameNode and the DataNodes in a
MapReduce operation?" Can you help him out?
MapReduce is a process which divides a computation into three sequential stages: map, shuffle and
reduce. In the map phase, the required data is read from the HDFS and processed in parallel by
multiple independent map tasks. A map() function is defined which produces key-value outputs. Then,
the map outputs are sorted in the shuffle phase. Finally, a reduce() function is defined to aggregate all
the values for each key, summarizing the inputs.
During this process, the NameNode allocates which DataNodes, meaning individual servers, will
perform the map operation and which will perform the reduce operation. This way, mapper
DataNodes perform the map phase and produce key-value pairs and the reducer DataNodes apply the
reduce function to the key-value pairs and generate the final output.
2. The CDO of Exportera asks you to prepare the talking points for a presentation he must make to
the board regarding a budget increase for his team. It is important that the board members
understand what data variety versus data variability is, as well as its impacts on the analytics
platforms (and, of course, business benefits of addressing them.) What do you write to make it
crystal clear?
When it comes to Big Data applications, there are 4 characteristics which need to be addressed:
Volume, Velocity, Variety and Variability (also know as the 4 V’s).
While the variety characteristic represents the need to analyze data from multiple sources and data
types, the variability refers to the changes in dataset characteristics, whether in the data flow rate,
format/structure and/or volume. When it comes to variety of data, distributed processing should be
applied on different types of data, followed by individual pre-analytics. On the other hand, variability
implies the need to scale-up or scale-down to efficiently handle the additional processing load, which
justifies the use of cloud computing. Finally, regarding the benefits to the business, both have their
advantages: variety of data brings an additional richness to the business, as more details from multiple
domains are available; variability keep the systems efficient as we don’t always have to design and
resource to the expected peak capacity.
Big Data Foundations
Week 1
Syllabus
Defining Big Data
Class outline
• Syllabus
• Defining Big Data
2
Class outline
• Syllabus
• Defining Big Data
3
Course agenda
• Defining Big Data
• Datacenter scale computing
• Foundational Big Data Core Components
• Cloud Storage and Modern Data Lake Components
• Batch Analysis
• Query Optimization for Batch Analysis
• Data collection
• Near-real time
• Real time
• Streaming Analysis
• Use case: Designing a modern Big Data solution
4
Evaluation and contact
• Evaluation:
• First assessment period:
• Quizzes in class (2): 2 x 5%
• Case analysis + status report: 40%
• Final Exam: 50%
• Second assessment period:
• Second Exam: 100%
• Office hours:
• Tuesdays & Wednesdays, 18h00-18h30, by appointment
• Contact:
• hcarreiro@novaims.unl.pt
5
Quizzes
6
Bibliography
7
Technology stack
8
Where to download Cloudera VM
• VMWare
• https://downloads.cloudera.com/demo_vm/vmware/cloudera-
quickstart-vm-5.13.0-0-vmware.zip
9
Approach
10
A note about online classes
11
Your expectations for the
course?
12
Case Analysis:
MovieLens Dataset(s)
13
MovieLens Dataset(s)
14
MovieLens Dataset(s)
• Based initially on the 100-K MovieLens Dataset, get to know the dataset and start
asking some questions that you might find interesting. Once comfortable, you
can scale up to the 20M dataset (see bellow). (Even for the MovieLens 1B).
• Some possible questions
• What are the movie genres in the dataset?
• What is the number of movies per genre?
• What is the movie rating distribution per user?
• What are the movies with the highest average rating?
• The more interesting the questions, the deeper the answers.
• You can start small and grow to larger versions of the dataset (available in
https://grouplens.org/datasets/movielens)
• You should, at least, use HDFS and/or a cloud store (for sure) and Hive/Presto,
although you can explore other Hadoop components.
• Feel free to use Power BI or other visualization tool to present your insights.
Deliverables (1)
17
Status report
18
Descriptors for evaluating the case analysis
EVALUATION ITEM DESCRIPTORS PERCENTAGE OF THE TOTAL CLASSIFICATION
1 Quality and comprehensiveness of the questions in order to explore the dataset. The questions: 15%
a) explore the possibilities presented by the dataset
b) give rise to relevant insights in the context.
4 Quality of the report (in the form of the presentation text and complementary The report: 15%
slides). a) presents the solution clearly,
b) indicates the main options taken,
c) indicates the limitations,
d) indicates possibilities for future work.
5 Respect for presentation time. The time allocated to the group for the presentation was respected. 10%
19
Logistics
20
What is Big Data?
21
Example of a modern Big Data stack: AWS
22
Example of a modern Big Data stack: Apache
Foundation projects
Source: Cloudera
23
Example: Azure server generations
Gen 2 Gen 3 HPC Gen 4 Godzilla Gen 5.1 GPU Gen 5 Beast Gen 6
Processor 2 x 6 Core 2.1 GHz Processor 2 x 8 Core 2.1 GHz Processor 2 x 12 Core 2.4 GHz Processor 2 x 12 Core 2.4 GHz Processor 2 x 16 Core 2.0 GHz Processor 2 x 20 Core 2.3 GHz Processor 2 x 8 Core 2.6 GHz Processor 4 x 18 Core 2.5 GHz Processor 2 x Skylake 24 Core 2.7GHz
Memory 32 GiB Memory 128 GiB Memory 128 GiB Memory 192 GiB Memory 512 GiB Memory 256 GiB Memory 256 GiB Memory 4096 GiB Memory 192GiB DDR4
NIC 40 Gb/s
NIC
24
40 Gb/s
NIC 1 Gb/s NIC 10 Gb/s NIC 40 Gb/s GPU 2 x 2 Compute GPU FPGA Yes
Example of a Big Data architecture in the
enterprise: Uber
25
Data science and Big Data
26
Class outline
• Syllabus
• Defining Big Data
27
Big Data characteristics
• Moore’s Law:
• In a 1965 paper, Gordon Moore
estimated that the density of
transistors on an integrated
circuit board was doubling
every two years.
• The growth rates of data
volumes are estimated to be
faster than Moore’s Law, with
data volumes more than
doubling every eighteen
months.
Source: Intel 28
Source: https://ourworldindata.org/technological-progress 29
Nielsen’s Law
30
Kryder's law
31
Big Data: what’s your own definition?
32
Big Data NIST definition
Source: https://doi.org/10.6028/NIST.SP.1500-1r1
33
The origins of the Big Data “V’s”
34
Parallelizing data handling
35
Big Data: the impacts of the 4 V’s in terms of
services architecture?
36
Volume
38
Variety
• The variety characteristic represents the need to analyze data from
multiple repositories, domains, or types.
• The variety of data from multiple domains was previously handled
through the identification of features that would allow alignment of
datasets, and their fusion into a data warehouse.
• Distributed processing allows individual pre-analytics on different
types of data, followed by different analytics to span these interim
results.
• While volume and velocity allow faster and more cost-effective
analytics, it is the variety of data that allows analytic results that were
never possible before.
• Business benefits are frequently higher when addressing the variety
of data than when addressing volume.
39
Variability
• Variability refers to changes in a dataset, whether in the data flow rate,
format/structure, and/or volume, that impacts its processing.
• Impacts can include the need to refactor architectures, interfaces,
processing/algorithms, integration/fusion, or storage.
• Variability in data volumes implies the need to scale-up or scale-down. virtualized
resources to efficiently handle the additional processing load, one of the
advantageous capabilities of cloud computing.
• Dynamic scaling keeps systems efficient, rather than having to design and
resource to the expected peak capacity (where the system at most times sits idle).
• It should be noted that this variability refers to changes in dataset characteristics,
whereas the term volatility refers to the changing values of individual data
elements.
• Since the latter does not affect the architecture—only the analytics—it is only
variability that affects the architectural design.
40
Structured, semi-structured, unstructured data
41
Recommended reading
42
Class takeaways
• Syllabus
• Defining Big Data
43
Big Data Foundations
Week 2
Core Components Overview
Course agenda
• Defining Big Data
• Datacenter scale computing
• Hadoop Core Components
• Cloud Storage
• Batch Analysis with Hive/Presto
• Optimizing Hive/Presto
• Data collection
• Near-real time
• Real time
• Streaming Analysis
• Use case: Designing a modern Big Data solution
2
Classe outline
3
Classe outline
4
Wise words
5
Vertical versus horizontal scaling
6
What do you call a few PB of free space?
7
What do you call a few PB of free space?
8
Example: Google platform
Data sources:
https://www.google.com/insidesearch/howsearchworks/thestory
http://www.seobook.com/learn-seo/infographics/how-search-works.php
http://www.ppcblog.com/how-google-works
9
GFS (and Colossus)
• Google’s GFS is an example of a storage system with a simple file-like abstraction
(Google’s Colossus system has since replaced GFS, but follows a similar
architectural philosophy).
• GFS was designed to support the web search indexing system (the system
that turned crawled web pages into index files for use in web search), and
therefore focuses on high throughput for thousands of concurrent readers/writers
and robust performance under high hardware failures rates.
• GFS users typically manipulate large quantities of data, and thus GFS is further
optimized for large operations. The system architecture consists of a primary
server (master), which handles metadata operations, and thousands of
chunkserver (secondary) processes running on every server with a disk drive, to
manage the data chunks on those drives.
• In GFS, fault tolerance is provided by replication across machines instead of within
them, as is the case in RAID systems. Cross-machine replication allows the system
to tolerate machine and network failures and enables fast recovery, since replicas
for a given disk or machine can be spread across thousands of other machines.
• Here are some terms used to describe the different software layers in
a typical WSC deployment.
• Platform-level software: The common firmware, kernel, operating system
distribution, and libraries expected to be present in all individual servers to
abstract the hardware of a single machine and provide a basic machine
abstraction layer.
• Cluster-level infrastructure: The collection of distributed systems software
that manages resources and provides services at the cluster level. Ultimately,
we consider these services as an operating system for a data center.
Examples are distributed file systems, schedulers and remote procedure call
(RPC) libraries, as well as programming models that simplify the usage of
resources at the scale of data centers, such as MapReduce, Dryad, Hadoop,
Sawzall, BigTable, Dynamo, Dremel, Spanner, and Chubby.
12
Google Clusters at the beginning
14
Source: Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018).
Building blocks
15
Source: Barroso, L. A., Hölzle, U., & Ranganathan, P. (2018).
Optional reading
16
Classe outline
17
The birth of the Big Data industry
• Many of the ideas that underpin the Apache Hadoop project are decades
old. Academia and industry have been exploring distributed storage and
computation since the 1960s.
• Real, practical, useful, massively scalable, and reliable systems simply could
not be found—at least not cheaply—until Google confronted the problem
of the internet in the late 1990s and early 2000s. Collecting, indexing, and
analyzing the entire web was impossible, using commercially available
technology of the time.
• Google dusted off the decades of research in large-scale systems. Its
architects realized that, for the first time ever, the computers and
networking they required could be had, at reasonable cost.
• Its work—on the Google File System (GFS) for storage and on the
MapReduce framework for computation—created the big data
industry.
22
Hadoop and modern platforms
• Once, the only storage system was the Hadoop Distributed File
System (HDFS), based on GFS.
• Today, HDFS is thriving, but there are plenty of other choices:
Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for cloud
storage, for example, or Apache Kudu for IoT and analytic data.
• Similarly, MapReduce was originally the only option for analyzing
data. Now, users can choose among MapReduce, Apache Spark for
stream processing and machine learning workloads, SQL engines like
Apache Impala and Apache Hive, and more.
• All of these new projects have adopted the fundamental
architecture of Hadoop: large-scale, distributed, shared-nothing
systems, connected by a good network, working together to
solve the same problem.
Interconnection Network
Source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
24
Some misconceptions
• Horizontal Scaling
• Adoption of Open Source
• Embracing Cloud Compute
• Decoupled Compute and Storage
26
Shared Nothing to Shared Storage
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
27
Why Shared Storage? Flexible Scaling
Image source: David J. DeWitt, Willis Lang. “Data Warehousing in the Cloud – The Death of
Shared Nothing.” http://mitdbg.github.io/nedbday/2017/#program
28
Apache Hadoop
30
Apache Hadoop
31
Apache Hadoop
33
HDFS
36
Horizontal scalability
37
Fault tolerance
38
Fault tolerance
39
Classe takeaways
40
Big Data Foundations
Week 3
Lab Hive
Core Components Overview 2
Class outline
• Lab
• HDFS architecture and access
•A
2
Class outline
• Lab
• HDFS architecture and access
•A
3
Class outline
• Lab
• HDFS architecture and access
•A
4
Hadoop Cluster Terminology
• A cluster is a group of computers working together
–Provides data storage, data processing, and resource management
• A node is an individual computer in the cluster
–Master nodes manage distribution of work and data to worker nodes
• A daemon is a program running on a node
–Each Hadoop daemon performs a specific function in the cluster
Worker Node
Worker Node
Worker Node
…
5
Cluster Components
Resource Storage
Management
6
HDFS Basic Concepts (1)
• HDFS is a filesystem writen in Java
–Based on Google’s GFS
• Sits on top of a native filesystem
–Such as ext3, ext4, or xfs
• Provides redundant storage for massive amounts of data
–Using readily-‐available, industry-‐standard computers
HDFS
Native OS filesystem
Disk Storage
7
HDFS Basic Concepts (2)
8
How Files Are Stored
• Data files are split into 128MB blocks which are distributed at load time
• Each block is replicated on multiple data nodes (default 3x)
• NameNode stores metadata
Block 1 Name
Block 3
Node
Block 1
Block 1
Metadata:
Block 2
information
Block 2
Block 2 about files
Very Block 3
Large and blocks
Block 4
Data File
Block 3 Block 2
Block 4
Block 1
Block 3
Block 4
Block 4
9
HDFS Secondary NameNode
The NameNode daemon must be running at all times
– If the NameNode stops, the cluster becomesinaccessible
10
Hierarchical file organization
• The fsimage and the edit log file are central data structures that
contain HDFS file system metadata and namespaces.
• Any corruption of these files can cause the HDFS cluster
instance to become nonfunctional.
• For this reason, the name node can be configured to support
maintaining multiple copies of the fsimage and edit log to
another machine.
• This is where the secondary name node comes into play.
Checkpoint process
Node A Node D
/logs/
031512.log
Node B Node E
/logs/
042313.log Node C
HDFS
Cluster
16
Example: Storing and Retrieving Files (2)
1 Node A Node D
/logs/ 2
031512.log
1 3 1 5
3 4 2
Node B Node E
1 2 2 5
3 4 4
/logs/
4
042313.log 5 Node C
3 5
17
Example: Storing and Retrieving Files (3)
1 Node A Node D
/logs/
/logs/042313.log?
2 1 3 1 5
031512.log
3 4 2
18
Example: Storing and Retrieving Files (4)
1 Node A Node D
/logs/
/logs/042313.log?
2 1 3 1 5
031512.log
3 4 2
19
HDFS Set up for High Availability
20
Options for Accessing HDFS
put
From the command line HDFS
– FsShell: Client Cluster
$ hdfs dfs get
In Spark
– Several libraries available, such as
snakebite, by Spotify
Other programs
– Java API
– Used by Hadoop MapReduce,
Impala, Hue, Sqoop,
Flume, etc.
– RESTful interface
21
Demo
24
HDFS Command Line Examples (3)
25
The Hue HDFS File Browser
• The File Browser in Hue (Hadoop User Experience) lets you view and manage your
HDFS directories and files
–Create, move, rename, modify, upload, download and delete directories and files
–View file contents
26
Hue in AWS
27
HDFS Recommendations
• HDFS is a repository for all your data
–Structure and organize carefully!
• Best practices include
–Define a standard directory structure
–Include separate locations for staging data
• Example organization
–/user/… – data and configuration belonging only to a single user
–/etl – Work in progress in Extract/Transform/Load stage
–/tmp – Temporary generated data shared between users
–/data – Data sets that are processed and available across the
organization for analysis
–/app – Non-‐data files such as configuration, JAR files, SQL files, etc.
28
AWS Elastic MapReduce (EMR)
29
An EMR Cluster
30
EMR Storage
• HDFS
• EMRFS: access S3 as if it were HDFS
• EMRFS Consistent View – Optional for S3 consistency
• Uses DynamoDB to track consistency
• Local file system
• EBS for HDFS
31
Class outline
• Lab
• HDFS architecture and access
•A
32
Review questions
2
Class outline
•A
3
Lab: Oil import prices
analysis with Hive
4
Recommended reading
5
Class outline
•A
6
Why MapReduce?
8
Divide and Conquer
“Work”
Partition
w1 w2 w3
r1 r2 r3
“Result” Aggregate
Difficult because:
• We don’t know the order in which workers run…
• We don’t know when workers interrupt each other…
• We don’t know when workers need to communicate partial results…
• We don’t know the order in which workers access shared resources…
Basic primitives
Semaphores (lock, unlock)
Conditional variables (wait, notify, broadcast)
Barriers
Memory
P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
Design Patterns
producer consumer
coordinator
work queue
workers
The reality:
• Lots of one-off solutions, custom code
• Write you own dedicated library, then program with it
• Burden on the programmer to explicitly manage everything
17
MR works like a UNIX Command Sequence
18
grep | sort | count
19
Exercise
• “We are going to a picnic near our house. Many of our friends
are coming. You are welcome to join us. We will have fun.”
20
WordCount Example: Myfile.txt
• Thus there will be 4 Map tasks, one for each segment of data.
21
Results of Map on each segment
22
Sorted results of Map operations
23
Results after Reduce phase
Key Value
a 1
ar 3
coming 1
friends 1
fun 1
going 1
have 1
house 1
join 1
many 1
near 1
of 1
our 2
picnic 1
to 2
us 1
we 2
welcome 1
will 1
you 1
24
Other example
25
Recap: Map, shuffle, reduce
• Essentially, MapReduce divides a computation into three sequential stages:
map, shuffle, and reduce.
• In the map phase, the relevant data is read from HDFS and processed in
parallel by multiple independent map tasks.
• These tasks should ideally run wherever the data is located—usually we
aim for one map task per HDFS block.
• The user defines a map() function (in code) that processes each record in
the file and produces key-value outputs ready for the next phase.
• In the shuffle phase, the map outputs are fetched by MapReduce and
shipped across the network to form input to the reduce tasks.
• A user-defined reduce() function receives all the values for a key in turn
and aggregates or combines them into fewer values which summarize the
inputs.
27
Another example in diagram format
28
Pseudo-code for WordCount
• map(String key, String value):
• // key: document name and value: document
contents
• for each word w in value:
• EmitIntermediate (w, "1");
Source: https://www.datacenterknowledge.com/archives/2014/06/25/
31
MapReduce at Google Search: long tenure
32
Hive and MapReduce
33
A few words about YARN: Yet Another
Resource Negotiator
• YARN is a large-scale, distributed operating system for Big Data
applications.
• It manages resources and monitors workloads, in a secure multi-
tenant environment, while ensuring high availability across
multiple Hadoop clusters.
• YARN is a common platform to run multiple tools and
applications such as interactive SQL (e.g. Hive), real-time
streaming (e.g. Spark), and batch processing (MapReduce), etc.
YARN application execution
Three clients run applications with different resource demands, which are translated into different-sized
containers and spread across the NodeManagers for execution.
Source: Kunigk, J. et al (2019)
35
YARN
•A
39
Apache Tez
41
Apache Tez
42
Side note: SEM as DAG
43
The MapReduce framework versus the Tez
framework for processing.
44
Apache Tez
45
Apache Tez
46
Apache Tez
47
Apache Tez
48
Apache Tez
49
Apache Tez
51
Spark versus Tez
52
Class takeaways
•A
53
BIG DATA
FOUNDATIONS
Week 5
2
CLASS OUTLINE
3
FROM GFS/HDFS TO OBJECT
STORAGE
4
FROM GFS/HDFS TO MODERN APPROACHES
6
AWS S3
7
RICH CONFIGURATION OPTIONS
8
AWS S3 STANDARD
9
AWS GLACIER
• Low-cost object storage meant for archiving / backup
• Data is retained for the longer term (10s of years)
• Alternative to on-premise magnetic tape storage
• Average annual durability is 99.999999999%
• Very low cost per storage per month + retrieval cost
• Each item in Glacier is called “Archive” (up to 40TB)
• 3 retrieval options:
• Expedited (1 to 5 minutes retrieval)
• Standard (3 to 5 hours)
• Bulk (5 to 12 hours)
10
S3 LIFECYCLE RULES
• Rules to move data between different tiers, to save storage cost
• General Purpose ► Infrequent Access ► Glacier
• Transition actions
• It defines when objects are transitioned to another storage class.
• Example: We can choose to move objects to Standard IA class 60 days after
you created them or can move to Glacier for archiving after 6 months.
• Moving to Glacier is helpful for backup / long term retention / regulatory
needs.
• Expiration actions
• Helps to configure objects to expire (be deleted) after a certain time period.
• Example: Access log files can be set to delete after a specified period of time.
11
APACHE HIVE
12
INTRODUCTION TO APACHE HIVE
14
INTRODUCTION TO APACHE HIVE
15
HIVE ARCHITECTURE
20
HIGH-LEVEL HIVE QUERY EXECUTION FLOW
21
HIGH-LEVEL HIVE QUERY EXECUTION FLOW
• Step-1: Execute Query – Interface of the Hive such as Command Line or Web user interface delivers
query to the driver to execute. In this step, UI calls the execute interface to the driver such as ODBC
or JDBC.
• Step-2: Get Plan – Driver designs a session handle for the query and transfer the query to the
compiler to make the execution plan. In other words, driver interacts with the compiler.
• Step-3: Get Metadata – In this step, the compiler transfers the metadata request to any database
and the compiler gets the necessary metadata from the metastore.
• Step-4: Send Metadata – Metastore transfers metadata as an acknowledgement to the compiler.
• Step-5: Send Plan – Compiler communicating with driver with the execution plan made by the
compiler to execute the query.
• Step-6: Execute Plan – Execute plan is sent to the execution engine by the driver. Execute Job. Job
Done. Dfs operation (Metadata Operation).
• Step-7: Fetch Results – Fetching results from the driver to the user interface (UI).
• Step-8: Send Results – Result is transferred to the execution engine from the driver. Sending results
to Execution engine. When the result is retrieved from data nodes to the execution engine, it returns
the result to the driver and to user interface (UI).
22
HIVE “SCHEMA ON READ”
23
HIVE TABLES
26
INTERNAL TABLES IN HIVE WITH LOCATION
SPECIFIED
27
EXTERNAL TABLES IN HIVE
28
CLASS TAKEAWAY
29
BIG DATA
FOUNDATIONS
Week 6
Optimizing Hive
LAB: Hive project best practices
WHERE ARE WE?
• Defining Big Data
• Datacenter scale computing
• Hadoop Core Components: HDFS, MapReduce, YARN, TEZ
• Cloud Storage
• Batch Analysis with Hive
• Optimizing Hive
• Data collection
• Real time
• Near-real time
• Batch
• Streaming Analysis
• Use case: Designing a modern Big Data solution
2
CLASS OUTLINE
• Optimizing Hive
• LAB: Hive project best practices
3
1. OPTIMIZING HIVE
4
INTRODUCTION TO DATA
STORAGE FORMATS
5
DATA STORAGE FORMATS
6
DATA STORAGE FORMATS
7
DATA STORAGE FORMATS
8
2. PARTITIONING AND
BUCKETING
9
SOME HIVE OPTIMIZATION TECHNIQUES
10
PARTITIONS
12
SETTING DYNAMIC PARTITIONING
13
PARTITION TABLE DESIGN
• A query with partition filtering will only load data from the
specified partitions (sub-directories), so it can execute much
faster than a normal query that filters by a non-partitioning
field.
• The selection of the partition key is always an important factor
for performance. It should always be a low-cardinal attribute
to avoid so many sub-directories overhead.
14
PARTITION TABLE DESIGN
15
EXAMPLE
16
SPLITTING THE DATA
17
SPLITTING THE DATA
18
SPLITTING THE DATA
19
EXAMPLE
20
EXAMPLE
21
SELECT
22
BUCKETS
23
CREATE A BUCKETED TABLE
24
CREATE A BUCKETED TABLE
25
INSERTING DATA INTO A BUCKETED TABLE
• You can use the INSERT statement to insert the data into a
bucketed table.
• Let's put the data from another table sales into our bucketed
table sales_bucketed:
• INSERT INTO sales_bucketed SELECT * from sales;
26
BUCKET HASHING
27
3. LAB MOVIELENS
28
THE DATASET
30
FILE BROWSER
31
FROM PREVIOUS EXERCISE
32
CREATE DIRECTORY
33
CREATE
34
CREATE DATABASE
35
CONFIRM THE CREATION
36
THE DATABASE IN THE WAREHOUSE IS…
37
…JUST A DIRECTORY
38
DRAG AND DROP TO DOWNLOADS IN THE
LINUX VM
39
UPLOAD TO HDFS
40
NOW, IN FOLDER MOVIELENS IN HDFS
41
LOOKING AT A FILE
42
MOVE EACH FILE TO A DIRECTORY WITH THE
SAME NAME
43
PARAMETERS YOU MIGHT WANT TO SET
-- PARTITIONS
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=300;
SET optimize.sort.dynamic.partitioning=true;
-- BUCKETING
SET hive.enforce.bucketing=true;
SET hive.enforce.sorting=true;
44
PARAMETERS YOU MIGHT WANT TO SET
-- VECTORIZATION
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
SET hive.vectorized.input.format.excludes=;
-- COST-BASED OPTIMIZATION
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
45
CREATE AN EXTERNAL TEMPORARY FILE FOR
RATINGS
DROP TABLE IF EXISTS ml100k.aux;
CREATE EXTERNAL TABLE ml100k.aux
(
userId INT,
movieId INT,
rating FLOAT,
timestamp BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/cloudera/movielens/ratings'
TBLPROPERTIES ("skip.header.line.count"="1");
46
CREATING THE TABLE
47
MOVE THE EXTERNAL TABLE TO AN INTERNAL HIVE
TABLE
48
CREATING THE TABLE
49
MOVING THE DATA FROM EXTERNAL TO
INTERNAL
50
MOVING THE DATA FROM EXTERNAL TABLE TO
INTERNAL TABLE
51
CHECKING
52
THE DATA MODEL
• Fact tables
• Movie ratings
• Tags, relevance
• Movie metrics
• Dimension tables
• Time and date
• Movie dimension
• Tags dimensions
• Genre dimensions
53