You are on page 1of 37

10/01/23, 22:25 First Exam Version 5

Dashboard / My courses / 2021-2022 / 2º Ciclo / Pós-Graduações / Outono / BDF-200209-202122-S1 / Exams / First Exam Version 5

Started on Wednesday, 19 January 2022, 6:30 PM


State Finished
Completed on Wednesday, 19 January 2022, 7:44 PM
Time taken 1 hour 13 mins

Information

Introduction

The exam is 75 minutes long and consists of yes/no and multiple-choice questions as well as open-ended questions.
The rating for yes/no and multiple-choice questions is 0.5 for a correct answer, 0 if the question is not answered, and -0.25 if the
answer is incorrect.
For open-ended questions, the rating is indicated on the left side of the question. Try to be precise in your answers.
The only permitted reference material is an A4 sheet of paper handwritten on both sides.
Any irregularity in terms of consultation media or other may result in the test being cancelled.

Question 1
Complete

Marked out of 0.50

The need for real-time data processing, even in the presence of large data volumes, drives a different type of architecture where the data
is not stored, but is processed typically in memory.

Select one:
a. False

b. N/R

c. True

Question 2
Complete

Marked out of 0.50

Value is used as a measure of the inherent potential in datasets.

Select one:
a. False

b. N/R

c. True

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 1/12


10/01/23, 22:25 First Exam Version 5

Question 3
Complete

Marked out of 0.50

You are working for a large telecom provider who has chosen the AWS platform for its data and analytics needs. It has agreed to using a
data lake and S3 as the platform of choice for the data lake. The company is getting data generated from DPI (deep packet inspection)
probes in near real time and looking to ingest it into S3 in batches of 100 MB or 2 minutes, whichever comes first. Which of the following
is an ideal choice for the use case without any additional custom implementation?

Select one:
a. Amazon Kinesis Data Streams

b. N/R

c. Amazon S3 Select

d. Amazon Kinesis Data Firehose

e. Amazon Kinesis Data Analytics

Question 4
Complete

Marked out of 0.50

We can configure, in Kinesis Data Firehose, the values for the Amazon S3 buffer size (from 1 MB to 128 MB) or buffer interval (from 60
seconds to 900 seconds). The condition satisfied first, triggers data delivery to Amazon S3.

Select one:
a. False

b. True

c. N/R

Question 5
Complete

Marked out of 0.50

Variety is a measure of the rate of data flow.

Select one:
a. True

b. False

c. N/R

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 2/12


10/01/23, 22:25 First Exam Version 5

Question 6
Complete

Marked out of 0.50

With ORC, a file with data in tabular format can be split into multiple smaller stripes where each stripe comprises of a subset of columns
from the original file.

Select one:
a. False

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 3/12


10/01/23, 22:25 First Exam Version 5

b. True

c. N/R

Question 7
Complete

Marked out of 0.50

YARN monitors workloads, in a secure single tenant environment, while ensuring high availability across multiple Hadoop clusters.

Select one:
a. False

b. True

c. N/R

Question 8
Complete

Marked out of 0.50

A DataNode is always aware of the file to which a particular block belongs.

Select one:
a. True

b. False

c. N/R

Question 9
Complete

Marked out of 0.50

In HDFS, Hive tables are nothing more than directories containing files.

Select one:
a. True

b. False

c. N/R

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 4/12


10/01/23, 22:25 First Exam Version 5

Question 10
Complete

Marked out of 0.50

Big Data frameworks are frequently designed to take advantage of data locality on each node when distributing the processing, aiming
to maximize the movement of data between nodes.

Select one:
a. True

b. N/R

c. False

Question 11

Complete

Marked out of 0.50

One of the reasons you may want to convert data from CSV to Parquet before querying it with a service such as Athena is to save on
costs.

Select one:
a. Verdadeiro

b. N/R

c. False

Question 12
Complete

Marked out of 0.50

Kinesis Data Streams is not a good choice for long-term data storage and analytics.

Select one:
a. N/R

b. False

c. True

Question 13
Complete

Marked out of 0.50

In Hive most of the optimizations are based on the cost of query.

Select one:
a. True

b. N/R

c. False

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 5/12


10/01/23, 22:25 First Exam Version 5

Question 14
Complete

Marked out of 0.50

We can say internal tables in Hive adhere to a principle called “schema on data”.

Select one:
a. False

b. True

c. N/R

Question 15
Complete

Marked out of 0.50

While the data may have high veracity (accurate representation of the real-world processes that created it), there are times when the data
is no longer valid for the hypothesis being asked.

Select one:
a. True

b. False

c. N/R

Question 16
Complete

Marked out of 0.50

Velocity drives the need for processing and storage parallelism, and its management during processing of large datasets.

Select one:
a. False

b. N/R

c. True

Question 17

Complete

Marked out of 0.50

The Checkpoint process is when the NameNode checks if the DataNodes are active.

Select one:
a. False

b. N/R

c. True

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 6/12


10/01/23, 22:25 First Exam Version 5

Question 18
Complete

Marked out of 0.50

HDFS is optimized to store very large amounts of highly mutable data with files being typically accessed in long sequential scans.

Select one:
a. N/R

b. False

c. True

Question 19
Complete

Marked out of 0.50

In HDFS, a client writes to the first DataNode, and then receives an ACK (Acknowledge) message confirming the correct storage of the
block. After receiving that message, writes to the second DataNode and the process is repeated until the correct storage in the third
DataNode. In this way, HDFS guarantee the safe storage of the block and its replicas.

Select one:
a. True

b. False

c. N/R

Question 20
Complete

Marked out of 0.50

Which of the following is a valid mechanism to do data transformations from Amazon Kinesis Firehose?

Select one:
a. Amazon Lambda

b. AWS Glue

c. Amazon SageMaker

d. Amazon Athena

e. N/R

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 7/12


10/01/23, 22:25 First Exam Version 5

Question 21
Complete

Marked out of 0.50

The selection of the partition key is always a key factor for performance. It should always be a prime number, and part of a key-pair
determined by a hash function, to avoid so many sub-directories overhead.

Select one:
a. N/R

b. True

c. False

Question 22
Complete

Marked out of 0.50

To partition the table customer by country we use the following HiveQL statement
CREATE TABLE customer (id STRING, name STRING, gender STRING, state STRING) PARTITIONED BY (country STRING);

Select one:
a. True

b. N/R

c. False

Question 23
Complete

Marked out of 0.50

One of the key design principles of HDFS is that it should favor high sustained bandwidth over low latency random access.

Select one:
a. True

b. False

c. N/R

Question 24
Complete

Marked out of 0.50

Apache YARN is an engine built on top of Apache Hadoop Tez.

Select one:
a. False

b. True

c. N/R

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 8/12


10/01/23, 22:25 First Exam Version 5

Question 25
Complete

Marked out of 1.00

You are working as a consultant for a telecommunications company. The data scientists have requested direct access to the data to dive
deep into the structure of the data and build models. They have good knowledge of SQL. Which tool or tools will you choose to provide
them with direct access to the data and reduce the infrastructure and maintenance overhead while ensuring that access to data on
Amazon S3 can be provided? Be careful with your answer.

Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes,
using AWS-designed hardware and machine learning to deliver the best price performance at any scale.
Redshift let’s us easily save the results of your queries back to our S3 data lake using open formats like Apache Parquet to further analyze
from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.

Question 26
Complete

Marked out of 1.00

In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive Driver is and what are its key functions/roles."
What do you say, trying to be very complete and precise?

The Hive Driver, Compiler, Optimizer and Executor work together to turn a query into a set of Hadoop jobs. The Driver acts like a
controller which receives the HiveQL statements. The driver starts the execution of statement by creating sessions. It monitors the life
cycle and progress of the execution. Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also
acts as a collection point of data or query result obtained after the Reduce operation.

Question 27
Complete

Marked out of 1.50

The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is concerned on losing data if the data delivery
to the destination is falling behind data writing to the delivery stream. Can you help him out understanding how this process works, in
order to alleviate his concerns?

If Kinesis Data Firehose encounters errors while delivering or processing data, it retries until the configured retry duration expires. If the
retry duration ends before the data is delivered successfully, Kinesis Data Firehose backs up the data to the configured S3 backup bucket.
If the destination is Amazon S3 and delivery fails or if delivery to the backup S3 bucket fails, Kinesis Data Firehose keeps retrying until the
retention period ends.

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 9/12


10/01/23, 22:25 First Exam Version 5

Question 28
Complete

Marked out of 1.50

You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the efects of flooding from the River Guardião. The
municipality has already distributed IoT devices across the river, that are able to measure the flow the wates. In the kick-off meeting, you
say: "I have an idea for a possible solution. We may need to use a number of AWS services.” And then you explain your solution. What do
you say?

We can use Amazon Kinesis to process streaming data from IoT devices, in this case, the devices that measure the river flow, then use the
data to send real-time alerts or take other actions when a device detects certain water level thresholds. We can also use AWS sample IoT
analytics code, to build our application.

Schematically:
- INPUT - Water sensors send data to Amazon Kinesis Data Streams

- AMAZON KINESIS DATA STREAMS - Ingests and stores sensor data streams for processing
- AMAZON LAMBDA - Amazon Lambda is triggered and runs code to detect trends in sensor data, identify water levels and initiate alerts

- OUTPUT – Alert are received when the water level reaches a certain threshold, so actions can be taken.

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 10/12


10/01/23, 22:25 First Exam Version 5

Question 29
Complete

Marked out of 1.50

The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand it. Can you explain very clearly to him
with a small example: counting the words of the following text:
"Mary had a little lamb

Little lamb, little lamb


Mary had a little lamb

It's fleece was white as snow"

Justify carefully.

Essentially, MapReduce divides a computation into three sequential stages: map, shuffle and reduce. Below is how the MapReduce word
count program executes and outputs the number of occurrences of a word in any given input file.
- Mapper Phase: the text from the input text file will be split into individual tokens, i.e. words, to form a key value pair with all the words
present in the input text file. The key is the word from the input file and value is ‘1’. In this case, the entire sentence will be split into 20
tokens (one for each word) with a value 1 as shown below:
(mary,1)

(had,1)
(a,1)

(little,1)
(lamb, 1)

(little,1)

(lamb,1)
(little,1)

(lamb,1)
(mary,1)

(had,1)
(a,1)

(little,1)
(lamb, 1)

(its, 1)

(fleece, 1)
(was, 1)

(white, 1)
(as, 1)

(snow, 1)

- Shuffle Phase: after the map phase execution is completed, shuffle phase is executed automatically wherein the key-value pairs
generated in the map phase are taken as input and then sorted in alphabetical order. The output will look like this:

(a,1)
(a,1)

(as, 1)
(fleece, 1)

(had,1)
(had,1)

(its, 1)

(lamb, 1)
(lamb, 1)

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 11/12


10/01/23, 22:25 First Exam Version 5

(lamb,1)

(lamb,1)
(little,1)

(little,1)
(little,1)

(little,1)
(mary,1)

(mary,1)

(snow, 1)
(was, 1)

(white, 1)

Reduce Phase: this is like an aggregation phase for the keys generated by the map phase. The reducer phase takes the output of shuffle
phase as input and then reduces the key-value pairs to unique keys with values added up. In our example:
(a,3)

(fleece, 1)

(had,2)
(its, 1)

(lamb, 4)
(little, 4)

(mary,2)
(snow, 1)

(was, 1)

(white, 1)

Question 30
Complete

Marked out of 1.50

An upcoming gaming startup is collecting gaming logs from its recently launched and hugely popular game. The logs are arriving in JSON
format with 500 different attributes for each record. The CMO has requested a dashboard based on six attributes that indicate the revenue
generated based on the in-game purchase recommendations as generated by the marketing departments ML team. The data is on S3 in
raw JSON format, and a report is being generated using a dashboard platform on the data available. Currently the report creation takes an
hour, whereas publishing the report is very quick. Furthermore, the IT department has complained about the cost of data scans on S3.
They have asked you as a solutions architect to provide a solution that improves performance and optimizes the cost. What would you
suggest? Be careful with your answer.

The main aspect of provisioning storage costs is to match the correct data with the correct storage class with how the data is utilized. The
best way to do this is to examine your data and determine how frequently accessed to determine if you need S3 Standard. Glacier storage
is a great option if you are looking for long-term storage.

◄ First Exam Version 4


Jump to...

First Exam Version 6 ►

file:///C:/Users/tiago/Desktop/Tiago/Faculdade/Nova IMS/2 Semestre/Big Data Foundations/Exames/2021/First Exam Version 5.html 12/12


Quiz 1 Study and Tips
1. Big Data is a characterization only for volumes of data above one petabyte (F, Definition).
2. The fundamental architectural principles of Hadoop are: large scale, distributed, shared everything
systems, connected by a good network, working together to solve the same problem. (F, W2, 23)
3. Volume drives the need for processing and storage parallelism, and its management during processing of
large datasets. (T, W1, 37)
4. Business benefits are frequently higher when addressing the variety of data than when addressing
volume. (T, W1, 379
5. Hadoop is considered a schema-on-write-system regarding write operations. (F, W2, 30)
6. One of the key design principles of HDFS is that it should be able to use commodity hardware. (T, W2,
34)
7. One of the key design principles of HDFS is that it should favor low latency random access over high
sustained bandwidth. (F, W2, 34)
8. The process of generating a new fsimage from a merge operation is called the Checkpoint process. (T,
W3, 15)
9. The map function in MapReduce processes key/value pairs to generate a set of intermediate key/value
pairs. (T, W4, 26)
10. YARN manages resources and monitors workloads, in a secure multitenant environment, while ensuring
high availability across multiple Hadoop clusters. (T, W4, 34)
11. When a HiveQL is executed, Hive translates the query into MapReduce, saving the time and effort of
writing actual MapReduce jobs. (T, W5, 15)
12. Hive follows a “schema on read” approach, unlike RDMBS, which enforces “schema on write.” (T, W5,
23)
13. Hive can be considered a data warehousing layer over Hadoop that allows for data to be exposed as
structured tables. (T, W5, 23)
14. Apache Tez is an engine built on top of Apache Hadoop YARN. (T, W4, 42)
15. Tez provides the capability to build an application framework that allows for a complex DAG of tasks for
high-performance data processing only in batch mode. (F, W4, 42)
16. Despite its capabilities, TEZ still needs the storage of intermediate output to HDFS. (F, W4, 48)
17. The name node is not involved in the actual data transfer. (T, W3, 9)
18. The operation
CREATE TABLE external
(col1 STRING,
col2 INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
creates an external table in Hive. (F, should be CREATE EXTERNAL)
19. The term MapReduce refers in exclusive to a programming model. (F, W4, 8)
20. Essentially, MapReduce divides a computation into two sequential stages: map and reduce. (F, W4, 26)
21. The servers where the data resides can only perform the map operation. (T, W4, 27)
22. In object storage, data is stored close to processing, just like in HDFS, but with rich metadata. (F, W5, 6)
23. When we drop an external table in Hive, both the data and the schema will be dropped. (F, W5, 28)
24. Internal tables in Hive can only be stored as TEXTFILE. (F, W6, 48)

25. We can say data internal tables in Hive adhere to a principle called “data on schema”. (T, W5, 25)
Quiz 2

Fig. 1

Fig. 2

By default, the Hive query execution engine processes one column of a table at a time. (F)

Clickstream data from web applications can be collected directly in a data lake and a portion of that data
can be moved out to a data warehouse for daily reporting. We think of this concept as inside-out data
movement. (T)

Fig. 1Consider the scenario in Fig. 1, for a log analytics solution. The web server in this example is an
Amazon Elastic Compute Cloud (EC2) instance. In step 1, Amazon Kinesis Firehose will continuously pull
log records from the Apache Web Server. (F)

Data storage formats like ORC and Parquet rely on metadata which describes a set of values in a section
of the data, called a stripe. If, for example, the user is interested in values < 10 and the metadata says all

1
the data in this stripe is between 20 and 30, the stripe is not relevant to the query at all, and the query
can skip over it. (T)

In Athena, if your files are too large or not splittable, parallelism can be limited due to query processing
halting until one reader has finished reading the complete file. (T)

In Athena, to change the name of Table1 to Table2, we would use the following instruction, as we would
in Hive:ALTER TABLE Table1RENAME TO Table2; (F)

Fig. 1In step 2, also, Amazon Kinesis Analytics application will continuously run a Presto script against
the streaming input data. (F)

Fig. 1In step 2, Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service
(Amazon S3) for durable storage of the raw log data. (T)

Fig. 1In step 3, the Amazon Kinesis Analytics application will create an aggregated data set every minute
and output that data to a second Firehose delivery stream. (T)

Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose. (T)

Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC. (T)

Kinesis Data Streams is a good choice for long-term data storage and analytics. (F)

One of the best use cases for AWS Glue, since it is a fully managed service, is if you require extensive
configuration changes from it. (F)

One of the characteristics of a serverless architecture, such as Athena's, is that, as the name implies,
there are no servers provisioned to begin with, and it is up to the user to completely provision all servers
and services. (F)

Presto is an open-source distributed SQL query engine optimized for batch ETL type jobs. (F)

Fig. 2Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service. (T)

The basic idea of vectorized query execution is to process a batch of columns as an array of line vectors.
(F)

Fig. 1We can configure, in Step 1, each shard in Amazon Kinesis Firehose to ingest up to 1MB/sec and
1000 records/sec, and emit up to 2MB/sec. (F)

You can use precisely the same set of tools to collect, prepare, and process real-time streaming data as
those tools that you have traditionally used for batch analytics. That's what is the basic premise of
Lambda architecture. (F)

You may copy the product catalog data stored in your database to your search service to make it easier
to look through your product catalog and offload the search queries from the database. We think of this
concept as data movement outside-in. (F)

2
Quiz 2 Questions

Fig. 1 A log analytics solution


1. Consider the scenario in Fig. 1, for a log analytics solution. The web server in this example is an Amazon
Elastic Compute Cloud (EC2) instance. In step 1, Amazon Kinesis Firehose will continuously pull log
records from the Apache Web Server. (F)
2. We can configure, in Step 1, each shard in Amazon Kinesis Firehose to ingest up to 1MB/sec and 1000
records/sec, and emit up to 2MB/sec. (F)
3. In step 2, Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service (Amazon
S3) for durable storage of the raw log data. (T)
4. In step 2, also, Amazon Kinesis Analytics application will continuously run a Kinesis Streaming Python
script against the streaming input data. (F)
5. In step 3, the Amazon Kinesis Analytics application will create an aggregated data set every minute and
output that data to a second Firehose delivery stream. (T)
6. All data in DynamoDB is replicated in two availability zones (F)
7. An end user provisions a Lambda service with similar steps as it provisions an EC2 instance. (F)
8. The concept of partitioning can be used to reduce the cost of querying the data. (T)
9. In a publish/subscribe model, although data producers are decoupled from data consumers, publishers
know who the consumers are. (F)
10. In Hive most of the optimizations are not based on the cost of query execution. (T)
11. The number of shards cannot be modified after the Kinesis stream is created. (F)
12. The basic idea of vectorized query execution is to process a batch of columns as an array of line vectors.
(F)
13. In Hive bucketing, data is evenly distributed between all buckets based on the hashing principle. (T)
14. The selection of the partition key is always an important factor for performance. It should always be a
low-cardinal attribute to avoid so many sub-directories overhead. (T)
15. To partition the table customerby countrywe use the following HiveQl statement
CREATE TABLE customer(id STRING, name STRING, gender STRING, state STRING, country
STRING)PARTITIONED BY (country STRING);(F)
16. We can configure the values for the Amazon S3 buffer size (1 MB to 128 MB) or buffer interval (60
seconds to 900 seconds). The condition satisfied first triggers data delivery to Amazon S3. (T)

1
Fig. 2 An alert solution
17. Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service. (T)
18. Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose. (T)
19. AWS Lambda polls the stream periodically (once per second) for new records. When it detects new
records, it invokes the Lambda function by passing the new records as a parameter. If no new records
are detected, the Lambda function is not invoked. (T)
20. DynamoDB tables do not have fixed schemas, but all items must have a similar number of attributes. (F)
21. The main drawback of Kinesis Data Firehose is that to scale up or down you need to manually provision
servers using the "AWS Kinesis Data Firehose Scaling API". (F)
22. Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC. (T)
23. Since it uses SQL as query language, Amazon Athena is a relational/transactional database. (F)
24. AWS Glue ETL jobs are Spark-based. (T)
25. Kinesis Data Streams is good choice for long-term data storage and analytics. (F)
26. Amazon OpenSearch/Elasticsearch stores CSV documents. (F)

27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE

28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE

29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE

29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE

2
23 – Since it uses SQL as query language, Amazon Athena is a relational/transactional database.
(2021)
FALSE

24 – AWS Glue ETL jobs are Spark-based.


(2021)
TRUE

25 – Kinesis Data Streams is good choice for long-term data storage and analytics
(2021)
FALSE

26 – Amazon OpenSearch/Elasticsearch stores CSV documents.


(2021)
FALSE

-- // --

27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE

28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE

29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE

29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE
In terms of storage, what does a name node contain and what do data nodes contain?

HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (slaves of HDFS) contain application
data in a partitioned manner for parallel writes and reads.

The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in pysical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host OS
drives. The name node simultaneously responds to the multiple client requests (in a multithreaded
system) and provides information to the client to connect to data nodes to write or read the data.
While writing, a file is broken down into multiple chunks of 64MB (by default, called blocks). Each
block is stored as a separate file on data nodes. Based on the replication factor of a file, multiple
copies or replicas of each block are stored for fault tolerance.

What were the key design strategies for HDFS to become fault tolerant?

HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).

When HDFS was implemented originally, certain assumptions and design goals were discussed:

Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes.

Fault tolerance—Keeps multiple copies of data to recover from failure.

Capability to run on commodity hardware—Designed to run on commodity hardware.

Write once and read many times—Based on a concept of write once, read multiple times, with an
assumption that once data is written, it will not be modified. Its focus is thus retrieving the data in
the fastest possible way.

Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.

Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and a
JobTracker (processing component). Processing is done where data exists, to avoid data movement
across nodes of the cluster.

High throughput—Designed for parallel data storage and retrieval.

HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

What is a checkpoint, and who performs this operation?

The process of generating a new fsimage by merging transactional records from the edit log to the
current fsimage is called checkpoint. The secondary name node periodically performs a checkpoint
by downloading fsimage and the edit log file from the name node and then uploading the new
fsimage back to the name node. The name node performs a checkpoint upon restart (not
periodically, though—only on name node start-up).

What is the default data block placement policy?

By default, three copies, or replicas, of each block are placed, per the default block placement policy
mentioned next. The objective is a properly load-balanced, fast-access, fault-tolerant file system:

The first replica is written to the data node creating the file.

The second replica is written to another data node within the same rack.

The third replica is written to a data node in a different rack.

How does a name node ensure that all the data nodes are functioning properly?

Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

To what does the term data locality refer?

Data locality is the concept of processing data locally wherever possible. This concept is central to
Hadoop, a platform that intentionally attempts to minimize the amount of data transferred across
the network by bringing the processing to the data instead of the reverse.
Review Questions 2

Answers

1. What is the default data block placement policy?


By default, three copies, or replicas, of each block are placed, per the default block
placement policy mentioned next. The objective is a properly load-balanced, fast-access,
fault-tolerant file system:
The first replica is written to the data node creating the file.
The second replica is written to another data node within the same rack.
The third replica is written to a data node in a different rack.
2. What is the replication pipeline? What is its significance?
Data nodes maintain a pipeline for data transfer. Having said that, data node 1 does not
need to wait for a complete block to arrive before it can start transferring it to data node 2
in the flow. In fact, the data transfer from the client to data node 1 for a given block happens
in smaller chunks of 4KB. When data node 1 receives the first 4KB chunk from the client, it
stores this chunk in its local repository and immediately starts transferring it to data node 2
in the flow. Likewise, when data node 2 receives the first 4KB chunk from data node 1, it
stores this chunk in its local repository and immediately starts transferring it to data node 3,
and so on. This way, all the data nodes in the flow (except the last one) receive data from
the previous data node and, at the same time, transfer it to the next data node in the flow,
to improve the write performance by avoiding a wait at each stage.
3. What is client-side caching, and what is its significance when writing data to HDFS?
HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS
client, to improve the performance of the block write operation and to minimize network
congestion. The HDFS client transparently caches the file into a temporary local file; when it
accumulates enough data for a block size, the client reaches out to the name node. At this
time, the name node responds by inserting the filename into the file system hierarchy and
allocating data nodes for its storage. The client then flushes the block of data from the local,
temporary file to the closest data node, and that data node transfers the block to other data
nodes (as instructed by the name node, based on the replication factor of the file). This
client-side caching avoids continuous use of the network and minimizes the risk of network
congestion.
4. How can you enable rack awareness in Hadoop?
You can make the Hadoop cluster rack aware by using a script that enables the master node
to map the network topology of the cluster using the properties topology.script.file.name or
net.topology.script.file.name, available in the core-site.xml configuration file. First, you must
change this property to specify the name of the script file. Then you must write the script
and place it in the file at the specified location. The script should accept a list of IP addresses
and return the corresponding list of rack identifiers. For example, the script would take
host.foo.bar as an argument and return /rack1 as the output.
5. What is the data block replication factor?
An application or a job can specify the number of replicas of a file that HDFS should
maintain. The number of copies or replicas of each block of a file is called the replication
factor of that file. The replication factor is configurable and can be changed at the cluster
level or for each file when it is created, or even later for a stored file.
6. What is block size, and how is it controlled?
When a client writes a file to a data node, it splits the file into multiple chunks, called blocks.
This data partitioning helps in parallel data writes and reads. Block size is controlled by the
dfs.blocksize configuration property in the hdfs-site.xml file and applies for files that are

1
created without a block size specification. When creating a file, the client can also specify a
block size specification to override the cluster-wide configuration.
7. How does a client ensure that the data it receives while reading is not corrupted? Is there
a way to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file
and stores these checksums in a separate hidden file in the same HDFS file system
namespace. Later, while reading the blocks, the client references these checksums to verify
that these blocks were not corrupted (corruption might happen because of faults in a
storage device, network transmission faults, or bugs in the program). When the client
realizes that a block is corrupted, it reaches out to another data node that has the replica of
the corrupted block, to get another copy of the block.
8. How can you access and manage files in HDFS?
You can access the files and data stored in HDFS in many different ways. For example, you
can use HDFS FS Shell commands, leverage the Java API available in the classes of the
org.apache.hadoop.fs package, write a MapReduce job, or write Hive or Pig queries. In
addition, you can even use a web browser to browse the files from an HDFS cluster.
9. What two issues does HDFS encounter in Hadoop 1.0?
First, the name node in Hadoop 1.0 is a single point of failure. You can configure a secondary
name node, but it’s not an active-passing configuration. The secondary name node thus
cannot be used for failure, in case the name node fails. Second, as the number of data nodes
grows beyond 4,000, the performance of the name node degrades, setting a kind of upper
limit to the number of nodes in a cluster.
10. What is a daemon?
The word daemon comes from the UNIX world. It refers to a process or service that runs in
the background. On a Windows platform, we generally refer to it is as a service. For
example, in HDFS, we have daemons such as name node, data node, and secondary name
node.
11. What is YARN and what does it do?
In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer created on
top of HDFS. This new layer, called YARN (Yet Another Resource Negotiator), takes care of
two major functions: resource management and application life-cycle management. The
JobTracker previously handled those functions. Now MapReduce is just a batch-mode
computational layer sitting on top of YARN, whereas YARN acts like an operating system for
the Hadoop cluster by providing resource management and application life-cycle
management functionalities. This makes Hadoop a general-purpose data processing
platform that is not constrained only to MapReduce.
12. What is uber-tasking optimization?
The concept of uber-tasking in YARN applies to smaller jobs. Those jobs are executed in the
same container or in the same JVM in which that application-specific Application Master is
running. The basic idea behind uber-tasking optimization is that the distributed task
allocation and management overhead exceeds the benefits of executing tasks in parallel for
smaller jobs, hence its optimum to execute smaller job in the same JVM or container of the
Application Master.
13. What are the different components of YARN?
Aligning to the original master-slave architecture principle, even YARN has a global or master
Resource Manager for managing cluster resources and a per-node and -slave Node Manager
that takes direction from the Resource Manager and manages resources on the node. These
two, form the computation fabric for YARN. Apart from that is a per-application Application
Master, which is merely an application-specific library tasked with negotiating resources
from the global Resource Manager and coordinating with the Node Manager(s) to execute

2
the tasks and monitor their execution. Containers also are present—these are a group of
computing resources, such as memory, CPU, disk, and network.

3
BDF2020 Review Questions 1 Questions

1. What is HDFS, and what are HDFS design goals?


HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
• Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes. Fault
tolerance—Keeps multiple copies of data to recover from failure.
• Capability to run on commodity hardware—Designed to run on commodity hardware. Write once
and read many times—Based on a concept of write once, read multiple times, with an assumption
that once data is written, it will not be modified. Its focus is thus retrieving the data in the fastest
possible way.
• Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
• Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and
a JobTracker (processing component). Processing is done where data exists, to avoid data
movement across nodes of the cluster.
• High throughput—Designed for parallel data storage and retrieval.
• HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

2. In terms of storage, what does a name node contain and what do data nodes contain?
• HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (workers of HDFS) contain
application data in a partitioned manner for parallel writes and reads.
• The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in physical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host
OS drives. The name node simultaneously responds to the multiple client requests (in a
multithreaded system) and provides information to the client to connect to data nodes to write
or read the data. While writing, a file is broken down into multiple chunks of 128MB (by default,
called blocks, 64MB sometimes). Each block is stored as a separate file on data nodes. Based on
the replication factor of a file, multiple copies or replicas of each block are stored for fault
tolerance.

3. What is the default data block placement policy?


By default, three copies, or replicas, of each block are placed, per the default block placement policy
mentioned next. The objective is a properly load-balanced, fast-access, fault-tolerant file system:
• The first replica is written to the data node creating the file.
• The second replica is written to another data node within the same rack.
• The third replica is written to a data node in a different rack.

4. What is the replication pipeline? What is its significance?


Data nodes maintain a pipeline for data transfer. Having said that, data node 1 does not need to wait
for a complete block to arrive before it can start transferring it to data node 2 in the flow. In fact, the
data transfer from the client to data node 1 for a given block happens in smaller chunks of 4KB. When
data node 1 receives the first 4KB chunk from the client, it stores this chunk in its local repository and
immediately starts transferring it to data node 2 in the flow. Likewise, when data node 2 receives the
first 4KB chunk from data node 1, it stores this chunk in its local repository and immediately starts
transferring it to data node 3, and so on. This way, all the data nodes in the flow (except the last one)
receive data from the previous data node and, at the same time, transfer it to the next data node in
the flow, to improve the write performance by avoiding a wait at each stage.

5. What is client-side caching, and what is its significance when writing data to HDFS?
HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to
improve the performance of the block write operation and to minimize network congestion. The HDFS
client transparently caches the file into a temporary local file; when it accumulates enough data for a
block size, the client reaches out to the name node. At this time, the name node responds by inserting
the filename into the file system hierarchy and allocating data nodes for its storage. The client then
flushes the block of data from the local, temporary file to the closest data node, and that data node
transfers the block to other data nodes (as instructed by the name node, based on the replication
factor of the file). This client-side caching avoids continuous use of the network and minimizes the risk
of network congestion.

6. How can you enable rack awareness in Hadoop?


You can make the Hadoop cluster rack aware by using a script that enables the master node to map
the network topology of the cluster using the properties topology.script.file.name or
net.topology.script.file.name, available in the core-site.xml configuration file. First, you must change
this property to specify the name of the script file. Then you must write the script and place it in the
file at the specified location. The script should accept a list of IP addresses and return the
corresponding list of rack identifiers. For example, the script would take host.foo.bar as an argument
and return /rack1 as the output.

7. What is the data block replication factor?


An application or a job can specify the number of replicas of a file that HDFS should maintain. The
number of copies or replicas of each block of a file is called the replication factor of that file. The
replication factor is configurable and can be changed at the cluster level or for each file when it is
created, or even later for a stored file.

8. What is block size, and how is it controlled?


When a client writes a file to a data node, it splits the file into multiple chunks, called blocks. This data
partitioning helps in parallel data writes and reads. Block size is controlled by the dfs.blocksize
configuration property in the hdfs-site.xml file and applies for files that are created without a block
size specification. When creating a file, the client can also specify a block size specification to override
the cluster-wide configuration.

9. What is a checkpoint, and who performs this operation?


The process of generating a new fsimage by merging transactional records from the edit log to the
current fsimage is called checkpoint. The secondary name node periodically performs a checkpoint by
downloading fsimage and the edit log file from the name node and then uploading the new fsimage
back to the name node. The name node performs a checkpoint upon restart (not periodically,
though—only on name node start-up).

10. How does a name node ensure that all the data nodes are functioning properly?
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.

11. How does a client ensure that the data it receives while reading is not corrupted? Is there a way
to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file and
stores these checksums in a separate hidden file in the same HDFS file system namespace. Later, while
reading the blocks, the client references these checksums to verify that these blocks were not
corrupted (corruption might happen because of faults in a storage device, network transmission faults,
or bugs in the program). When the client realizes that a block is corrupted, it reaches out to another
data node that has the replica of the corrupted block, to get another copy of the block.

12. How can you access and manage files in HDFS?


You can access the files and data stored in HDFS in many different ways. For example, you can use
HDFS FS Shell commands, leverage the Java API available in the classes of the org.apache.hadoop.fs
package, write a MapReduce job, or write Hive or Pig queries. In addition, you can even use a web
browser to browse the files from an HDFS cluster.

13. What two issues does HDFS encounter in Hadoop 1.0?


First, the name node in Hadoop 1.0 is a single point of failure. You can configure a secondary name
node, but it’s not an active-passing configuration. The secondary name node thus cannot be used for
failure in case the name node fails. Second, as the number of data nodes grows beyond 4,000, the
performance of the name node degrades, setting a kind of upper limit to the number of nodes in a
cluster.

14. What is a daemon?


The word daemon comes from the UNIX world. It refers to a process or service that runs in the
background. On a Windows platform, we generally refer to it is as a service. For example, in HDFS, we
have daemons such as name node, data node, and secondary name node.

15. What is YARN and what does it do?


In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer created on top of
HDFS. This new layer, called YARN (Yet Another Resource Negotiator), takes care of two major
functions: resource management and application life-cycle management. The JobTracker previously
handled those functions. Now MapReduce is just a batch-mode computational layer sitting on top of
YARN, whereas YARN acts like an operating system for the Hadoop cluster by providing resource
management and application life-cycle management functionalities. This makes Hadoop a general-
purpose data processing platform that is not constrained only to MapReduce.

16. What is uber-tasking optimization?


The concept of uber-tasking in YARN applies to smaller jobs. Those jobs are executed in the same
container or in the same JVM in which that application-specific Application Master is running. The
basic idea behind uber-tasking optimization is that the distributed task allocation and management
overhead exceeds the benefits of executing tasks in parallel for smaller jobs, hence its optimum to
execute smaller job in the same JVM or container of the Application Master.

17. What are the different components of YARN?


Aligning to the original master-slave architecture principle, even YARN has a global or master Resource
Manager for managing cluster resources and a per-node and -slave Node Manager that takes direction
from the Resource Manager and manages resources on the node. These two, form the computation
fabric for YARN. Apart from that is a per-application Application Master, which is merely an
application-specific library tasked with negotiating resources from the global Resource Manager and
coordinating with the Node Manager(s) to execute the tasks and monitor their execution. Containers
also are present—these are a group of computing resources, such as memory, CPU, disk, and network.
What were the key design strategies for HDFS to become fault tolerant?
HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes.
Fault tolerance—Keeps multiple copies of data to recover from failure.
Capability to run on commodity hardware—Designed to run on commodity hardware.
Write once and read many times—Based on a concept of write once, read multiple times, with an
assumption that once data is written, it will not be modified. Its focus is thus retrieving the data in
the fastest possible way.
Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and a
JobTracker (processing component). Processing is done where data exists, to avoid data movement
across nodes of the cluster.
High throughput—Designed for parallel data storage and retrieval.
HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.

To what does the term data locality refer?


Data locality is the concept of processing data locally wherever possible. This concept is central to
Hadoop, a platform that intentionally attempts to minimize the amount of data transferred across the
network by bringing the processing to the data instead of the reverse.
Offered to João Homem

Dashboard / My courses / 2020-2021 / 2º Ciclo / Pós-Graduações / Outono / BDF-200209-202021-S1 / Second exam / Second exam

Started on Friday, 29 January 2021, 6:25 PM


State Finished
Completed on Friday, 29 January 2021, 7:34 PM
Time taken 1 hour 9 mins
Grade Not yet graded

Question 1 You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the efects of flooding from the River
Complete Guardião. The municipality has already distributed IoT devices across the river, that are able to measure the flow the
Marked out of wates. In the kick-off meeting, you say: "I have an idea for a possible solution. We may need to use a number of AWS
2.00 services.” And then you explain your solution. What do you say?

You do not need to worry about computer resources if you use AWS Kinesys. This system will enable you to acquire data
through a gateway on the cloud that will receive your data, than you can process it using AWS Lambda module and later
store the data in a storage system e.g. AWS's S3

Question 2 You go to a conference and hear a speaker saying the following: "Real-time analytics is a key factor for digital
Complete transformation. Companies everywhere are using their datalakes based on technologies such as Hadoop and HDFS to
Marked out of give key insights, in real time, of their sales and customer preferences." You rise your hand for a comment. What are going
2.00 to say? Please, justify carefully.

Hadoop is a framework that incorporates several technologies including batch, streaming (you mentioned) and others.
HDFS is Hadoop's file system.

Question 3 In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive Driver is and what are its key
Complete functions/roles." What do you say, trying to be very complete and precise?
Marked out of
2.00 Hive is Hadoop's SQL analytics query framework. It is the same as saying Hive is a SQL abstraction layer over Hadoop
MapReduce with a SQL-like query engine. It enables several types of connections to other data bases using Hive driver.
HIVE DRIVER can for instances connect through ODBC to relational databases

Question 4 On implementing Hadoop, the CDO is worried on understanding how a name node ensures that all the data nodes are
Complete functioning properly. What can you tell him to reassure him?
Marked out of
2.00 Each data node in the cluster periodically sends heartbeat signals and a block-report to the name node. Receipt of a
heartbeat signal implies that the data node is active and functioning properly. A block-report from a data node contains a
list of all blocks on that specific data node.

Question 5
The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is concerned on losing data if
Complete
the data delivery to the destination is falling behind data writing to the delivery stream. Can you help him out
Marked out of understanding how this process works, in order to alleviate his concerns?
2.00

Amazon Kinesis Firehose automatically scales to meet demand.


Offered to João Homem
Question 6 The CDO of Exportera asks you to prepare the talking points for a presentation he must make to the board regarding a
Complete budget increase for his team. It is important that the board members understand what are the impacts of the data variety
Marked out of versus the data volume. What can you tell them?
2.00

Handling Big data issues relates not only with data volume but with other aspects, one of them concerning the variety of
types of data.
While the volume can be handle scaling the processing and storage resources the issues related with variety require
customization of processing, namely interfaces, and programming of handlers. For instances handling objects coming
from SQL sources or OSI-PI sources require different type of programming skills. Not having the right skills to handle this
can ruin the ambition of storing different -and so rich - data.
To have this kind of skills we need a budget increase in the department.

Question 7 The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables can only be made on the
Complete basis of one parameter, making it not as useful as it could be." What can you answer him?
Marked out of
2.00

Hive partitioning can be made with more than one parameter from the original table. The partitioning can be made by
dividing the table in small tables by folders against each combination of parameters. And even the partitioning can be
optimized by bucket files.

Question 8
The CDO of Exportera has been exploring stream processing and is worried about the latency of a solution based on a
Complete
platform such as as Kinesis Data Streams. To address the question, what can you tell him?
Marked out of
2.00

Kinesis Data Stream platform is based on a high throughput, large bandwidth performance components.
Kinesis ensures durability and elasticity of data acquired through streaming. The elasticity of Kinesis Data Streams enables
you to scale the stream up or down, so that you never lose data records before it expires (1day default).
Offered to João Homem
Question 9 The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand it. Than you explain very
Complete clearly to him with a small example: counting the words of the following text:
Marked out of "Mary had a little lamb
2.00
Little lamb, little lamb
Mary had a little lamb
It's fleece was white as snow"

Justify carefully.

MapReduce is a program model, or a processing technique that can be applied in several contexts. Its implementation is
usually divided in three steps: Mapping the data variables considered, shuffling those variables following a
pattern/directive and then aggregating/reducing them.
Let us take the text above as reference for the following example: We want to count the number of words in the text.
In the first step we would map all the words, let say, continuously, getting a list like:

"Mary
had
a
little

...
as

snow"
If one wants to count the number of words, a number should be associated with each word for later counting. So we can
associate "1", so we can create, still in Map step the kind of Key-Value database associating each word with "1", e.g.:
<"Mary",1>, where the word is the Key.
<Mary,1>
<little,1>

In the next step one could shuffle this created database so that all similar keys can get together. But since one just wants
to count the number of words, we just need to sum the values and this operation is the Reduce operation in this case.

Question 10 In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in motion is no different than
Complete processing data at rest." If you had to join the conversation, what would you say to them?
Marked out of
2.00 Processing data in motion (streaming) is much different from processing data at rest. The main differences are that:

1- analytics over streaming data needs to be done with data incoming and not in steady data.
2- Data in motion needs buffering to process amounts of data while similar steady data processing can be done through
querying.
3- The concept of incoming data in streaming does not apply in data at rest. In the former case one needs to adapt que
acquisition to the amount of incoming data.
Of course there are also some similarities, namely in quality control of data, namely the detection of common types of
errors like valid values or missing values. Anyway, the processing- which is what we are talking about - is much different in
both cases

◄ BDF2020 W14 Closing the loop Jump to...


1 - You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the
efects of flooding from the River Guardião. The municipality has already distributed IoT
devices across the river, that are able to measure the flow the waters. In the kick-off
meeting, you say: "I have an idea for a possible solution. We may need to use a number of
AWS services.” And then you explain your solution. What do you say?

We can use Amazon Kinesis to process streaming data from IoT devices, in this case, the
devices that measure the river flow, then use the data to send real-time alerts or take other
actions when a device detects certain water level thresholds. We can also use AWS sample
IoT analytics code, to build our application.

Schematically:

INPUT - Water sensors send data to Amazon Kinesis Data Streams

AMAZON KINESIS DATA STREAMS - Ingests and stores sensor data streams for processing

AMAZON LAMBDA - Amazon Lambda is triggered and runs code to detect trends in sensor
data, identify water levels and initiate alerts

OUTPUT – Alert are received when the water level reaches a certain threshold, so actions can
be taken.

2 - You go to a conference and hear a speaker saying the following: "Real-time analytics is a key
factor for digital transformation. Companies everywhere are using their datalakes based on
technologies such as Hadoop and HDFS to give key insights, in real time, of their sales and
customer preferences." You raise your hand for a comment. What are going to say? Please, justify
carefully.

As mentioned, real-time data analytics is a key factor in digital transformation, since it combines
the power of parallel processing with the value of real-time data. For most companies, having
data means having access to wealth.
Datalakes are central repositories of data in a natural or raw format, having been pulled from a
variety of sources. With the datalakes, users can extract structured metadata from unstructured
data on a regular basis and store it in the operational data lake for quick and easy querying, thus
enabling better real-time data analysis. Hadoop’s ability to efficiently process large volumes of
data in parallel provides great benefits, but there are also a number of use cases that require more
“real time” processing of data—processing the data as it arrives, rather than through batch
processing. Fortunately, this need for more real-time processing is being addressed with the
integration of new tools into the Hadoop ecosystem.

3 - In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive
Driver is and what are its key functions/roles." What do you say, trying to be very complete and
precise?

The Hive Driver, Compiler, Optimizer, and Executor work together to turn a query into a set of
Hadoop jobs. The Driver acts like a controller which receives the HiveQL statements. The driver
starts the execution of statement by creating sessions. It monitors the life cycle and progress of
the execution. Driver stores the necessary metadata generated during the execution of
a HiveQL statement. It also acts as a collection point of data or query result obtained after the
Reduce operation.

4 - On implementing Hadoop, the CDO is worried on understanding how a name node ensures
that all the data nodes are functioning properly. What can you tell him to reassure him?

With HDFS, data is written on the server once, and read and reused numerous times after that.
The NameNode is the master node in the Apache Hadoop HDFS Architecture, which keeps track
of where file data is kept in the cluster and maintains and manages the blocks present on the
DataNodes. Data is broken down into separate blocks and distributed among the various
DataNodes for storage.
The DataNodes are in constant communication with the NameNode to determine if the
DataNodes need to complete specific tasks. Consequently, the NameNode is always aware of the
status of each DataNode. If the NameNode realizes that one DataNode isn't working properly, it
can immediately reassign that DataNode's task to a different node containing the same data block.
DataNodes also communicate with each other, which enables them to cooperate during normal
file operations.
5 - The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is
concerned on losing data if the data delivery to the destination is falling behind data writing
to the delivery stream. Can you help him out understanding how this process works, in order to
alleviate his concerns?

If Kinesis Data Firehose encounters errors while delivering or processing data, it retries until the
configured retry duration expires. If the retry duration ends before the data is delivered
successfully, Kinesis Data Firehose backs up the data to the configured S3 backup bucket. If the
destination is Amazon S3 and delivery fails or if delivery to the backup S3 bucket fails, Kinesis Data
Firehose keeps retrying until the retention period ends.

6 - The CDO of Exportera asks you to prepare the talking points for a presentation he must make
to the board regarding a budget increase for his team. It is important that the board members
understand what data variety versus data variability is, as well as its impacts to the analytics
platforms (and, of course, business benefits of addressing them). What do you write to make it
crystal clear?

Big Data has defined by the National Institute of Standards and Technology, consists of extensive
datasets - primarily in the characteristics of volume, velocity, variety, and/or variability - that
require a scalable architecture for efficient storage, manipulation, and analysis.
So, while many other characteristics have been attributed to Big Data, only the above four drive
the shift to new scalable architectures for data-intensive applications in order to achieve cost-
effective performance. These characteristics have the following definitions:
- Volume: the size of the dataset
- Velocity: rate of flow
- Variety: data from multiple repositories, domains, or types. Note that while volume and
velocity allow faster and more cost-effective analytics, it is the variety of data that allows
analytic results that were never possible before. Business benefits are frequently higher
when addressing the variety of data than when addressing volume.
- Variability: the changes in dataset, whether data flow rate, format/structure, semantics,
and/or quality that impact the analytics application. Variability in data volumes implies the
need to scale-up or scale-down virtualized resources to efficiently handle the additional
processing load, one of the advantageous capabilities of cloud computing.
7 - The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables
can only be made on the basis of one parameter, making it not as useful as it could be." What
can you answer him?

Partitioning in Hive is used to increase query performance. It is done by using the PARTITIONED
BY clause in the create table statement. A table can be partitioned on the basis of one or more
columns. It is a way of dividing a table into related parts based on the values of columns like date,
city, and department. Each table in the Hive can have one or more partition keys to identify a
particular partition. The columns on which partitioning is done cannot be included in the data
table.

8 - The CDO of Exportera has been exploring stream processing and is worried about the latency
of a solution based on a platform such as as Kinesis Data Streams. To address the question,
what can you tell him?

Latency is a measure of delay. Amazon Kinesis Data Streams are specifically used to build real -
time custom model applications. You can use Amazon Kinesis Data Streams to collect and process
large streams of data records in real-time. One of AWS Kinesis advantage is that it enables us to
ingest buffer and process streaming data in real-time to drive insights in seconds or minutes
instead of hours or days.

9 - The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully
understand it. Then you explain very clearly to him with a small example: counting the words of
the following text:

"Mary had a little lamb


Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow"

Justify carefully.

Essentially, MapReduce divides a computation into three sequential stages: map, shuffle and
reduce. Below is how the MapReduce word count program executes and outputs the number of
occurrences of a word in any given input file.
- Mapper Phase: the text from the input text file will be split into individual tokens, i.e. words,
to form a key value pair with all the words present in the input text file. The key is the word
from the input file and value is ‘1’. In this case, the entire sentence will be split into 20 tokens
(one for each word) with a value 1 as shown below:

(mary,1)
(had,1)
(a,1)
(little,1)
(lamb, 1)
(little,1)
(lamb,1)
(little,1)
(lamb,1)
(mary,1)
(had,1)
(a,1)
(little,1)
(lamb, 1)
(its, 1)
(fleece, 1)
(was, 1)
(white, 1)
(as, 1)
(snow, 1)

- Shuffle Phase: after the map phase execution is completed, shuffle phase is executed
automatically wherein the key-value pairs generated in the map phase are taken as input and
then sorted in alphabetical order. The output will look like this:

(a,1)
(a,1)
(as, 1)
(fleece, 1)
(had,1)
(had,1)
(its, 1)
(lamb, 1)
(lamb, 1)
(lamb,1)
(lamb,1)
(little,1)
(little,1)
(little,1)
(little,1)
(mary,1)
(mary,1)
(snow, 1)
(was, 1)
(white, 1)

Reduce Phase: this is like an aggregation phase for the keys generated by the map phase. The
reducer phase takes the output of shuffle phase as input and then reduces the key-value pairs to
unique keys with values added up. In our example:

(a,3)
(fleece, 1)
(had,2)
(its, 1)
(lamb, 4)
(little, 4)
(mary,2)
(snow, 1)
(was, 1)
(white, 1)

10 - In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in
motion is no different than processing data at rest." If you had to join the conversation, what
would you say to them?

Data at rest, refers to data that has been collected from various sources and is then analysed after
the event occurs. The point where the data is analysed and the point where action is taken on it
occur at two separate times.
The collection process for data in motion is similar to that of data at rest. However, the difference
lies in the analytics. In this case, the analytics occur in real-time as the event happens. Data in
motion is processed and analysed in real time, or near real time, and must be handled in a very
different way than data at rest (i.e., persisted data). Data in motion tends to resemble event -
processing architectures and focuses on real-time or operational intelligence applications.
For data at rest, a batch processing method would be most likely used. There is no need for
“always on” infrastructure. This approach provides access to high-performance processing
capabilities as needed. For data in motion, you would want to utilize a real-time processing
method. In this case, latency becomes a key consideration because a lag in processing could result
in a missed opportunity to improve business results.

You might also like