Professional Documents
Culture Documents
Dashboard / My courses / 2021-2022 / 2º Ciclo / Pós-Graduações / Outono / BDF-200209-202122-S1 / Exams / First Exam Version 5
Information
Introduction
The exam is 75 minutes long and consists of yes/no and multiple-choice questions as well as open-ended questions.
The rating for yes/no and multiple-choice questions is 0.5 for a correct answer, 0 if the question is not answered, and -0.25 if the
answer is incorrect.
For open-ended questions, the rating is indicated on the left side of the question. Try to be precise in your answers.
The only permitted reference material is an A4 sheet of paper handwritten on both sides.
Any irregularity in terms of consultation media or other may result in the test being cancelled.
Question 1
Complete
The need for real-time data processing, even in the presence of large data volumes, drives a different type of architecture where the data
is not stored, but is processed typically in memory.
Select one:
a. False
b. N/R
c. True
Question 2
Complete
Select one:
a. False
b. N/R
c. True
Question 3
Complete
You are working for a large telecom provider who has chosen the AWS platform for its data and analytics needs. It has agreed to using a
data lake and S3 as the platform of choice for the data lake. The company is getting data generated from DPI (deep packet inspection)
probes in near real time and looking to ingest it into S3 in batches of 100 MB or 2 minutes, whichever comes first. Which of the following
is an ideal choice for the use case without any additional custom implementation?
Select one:
a. Amazon Kinesis Data Streams
b. N/R
c. Amazon S3 Select
Question 4
Complete
We can configure, in Kinesis Data Firehose, the values for the Amazon S3 buffer size (from 1 MB to 128 MB) or buffer interval (from 60
seconds to 900 seconds). The condition satisfied first, triggers data delivery to Amazon S3.
Select one:
a. False
b. True
c. N/R
Question 5
Complete
Select one:
a. True
b. False
c. N/R
Question 6
Complete
With ORC, a file with data in tabular format can be split into multiple smaller stripes where each stripe comprises of a subset of columns
from the original file.
Select one:
a. False
b. True
c. N/R
Question 7
Complete
YARN monitors workloads, in a secure single tenant environment, while ensuring high availability across multiple Hadoop clusters.
Select one:
a. False
b. True
c. N/R
Question 8
Complete
Select one:
a. True
b. False
c. N/R
Question 9
Complete
In HDFS, Hive tables are nothing more than directories containing files.
Select one:
a. True
b. False
c. N/R
Question 10
Complete
Big Data frameworks are frequently designed to take advantage of data locality on each node when distributing the processing, aiming
to maximize the movement of data between nodes.
Select one:
a. True
b. N/R
c. False
Question 11
Complete
One of the reasons you may want to convert data from CSV to Parquet before querying it with a service such as Athena is to save on
costs.
Select one:
a. Verdadeiro
b. N/R
c. False
Question 12
Complete
Kinesis Data Streams is not a good choice for long-term data storage and analytics.
Select one:
a. N/R
b. False
c. True
Question 13
Complete
Select one:
a. True
b. N/R
c. False
Question 14
Complete
We can say internal tables in Hive adhere to a principle called “schema on data”.
Select one:
a. False
b. True
c. N/R
Question 15
Complete
While the data may have high veracity (accurate representation of the real-world processes that created it), there are times when the data
is no longer valid for the hypothesis being asked.
Select one:
a. True
b. False
c. N/R
Question 16
Complete
Velocity drives the need for processing and storage parallelism, and its management during processing of large datasets.
Select one:
a. False
b. N/R
c. True
Question 17
Complete
The Checkpoint process is when the NameNode checks if the DataNodes are active.
Select one:
a. False
b. N/R
c. True
Question 18
Complete
HDFS is optimized to store very large amounts of highly mutable data with files being typically accessed in long sequential scans.
Select one:
a. N/R
b. False
c. True
Question 19
Complete
In HDFS, a client writes to the first DataNode, and then receives an ACK (Acknowledge) message confirming the correct storage of the
block. After receiving that message, writes to the second DataNode and the process is repeated until the correct storage in the third
DataNode. In this way, HDFS guarantee the safe storage of the block and its replicas.
Select one:
a. True
b. False
c. N/R
Question 20
Complete
Which of the following is a valid mechanism to do data transformations from Amazon Kinesis Firehose?
Select one:
a. Amazon Lambda
b. AWS Glue
c. Amazon SageMaker
d. Amazon Athena
e. N/R
Question 21
Complete
The selection of the partition key is always a key factor for performance. It should always be a prime number, and part of a key-pair
determined by a hash function, to avoid so many sub-directories overhead.
Select one:
a. N/R
b. True
c. False
Question 22
Complete
To partition the table customer by country we use the following HiveQL statement
CREATE TABLE customer (id STRING, name STRING, gender STRING, state STRING) PARTITIONED BY (country STRING);
Select one:
a. True
b. N/R
c. False
Question 23
Complete
One of the key design principles of HDFS is that it should favor high sustained bandwidth over low latency random access.
Select one:
a. True
b. False
c. N/R
Question 24
Complete
Select one:
a. False
b. True
c. N/R
Question 25
Complete
You are working as a consultant for a telecommunications company. The data scientists have requested direct access to the data to dive
deep into the structure of the data and build models. They have good knowledge of SQL. Which tool or tools will you choose to provide
them with direct access to the data and reduce the infrastructure and maintenance overhead while ensuring that access to data on
Amazon S3 can be provided? Be careful with your answer.
Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes,
using AWS-designed hardware and machine learning to deliver the best price performance at any scale.
Redshift let’s us easily save the results of your queries back to our S3 data lake using open formats like Apache Parquet to further analyze
from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.
Question 26
Complete
In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive Driver is and what are its key functions/roles."
What do you say, trying to be very complete and precise?
The Hive Driver, Compiler, Optimizer and Executor work together to turn a query into a set of Hadoop jobs. The Driver acts like a
controller which receives the HiveQL statements. The driver starts the execution of statement by creating sessions. It monitors the life
cycle and progress of the execution. Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also
acts as a collection point of data or query result obtained after the Reduce operation.
Question 27
Complete
The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is concerned on losing data if the data delivery
to the destination is falling behind data writing to the delivery stream. Can you help him out understanding how this process works, in
order to alleviate his concerns?
If Kinesis Data Firehose encounters errors while delivering or processing data, it retries until the configured retry duration expires. If the
retry duration ends before the data is delivered successfully, Kinesis Data Firehose backs up the data to the configured S3 backup bucket.
If the destination is Amazon S3 and delivery fails or if delivery to the backup S3 bucket fails, Kinesis Data Firehose keeps retrying until the
retention period ends.
Question 28
Complete
You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the efects of flooding from the River Guardião. The
municipality has already distributed IoT devices across the river, that are able to measure the flow the wates. In the kick-off meeting, you
say: "I have an idea for a possible solution. We may need to use a number of AWS services.” And then you explain your solution. What do
you say?
We can use Amazon Kinesis to process streaming data from IoT devices, in this case, the devices that measure the river flow, then use the
data to send real-time alerts or take other actions when a device detects certain water level thresholds. We can also use AWS sample IoT
analytics code, to build our application.
Schematically:
- INPUT - Water sensors send data to Amazon Kinesis Data Streams
- AMAZON KINESIS DATA STREAMS - Ingests and stores sensor data streams for processing
- AMAZON LAMBDA - Amazon Lambda is triggered and runs code to detect trends in sensor data, identify water levels and initiate alerts
- OUTPUT – Alert are received when the water level reaches a certain threshold, so actions can be taken.
Question 29
Complete
The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand it. Can you explain very clearly to him
with a small example: counting the words of the following text:
"Mary had a little lamb
Justify carefully.
Essentially, MapReduce divides a computation into three sequential stages: map, shuffle and reduce. Below is how the MapReduce word
count program executes and outputs the number of occurrences of a word in any given input file.
- Mapper Phase: the text from the input text file will be split into individual tokens, i.e. words, to form a key value pair with all the words
present in the input text file. The key is the word from the input file and value is ‘1’. In this case, the entire sentence will be split into 20
tokens (one for each word) with a value 1 as shown below:
(mary,1)
(had,1)
(a,1)
(little,1)
(lamb, 1)
(little,1)
(lamb,1)
(little,1)
(lamb,1)
(mary,1)
(had,1)
(a,1)
(little,1)
(lamb, 1)
(its, 1)
(fleece, 1)
(was, 1)
(white, 1)
(as, 1)
(snow, 1)
- Shuffle Phase: after the map phase execution is completed, shuffle phase is executed automatically wherein the key-value pairs
generated in the map phase are taken as input and then sorted in alphabetical order. The output will look like this:
(a,1)
(a,1)
(as, 1)
(fleece, 1)
(had,1)
(had,1)
(its, 1)
(lamb, 1)
(lamb, 1)
(lamb,1)
(lamb,1)
(little,1)
(little,1)
(little,1)
(little,1)
(mary,1)
(mary,1)
(snow, 1)
(was, 1)
(white, 1)
Reduce Phase: this is like an aggregation phase for the keys generated by the map phase. The reducer phase takes the output of shuffle
phase as input and then reduces the key-value pairs to unique keys with values added up. In our example:
(a,3)
(fleece, 1)
(had,2)
(its, 1)
(lamb, 4)
(little, 4)
(mary,2)
(snow, 1)
(was, 1)
(white, 1)
Question 30
Complete
An upcoming gaming startup is collecting gaming logs from its recently launched and hugely popular game. The logs are arriving in JSON
format with 500 different attributes for each record. The CMO has requested a dashboard based on six attributes that indicate the revenue
generated based on the in-game purchase recommendations as generated by the marketing departments ML team. The data is on S3 in
raw JSON format, and a report is being generated using a dashboard platform on the data available. Currently the report creation takes an
hour, whereas publishing the report is very quick. Furthermore, the IT department has complained about the cost of data scans on S3.
They have asked you as a solutions architect to provide a solution that improves performance and optimizes the cost. What would you
suggest? Be careful with your answer.
The main aspect of provisioning storage costs is to match the correct data with the correct storage class with how the data is utilized. The
best way to do this is to examine your data and determine how frequently accessed to determine if you need S3 Standard. Glacier storage
is a great option if you are looking for long-term storage.
25. We can say data internal tables in Hive adhere to a principle called “data on schema”. (T, W5, 25)
Quiz 2
Fig. 1
Fig. 2
By default, the Hive query execution engine processes one column of a table at a time. (F)
Clickstream data from web applications can be collected directly in a data lake and a portion of that data
can be moved out to a data warehouse for daily reporting. We think of this concept as inside-out data
movement. (T)
Fig. 1Consider the scenario in Fig. 1, for a log analytics solution. The web server in this example is an
Amazon Elastic Compute Cloud (EC2) instance. In step 1, Amazon Kinesis Firehose will continuously pull
log records from the Apache Web Server. (F)
Data storage formats like ORC and Parquet rely on metadata which describes a set of values in a section
of the data, called a stripe. If, for example, the user is interested in values < 10 and the metadata says all
1
the data in this stripe is between 20 and 30, the stripe is not relevant to the query at all, and the query
can skip over it. (T)
In Athena, if your files are too large or not splittable, parallelism can be limited due to query processing
halting until one reader has finished reading the complete file. (T)
In Athena, to change the name of Table1 to Table2, we would use the following instruction, as we would
in Hive:ALTER TABLE Table1RENAME TO Table2; (F)
Fig. 1In step 2, also, Amazon Kinesis Analytics application will continuously run a Presto script against
the streaming input data. (F)
Fig. 1In step 2, Amazon Kinesis Firehose will write each log record to Amazon Simple Storage Service
(Amazon S3) for durable storage of the raw log data. (T)
Fig. 1In step 3, the Amazon Kinesis Analytics application will create an aggregated data set every minute
and output that data to a second Firehose delivery stream. (T)
Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose. (T)
Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC. (T)
Kinesis Data Streams is a good choice for long-term data storage and analytics. (F)
One of the best use cases for AWS Glue, since it is a fully managed service, is if you require extensive
configuration changes from it. (F)
One of the characteristics of a serverless architecture, such as Athena's, is that, as the name implies,
there are no servers provisioned to begin with, and it is up to the user to completely provision all servers
and services. (F)
Presto is an open-source distributed SQL query engine optimized for batch ETL type jobs. (F)
Fig. 2Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service. (T)
The basic idea of vectorized query execution is to process a batch of columns as an array of line vectors.
(F)
Fig. 1We can configure, in Step 1, each shard in Amazon Kinesis Firehose to ingest up to 1MB/sec and
1000 records/sec, and emit up to 2MB/sec. (F)
You can use precisely the same set of tools to collect, prepare, and process real-time streaming data as
those tools that you have traditionally used for batch analytics. That's what is the basic premise of
Lambda architecture. (F)
You may copy the product catalog data stored in your database to your search service to make it easier
to look through your product catalog and offload the search queries from the database. We think of this
concept as data movement outside-in. (F)
2
Quiz 2 Questions
1
Fig. 2 An alert solution
17. Regarding Fig. 2, the box marked with an X refers to an Amazon Kinesis Analytics service. (T)
18. Kinesis Analytics outputs its results to Kinesis Streams or Kinesis Firehose. (T)
19. AWS Lambda polls the stream periodically (once per second) for new records. When it detects new
records, it invokes the Lambda function by passing the new records as a parameter. If no new records
are detected, the Lambda function is not invoked. (T)
20. DynamoDB tables do not have fixed schemas, but all items must have a similar number of attributes. (F)
21. The main drawback of Kinesis Data Firehose is that to scale up or down you need to manually provision
servers using the "AWS Kinesis Data Firehose Scaling API". (F)
22. Kinesis Data Firehose can convert a stream of JSON data to Apache Parquet or Apache ORC. (T)
23. Since it uses SQL as query language, Amazon Athena is a relational/transactional database. (F)
24. AWS Glue ETL jobs are Spark-based. (T)
25. Kinesis Data Streams is good choice for long-term data storage and analytics. (F)
26. Amazon OpenSearch/Elasticsearch stores CSV documents. (F)
27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE
28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE
29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE
29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE
2
23 – Since it uses SQL as query language, Amazon Athena is a relational/transactional database.
(2021)
FALSE
25 – Kinesis Data Streams is good choice for long-term data storage and analytics
(2021)
FALSE
-- // --
27 – One of the drawbacks of Amazon Kinesis Firehose is that it is unable to encrypt the data, limiting its
usability for sensitive applications.
(2020)
FALSE
28 – Stream processing applications process data continuously in realtime, usually after a store operation
(in HD or SSD).
(2020)
FALSE
29 – You cannot use streaming data services for real-time applications such as application monitoring,
fraud detection, and live leaderboards, because these use cases require millisecond end-to-end
latencies—from ingestion, to processing, all the way to emitting the results to target data stores and
other systems.
(2020)
FALSE
29 – Kinesis Firehose It’s a fully managed service that automatically scales to match the throughput of
the data and requires no ongoing administration.
(2020)
TRUE
In terms of storage, what does a name node contain and what do data nodes contain?
HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (slaves of HDFS) contain application
data in a partitioned manner for parallel writes and reads.
The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in pysical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host OS
drives. The name node simultaneously responds to the multiple client requests (in a multithreaded
system) and provides information to the client to connect to data nodes to write or read the data.
While writing, a file is broken down into multiple chunks of 64MB (by default, called blocks). Each
block is stored as a separate file on data nodes. Based on the replication factor of a file, multiple
copies or replicas of each block are stored for fault tolerance.
What were the key design strategies for HDFS to become fault tolerant?
HDFS is a highly scalable, distributed, load-balanced, portable, and fault-tolerant storage component
of Hadoop (with built-in redundancy at the software level).
When HDFS was implemented originally, certain assumptions and design goals were discussed:
Horizontal scalability—Based on the scale-out model. HDFS can run on thousands of nodes.
Write once and read many times—Based on a concept of write once, read multiple times, with an
assumption that once data is written, it will not be modified. Its focus is thus retrieving the data in
the fastest possible way.
Capability to handle large data sets and streaming data access—Targeted to small numbers of very
large files for the storage of large data sets.
Data locality—Every slave node in the Hadoop cluster has a data node (storage component) and a
JobTracker (processing component). Processing is done where data exists, to avoid data movement
across nodes of the cluster.
HDFS file system namespace—Uses a traditional hierarchical file organization in which any user or
application can create directories and recursively store files inside them.
The process of generating a new fsimage by merging transactional records from the edit log to the
current fsimage is called checkpoint. The secondary name node periodically performs a checkpoint
by downloading fsimage and the edit log file from the name node and then uploading the new
fsimage back to the name node. The name node performs a checkpoint upon restart (not
periodically, though—only on name node start-up).
By default, three copies, or replicas, of each block are placed, per the default block placement policy
mentioned next. The objective is a properly load-balanced, fast-access, fault-tolerant file system:
The first replica is written to the data node creating the file.
The second replica is written to another data node within the same rack.
How does a name node ensure that all the data nodes are functioning properly?
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.
Data locality is the concept of processing data locally wherever possible. This concept is central to
Hadoop, a platform that intentionally attempts to minimize the amount of data transferred across
the network by bringing the processing to the data instead of the reverse.
Review Questions 2
Answers
1
created without a block size specification. When creating a file, the client can also specify a
block size specification to override the cluster-wide configuration.
7. How does a client ensure that the data it receives while reading is not corrupted? Is there
a way to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file
and stores these checksums in a separate hidden file in the same HDFS file system
namespace. Later, while reading the blocks, the client references these checksums to verify
that these blocks were not corrupted (corruption might happen because of faults in a
storage device, network transmission faults, or bugs in the program). When the client
realizes that a block is corrupted, it reaches out to another data node that has the replica of
the corrupted block, to get another copy of the block.
8. How can you access and manage files in HDFS?
You can access the files and data stored in HDFS in many different ways. For example, you
can use HDFS FS Shell commands, leverage the Java API available in the classes of the
org.apache.hadoop.fs package, write a MapReduce job, or write Hive or Pig queries. In
addition, you can even use a web browser to browse the files from an HDFS cluster.
9. What two issues does HDFS encounter in Hadoop 1.0?
First, the name node in Hadoop 1.0 is a single point of failure. You can configure a secondary
name node, but it’s not an active-passing configuration. The secondary name node thus
cannot be used for failure, in case the name node fails. Second, as the number of data nodes
grows beyond 4,000, the performance of the name node degrades, setting a kind of upper
limit to the number of nodes in a cluster.
10. What is a daemon?
The word daemon comes from the UNIX world. It refers to a process or service that runs in
the background. On a Windows platform, we generally refer to it is as a service. For
example, in HDFS, we have daemons such as name node, data node, and secondary name
node.
11. What is YARN and what does it do?
In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer created on
top of HDFS. This new layer, called YARN (Yet Another Resource Negotiator), takes care of
two major functions: resource management and application life-cycle management. The
JobTracker previously handled those functions. Now MapReduce is just a batch-mode
computational layer sitting on top of YARN, whereas YARN acts like an operating system for
the Hadoop cluster by providing resource management and application life-cycle
management functionalities. This makes Hadoop a general-purpose data processing
platform that is not constrained only to MapReduce.
12. What is uber-tasking optimization?
The concept of uber-tasking in YARN applies to smaller jobs. Those jobs are executed in the
same container or in the same JVM in which that application-specific Application Master is
running. The basic idea behind uber-tasking optimization is that the distributed task
allocation and management overhead exceeds the benefits of executing tasks in parallel for
smaller jobs, hence its optimum to execute smaller job in the same JVM or container of the
Application Master.
13. What are the different components of YARN?
Aligning to the original master-slave architecture principle, even YARN has a global or master
Resource Manager for managing cluster resources and a per-node and -slave Node Manager
that takes direction from the Resource Manager and manages resources on the node. These
two, form the computation fabric for YARN. Apart from that is a per-application Application
Master, which is merely an application-specific library tasked with negotiating resources
from the global Resource Manager and coordinating with the Node Manager(s) to execute
2
the tasks and monitor their execution. Containers also are present—these are a group of
computing resources, such as memory, CPU, disk, and network.
3
BDF2020 Review Questions 1 Questions
2. In terms of storage, what does a name node contain and what do data nodes contain?
• HDFS stores and maintains file system metadata and application data separately. The name node
(master of HDFS) contains the metadata related to the file system (information about each file, as
well as the history of changes to the file metadata). Data nodes (workers of HDFS) contain
application data in a partitioned manner for parallel writes and reads.
• The name node contains an entire metadata called namespace (a hierarchy of files and directories)
in physical memory, for quicker response to client requests. This is called the fsimage. Any changes
into a transactional file is called an edit log. For persistence, both of these files are written to host
OS drives. The name node simultaneously responds to the multiple client requests (in a
multithreaded system) and provides information to the client to connect to data nodes to write
or read the data. While writing, a file is broken down into multiple chunks of 128MB (by default,
called blocks, 64MB sometimes). Each block is stored as a separate file on data nodes. Based on
the replication factor of a file, multiple copies or replicas of each block are stored for fault
tolerance.
5. What is client-side caching, and what is its significance when writing data to HDFS?
HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to
improve the performance of the block write operation and to minimize network congestion. The HDFS
client transparently caches the file into a temporary local file; when it accumulates enough data for a
block size, the client reaches out to the name node. At this time, the name node responds by inserting
the filename into the file system hierarchy and allocating data nodes for its storage. The client then
flushes the block of data from the local, temporary file to the closest data node, and that data node
transfers the block to other data nodes (as instructed by the name node, based on the replication
factor of the file). This client-side caching avoids continuous use of the network and minimizes the risk
of network congestion.
10. How does a name node ensure that all the data nodes are functioning properly?
Each data node in the cluster periodically sends heartbeat signals and a block-report to the name
node. Receipt of a heartbeat signal implies that the data node is active and functioning properly. A
block-report from a data node contains a list of all blocks on that specific data node.
11. How does a client ensure that the data it receives while reading is not corrupted? Is there a way
to recover an accidently deleted file from HDFS?
When writing blocks of a file, an HDFS client computes the checksum of each block of the file and
stores these checksums in a separate hidden file in the same HDFS file system namespace. Later, while
reading the blocks, the client references these checksums to verify that these blocks were not
corrupted (corruption might happen because of faults in a storage device, network transmission faults,
or bugs in the program). When the client realizes that a block is corrupted, it reaches out to another
data node that has the replica of the corrupted block, to get another copy of the block.
Dashboard / My courses / 2020-2021 / 2º Ciclo / Pós-Graduações / Outono / BDF-200209-202021-S1 / Second exam / Second exam
Question 1 You are hired by the Municipality of Terras do Ouro as Data Scientist, to help prevent the efects of flooding from the River
Complete Guardião. The municipality has already distributed IoT devices across the river, that are able to measure the flow the
Marked out of wates. In the kick-off meeting, you say: "I have an idea for a possible solution. We may need to use a number of AWS
2.00 services.” And then you explain your solution. What do you say?
You do not need to worry about computer resources if you use AWS Kinesys. This system will enable you to acquire data
through a gateway on the cloud that will receive your data, than you can process it using AWS Lambda module and later
store the data in a storage system e.g. AWS's S3
Question 2 You go to a conference and hear a speaker saying the following: "Real-time analytics is a key factor for digital
Complete transformation. Companies everywhere are using their datalakes based on technologies such as Hadoop and HDFS to
Marked out of give key insights, in real time, of their sales and customer preferences." You rise your hand for a comment. What are going
2.00 to say? Please, justify carefully.
Hadoop is a framework that incorporates several technologies including batch, streaming (you mentioned) and others.
HDFS is Hadoop's file system.
Question 3 In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive Driver is and what are its key
Complete functions/roles." What do you say, trying to be very complete and precise?
Marked out of
2.00 Hive is Hadoop's SQL analytics query framework. It is the same as saying Hive is a SQL abstraction layer over Hadoop
MapReduce with a SQL-like query engine. It enables several types of connections to other data bases using Hive driver.
HIVE DRIVER can for instances connect through ODBC to relational databases
Question 4 On implementing Hadoop, the CDO is worried on understanding how a name node ensures that all the data nodes are
Complete functioning properly. What can you tell him to reassure him?
Marked out of
2.00 Each data node in the cluster periodically sends heartbeat signals and a block-report to the name node. Receipt of a
heartbeat signal implies that the data node is active and functioning properly. A block-report from a data node contains a
list of all blocks on that specific data node.
Question 5
The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is concerned on losing data if
Complete
the data delivery to the destination is falling behind data writing to the delivery stream. Can you help him out
Marked out of understanding how this process works, in order to alleviate his concerns?
2.00
Handling Big data issues relates not only with data volume but with other aspects, one of them concerning the variety of
types of data.
While the volume can be handle scaling the processing and storage resources the issues related with variety require
customization of processing, namely interfaces, and programming of handlers. For instances handling objects coming
from SQL sources or OSI-PI sources require different type of programming skills. Not having the right skills to handle this
can ruin the ambition of storing different -and so rich - data.
To have this kind of skills we need a budget increase in the department.
Question 7 The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables can only be made on the
Complete basis of one parameter, making it not as useful as it could be." What can you answer him?
Marked out of
2.00
Hive partitioning can be made with more than one parameter from the original table. The partitioning can be made by
dividing the table in small tables by folders against each combination of parameters. And even the partitioning can be
optimized by bucket files.
Question 8
The CDO of Exportera has been exploring stream processing and is worried about the latency of a solution based on a
Complete
platform such as as Kinesis Data Streams. To address the question, what can you tell him?
Marked out of
2.00
Kinesis Data Stream platform is based on a high throughput, large bandwidth performance components.
Kinesis ensures durability and elasticity of data acquired through streaming. The elasticity of Kinesis Data Streams enables
you to scale the stream up or down, so that you never lose data records before it expires (1day default).
Offered to João Homem
Question 9 The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully understand it. Than you explain very
Complete clearly to him with a small example: counting the words of the following text:
Marked out of "Mary had a little lamb
2.00
Little lamb, little lamb
Mary had a little lamb
It's fleece was white as snow"
Justify carefully.
MapReduce is a program model, or a processing technique that can be applied in several contexts. Its implementation is
usually divided in three steps: Mapping the data variables considered, shuffling those variables following a
pattern/directive and then aggregating/reducing them.
Let us take the text above as reference for the following example: We want to count the number of words in the text.
In the first step we would map all the words, let say, continuously, getting a list like:
"Mary
had
a
little
...
as
snow"
If one wants to count the number of words, a number should be associated with each word for later counting. So we can
associate "1", so we can create, still in Map step the kind of Key-Value database associating each word with "1", e.g.:
<"Mary",1>, where the word is the Key.
<Mary,1>
<little,1>
In the next step one could shuffle this created database so that all similar keys can get together. But since one just wants
to count the number of words, we just need to sum the values and this operation is the Reduce operation in this case.
Question 10 In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in motion is no different than
Complete processing data at rest." If you had to join the conversation, what would you say to them?
Marked out of
2.00 Processing data in motion (streaming) is much different from processing data at rest. The main differences are that:
1- analytics over streaming data needs to be done with data incoming and not in steady data.
2- Data in motion needs buffering to process amounts of data while similar steady data processing can be done through
querying.
3- The concept of incoming data in streaming does not apply in data at rest. In the former case one needs to adapt que
acquisition to the amount of incoming data.
Of course there are also some similarities, namely in quality control of data, namely the detection of common types of
errors like valid values or missing values. Anyway, the processing- which is what we are talking about - is much different in
both cases
We can use Amazon Kinesis to process streaming data from IoT devices, in this case, the
devices that measure the river flow, then use the data to send real-time alerts or take other
actions when a device detects certain water level thresholds. We can also use AWS sample
IoT analytics code, to build our application.
Schematically:
AMAZON KINESIS DATA STREAMS - Ingests and stores sensor data streams for processing
AMAZON LAMBDA - Amazon Lambda is triggered and runs code to detect trends in sensor
data, identify water levels and initiate alerts
OUTPUT – Alert are received when the water level reaches a certain threshold, so actions can
be taken.
2 - You go to a conference and hear a speaker saying the following: "Real-time analytics is a key
factor for digital transformation. Companies everywhere are using their datalakes based on
technologies such as Hadoop and HDFS to give key insights, in real time, of their sales and
customer preferences." You raise your hand for a comment. What are going to say? Please, justify
carefully.
As mentioned, real-time data analytics is a key factor in digital transformation, since it combines
the power of parallel processing with the value of real-time data. For most companies, having
data means having access to wealth.
Datalakes are central repositories of data in a natural or raw format, having been pulled from a
variety of sources. With the datalakes, users can extract structured metadata from unstructured
data on a regular basis and store it in the operational data lake for quick and easy querying, thus
enabling better real-time data analysis. Hadoop’s ability to efficiently process large volumes of
data in parallel provides great benefits, but there are also a number of use cases that require more
“real time” processing of data—processing the data as it arrives, rather than through batch
processing. Fortunately, this need for more real-time processing is being addressed with the
integration of new tools into the Hadoop ecosystem.
3 - In a whiteboarding session, you say to the CDO of Exportera: "OK, I'll explain what the Hive
Driver is and what are its key functions/roles." What do you say, trying to be very complete and
precise?
The Hive Driver, Compiler, Optimizer, and Executor work together to turn a query into a set of
Hadoop jobs. The Driver acts like a controller which receives the HiveQL statements. The driver
starts the execution of statement by creating sessions. It monitors the life cycle and progress of
the execution. Driver stores the necessary metadata generated during the execution of
a HiveQL statement. It also acts as a collection point of data or query result obtained after the
Reduce operation.
4 - On implementing Hadoop, the CDO is worried on understanding how a name node ensures
that all the data nodes are functioning properly. What can you tell him to reassure him?
With HDFS, data is written on the server once, and read and reused numerous times after that.
The NameNode is the master node in the Apache Hadoop HDFS Architecture, which keeps track
of where file data is kept in the cluster and maintains and manages the blocks present on the
DataNodes. Data is broken down into separate blocks and distributed among the various
DataNodes for storage.
The DataNodes are in constant communication with the NameNode to determine if the
DataNodes need to complete specific tasks. Consequently, the NameNode is always aware of the
status of each DataNode. If the NameNode realizes that one DataNode isn't working properly, it
can immediately reassign that DataNode's task to a different node containing the same data block.
DataNodes also communicate with each other, which enables them to cooperate during normal
file operations.
5 - The CDO is considering using Amazon Kinesis Firehose to deliver data to Amazon S3. He is
concerned on losing data if the data delivery to the destination is falling behind data writing
to the delivery stream. Can you help him out understanding how this process works, in order to
alleviate his concerns?
If Kinesis Data Firehose encounters errors while delivering or processing data, it retries until the
configured retry duration expires. If the retry duration ends before the data is delivered
successfully, Kinesis Data Firehose backs up the data to the configured S3 backup bucket. If the
destination is Amazon S3 and delivery fails or if delivery to the backup S3 bucket fails, Kinesis Data
Firehose keeps retrying until the retention period ends.
6 - The CDO of Exportera asks you to prepare the talking points for a presentation he must make
to the board regarding a budget increase for his team. It is important that the board members
understand what data variety versus data variability is, as well as its impacts to the analytics
platforms (and, of course, business benefits of addressing them). What do you write to make it
crystal clear?
Big Data has defined by the National Institute of Standards and Technology, consists of extensive
datasets - primarily in the characteristics of volume, velocity, variety, and/or variability - that
require a scalable architecture for efficient storage, manipulation, and analysis.
So, while many other characteristics have been attributed to Big Data, only the above four drive
the shift to new scalable architectures for data-intensive applications in order to achieve cost-
effective performance. These characteristics have the following definitions:
- Volume: the size of the dataset
- Velocity: rate of flow
- Variety: data from multiple repositories, domains, or types. Note that while volume and
velocity allow faster and more cost-effective analytics, it is the variety of data that allows
analytic results that were never possible before. Business benefits are frequently higher
when addressing the variety of data than when addressing volume.
- Variability: the changes in dataset, whether data flow rate, format/structure, semantics,
and/or quality that impact the analytics application. Variability in data volumes implies the
need to scale-up or scale-down virtualized resources to efficiently handle the additional
processing load, one of the advantageous capabilities of cloud computing.
7 - The CDO of Exportera tells you: “From what I understand, in Hive, the partitioning of tables
can only be made on the basis of one parameter, making it not as useful as it could be." What
can you answer him?
Partitioning in Hive is used to increase query performance. It is done by using the PARTITIONED
BY clause in the create table statement. A table can be partitioned on the basis of one or more
columns. It is a way of dividing a table into related parts based on the values of columns like date,
city, and department. Each table in the Hive can have one or more partition keys to identify a
particular partition. The columns on which partitioning is done cannot be included in the data
table.
8 - The CDO of Exportera has been exploring stream processing and is worried about the latency
of a solution based on a platform such as as Kinesis Data Streams. To address the question,
what can you tell him?
Latency is a measure of delay. Amazon Kinesis Data Streams are specifically used to build real -
time custom model applications. You can use Amazon Kinesis Data Streams to collect and process
large streams of data records in real-time. One of AWS Kinesis advantage is that it enables us to
ingest buffer and process streaming data in real-time to drive insights in seconds or minutes
instead of hours or days.
9 - The CDO of Exportera is confused: "I've read about MapReduce, but I still do not fully
understand it. Then you explain very clearly to him with a small example: counting the words of
the following text:
Justify carefully.
Essentially, MapReduce divides a computation into three sequential stages: map, shuffle and
reduce. Below is how the MapReduce word count program executes and outputs the number of
occurrences of a word in any given input file.
- Mapper Phase: the text from the input text file will be split into individual tokens, i.e. words,
to form a key value pair with all the words present in the input text file. The key is the word
from the input file and value is ‘1’. In this case, the entire sentence will be split into 20 tokens
(one for each word) with a value 1 as shown below:
(mary,1)
(had,1)
(a,1)
(little,1)
(lamb, 1)
(little,1)
(lamb,1)
(little,1)
(lamb,1)
(mary,1)
(had,1)
(a,1)
(little,1)
(lamb, 1)
(its, 1)
(fleece, 1)
(was, 1)
(white, 1)
(as, 1)
(snow, 1)
- Shuffle Phase: after the map phase execution is completed, shuffle phase is executed
automatically wherein the key-value pairs generated in the map phase are taken as input and
then sorted in alphabetical order. The output will look like this:
(a,1)
(a,1)
(as, 1)
(fleece, 1)
(had,1)
(had,1)
(its, 1)
(lamb, 1)
(lamb, 1)
(lamb,1)
(lamb,1)
(little,1)
(little,1)
(little,1)
(little,1)
(mary,1)
(mary,1)
(snow, 1)
(was, 1)
(white, 1)
Reduce Phase: this is like an aggregation phase for the keys generated by the map phase. The
reducer phase takes the output of shuffle phase as input and then reduces the key-value pairs to
unique keys with values added up. In our example:
(a,3)
(fleece, 1)
(had,2)
(its, 1)
(lamb, 4)
(little, 4)
(mary,2)
(snow, 1)
(was, 1)
(white, 1)
10 - In the Exportera cafetaria you hear someone telling the following: "Oh, processing data in
motion is no different than processing data at rest." If you had to join the conversation, what
would you say to them?
Data at rest, refers to data that has been collected from various sources and is then analysed after
the event occurs. The point where the data is analysed and the point where action is taken on it
occur at two separate times.
The collection process for data in motion is similar to that of data at rest. However, the difference
lies in the analytics. In this case, the analytics occur in real-time as the event happens. Data in
motion is processed and analysed in real time, or near real time, and must be handled in a very
different way than data at rest (i.e., persisted data). Data in motion tends to resemble event -
processing architectures and focuses on real-time or operational intelligence applications.
For data at rest, a batch processing method would be most likely used. There is no need for
“always on” infrastructure. This approach provides access to high-performance processing
capabilities as needed. For data in motion, you would want to utilize a real-time processing
method. In this case, latency becomes a key consideration because a lag in processing could result
in a missed opportunity to improve business results.