You are on page 1of 17

Analyzing Big Data

1st Exam 2021

1 – In Databricks notebooks you can? *


a) Program only in Python and Spark
b) Program only in the language defined in the Notebook creation
c) Select a language at the cell level
d) Select a language for a set of cells

2 – In a Databricks notebook, to access the cluster driver node console, what magic command is used?
a) %fs
b) %drive
c) dbutils.fs.mount()
d) %sh

3 – What is an RDD?*
a) A Hadoop data format
b) A dataset in-memory
c) A dataset in-disk
d) A dataset in-disk and in-memory

4 – What is MapReduce?
a) A programming language
b) A programming model
c) A set of functions for processing big data
d) The original Apache Software Foundation query engine for big data

5 – What is Avro in Hadoop?


a) A program to load data with high parallelization
b) A column-based data format
c) A row-based data format
d) A full based text data format for compatibility and portability

6 – What is a pair RDD?


a) Two sets of RDDs in a transformation
b) An RDD with only two rows
c) An RDD with two data types
d) An RDD with only two columns

7 – What is the result of the Spark statement below?


sc.parallelize(mydata,3)
a) Creates an RDD with a minimum of 3 partitions
b) Creates and RDD named mydata and the value 3
c) Generates an error of too many parameters
d) Creates 3 RDDs with mydata

8 – What is the output object type that results from applying a map() function to an RDD that was
created from a text file with the sc.textFile() method?
a) String
b) Tuple
c) List
d) Dictionary

9 – Select the right statement to create an RDD:


a) myRDD = [“Alice”, “Carlos”, “Frank”, “Barbara”]
b) myRDD = sc.load(“Alice”, “Carlos”, “Frank”, “Barbara”)
c) myRDD = sc.parallellize([“Alice”, “Carlos”, “Frank”, “Barbara”])
d) myRDD = load(“Alice”, “Carlos”, “Frank”, “Barbara”)

10 – What Spark function has as an output, a Pair RDD?


a) map
b) faltmap
c) textFile
d) wholeTextFiles

11 – Select the right statement regarding reduceByKey():


a) reduceByKey() is a wide transformation
b) reduceByKey() is a narrow transformation
c) reduceByKey() is a lazy transformation
d) reduceByKey() is an action

12 – Select the false statement regarding Spark terminology:


a) A Job is a set of tasks executed as a result of an action
b) A Stage is a set of tasks in a job that can be executed in parallel
c) A Task is an individual unit of work sent to an executor
d) An Application can only contain one job (set of jobs managed by a driver)

13 – What is a lambda function?*


a) It’s a function defined without a name and with only one parameter
b) It’s a function defined without a name and with only one expression
c) It’s a function that can be reused with many parameters
d) It’s a function that can be reused with many expressions

14 – collect() is a Spark function that?


a) You can extensively use to display Dataframes content
b) It’s not available for Dataframes
c) You can extensively use to display RDDs content
d) You should use with caution to display RDDs content
We should use the collect() on smaller dataset usually after filter(), group(), count() e.t.c. Retrieving on larger
dataset results in out of memory.

15 – With the instruction sc.textFile(“file:/data”) you are?


a) Reading a file from your hdfs file system
b) Reading a file called “File:/data”
c) Reading a file from your local non-Hadoop file system
d) Reading a file called “data” stored in a folder called “file”

16 – Select the right instruction to create a Dataframe?


a) Dataframe = spark.range(10) Out[1]: DataFrame[id: bigint]
b) Dataframe = sc.textFile(“mydata”)
c) Dataframe = spark.dataFrame(“Mydata”)
d) Dataframe = sc.parallelize(“mydata”)

17 – Select the right statement regarding Spark transformations:


a) Wide transformations are very efficient because they don’t move data from the node
b) Narrow transformations are very efficient because they don’t move data from the node
c) Both wide and narrow transformations move data from the node
d) None of the narrow or wide transformations move data from the node

18 – In Spark, lazy execution means that:*


a) Execution will take some time because it needs to be sent to the worker nodes
b) Execution will take some time because the code is interpreted
c) Execution is done one line at the time
d) Execution is triggered only when an action is found
19 – What is the difference between Spark Streaming and Structured Streaming?*
a) Structured Streaming is for structured streaming data processing and Spark Streaming is for
unstructured streaming data processing
b) Spark Streaming is the new ASF library for Streaming Data and Structured Streaming the old
one
c) Structured Streaming is a stream processing engine and Spark Streaming is an extension to
the core Spark API to streaming data processing
d) Structured Streaming relies on micro batch and RDDs while Spark Streaking relies on
DataFrames and Datasets

20 – What are DStreams?


a) Data abstractions provided from Spark core library
b) Data abstractions provided from Spark MLlib
c) Data abstractions provided from Spark Streaming
d) Data abstractions provided from Spark Structured Streaming

21 – Window operations in Spark are used for?


a) Define a window to display data
b) Select a window of data to display
c) Freeze data in memory from a window of data
d) Apply transformation over a window of data

22 – Spark ML library can be classified as:


a) A mature ML library with a very wide range of predictive and descriptive models to choose
from
b) A strong Deep Learning library
c) A complete ML framework for data analysis
d) A ML library with a reasonable set of models but still with work in progress

23 – What is the main difference between Spark MLlib and ML?*


a) There is no difference apart from the bigger set of algorithms available on Spark ML
b) Spark ML works with Streaming Data
c) Spark MLlib is faster
d) Spark ML works with Dataframes
Spark MLlib carries the original API built on top of RDDs.
Spark ML contains higher-level API built on top of DataFrames for constructing ML pipelines.

24 – In a Spark ML program, what is the purpose of the code below?


model.transform(mydata)
a) Create a machine learning model based on the data of ‘mydata’
b) Apply the model in ‘model’ to the data in ‘mydata’
c) Create a new model based on ‘mydata’
d) Adjust the model based on the data in ‘mydata’

25 – In a Spark ML program, what is the purpose of the code below?


model.fit(mydata)
a) Train a machine learning model based on the data of ‘mydata’
b) Apply the model in ‘model’ to the data in ‘mydata’
c) Create a new model based on ‘mydata’
d) Adjust the model based on the data in ‘mydata’

26 – What is the result of the Spark ML instruction below?


lr = LogisticRegression(maxIter = 10)
a) A logistic regression object is declared with a maximum of 10 interactions
b) A logistic regression object is executed with a maximum of 10 interactions
c) A logistic regression object is trained with a maximum of 10 interactions
d) A logistic regression object is estimated with a maximum of 10 interactions
27 – The vertex DataFrame in a GraphFrame is:*
a) A free form DataFrame
b) A DataFrame that must contain a column named ‘id’
c) A DataFrame that must contain a column named ‘src’ and ‘dst’
d) A DataFrame that must contain a column named ‘id’, ‘src’ and ‘dst’

28 – What is DSL used for in GraphFrames?


a) Formatting the output of a GraphFrame query
b) Declare a GraphFrame object
c) Search for patterns in a graph
d) Define properties in a GraphFrame

29 – Based on the figure below, explain the shown algorithm line by line and the expected line
outcome. Complementary, explain what is supposed to be input content for the algorithm and where
is it (the object) identified in the code.

count = rdd.flatMap(lambda line: line.split()) \


.map(lambda word: (word, 1)) \
.reduceByKey(lambda a,b: a+b)
The code is a map-reduce algorithm to perform a word count.
o In the first line, the outcome is a list of words (one word per line). Each word is split on the delimiter
“ “, after applying the flatMap, which flattens the results.
o In the second line, the map function applies a lambda function to each word to create a tuple with
(word, 1). This way, the outcome is a key-value pair, being the word the key and the value 1.
o In the third line, an aggregation is performed by key (word), being the outcome a list of the distinct
words, with the count of occurrences registered.
The input content of the algorithm is an rdd representing at least a line of text, which is called in the first
line before the first lambda function.
Example:
rdd = sc.parallelize([“This is the first line”, “This is the second line”, “This is the last line”])

30 – Explain the differences between the map() and flatMap() transformations in Spark.
Complementary, show and example created by you (with code) and explain the output differences.
• Both map() and flatMap() are an transformations that apply a function to the elements of an RDD an
return a new RDD with the transformed elements.
• On the one hand, map() transformation takes one element and produces one element (one-to-one
transformation). On the other hand, flatMap() takes one element and produces zero, one or more
elements (one-to-many transformation)
• In this example, let’s create an RDD which has a list of 2 lines of text
o rdd = sc.parallelize(["First Word", "Second Word"])
• If we perform an upper case transformation using the map(), the output will be a list with the two lines
of text with all the words in uppercase:
o code: rdd.map(lambda line: line.upper()).collect()
o output: ['FIRST WORD', 'SECOND WORD']
• If we perform an upper case transformation using the flatMap(), the output will be a list with all the
characters character in upper case, as the results are flattened:
o code: rdd.flatMap(lambda line: line.upper()).collect()
o output: ['F', 'I','R','S', 'T',' ','W','O','R','D','S','E','C','O','N','D',' ','W','O','R','D']

31 – Explain the differences between Real-time data processing and Batch data processing.
Complementary, give an example of both data processing types
• Batch data processing deals with groups of transactions that have already been collected over a period of time. The goal of a batch
processing system is to automatically execute periodic jobs in a batch. It is ideal for large volumes of data/transactions, as it is increases
efficiency in comparison with processing them individually. However, it can have a delay between the collection of data and getting
the result after the batch process, as it is normally a very time consuming process. An example of batch processing are the processing
of the salary within a company every month.
• On the other hand, real-time data processing deals with continuously flowing data in real-time. Real-time processing systems need to
be very responsive and active all the time, in order to supply immediate response at every instant. In this systems, the information is
always up to date. However, the complexity of this process is higher than in batch processing. An example can be a radar system,
weather forecast or temperature measurement and normally involve several IoT sensors.
2nd Exam 2021

Question 1
Write a Databricks Notebook program to do the following tasks
1. Display your Spark session version, master and AppName

2. Print the list of the files in /FileStore/tables

3. Copy one of the files (you may suggest a non-existing) name to a new version with the name
prefix “New_”

1 – sc
2 – %fs ls /FileStore/tables
3 - dbutils.fs.cp("dbfs:/FileStore/tables/File.csv", "dbfs:/FileStore/tables/New_File.csv", True)

Question 2
Write a Spark program to do the following tasks:
1. Create a Python list of Temperatures in ºF as in: [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8,
53.6, 55.4, 54.7]
2. Create an RDD based on the list above
3. Display the main stats of the RDD (like count, mean, stdev etc.)

4. Create a program to convert the elements of the RDD from ºF to ºC


5. Display the result of the new RDD with the elements in ºC

1 – List = [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6, 55.4, 54.7]
2 – rdd = sc.parallelize(List)
rdd.collect()
3 – rdd.stats()
4 – rdd_Celsius = rdd.map(lambda T: (T-32)*5/9)
5 – rdd_Celsius.collect()

Question 3
Write a Spark program to do the following tasks:
1. Create a RDD with the following 7 lines of text: "First Line", "Now the 2nd", "This is the 3rd
line", "This is not the 3rd line?", "This is the 5th ", "This is the 6th", "Last Line"

2. Create a RDD (based on the previous one with Spark functions) with only the lines that start
with the word "This"

3. Create a RDD (based on the previous one with Spark functions) without lines with the word
"not"

4. Create a RDD (based on the previous one with Spark functions) with the text in capital letters
5. Display only two elements of the resulting RDD

1–
text = ["First Line", "Now the 2nd", "This is the 3rd line", "This is not the 3rd line?", "This is the 5th ", "This is the 6th", "Last
Line"]
rdd = sc.parallelize(text)
rdd.collect()

2–
rdd_2 = rdd.filter(lambda line: line.startswith("This"))
rdd_2.collect()

3–
rdd_3 = rdd.filter(lambda line: "not" not in line)
rdd_3.collect()

4–
rdd_4 = rdd.map(lambda line: line.upper())
rdd_4.collect()

5 – rdd_4.take(2)

Question 4
Write a Spark program to do the following tasks:
1. Create an RDD with the data below (simulating customer acquisitions):
o 'Client1:p1,p2,p3'
o 'Client2:p1,p2,p3'
o 'Client3:p3,p4'

2. Write a Spark program to convert the above RDD in the following output:
o 'Client1', 'p2'
o 'Client1', 'p3'
o 'Client2', 'p1'
o 'Client2', 'p2'
o 'Client2', 'p3'
o 'Client3', 'p3'
o 'Client3', 'p4'
1–
rdd = sc.parallelize(['Client1:p2,p3', 'Client2:p1,p2,p3', 'Client3:p3,p4'])
rdd.collect()

2-
rdd2 = rdd.map(lambda line: line.split(":"))
rdd3 = rdd2.map(lambda fields: (fields[0],fields[1]))
rdd4 = rdd3.flatMapValues(lambda p: p.split(","))
rdd4.collect()
Question 5
Write a Spark algorithm to count the number of distinct words on the text line below (or similar input
you may write).
o "First Line"
o "Now the 2nd line"
o "This is the 3rd line"
o "This is not the final line?"
o "This is the 5th "
o "This is the 6th"
o "Last Line is the 7th"
The output must be a pair RDD with the distinct words and the corresponding number of occurrences.

text = ["First Line", "Now the 2nd line", "This is the 3rd line", "This is not the final line?", "This is the 5th", "This is the 6th",
"Last Line is the 7th"]
rdd = sc.parallelize(text)
rdd2 = rdd.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda w1,w2: w1+w2).collect()
Question 6
Write a Spark program to do the following tasks:
1. Create a RDD based on the list of the following 4 tuples:
[('Mark',25),('Tom',22),('Mary',20),('Sofia',26)]
2. Create a Dataframe based on the previous RDD with the 2 columns named "Name" and "Age"
3. Display the new DataFrame
4. Create a new DataFrame based on the previous one, adding a new column named "AgePlus"
with the content of Age multiplied by 1.2

5. Write the new DataFrame (with the columns "Name", "Age" and "AgePlus") in dbfs in Delta
format
6. Check that the written DataFrame/file is in dbfs

1 – rdd = sc.parallelize([('Mark',25),('Tom',22),('Mary',20),('Sofia',26)])
2 – df = spark.createDataFrame(rdd).toDF("Name","Age")
3 – display(df)
4 – df2 = df.withColumn("AgePlus", df["Age"]*1.2)
5 – df2.write.format("delta").save("/FileStore/tables/df2")
6 - %fs ls /FileStore/tables/df2
Question 7
Explain the major differences between Spark SQL, Hive, and Impala. Give examples supporting your
explanation for each of the 3 cases.
• Spark SQL is a distributed in-memory computation engine. It is a spark module for structured data processing which is
built on top of Spark Core.
o Spark SQL can handle several independent processes and in a distributed manner across thousands of clusters
that are distributed among several physical and virtual clusters.
o It supports several other Spark Modules being used in applications such as Stream Processing and Machine
Learning
• On the other hand, Hive is a data warehouse software for querying and managing large distributed datasets, built on
top of Hadoop File System (HDFS)
o Hive is designed for Batch Processing through the use of Map Reduce Programming
• Finally, Impala is a massively parallel processing (MPP) engine developed by Cloudera.
o Contrarly to Spark, it supports multi-user environment while having all the qualities of Hadoop it supports column
storage, tree architecture, Apache HBase storage and HDFS.
o It has significantly higher query throughput than Spark SQL and Hive.
o However, in large analytical queries Spark SQL and Hive outperform Impala.

Question 8
Write a Spark program to do the following tasks:
1. Create a Graph that depicts the following data and relationships:
o Alice: age=31; Esther: age=35; David: age=34; Bob age=29
o Alice is married to Bob and is friend of Esther
o Esther is married to David and is friend of Alice
o Bob and David are friends
2. Print the edges and vertices of your graph
3. Create a subgraph with only the friend relationships and show the result

1 - from graphframes import *

# Create the vertices


vertices = sqlContext.createDataFrame([
("a", "Alice", 31),
("b", "Esther", 35),
("c", "David", 34),
("d", "Bob", 29)], ["id", "name", "age"])

# Create the edges


edges = sqlContext.createDataFrame([
("a", "d", "married"),
("a", "b", "friend"),
("b", "c", "married"),
("b", "a", "friend"),
("c", "d", "friend")], ["src", "dst", "relationship"])

# Create the graph


g = GraphFrame(vertices, edges)

2–
display(g.vertices)
display(g.edges)

3–
friends = g.edges.filter("relationship = 'friend' ")
friends.show()
Question 9
Write a Spark program to do the following tasks:
1. Create a DataFrame simulating insurance customer data with:
o The columns: ["age","bmi","children","charges","smoker"]
o 4 records with the following values: [ [19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"s"], [32,38,3,208,"n"]]
2. Create a new DataFrame based on the previous one with the values in the smoker column
encoded in 1/0 values
3. Display the new DataFrame without the original "smoker" column

1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"], [28,35,2,191,"y"], [32,38,3,208,"n"] ]
df = spark.createDataFrame(records, ["age","bmi","children","charges","smoker"])

2–
from pyspark.ml.feature import StringIndexer
smokerIndexer = StringIndexer(inputCol = "smoker", outputCol = "smokerIndex")
df2 = smokerIndexer.fit(df).transform(df)

3-
df_final = df2.drop("smoker")
display(df_final)
Question 10
1. Write a Spark program to do the following tasks:
o Create a DataFrame simulating insurance customer data, with (or assume you already
have a file with the data and read it from dbfs):
o The columns: ["age","bmi","children","charges","smoker"]
o 4 records with the following values: [ [19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"s"], [32,38,3,208,"n"] ]

2. Write a Spark ML program to perform a multiple linear regression in an attempt to predict


the health insurance charges based on the clients age, bmi, number of children and his/hers
smoking habits
3. At the end of your program print also the: coefficients, RMSE and the r2 of your model

1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"], [28,35,2,191,"y"], [32,38,3,208,"n"] ]
df = spark.createDataFrame(records, ["age","bmi","children","charges","smoker"])’

2–
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Convert the smoker column to a binary column


smokerIndexer = StringIndexer(inputCol = "smoker", outputCol = "smoker_binary")
df2 = smokerIndexer.fit(df).transform(df)
df3 = df2.drop("smoker")

# Vectorize the data


from pyspark.ml import Pipeline
stages = []
assembler = VectorAssembler(inputCols=["age", "bmi", "children", "smoker_binary"], outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df3)
dataset = pipelineModel.transform(df3)

# Keep relevant columns and rename the column "charges" to "label"


df_final = dataset.select(["features", "charges"]).selectExpr("features as features", "charges as label")

# Split the data into training and test sets (80% train, 20% test)
(trainingData, testData) = df_final.randomSplit([0.8,0.2])
# Create a linear regression object
lr = LinearRegression(maxIter = 10, regParam = 0.3, elasticNetParam = 0.8)
# Chain Linear Regression in a Pipeline
pipeline = Pipeline(stages = [lr])
# Train the Model
model = pipeline.fit(trainingData)
# Make Predictions
predictions = model.transform(testData)

3–
eval = RegressionEvaluator (labelCol = "label", predictionCol = "prediction")
print("Coefficients: " + str(model.stages[0].coefficients))
print("RMSE:", eval.evaluate(predictions, {eval.metricName: "rmse"}))
print("R2:", eval.evaluate(predictions, {eval.metricName: "r2"}))
Apoio – Test Analyzing Big Data*

1 – Which of the following is NOT a component of big data architecture?


a) Data storage
b) Data sources
c) Machine Learning
d) Anonymization

2 – Only one of the following sentences is NOT true. Choose it.


a) Workspaces allow you to organize all the work that you are doing on Databricks
b) The objective of data lake is to break data out of silos
c) Clusters are single computer that you treat as a group of computers
d) The data lake stores data of any type: structured, unstructured, streaming

3 – Which of the following is the most appropriate definition for “Jobs”


a) Are packages or modules that provide additional functionality that you need to solve your
business problems
b) Are structured data that you and your team will use for analysis
c) Are the tool by which you can schedule execution to occur either on an already existing
cluster or a cluster of its own
d) Are third party integrations with the Databricks platform

4 – Only one of the following options is NOT true. Choose it.


a) Notebooks need to be connected to a cluster in order to be able to execute commands
b) Dashboards can be created from notebooks as a way of displaying the output of cells without
the code that generates them
c) Clusters allow you to execute code from apps
d) Workspaces allow you to organize all the work that you are doing on Databricks
(Clusters allow you to execute code from notebooks or libraries on set of data.)

5 – Only one of the following sentences is correct. Choose it.


a) Clusters allow you to execute code from notebooks or libraries on set of data
b) Dashboards cannot be created from notebooks
c) Tables cannot be stored on the cluster that you're currently using
d) Applications like Tableau are jobs

6 – Only one of the following sentences is correct. Choose it.


The command “%sh” allows you:
a) To display the files of the folder
b) To execute shell code in your notebook
c) To use dbutils filesysytem commands
d) To include various types of documentation, including text, images and mathematical formula
and equations

7 – Concerning the characteristics of “Lists” only one of the following options is NOT true. Choose it.
a) Lists are collections of items where each item in the list has an assigned index value
b) Lists consists of values separated by commas
c) A list is mutable (meaning you can change its contents)
d) Lists are enclosed in square brackets [ ] and each item is separated by a comma

8 – Only one of the following sentences is correct. Choose it.


a) A dictionary maps a set of keys to another set of values
b) Tuples are never enclosed in parentheses
c) Dictionaries are immutable
d) The Lambda function is used for creating big and multimer function objects

9 – Only one of the following sentences is NOT true. Choose it.


Several types of data are stored in the following DBFS root locations:
a) /databricks-datasets
b) databricks-results
c) /FileStore
d) databricks-finalizations

10 – The example annex uses the lambda function. The syntaxes are correct?

a) Yes
b) No

11 – Concerning the use of Hadoop, only one of the following sentences is correct. Choose it.
a) A Node is a group of computers working together
b) A Cluster is an individual computer in the cluster
c) A Daemon is a program running on a node
d) With Hadoop we can’t explore the nodes (name or data)

12 – Which of the following is NOT a component of Hadoop data architecture?


a) HDFS
b) MapIncrease
c) YARN
d) Spark

13 – Concerning HDFS, only one of the following sentences is NOT true. Choose it.
a) HDFS is responsible for storing data on the cluster
b) HDFS is a File System written in Java
c) HDFS sits on top of a native file system
d) HDFS provides non-redundant storage

14 – Only one of the following sentences is correct. Choose it.


a) YARN allows multiple data processing engines to run on a single cluster
b) MapReduce is a programming model for processing data in a distributed way on a unique
node
c) MapReduce drawbacks is optimized for iterative algorithms
d) MapReduce drawbacks is not limited to batch processing

15 – Only one of the following sentences is NOT true. Choose it.


Avro data files:
a) Is a row-based storage format for Hadoop
b) It stores data in a non-binary format
c) It’s an efficient data serialization framework
d) Uses JSON for defining the data schema

16 – Concerning Parquet files, only one of the following sentences is NOT true. Choose it.
a) Parquet is supported in Spark, mapReduce, hive, Pig, Impala, and others
b) Parquet reduces performance
c) Parquet is a columnar format developed by Cloudera and Twitter
d) Parquet is most efficient when adding many records at once

17 – Which of the following is NOT a Delta Lake key feature?


a) Closed Format
b) Scalable Metadata Handling
c) Unified Batch and Streaming Source and Sink
d) Schema Enforcement and Evolution
18 – Choose the option with the correct command to copy the file foo.txt from local disk to user’s
directory in HDFS
a) $hdfs dfs -put foo.txt foo.txt
b) $hdfs dfs -ls foo.txt foo.txt
c) $hdfs dfs -get foo.txt foo.txt (??)
d) $hdfs dfs -rm foo.txt foo.txt

19 – Concerning Apache Flame, only one of the following sentences is NOT true. Choose it.
a) Apache Flame is a real time data ingestion tool
b) The use of Apache Flume is only restricted to log data aggregation
c) Apache Flame is a distributed, reliable, and available system for efficiently collecting,
aggregating, and moving large amounts of streaming data from many different sources to a
centralized data store
d) Apache Flume is a top level project at the Apache Software Foundation

20 – Can you explain why use Apache Storm?


Apache Storm is a free and open source distributed realtime computation system. Apache Storm
makes it easy to reliably process unbounded streams of data, doing for realtime processing what
Hadoop did for batch processing. Apache Storm is simple, can be used with any programming
language, and is a lot of fun to use!
Apache Storm has many use cases: realtime analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a
million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will
be processed, and is easy to set up and operate.

You might also like