Professional Documents
Culture Documents
2 – In a Databricks notebook, to access the cluster driver node console, what magic command is used?
a) %fs
b) %drive
c) dbutils.fs.mount()
d) %sh
3 – What is an RDD?*
a) A Hadoop data format
b) A dataset in-memory
c) A dataset in-disk
d) A dataset in-disk and in-memory
4 – What is MapReduce?
a) A programming language
b) A programming model
c) A set of functions for processing big data
d) The original Apache Software Foundation query engine for big data
8 – What is the output object type that results from applying a map() function to an RDD that was
created from a text file with the sc.textFile() method?
a) String
b) Tuple
c) List
d) Dictionary
29 – Based on the figure below, explain the shown algorithm line by line and the expected line
outcome. Complementary, explain what is supposed to be input content for the algorithm and where
is it (the object) identified in the code.
30 – Explain the differences between the map() and flatMap() transformations in Spark.
Complementary, show and example created by you (with code) and explain the output differences.
• Both map() and flatMap() are an transformations that apply a function to the elements of an RDD an
return a new RDD with the transformed elements.
• On the one hand, map() transformation takes one element and produces one element (one-to-one
transformation). On the other hand, flatMap() takes one element and produces zero, one or more
elements (one-to-many transformation)
• In this example, let’s create an RDD which has a list of 2 lines of text
o rdd = sc.parallelize(["First Word", "Second Word"])
• If we perform an upper case transformation using the map(), the output will be a list with the two lines
of text with all the words in uppercase:
o code: rdd.map(lambda line: line.upper()).collect()
o output: ['FIRST WORD', 'SECOND WORD']
• If we perform an upper case transformation using the flatMap(), the output will be a list with all the
characters character in upper case, as the results are flattened:
o code: rdd.flatMap(lambda line: line.upper()).collect()
o output: ['F', 'I','R','S', 'T',' ','W','O','R','D','S','E','C','O','N','D',' ','W','O','R','D']
31 – Explain the differences between Real-time data processing and Batch data processing.
Complementary, give an example of both data processing types
• Batch data processing deals with groups of transactions that have already been collected over a period of time. The goal of a batch
processing system is to automatically execute periodic jobs in a batch. It is ideal for large volumes of data/transactions, as it is increases
efficiency in comparison with processing them individually. However, it can have a delay between the collection of data and getting
the result after the batch process, as it is normally a very time consuming process. An example of batch processing are the processing
of the salary within a company every month.
• On the other hand, real-time data processing deals with continuously flowing data in real-time. Real-time processing systems need to
be very responsive and active all the time, in order to supply immediate response at every instant. In this systems, the information is
always up to date. However, the complexity of this process is higher than in batch processing. An example can be a radar system,
weather forecast or temperature measurement and normally involve several IoT sensors.
2nd Exam 2021
Question 1
Write a Databricks Notebook program to do the following tasks
1. Display your Spark session version, master and AppName
3. Copy one of the files (you may suggest a non-existing) name to a new version with the name
prefix “New_”
1 – sc
2 – %fs ls /FileStore/tables
3 - dbutils.fs.cp("dbfs:/FileStore/tables/File.csv", "dbfs:/FileStore/tables/New_File.csv", True)
Question 2
Write a Spark program to do the following tasks:
1. Create a Python list of Temperatures in ºF as in: [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8,
53.6, 55.4, 54.7]
2. Create an RDD based on the list above
3. Display the main stats of the RDD (like count, mean, stdev etc.)
1 – List = [50, 59.2, 59, 57.2, 53.5, 53.2, 55.4, 51.8, 53.6, 55.4, 54.7]
2 – rdd = sc.parallelize(List)
rdd.collect()
3 – rdd.stats()
4 – rdd_Celsius = rdd.map(lambda T: (T-32)*5/9)
5 – rdd_Celsius.collect()
Question 3
Write a Spark program to do the following tasks:
1. Create a RDD with the following 7 lines of text: "First Line", "Now the 2nd", "This is the 3rd
line", "This is not the 3rd line?", "This is the 5th ", "This is the 6th", "Last Line"
2. Create a RDD (based on the previous one with Spark functions) with only the lines that start
with the word "This"
3. Create a RDD (based on the previous one with Spark functions) without lines with the word
"not"
4. Create a RDD (based on the previous one with Spark functions) with the text in capital letters
5. Display only two elements of the resulting RDD
1–
text = ["First Line", "Now the 2nd", "This is the 3rd line", "This is not the 3rd line?", "This is the 5th ", "This is the 6th", "Last
Line"]
rdd = sc.parallelize(text)
rdd.collect()
2–
rdd_2 = rdd.filter(lambda line: line.startswith("This"))
rdd_2.collect()
3–
rdd_3 = rdd.filter(lambda line: "not" not in line)
rdd_3.collect()
4–
rdd_4 = rdd.map(lambda line: line.upper())
rdd_4.collect()
5 – rdd_4.take(2)
Question 4
Write a Spark program to do the following tasks:
1. Create an RDD with the data below (simulating customer acquisitions):
o 'Client1:p1,p2,p3'
o 'Client2:p1,p2,p3'
o 'Client3:p3,p4'
2. Write a Spark program to convert the above RDD in the following output:
o 'Client1', 'p2'
o 'Client1', 'p3'
o 'Client2', 'p1'
o 'Client2', 'p2'
o 'Client2', 'p3'
o 'Client3', 'p3'
o 'Client3', 'p4'
1–
rdd = sc.parallelize(['Client1:p2,p3', 'Client2:p1,p2,p3', 'Client3:p3,p4'])
rdd.collect()
2-
rdd2 = rdd.map(lambda line: line.split(":"))
rdd3 = rdd2.map(lambda fields: (fields[0],fields[1]))
rdd4 = rdd3.flatMapValues(lambda p: p.split(","))
rdd4.collect()
Question 5
Write a Spark algorithm to count the number of distinct words on the text line below (or similar input
you may write).
o "First Line"
o "Now the 2nd line"
o "This is the 3rd line"
o "This is not the final line?"
o "This is the 5th "
o "This is the 6th"
o "Last Line is the 7th"
The output must be a pair RDD with the distinct words and the corresponding number of occurrences.
text = ["First Line", "Now the 2nd line", "This is the 3rd line", "This is not the final line?", "This is the 5th", "This is the 6th",
"Last Line is the 7th"]
rdd = sc.parallelize(text)
rdd2 = rdd.flatMap(lambda line: line.split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda w1,w2: w1+w2).collect()
Question 6
Write a Spark program to do the following tasks:
1. Create a RDD based on the list of the following 4 tuples:
[('Mark',25),('Tom',22),('Mary',20),('Sofia',26)]
2. Create a Dataframe based on the previous RDD with the 2 columns named "Name" and "Age"
3. Display the new DataFrame
4. Create a new DataFrame based on the previous one, adding a new column named "AgePlus"
with the content of Age multiplied by 1.2
5. Write the new DataFrame (with the columns "Name", "Age" and "AgePlus") in dbfs in Delta
format
6. Check that the written DataFrame/file is in dbfs
1 – rdd = sc.parallelize([('Mark',25),('Tom',22),('Mary',20),('Sofia',26)])
2 – df = spark.createDataFrame(rdd).toDF("Name","Age")
3 – display(df)
4 – df2 = df.withColumn("AgePlus", df["Age"]*1.2)
5 – df2.write.format("delta").save("/FileStore/tables/df2")
6 - %fs ls /FileStore/tables/df2
Question 7
Explain the major differences between Spark SQL, Hive, and Impala. Give examples supporting your
explanation for each of the 3 cases.
• Spark SQL is a distributed in-memory computation engine. It is a spark module for structured data processing which is
built on top of Spark Core.
o Spark SQL can handle several independent processes and in a distributed manner across thousands of clusters
that are distributed among several physical and virtual clusters.
o It supports several other Spark Modules being used in applications such as Stream Processing and Machine
Learning
• On the other hand, Hive is a data warehouse software for querying and managing large distributed datasets, built on
top of Hadoop File System (HDFS)
o Hive is designed for Batch Processing through the use of Map Reduce Programming
• Finally, Impala is a massively parallel processing (MPP) engine developed by Cloudera.
o Contrarly to Spark, it supports multi-user environment while having all the qualities of Hadoop it supports column
storage, tree architecture, Apache HBase storage and HDFS.
o It has significantly higher query throughput than Spark SQL and Hive.
o However, in large analytical queries Spark SQL and Hive outperform Impala.
Question 8
Write a Spark program to do the following tasks:
1. Create a Graph that depicts the following data and relationships:
o Alice: age=31; Esther: age=35; David: age=34; Bob age=29
o Alice is married to Bob and is friend of Esther
o Esther is married to David and is friend of Alice
o Bob and David are friends
2. Print the edges and vertices of your graph
3. Create a subgraph with only the friend relationships and show the result
2–
display(g.vertices)
display(g.edges)
3–
friends = g.edges.filter("relationship = 'friend' ")
friends.show()
Question 9
Write a Spark program to do the following tasks:
1. Create a DataFrame simulating insurance customer data with:
o The columns: ["age","bmi","children","charges","smoker"]
o 4 records with the following values: [ [19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"s"], [32,38,3,208,"n"]]
2. Create a new DataFrame based on the previous one with the values in the smoker column
encoded in 1/0 values
3. Display the new DataFrame without the original "smoker" column
1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"], [28,35,2,191,"y"], [32,38,3,208,"n"] ]
df = spark.createDataFrame(records, ["age","bmi","children","charges","smoker"])
2–
from pyspark.ml.feature import StringIndexer
smokerIndexer = StringIndexer(inputCol = "smoker", outputCol = "smokerIndex")
df2 = smokerIndexer.fit(df).transform(df)
3-
df_final = df2.drop("smoker")
display(df_final)
Question 10
1. Write a Spark program to do the following tasks:
o Create a DataFrame simulating insurance customer data, with (or assume you already
have a file with the data and read it from dbfs):
o The columns: ["age","bmi","children","charges","smoker"]
o 4 records with the following values: [ [19,27,0,168,"y"], [18,33,1,177,"n"],
[28,35,2,191,"s"], [32,38,3,208,"n"] ]
1–
records = [ [19,27,0,168,"y"], [18,33,1,177,"n"], [28,35,2,191,"y"], [32,38,3,208,"n"] ]
df = spark.createDataFrame(records, ["age","bmi","children","charges","smoker"])’
2–
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
# Split the data into training and test sets (80% train, 20% test)
(trainingData, testData) = df_final.randomSplit([0.8,0.2])
# Create a linear regression object
lr = LinearRegression(maxIter = 10, regParam = 0.3, elasticNetParam = 0.8)
# Chain Linear Regression in a Pipeline
pipeline = Pipeline(stages = [lr])
# Train the Model
model = pipeline.fit(trainingData)
# Make Predictions
predictions = model.transform(testData)
3–
eval = RegressionEvaluator (labelCol = "label", predictionCol = "prediction")
print("Coefficients: " + str(model.stages[0].coefficients))
print("RMSE:", eval.evaluate(predictions, {eval.metricName: "rmse"}))
print("R2:", eval.evaluate(predictions, {eval.metricName: "r2"}))
Apoio – Test Analyzing Big Data*
7 – Concerning the characteristics of “Lists” only one of the following options is NOT true. Choose it.
a) Lists are collections of items where each item in the list has an assigned index value
b) Lists consists of values separated by commas
c) A list is mutable (meaning you can change its contents)
d) Lists are enclosed in square brackets [ ] and each item is separated by a comma
10 – The example annex uses the lambda function. The syntaxes are correct?
a) Yes
b) No
11 – Concerning the use of Hadoop, only one of the following sentences is correct. Choose it.
a) A Node is a group of computers working together
b) A Cluster is an individual computer in the cluster
c) A Daemon is a program running on a node
d) With Hadoop we can’t explore the nodes (name or data)
13 – Concerning HDFS, only one of the following sentences is NOT true. Choose it.
a) HDFS is responsible for storing data on the cluster
b) HDFS is a File System written in Java
c) HDFS sits on top of a native file system
d) HDFS provides non-redundant storage
16 – Concerning Parquet files, only one of the following sentences is NOT true. Choose it.
a) Parquet is supported in Spark, mapReduce, hive, Pig, Impala, and others
b) Parquet reduces performance
c) Parquet is a columnar format developed by Cloudera and Twitter
d) Parquet is most efficient when adding many records at once
19 – Concerning Apache Flame, only one of the following sentences is NOT true. Choose it.
a) Apache Flame is a real time data ingestion tool
b) The use of Apache Flume is only restricted to log data aggregation
c) Apache Flame is a distributed, reliable, and available system for efficiently collecting,
aggregating, and moving large amounts of streaming data from many different sources to a
centralized data store
d) Apache Flume is a top level project at the Apache Software Foundation