You are on page 1of 3

: Spark Programming

DataFrame

You want to construct a dataframe using the data from a text file. Assume that the file has in each line a
set of words separated by space. The fields are as follows:

Last Name, First Name, Age, Gender, State of Residence, Education level

First you create a text file with these fields for say 10 samples. Use this sample file as an input to your
program. You want to construct a dataframe with Last Name, Age, Gender, State of Residence. You will
use explicit schema construct method for creating the dataframe.

You want to create a result that shows the number of people who are 21 year or older and from New York
State. Provide the actual runnable code to create the dataframe using appropriately constructed schema
and the query both using SparkQL and Dataframe API.

You could use any programming language for this. Submit the actual code for this alongwith the
screenshot of the results to prove the code you ran in your local machine using Spark.

Spark RDD

Page Rank algorithm and code. onstruct all the required RDD and show the content.

List all the transformations and actions for the code. List all the RDDs with the associated content for the
first iteration where you compute page rank for each page. You do not need make this code as a running
code. You can just illustrate with example content for the given example.

Spark Streams

Provide the code fragment that will get the tweets from Twitter using its streaming API and extract the
hashtag. After extracting the hastag, it will count the number of hashtags #cloudcomputing. Ideally your
code should be ready to run without any syntax error. But you do not need to show the output or actual
running.
Question 2. System Design

You are going to design part of a chat messaging service like WhatsApp using AWS services. The
following features need to be supported by your backend platform. You will sketch the supporting frontend
with the assumptions you are making.

1. User login from any device using phone number at the setup time (like Whatsapp or WeChat
etc.). User could be offline. User could login from multiple device where the app has been
installed.
2. User can see the list of users in the contact list. User can see the status of each user in the
contact list. The following status should be displayed: Online, last active with a timestamp.
3. User can send a message to a user that is in their contact list. It should show the three status
states of the message (message sent, message delivered, message read). You need to have the
appropriate data pipeline.
4. User can send attachments such as image, video, and audio files. You need to support this
requirement with appropriate data pipeline.
5. User can delete the chat content just like in WhatsApp.

You need to list the APIs and the backend architecture using AWS services and infrastructure. Your
design should be scalable, event driven and asynchronous. It should handle carefully any failure to
guarantee reliability.

Please refer to YouTube video description on WhatsApp: https://www.youtube.com/watch?v=L7LtmfFYjc4


Or/and Chapter 12 of Alex Xu’s System Design Interview book to get familiarity of the working design for
a WhatsApp backend.

You need to list in detail all the assumptions you are making, list of all the required APIs to support the 5
requirements listed above, and the backend architecture. Take time to do this carefully and as complete
as possible.

. Borg and Kubernets


We covered Borg and Kubernetes using the reference materials.

Please refer to those to answer this part. You need to complete the first three tutorials from the link below:

https://kubernetes.io/docs/tutorials/kubernetes-basics/ following the description from the tutorial.

You need to complete only the first three. You should see this figure once you open the link above.

You need to submit the last screenshot once you complete each tutorial to prove that you complete it. In
addition, you need to provide a brief summary what each tutorial accomplished with explanation for
overall steps.
. Concepts and Papers

D1 [5]. Why Kafka decided to use stateless broker? What are the benefits and drawbacks? How
is the drawback handled?

D2 [10]. he Borg presentation video by John Wilkes : https://


www.youtube.com/watch?v=7MwxA4Fj2l4

Based on this presentation and the Borg paper, describe how Borg design architecture handles
failures of Borg Master, scheduler or any Borg node.

D3 [5]. What is a Spark Context (sc)? What is it used for (describe through a diagram and
illustration)? How do you submit a Spark code on a EMR cluster? Your illustration should be a
step by step description (with any code fragments if any) that if followed will actually work on
AWS EMR.

D4 [5]. Refer to Kafka and LinkedIn slide . Illustrate how Kafka is used on the overall
LinkedIn data pipeline. Using the architecture diagram articulate how the recommendation on
friends/users and jobs that you get when you login to LinkedIn might be working.

You might also like