You are on page 1of 33

Build A Serverless Real-Time Data

Processing Application On AWS

A Project Report Submitted in Partial Fulfillment of the Requirements for the


Award of the Degree of Bachelor of Science (Research) in Computer Science with
Specialization in Cloud Computing and Big data

Submitted by

Aniket Vilas Chinchole(R18BS005)


Under the Guidance of
Internal Guide
Prof. Deepa B G

External Guide
Mr. Ramesh
School of Computer Science and Applications
REVA University, Bangalore

SCHOOL OF COMPUTER SCIENCE AND APPLICATIONS


REVA University
Kattigenahalli, Yelahanka, Bangalore – 560 064
January 2022

1
CERTIFICATE

This is to certify that the Minor project work entitled “Build A Serverless Real-Time Data

Processing Application On AWS” submitted to the School of Computer Science and


Applications, REVA University in partial fulfilment of the requirements for the award of the Degree of
Bachelor of Science (Research) in Computer Science with specialization in Cloud Computing and Big
data in the academic year 2021-2022 is a record of the original work done by Aniket Vilas Chinchole

(R18BS005) under my supervision and guidance and that this Minor project work has not formed the
basis for the award of any Degree / Diploma / Associate ship / Fellowship or similar title to any candidate
of any University.

Place: Bangalore
Date:

Internal Guide Signature:

External Guide Signature: Signature of the Program Co-ordinator

Signature of the Director

Submitted for the University Examination held on

Internal Examiner External Examiner

2
DECLARATION

I, Aniket Vilas Chinchole (R18BS005) hereby declare that this Minor project work entitled

“Build A Serverless Real-Time Data Processing Application On AWS” under the

guidance of Prof Deepa B G and Mr. Ramesh and that this Minor project work has not
formed the basis for the award of any Degree / Diploma / Associate ship / Fellowship or similar
title to any candidate of any University.

Signature of the Candidate

Date

Countersigned by

Signature of the Internal guide Signature of the External guide

3
ABSTRACT
In this mini project, we will learn how to build a serverless app to process real-time data streams. We
will build infrastructure for a fictional ride-sharing company.
In this case, we will enable operations personnel at a fictional Wild Rydes headquarters to monitor the
health and status of their unicorn fleet. Each unicorn is equipped with a sensor that reports its location
and vital signs. We will use AWS to build applications to process and visualize this data in real-time.
We will use AWS Lambda to process real-time streams, Amazon DynamoDB to persist records in a
NoSQL database, Amazon Kinesis Data Analytics to aggregate data, Amazon Kinesis Data Firehose to
archive the raw data to Amazon S3, and Amazon Athena to run ad-hoc queries against the raw data.

4
TABLE OF CONTENTS

CHAPTERS PAGE NO

1. INTRODUCTION
1.1 INTRODUCTION TO THE PROJECT 06
1.2 STATEMENT OF PROBLEM 06
2. SYSTEM ANALYSIS
2.1 EXISTING SYSTEM 07
2.2 PROPOSED SYSTEM 07
3. TOOLS AND TECHNOLOGIES
3.3 HARDWARE REQUIREMENTS 08
3.4 SOFTWARE REQUIREMENTS 08
4. SYSTEM DESIGN
4.1 ARCHITECTURE DIAGRAM 09
4.2 ALGORITHMS USED 10
5. IMPLEMENTATION 12
6. SNAPSHOTS 21
7. CONCLUSION 32
BIBLIOGRAPHY

5
CHAPTER 1
INTRODUCTION

1.1 INTRODUCTION TO THE PROJECT

Cloud Computing has become very popular due to the multiple benefits it provides and is being
adopted by businesses worldwide. Flexibility to scale up or down as per the business needs,
faster and efficient disaster recovery, subscription-based models which reduce the high cost of
hardware, and flexible working for employees are some of the benefits of cloud that attracts
businesses. Similar to cloud, Data Analytics is another crucial area which businesses are
exploring for their growth. With the exponential rise in the amount of data available on the
internet is a result of the boom in the usage of social media, mobile apps, IoT devices, sensors
and so on. It has become imperative for the organisations to analyse this data to get insights into
their businesses and take appropriate action.

AWS provides a reliable platform for solving complex problems where cost-effective
infrastructure can be built with great ease at low cost. AWS provides a wide range of managed
services, including computing, storage, networking, database, analytics, application services and
many more.

1.2 STATEMENT OF THE PROBLEM

We have analysed multiple software solutions which perform analysis on data collected from
the market and provide information as well as suggestions and provide better customer
experience. This includes trade application providing stock price, taxi companies providing
locations of nearby taxis, journey plan applications providing live updates on the different
transport media and many more.
We have considered a “server-less” platform / “Server-less Computing Execution Model” to
build the real-time data-processing app. Architecture is based on managed services provided by
AWS.

6
What is “Server-less”?

A cloud-based execution model in which the cloud provider dynamically allocates and
runs the server. This is a consumption-based model where pricing is directly proportional
to consumer use. AWS takes complete ownership of operational responsibili ties eliminating
infrastructure management and availability with higher uptime.

7
CHAPTER 2
SYSTEM ANALYSIS
2.1 EXISTING SYSTEM

This workshop is an interactive exercise that builds infrastructure to collect, process, and persist
data without using servers
Agenda
• Overview of workshop scenario, modules, and the services we'll utilize
• Review of workshop prerequisites and tools
• Execution of four modules-each about thirty minutes in length

2.2 PROPOSED SYSTEM

Calling all serverless developers! Wild Rydes (www.wildrydes.com), the world's leading unicorn
transportation startup, needs your help! The company's rydesharing network of unicorns has grown
to thousands, fulfilling hundreds of thousands of passenger rydes each day. Your mission is to collect,
store, process, and analyze data to track the real-time location and health of our unicorns. In this
workshop, learn how to build infrastructure to process data streams in real time using Amazon
Kinesis. Build serverless applications using Amazon Kinesis Analytics to aggregate and summarize
data and use AWS Lambda to store aggregated data in Amazon DynamoDB. Finally, use Amazon
Kinesis Firehose to build a data lake in Amazon S3, and use Amazon Athena to run ad-hoc queries
against it. Requirements: laptop, text editor, AWS account, and AWS Command Line Interface (CLI)
installed and configured.

8
CHAPTER 3
SYTEM REQUIREMENTS
3.1 HARDWARE REQUIREMENTS:
• Laptop/Desktop
• Keyboard
• Mouse
• Internet (2Mbps of speed)

3.2 SOFTWARE REQUIREMENTS:


• Active AWS Account.
• Browser (Chrome Recommended)
• AWS Lambda
• Amazon Kinesis
• Amazon S3
• Amazon DynamoDB
• Amazon Cognito
• Amazon Athena
• AWS IAM

9
CHAPTER 4
SYSTEM DESIGN

4.1 ARCHITECTURE DIAGRAM

10
4.2 ALGORITHM USED

Overview
Serverless applications don’t require you to provision, scale, and manage any servers. We
can build them for nearly any type of application or backend service, and everything
required to run and scale your application with high availability is handled for you.
Serverless architectures can be used for many types of applications. For example, you can
process transaction orders, analyze click streams, clean data, generate metrics, filter logs,
analyze social media, or perform IoT device data telemetry and metering.

We will use AWS to build applications to process and visualize this data in real-time. We
will use AWS Lambda to process real-time streams, Amazon DynamoDB to persist
records in a NoSQL database, Amazon Kinesis Data Analytics to aggregate data, Amazon
Kinesis Data Firehose to archive the raw data to Amazon S3, and Amazon Athena to run
ad-hoc queries against the raw data.

This workshop is broken up into four modules. We must complete each module before
proceeding to the next.

1. Build a data stream


Create a stream in Kinesis and write to and read from the stream to track
Wild Rydes unicorns on the live map. In this module you'll also create an
Amazon Cognito identity pool to grant live map access to your stream.

2. Aggregate data
Build a Kinesis Data Analytics application to read from the stream and
aggregate metrics like unicorn health and distance traveled each minute.

3. Process streaming data


Persist aggregate data from the application to a backend database stored
in DynamoDB and run queries against those data.

4. Store & query data


Use Kinesis Data Firehose to flush the raw sensor data to an S3 bucket
for archival purposes. Using Athena, you'll run SQL queries against the
raw data for ad-hoc analyses
11
CHAPTER 5
IMPLEMENTATION

Create an AWS Account:

Use AWS account email address and password to sign in as the AWS account root user to
the IAM console at https://console.aws.amazon.com/iam/.

In the navigation pane of the console, choose User, and then choose Add user.

• For User name, type the Administrator


• Select the check box next to AWS Management Console access, select Custom
password, and then type the new user's password in the text box. You can optionally
select Require password reset to force the user to create a new password the next time
the user signs in.
• Choose Next: Permissions

Set up AWS Cloud9 IDE:

• Go to the AWS Management Console, select Service then select Cloud9 under
Developer Tools.

• Select Create environment.


• Enter Development into Name and optionally provide a Description.
• Select Next Step.
• you may leave Environment settings at their defaults of launching a new t2.micro EC2
instance which will be paused after 30 minutes of inactivity.
• Select Next steps.
• Type in Admin: ~/environment $ aws sts get-caller-identity

12
Installation

• Switch to the tab where I have Cloud9 environment opened


• Download and unpack the command line clients by running the following command in
the Cloud9 terminal: curl -s https://dataprocessing.wildrydes.com/client/client.tar | tar
-xv

5.1 Real-time Streaming Data:

5.1.1 Create an Amazon Kinesis stream

• Go to the AW Management Console, select Service then select Kinesis under Analytics
• Select Get started if prompted with an introductory screen.

• Select Create data stream.

• Enter wildrydes into Kinesis stream name and 1 into Number of shards, then select
Create Kinesis stream.

• Within 60 seconds, Kinesis stream will be ACTIVE and ready to store real-time
streaming data.

5.1.2 Produce messages into the stream

• Switch to the tab where have Cloud9 environment opened.


• In the terminal, run the producer to start emitting sensor data to the stream: ./producer

5.1.3 Read messages from the stream

• Click New Terminal to open a new terminal tab.


• Run the consumer to start reading sensor data from the stream: ./consumer

5.1.4 Create an identity pool for the unicorn dashboard

• Go to the AWS Management Console, services Services then select Cognito under
Security, Identity & Compliance.
• Select Manage Identity Pools.
• Enter wildrydes into Identity pool name.
• Tick the Enable access to unauthenticated identities checkbox.
• Click Create Pool. 13
• Click Allow which will create authenticated and unauthenticated roles for your identity
pool.
• Select Go to Dashboard.
• Select Edit identity pool in the upper right-hand corner.
• Note the Identity pool ID for use in a later step.

5.1.5 Grant the unauthenticated role access to the stream


• Go to the AWS Management Console, services Services then select IAM under
Security, Identity & Compliance.

• Select Roles in the left-hand navigation.


• Select the Cognito wildrydes Unauth Role.
• Select Add inline policy.
• Select Choose a service and select Kinesis.
• Tick the Read and List permissions checkboxes in the Actions section.
• Select Resources to limit the role to the wildrydes stream.
• Select Add ARN next to stream
• Enter the region you're using in region (e.g. us-east-1), your Account ID in
Account, and wildrydes in Stream name. To find your twelve-digit AWS
Account ID number in the AWS Management Console, select Support in the
navigation bar in the upper-right, and then select Support Center. Your currently
signed ID appears in the upper-right corner below the Support menu.
• Select Add
• Select Review policy
• Enter wildrydes Dashboard Policy in Name.
• Select Create policy

5.1.6 View unicorn status on the dashboard

• Open the Unicorn Dashboard.


• Enter the Cognito Identity Pool ID and select Start.
• Validate that you can see the unicorn on the map.
• Click on the unicorn to see more details from the stream.

5.1.7 Experiment with the producer

• Stop the producer by pressing Control + C and notice the messages stop and the unicorn
disappear after 30 seconds.
• Start the producer again and notice the messages resume and the unicorn reappear.
• Hit the (+) button and click New Terminal to open a new terminal tab.
• Start another instance of the producer in the new tab. Provide a specific unicorn name 14
and notice data points for both unicorns in consumer’s output: ./producer -name
Bucephalus
• Check the dashboard and verify you see multiple unicorns.

5.2 Aggregate data

5.2.1 Create an Amazon Kinesis stream

• Go to the AWS Management Console, click Services then select Kinesis under
Analytics.
• Select Get started if prompted with an introductory screen.
• Select Create data stream.
• Enter wildrydes-summary into Kinesis stream name and 1 into Number of shards, then
select Create Kinesis stream.
• Within 60 seconds, your Kinesis stream will be ACTIVE and ready to store the real-
time streaming data.

5.2.2 Create an Amazon Kinesis Data Analytics application

• Switch to the tab where you have your Cloud9 environment open.
• Run the producer to start emitting sensor data to the stream.
• Go to the AWS Management Console, click Services then select Kinesis under
Analytics.
• Select Create analytics application.
• Enter wildrydes into the Application name and then select Create application.
• Select Connect streaming data.
• Select wildrydes from Kinesis stream

• Scroll down and click Discover schema, wait a moment, and ensure that schema was
properly auto discovered.
• Select Save and continue.
• Select Go to SQL editor. This will open up an interactive query session where we can
build a query on top of our real-time Amazon Kinesis stream.
• Select Yes, start application. It will take 30-90 seconds for your application to start.
• Copy and paste the following SQL query into the SQL editor:

15
• Select Save and run SQL. Each minute, you will see rows arrive containing the
aggregated data. Wait for the rows to arrive.

• Click the Console link.


• Select Connect to a destination.
• Select wildrydes-summary from Kinesis stream.
• Select DESTINATION_SQL_STREAM from In-application stream name.
• Select Save and continue.

5.2.3 Read messages from the stream

• Switch to the tab where you have your CLoud9 environment opened.
• Run the consumer to start reading sensor data from the stream: ./consumer -stream
wildrydes-summary

5.2.4 Experiment with the producer

• Switch to the tab where you have your Cloud9 environment opened.
• Stop the producer by pressing Control + C and notice the messages stop.
• Start the producer again and notice the messages resume.
• Hit the (+) button and click New Terminal to open a new terminal tab.
• Start another instance of the producer in the new tab. Provide a specific unicorn name
and notice data points for both unicorns in

5.3 Process streaming data

5.3.1 Create an Amazon DynamoDB tables

• Go to the AWS Management Console, choose Services then select DynamoDB under
Database.
• Click Create table.
• Enter Unicorn Sensor Data for the Table name.
• Enter Name for the Partition key and select String for the key type. 16
• Tick the Add sort key checkbox. Enter Status Time for the Sort key and select String
for the key type.
• Leave the Use default settings box checked and choose Create.
• Scroll to the Table details section of your new table's properties and note the Amazon
Resource Name (ARN). You will use this in the next step.

5.3.2 Create an IAM role for your Lambda function

• From the AWS Console, click on Services and then select IAM in the Security, Identity
& Compliance section.
• Select Policies from the left navigation and then click Create policy.
• Using the Visual editor, we're going to create an IAM policy to allow our Lambda
function access to the DynamoDB table created in the last section. To begin, select
Service, begin typing DynamoDB in Find a service, and click DynamoDB.
• Select Action, begin typing Batch Write Item in Filter actions, and tick the Batch Write
Item checkbox.
• In Region, enter the AWS Region in which you created the DynamoDB table in the
previous section, e.g.: us-east-1.
• In Account, enter your AWS Account ID which is a twelve digit number, e.g.:
123456789012. To find your AWS account ID number in the AWS Management
Console, click on Support in the navigation bar in the upper-right, and then click
Support Center. Your currently signed ID appears in the upper-right corner below the
Support menu.
• In Table Name, enter Unicorn Sensor Data.
• Select Add.
• Select Review policy.
• Enter WildRydes DynamoDB Write Policy in the Name field.
• Select Create policy.
• Select Roles from the left navigation and then select Create role.
• Select Next: Permissions.
• Begin typing AWS Lambda Kinesis Execution Role in the Filter text box and check the
box next to that role.
• Begin typing WildRydes DynamoDB Write Policy in the Filter text box and check the
box next to that role.
• Click Next: Review.
• Enter WildRydes Stream Processor Role for the Role name.
• Click Create role.

5.3.3 Create a Lambda function to process the stream

• Go to the AWS Management Console, choose Service then select Lambda under
Compute.
• Select Create a function.
17
• Enter WildRydes Stream Processor in the Name field.
• Select Node.js 6.10 from Runtime.
• Select WildRydes Stream Processor Role from the Existing role dropdown.
• Select Create function.
• Scroll down to the Function code section.
• Copy and paste the JavaScript code below into the code editor.

• In the Environment variable section, enter an environment variable with Key


TABLE_NAME and Value Unicorn Sensor Data.
• In the Basic settings section. Set the Timeout to 1 minute.
• Scroll up and select Kinesis from the Designer section
• In the Configure triggers section, select wildrydes-summary from Kinesis Stream.
• Leave Batch size set to 100 and Starting position set to Latest.
• Select Enable trigger
• Select Add
• Select Save.

5.3.4 Monitor the Lambda function

• Run the producer to start emitting sensor data to the stream with a unicorn name:
./producer -name Rosinante
• Select the Monitor tab and explore the metrics available to monitor the function. Select
Jump to Logs to explore the function's log output.

5.3.5 Query the DynamoDB table

• Select Services then select DynamoDB in the Database section.


• Select Tables from the left-hand navigation.
• Select Unicorn Sensor Data.
• Select the Items tab. Here you should see each per-minute data point for each Univorn
for which you're running a producer. 18
5.4 Store & query data

5.4.1 Create an Amazon S3 bucket

• From the AWS Management Console select Services then select S3 under Storage.
• Select Next twice, and then select Create bucket.

5.4.2 Create an Amazon Kinesis Data Firehose delivery stream

• From the AWS Management Console select Services then select Kinesis under
Analytics.
• Select Create delivery stream.
• Enter wildrydes into Delivery stream name.
• Select Kinesis stream as Source and select wildrydes as the source stream.
• Select Next.
• Leave Record transformation and Record format conversation disabled and select Next.
• Select Amazon S3 from Destination.
• Choose the bucket you created in the previous section (i.e. wildrydes-data-johndoe)
from S3 bucket.
• Select Next.
• Enter 60 into Buffer interval under S3 Buffer to set the frequency of S3 deliveries once
per minute.
• Scroll down to the bottom of the page and select Create new or Choose from IAM role.
In the new tab, click Allow.
• Select Next. Review the delivery stream details and select Create delivery stream.

• Select + Create bucket.


• Provide a globally unique name for your bucket such as wildeydes-data-your name.
• Select the region you've been using for your bucket.

5.4.3 Create an Amazon Athena table

• Select Services then select Athena in the Analytics section.


• If prompted, select Get started and exit the first-run tutorial by hitting the x in the upper
right hand corner of the modal dialog.
• Copy and paste the following SQL statement to create the table. Replace the
YOUR_BUCKET_NAME_HERE placeholder with your bucket name (e.g. wildrydes-
data-johndoe) in the LOCATION clause.
19
• Select Run Query.
• Verify the table wildrydes was created by ensuring it has been added to the list of tables
in the left navigation.

5.4.4 Explore the batched data files

• Select Services then select S3 in the Storage section.


• Enter the bucket name you created in the first section in the Search for buckets text
input.
• Select one of the files and select Download. Open the file with a text editor and explore its
content.

5.4.5 Query the data files

• Select Services then select Athena in the Analytics section.


• Copy and paste the following SQL query:

• Select Run Query.

5.5 Clean Up
• Clean Amazon Athena.
• Clean Kinesis Data firehose.
• Clean S3.
• Clean Lambda.
• Clean DynamoDB.
• Clean IAM.

20
CHAPTER 6
SNAPSHOTS

6.1 Create an AWS Account:

6.2 Set up AWS Cloud9 IDE:

21
=
6.3 Create an Amazon Kinesis stream:

6.4 Produce messages into the stream:

22
6.5 Create an identity pool for the unicorn dashboard:

6.6 Grant the unauthenticated role access to the stream:

23
6.7 View unicorn status on the dashboard:

6.8 Create an Amazon Kinesis stream:

24
6.9 Create an Amazon Kinesis Data Analytics application:

25
6.10 Create an Amazon DynamoDB tables:

6.11 Create an IAM role for your Lambda function:

26
6.12 Create a Lambda function to process the stream:

27
6.13 Monitor the Lambda function:

6.14 Query the DynamoDB table:

28
6.15 Create an Amazon S3 bucket:

6.16 Create an Amazon Kinesis Data Firehose delivery stream:

29
6.17 Create an Amazon Athena table:

6.18 Explore the batched data files:

30
6.19 Query the data files:

31
CHAPTER 7
CONCLUSION
Using AWS services, we were able to create a real-time data processing application based
on serverless architecture which is capable of accepting data through Kinesis data streams,
processing through Kinesis Data Analytics, triggering Lambda Function and s toring in
DynamoDB. The architecture can be reused for multiple data types from various data
sources and formats with minor modifications. We have used all the managed services
provided by AWS which led to zero infrastructure management efforts.
Capstone project has helped us in building practical expertise on AWS services like
Kinesis, Lambda, Dynamo DB, Athena, S3, Identity and Access Management, Serverless
Architecture and Managed Services. We have also learnt the Go programming language to
build pseudo data producer programs. AWS CLI has helped us to connect on-premises
infrastructure with cloud services.

32
BIBLIOGRAPHY
1. https://aws.amazon.com/getting-started/hands-on/build-serverless-real-time-data-
processing-app-lambda-kinesis-s3-dynamodb-cognito-athena/
2. https://github.com/Rajpratik71/Distributed-Systems-
Complete/blob/master/Build%20a%20Serverless%20Real-
Time%20Data%20Processing%20App.md
3. https://blog.griddynamics.com/how-to-create-a-serverless-real-time-analytics-platform-a-
case-study/

33

You might also like