Professional Documents
Culture Documents
Bachelor of Technology
in
Computer Science & Information Technology
2019-2023
DECLARATION
I hereby declare that the work, which is being presented in this project
entitled “Orchestrate Redshift ETL using AWS Glue and Step
Functions” in partial fulfillment of the requirements for the award of degree
of Bachelor of Technology in Computer Science and Information
Technology, is authentic record of work carried out by me.
Department of CSITPage 1
“CYDEV”
2019-2023
RECOMMENDATION
This is to certify that the work embodied in this project entitled “CyDEV-A
Security Shield” submitted by Sahil Dubey (0827CO191049), Saransh Jain
(0827CO191051) & Rashida Jawadwala (0827CO191045) is a satisfactory
account of the bonafide work done under the supervision of Prof. Nidhi Nigam is
recommended towards partial fulfillment for the award of the Bachelor of
Technology in Computer Science & Information Technology degree by Rajiv
Gandhi Proudyogiki Vishwavidhyalaya, Bhopal.
Department of CSITPage 2
“CYDEV”
2019-2023
CERTIFICATE
The Project entitled “CyDEV-A Security Shield” submitted by Sahil
Dubey (0827CO191049), Saransh Jain (0827CO191051) & Rashida
Jawadwala (0827CO191045) has been examined and is hereby approved
towards partial fulfillment for the award of Bachelor of Technology in
Computer Engineering, for which it has been submitted. It is understood
that by this approval the undersigned do not necessarily endorse or approve
any statement made, opinion expressed or conclusion drawn therein, but
approve the project only for the purpose for which it has been submitted.
Date: Date:
Department of CSITPage 3
“CYDEV”
2019-2023
STUDENT UNDERTAKING
This is to certify that the project entitled “CyDEV-A Security Shield” has
been developed by us under the supervision of Prof. Nidhi NIgamThe
whole responsibility of work done in this project is ours. The sole intention
of this work is only for practical learning and research.
We further declare that to the best of our knowledge, this report does not
contain any part of any work which has been submitted for the award of any
degree either in this University or in any other University / Deemed
University without proper citation and if the same work found then we are
liable for explanation to this.
Department of CSITPage 4
“CYDEV”
ACKNOWLEDGEMENT
We thank the almighty Lord for giving me the strength and courage to sail out through
the tough and reach on shore safely.
We owe a debt of sincere gratitude, deep sense of reverence and respect to our guide
Prof. Nidhi Nigam and mentor Prof. Vandana Kate, AITR, Indore for their motivation,
sagacious guidance, constant encouragement, vigilant supervision and valuable critical
appreciation throughout this project work, which helped us to successfully complete the
project on time.
We express profound gratitude and heartfelt thanks to Prof. (Dr.) Shilpa Bhalerao HOD
CSIT, AITR Indore for her support, suggestion and inspiration for carrying out this
project. We would be failing in our duty if do not acknowledge the support and guidance
received from Prof. (Dr.) S. C. Sharma, Director, AITR, Indore whenever needed. We
take opportunity to convey my regards to the Management of Acropolis Institute,
Indore for extending academic and administrative support and providing us all necessary
facilities for project to achieve our objectives.
We are grateful to our parent and family members who have always loved and supported
us unconditionally.
Department of CSITPage 5
“CYDEV”
TABLE OF CONTENTS
DECLARATION I
RECOMMENDATION II
CERTIFICATE III
STUDENT UNDERTAKING IV
ACKNOWLEDGEMENT V
ABSTRACT VI
CONTENTS VII
List of Figures VIII
List of Tables IX
List of Abbreviations X
Chapter 1: Introduction 1
1.1 Overview 1
1.2 Existing System 1
1.3 Problem Statement 2
1.4 Proposed System 2
1.5 Need and Scope 2
1.6 Report Organization 2
Chapter 2: Literature Survey 4
2.1 Study 5
2.2 Problem Methodology 6
2.3 Software Engineering Paradigm 7
2.4 Software Development Life Cycle: 9
2.5 Technology Methodology 9
2.6 Hardware Requirements 10
Chapter 3: Analysis 11
Department of CSITPage 6
“CYDEV”
Department of CSITPage 7
“CYDEV”
ABSTRACT
AWS Glue is a fully managed ETL service that can extract data from various
sources, transform it, and load it into a data store, while AWS Step Functions is
a serverless workflow service that enables the coordination and execution of
multiple AWS services. By combining these two services, users can build a
flexible and powerful ETL pipeline for Redshift that can handle large volumes
of data and complex processing requirements.
Overall, this project provides a practical guide for users interested in building
highly scalable and efficient ETL pipelines using AWS Glue and Step Functions,
with a particular focus on the Redshift data store. By leveraging the power and
flexibility of these AWS services, users can drive insights and value from their
data in a reliable and cost-effective manner
Department of CSITPage 8
“CYDEV”
Chapter 1: Introduction
1.1 Overview
In this project report, we will discuss how to orchestrate an ETL process using AWS
Glue and AWS Step Functions for Redshift database. AWS Glue is a fully managed
ETL (Extract, Transform, Load) service that makes it easy to move data between data
stores. AWS Step Functions is a serverless workflow service that lets you coordinate
distributed applications and microservices using visual workflows. We will create an
ETL process that extracts data from an S3 bucket, transforms it using a Glue job, and
loads it into a Redshift database using a Glue connection.
1.2Problem Statement
The problem statement for this project is to build an ETL pipeline for extracting data
from various sources, transforming it, and loading it into Amazon Redshift. The pipeline
should be orchestrated using AWS Glue and Step Functions. The data sources could
include various databases, S3 buckets, and other third-party sources. The data will
need to be cleaned, validated, and transformed to meet the requirements of the target
schema in Redshift.
The target Redshift database will need to be created if it does not already exist, and the
schema and tables will need to be defined based on the business requirements. The ETL pipeline
Department of CSITPage 9
“CYDEV”
will need to ensure that the data is loaded into the correct tables with the correct data types and
constraints. AWS Glue will be used to create and manage the data catalog and define the ETL
jobs that will transform the data. AWS Glue will also be responsible for executing the ETL
jobs and loading the transformed data into Redshift.
The proposed system involves using AWS Glue and Step Functions to orchestrate ETL processes
for Amazon Redshift. Here's a high-level overview of how the system works:
1. Data sources: Data is collected from various sources like AWS S3, RDS or any other
external source.
2. AWS Glue: AWS Glue is used to create and manage ETL workflows. AWS Glue
crawlers scan the data sources to infer schema and partitioning, and then generate ETL
scripts to transform the data. AWS Glue jobs then execute these scripts on the data and
load it into Amazon Redshift.
3. Amazon Redshift: Amazon Redshift is a fully managed data warehouse service. It stores
the transformed data in a columnar format and provides fast querying capabilities.
4. AWS Step Functions: AWS Step Functions is used to orchestrate the ETL workflows. It
provides a visual workflow editor to design and execute workflows as a series of steps.
Each step can invoke AWS Glue jobs to transform and load data.
Step 1: Set up data sources and configure AWS Glue Crawlers to scan them.
Step 2: Create AWS Glue jobs to transform data and load it into Amazon Redshift.
Step 3: Define the ETL workflow using AWS Step Functions. This involves defining the steps in
the workflow and configuring the Glue jobs to be executed at each step.
Step 4: Run the ETL workflow using AWS Step Functions. The workflow executes the Glue jobs
in the order specified and loads the data into Amazon Redshift.
Step 5: Monitor the ETL process and troubleshoot any issues that arise.
Overall, this system provides an efficient and scalable way to orchestrate ETL processes for
Amazon Redshift using AWS Glue and Step Functions.
Department of CSITPage 10
“CYDEV”
Organizations need to extract, transform, and load (ETL) large volumes of data from
various sources into their data warehouse to make informed business decisions. Amazon
Redshift is a popular data warehousing solution that provides petabyte-scale data
warehousing at a low cost. AWS Glue is a fully managed extract, transform, and load
(ETL) service that makes it easy to move data between data stores. AWS Step Functions
is a serverless workflow service that lets you coordinate AWS services and build
serverless workflows using visual workflows.The need of the project is to use these AWS
services to orchestrate the ETL process in a more efficient and automated way.
Scope:
The scope of the project is to use AWS Glue and Step Functions to create a serverless
ETL pipeline that extracts data from various sources, transforms it, and loads it into
Amazon Redshift. The project will involve the following steps:
1. Design and define the ETL workflow using AWS Step Functions. This will involve
defining the states, transitions, and inputs/outputs of the workflow.
2. Use AWS Glue to extract data from various sources, such as S3, RDS, or other
databases, and transform it using Apache Spark. The transformed data can be stored
in a variety of formats, such as Parquet, ORC, or JSON.
3. Use AWS Glue to load the transformed data into Amazon Redshift. This can be done
using either the COPY command or the Redshift Spectrum feature.
4. Monitor the ETL workflow using AWS CloudWatch and log the execution details.
This will help in identifying any issues or bottlenecks in the ETL pipeline.
Department of CSITPage 11
“CYDEV”
● Chapter 1 states the overview of the project with discussing about the
● Chapter 2 states the literature survey i.e. the background details of our system
● Chapter 3 states about the Analysis of the whole system i.e. identification of
● Chapter 4 states about the Design of the whole system including all the UML
diagrams all the tools used with ER Diagram and Data Flow Diagram and
Data Dictionary.
● Chapter 5 states about the whole code of the system, java code ,
● Chapter 6 states the Testing phase of our system all the different testing
Department of CSITPage 12
“CYDEV”
2.1 Study
1. As we all know that in many areas customized application software’s are required
to solve organisational day to day activities.
3. Among the difficulties faced by people of organisation by lots of paper work and
searching for data, arranging them and solving complaints, some centralised solutions
needed in such busy environment.
5. The main aim is to satisfy the user of application and also reduce time spent on
the manual process which is to complete the cycle of registering complaints,
processing it and reaching to solutions.
Department of CSITPage 13
“CYDEV”
Designing and implementing the ETL process using AWS Glue and Step Functions
Integrating various data sources and defining the data schema
Creating data transformation jobs to clean, validate, and enrich the data
Loading the transformed data into Amazon Redshift
Monitoring and troubleshooting the ETL pipeline to ensure data accuracy and consistency
The successful completion of this project will enable the organization to have a reliable
and efficient ETL pipeline for Amazon Redshift, which will lead to faster access to
insights and better decision-making capabilities.
Department of CSITPage 14
“CYDEV”
The project can be divided into several sprints, each focusing on a specific set of tasks.
During each sprint, the team can work on designing, implementing, and testing different
aspects of the ETL pipeline. The Agile approach allows for continuous feedback and
improvement, which is crucial for ensuring that the final product meets the requirements
and expectations of stakeholders.
Using Agile, the team can prioritize tasks based on their importance and complexity, and
ensure that they are completed within the allotted time frame. The Agile approach also
allows for greater collaboration between team members and stakeholders, which can lead
to a better understanding of the project's goals and requirements.
To implement the Agile approach, the project team can use tools such as user stories,
sprint planning, sprint retrospectives, and daily stand-up meetings. These tools can help
the team to stay on track and ensure that the project is progressing according to plan.
In summary, the Agile software engineering paradigm can be used to implement the
orchestration of Redshift ETL using AWS Glue and Step Functions project. This
approach allows for flexibility, collaboration, and rapid delivery, which are essential for
the success of the project.
Department of CSITPage 15
“CYDEV”
Planning: In this stage, the project team defines the project goals, requirements, and
scope. The team also identifies the resources needed for the project, such as personnel,
hardware, and software.
Analysis: In this stage, the team analyzes the data sources and determines the data
schema for each source. The team also identifies any data quality issues that need to be
addressed.
Design: In this stage, the team designs the ETL pipeline using AWS Glue and Step
Functions. This includes defining the data flow, transformation logic, and error handling
procedures.
Implementation: In this stage, the team implements the ETL pipeline using AWS Glue
and Step Functions. This includes writing code, configuring services, and testing the
pipeline.
Testing: In this stage, the team tests the ETL pipeline to ensure that it meets the project
requirements. This includes functional testing, performance testing, and load testing.
Deployment: In this stage, the team deploys the ETL pipeline to the production
environment. This includes configuring security, monitoring, and backup procedures.
Department of CSITPage 16
“CYDEV”
Maintenance: In this stage, the team maintains the ETL pipeline by monitoring its
performance, addressing any issues, and making updates as needed.
To ensure that the project is completed successfully, it is important to follow each stage
of the SDLC and to maintain good communication between team members and
stakeholders. Additionally, the use of automated testing and deployment tools can help to
streamline the SDLC and reduce the risk of errors or delays
2.5 Technology Methodology
The technology methodology for orchestrating Redshift ETL using AWS Glue and Step
Functions project can be based on the following:
Agile: The project can follow an Agile methodology for software development, as
discussed earlier. This approach allows for iterative and incremental development,
continuous feedback, and rapid delivery.
DevOps: The project can follow a DevOps methodology, which emphasizes collaboration
and automation between development and operations teams. This approach can help to
streamline the ETL pipeline and reduce the time and effort required for deployment and
maintenance.
Infrastructure as Code (IaC): The project can use IaC tools such as AWS CloudFormation
or AWS CDK to define and manage the AWS resources required for the ETL pipeline.
This approach can help to ensure consistency and reproducibility of the infrastructure,
reduce manual errors, and simplify the deployment process.
Department of CSITPage 17
“CYDEV”
In summary, the technology methodology for orchestrating Redshift ETL using AWS
Glue and Step Functions project can be based on cloud-native, Agile, DevOps, IaC, and
CI/CD methodologies. By following these methodologies, the project team can leverage
the benefits of AWS services and tools to design, implement, and deploy a scalable,
reliable, and cost-effective ETL pipeline for Redshift.
Department of CSITPage 18
“CYDEV”
Chapter 3: Analysis
3.1 Identification of System Requirements-
Processor – 1.9 (GHz) x64-bit dual core processor with SSE2 instruction set
to 3.3 (GHz) x64-bit dual core processor with SSE2 instruction set.
Memory - 2-GB RAM to 4-GB RAM or more.
Network - Greater than 50KBps
Department of CSITPage 19
“CYDEV”
and sequence multiple AWS services, including AWS Glue, to build and
orchestrate complex workflows for ETL processes.
AWS Glue supports various data sources, including relational databases, flat
files, and NoSQL databases, making it flexible and adaptable to different data
sources.
AWS Glue also supports various data formats, such as CSV, JSON, and
Parquet, and can convert data from one format to another as needed.
AWS Glue can automatically generate ETL code in Python, Scala, and Java,
making it easy to use for developers who are familiar with these programming
languages.
AWS Step Functions provides robust error handling and retry logic, which
ensures that the ETL process runs smoothly and reliably.
AWS Glue and Step Functions can integrate seamlessly with other AWS
services, such as Amazon S3, AWS Lambda, and Amazon CloudWatch,
providing a comprehensive and powerful solution for ETL processes.
3.4.2 Financial Feasibility
The financial feasibility of this project will depend on the specific needs and
requirements of the organization, as well as the volume and complexity of the
data being processed, while there may be some financial costs associated with
using AWS Glue and Step Functions to orchestrate Redshift ETL, the pay-as-
you-go pricing model and potential cost savings from automation and
optimization can make this project financially feasible for many organizations.
3.4.3 Operational Feasibility
The operational feasibility of orchestrating Redshift ETL using AWS Glue
and Step Functions requires careful planning, execution, and ongoing
management. Organizations should ensure that they have the necessary skills,
resources, and processes in place to operate and maintain the ETL processes
effectively.
Department of CSITPage 20
“CYDEV”
Department of CSITPage 21
“CYDEV”
Department of CSITPage 22
“CYDEV”
Chapter 5: Design
5.1 Introduction to UML
UML (Unified Modeling Language) is a standardized
notation used in software engineering to model and
visualize software systems. It includes a variety of diagrams
that can be used to represent different aspects of the system
being developed. For a project involving orchestrating
Redshift ETL using AWS Glue and Step Functions, UML
can be used to model the system architecture and the
interactions between different components.
Department of CSITPage 23
“CYDEV”
Department of CSITPage 24
“CYDEV”
Chapter 6: Implementation
6.1 Coding(Main Module)
To orchestrate Redshift ETL using AWS Glue and Step
Functions, you can use a combination of AWS Glue Jobs
and Step Functions state machines, along with some custom
code to manage the orchestration and data flow between the
different components. Here is an example of how this could
be done:
Define the AWS Glue Jobs: Define the AWS Glue jobs that
will be used to extract, transform, and load data into
Redshift. These jobs can be created using the AWS Glue
console or the AWS Glue API, and should be defined to
read data from the appropriate sources and write data to the
appropriate Redshift tables.
Department of CSITPage 25
“CYDEV”
AWS Glue jobs. The state machine can be defined using the
AWS Step Functions console or the AWS Step Functions
API, and should be defined to run the AWS Glue jobs in the
appropriate order.
Write the Custom Code: Write the custom code that will be
used to manage the orchestration and data flow between the
different components. This code can be written in any
programming language that is supported by AWS Lambda,
and can be used to perform tasks such as passing data
between AWS Glue jobs and updating the state of the Step
Functions state machine.
Deploy and Test the System: Deploy the AWS Glue jobs,
Step Functions state machine, and custom code using the
appropriate AWS services (such as AWS Lambda and AWS
CloudFormation), and test the system to ensure that it is
working correctly.
Department of CSITPage 26
“CYDEV”
Chapter 7: Testing
7.1 Testing Objectives-
The testing objectives for orchestrating Redshift ETL using
AWS Glue and Step Functions project would typically
include the following:
1. Verify that data is correctly extracted from the data
sources and transformed by the Glue jobs according
to the defined schema and logic.
2. Confirm that the transformed data is correctly loaded
into the target Redshift tables using COPY
commands or other methods.
3. Test the error handling and retry logic of the ETL
process, including handling of network errors, data
validation errors, and other types of exceptions.
4. Verify that the Step Functions state machine is
correctly orchestrating the execution of the Glue
jobs and handling any dependencies or error
conditions between the jobs.
5. Test the performance and scalability of the ETL
process under different data volumes and processing
loads.
6. Validate that the ETL process is running within the
defined AWS cost and resource usage limits, and
Department of CSITPage 27
“CYDEV”
Chapter 8: Conclusion
8.1 Conclusion-
AWS Glue and Step Functions to orchestrate Redshift
ETL can provide a robust, scalable, and cost-effective
solution for processing and transforming large volumes of
data in a variety of use cases. By leveraging Glue's managed
ETL service, users can easily create and manage jobs to
extract, transform, and load data from various sources into
Redshift, while Step Functions provides a flexible and
reliable way to coordinate the execution of these jobs and
handle dependencies, retries, and error handling. Overall, by
using AWS Glue and Step Functions, users can build a
highly automated and scalable ETL pipeline that can handle
a wide range of data volumes and processing requirements,
while minimizing operational overhead and cost
8.2 Future Work-
Some potential future work for orchestrating Redshift ETL
using AWS Glue and Step Functions project could include:
Department of CSITPage 28
“CYDEV”
References-
Department of CSITPage 29
“CYDEV”
[3] https://aws.amazon.com/blogs/big-data/orchestrate-
amazon-redshift-based-etl-workflows-with-aws-step-
functions-and-aws-glue/
[4]
https://docs.aws.amazon.com/redshift/latest/mgmt/overview.htm
l
Department of CSITPage 30