You are on page 1of 36

AWS Glue: A Concise Tutorial Just An Hour – FREE

AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load
(ETL) processes. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases
that run on Amazon Elastic Compute Cloud (EC2) instances in an Amazon Virtual Private Cloud.Nov 30,
2017.

As described above, AWS Glue is a fully managed ETL service that aims to take the difficulties out of the
ETL process for organizations that want to get more out of their big data. The initial public release of
AWS Glue was in August 2017. Since that date, Amazon has continued to actively release updates for
AWS Glue with new features and functionality. Some of the most recent AWS Glue updates include:

 Support for Python 3.6 in Python shell jobs (June 2019).


 Support for connecting directly to AWS Glue via a virtual private cloud (VPC) endpoint (May
2019).
 Support for real-time, continuous logging for AWS Glue jobs with Apache Spark (May 2019).
 Support for custom CSV classifiers to infer the schema of CSV data (March 2019).

The arrival of AWS Glue fills a hole in Amazon’s cloud data processing services. Previously, AWS had
services for data acquisition, storage, and analysis, yet it was lacking a solution for data transformation.

Subscribe For Free Demo


Error: Contact form not found.

What is AWS Glue?

 AWS Glue is a fully managed ETL service. This service makes it simple and cost-effective to
categorize your data, clean it, enrich it, and move it swiftly and reliably between various data
stores.
 It comprises of components such as a central metadata repository known as the AWS Glue Data
Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible
scheduler that handles dependency resolution, job monitoring, and retries.
 AWS Glue is serverless, this means that there’s no infrastructure to set up or manage.

When Should I Use AWS Glue?


1. To build a data warehouse to organize, cleanse, validate, and format data.

 You can transform as well as move AWS Cloud data into your data store.
 You can also load data from disparate sources into your data warehouse for regular reporting and
analysis.
 By storing it in a data warehouse, you integrate information from different parts of your business
and provide a common source of data for decision making.

2. When you run serverless queries against your Amazon S3 data lake.

 AWS Glue can catalog your Amazon Simple Storage Service (Amazon S3) data, making it
available for querying with Amazon Athena and Amazon Redshift Spectrum.
 With crawlers, your metadata stays in synchronization with the underlying data. Athena and
Redshift Spectrum can directly query your Amazon S3 data lake with the help of the AWS Glue
Data Catalog.
 With AWS Glue, you access as well as analyze data through one unified interface without loading
it into multiple data silos.

3. When you want to create event-driven ETL pipelines

 You can run your ETL jobs as soon as new data becomes available in Amazon S3 by invoking
your AWS Glue ETL jobs from an AWS Lambda function.
 You can also register this new dataset in the AWS Glue Data Catalog considering it as part of your
ETL jobs.

4. To understand your data assets.

 You can store your data using various AWS services and still maintain a unified view of your data
using the AWS Glue Data Catalog.
 View the Data Catalog to quickly search and discover the datasets that you own, and maintain the
relevant metadata in one central repository.
 The Data Catalog also serves as a drop-in replacement for your external Apache Hive Metastore.

Under the hood of AWS Glue is:

 The AWS Glue Data Catalog, a metadata repository that contains references to data sources and
targets that will be part of the ETL process.
 An ETL engine that automatically generates scripts in Python and Scala for use throughout the
ETL process.
 A scheduler that can run jobs and trigger events based on time-based and other criteria.

The purpose of AWS Glue is to facilitate the construction of an enterprise-class data warehouse.
Information can be moved into the data warehouse from a variety of sources, including transactional
databases as well as the Amazon cloud.According to Amazon, there are many possible use cases for AWS
Glue to simplify ETL tasks, including:

 Discovering metadata about your various databases and data stores, and archiving them in the
AWS Glue Data Catalog.
 Creating ETL scripts in order to transform, denormalize, and enrich the data while en route from
source to target.
 Automatically detecting changes in your database schema and adjusting the service in order to
match them.
 Launching ETL jobs based on a particular trigger, schedule, or event.
 Collecting logs, metrics, and KPIs on your ETL operations for monitoring and reporting purposes.
 Handling errors and retrying in order to prevent stalling during the process.
 Scaling resources automatically in order to fit the needs of your current situation.

ETL engine

After data is cataloged, it is searchable and ready for ETL jobs. AWS Glue includes an ETL script
recommendation system to create Python and Spark (PySpark) code, as well as an ETL library to execute
jobs. A developer can write ETL code via the Glue custom library, or write PySpark code via the AWS
Glue Console script editor.

A developer can also import custom PySpark code or libraries. The developer can also upload code for
existing ETL jobs to an S3 bucket, then create a new Glue job to process the code. AWS also provides
sample code for Glue in a GitHub repository.

AWS Glue Features

Since Amazon Glue is completely managed by AWS, deployment and maintenance is super simple. Below
are some important features of Glue:

1. Integrated Data Catalog

The Data Catalog is a persistent metadata store for all kind of data assets in your AWS account. Your
AWS account will have one Glue Data Catalog. This is the place where multiple disjoint systems can store
their metadata. In turn, they can also use this metadata to query and transform the data. The catalog can
store table definitions, job definition, and other control information that help manage the ETL environment
inside Glue.

2. Automatic Schema Discovery

AWS Glue allows you to set up crawlers that connect to the different data source. It classifies the data,
obtains the schema related info and automatically stores it in the data catalog. ETL jobs can then use this
information to manage ETL operations.

3. Code Generation

AWS Glue comes with an exceptional feature that can automatically generate code to extract, transform
and load your data. The only input Glue would need is the path/location where the data is stored. From
there, glue creates ETL scripts by itself to transform, flatten and enrich data. Normally, scala and python
code is generated for Apache spark.

4. Developer Endpoints
This is one of the best features of Amazon Glue and helps interactively develop ETL code. When Glue
automatically generates a code for you, you will need to debug, edit and test the same. The Developer
Endpoints provide you with this service. Using this, custom readers, writers or transformations can be
created. These can further be imported into Glue ETL jobs as custom libraries.

5. Flexible Job Scheduler

One of the most important features of Glue is that it can be invoked as per schedule, on-demand or on an
event trigger basis. Also, you can simply start multiple jobs in parallel. Using the scheduler, you can also
build complex ETL pipelines by specifying dependencies across jobs. AWS Glue ETL always retries the
jobs in case they fail. They also automatically initiate filtering for infected or bad data. All kinds of inter
job dependencies will be handled by Glue.

AWS Glue jobs can execute on a schedule. A developer can schedule ETL jobs at a minimum of five-
minute intervals. AWS Glue cannot handle streaming data.

If a dev team prefers to orchestrate its workloads, the service allows scheduled, on-demand and job
completion triggers. A scheduled trigger executes jobs at specified intervals, while an on-demand trigger
executes when prompted by the user. With a job completion trigger, single or multiple jobs can execute
when a job finishes. These jobs can trigger at the same time or sequentially, and they can also trigger from
an outside service, such as AWS Lambda.

Benefits

Cost effective

AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles
provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully
managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are
running.

Less hassle

AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when
onboarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS
engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your
Virtual Private Cloud (Amazon VPC) running on Amazon EC2.

More power

AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue
crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue
automatically generates the code to execute your data transformations and loading processes.
Use cases

Queries against an Amazon S3 data lake

Data lakes are an increasingly popular way to store and analyze both structured and unstructured data. If
you want to build your own custom Amazon S3 Data Lake, AWS Glue can make all your data
immediately available for analytics without moving the data.

Analyze log data in your data warehouse

Prepare your clickstream or process log data for analytics by cleaning, normalizing, and enriching your
data sets using AWS Glue. AWS Glue generates the schema for your semi-structured data, creates ETL
code to transform, flatten, and enrich your data, and loads your data warehouse on a recurring basis.
Unified view of your data across multiple data stores

You can use the AWS Glue Data Catalog to quickly discover and search across multiple AWS data sets
without moving the data. Once the data is cataloged, it is immediately available for search and query using
Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Event-driven ETL pipelines

AWS Glue can run your ETL jobs based on an event, such as getting a new data set. For example, you can
use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in
Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL
jobs.
AWS Glue Use Cases

This section highlights the most common use cases of Glue. You can use Glue with some of the famous
tools and applications listed below:

1. AWS Glue with Athena

In Athena, you can easily use AWS Glue Catalog to create databases and tables, which can later
be queried. Alternatively, you can use Athena in AWS Glue ETL to create the schema and related
services in Glue.

2. AWS Glue for Non-native JDBC Data Sources

AWS Glue by default has native connectors to data stores that will be connected via JDBC. This
can be used in AWS or anywhere else on the cloud as long as they are reachable via an IP. AWS
Glue natively supports the following data stores- Amazon Redshift, Amazon RDS ( Amazon
Aurora, MariaDB, MSSQL Server, MySQL, Oracle, PgSQL.)

Enroll in Best AWS Glue Training and Get Hired by TOP MNCs
 Instructor-led Sessions

 Real-life Case Studies


 Assignments
Explore Curriculum

3. AWS Glue integrated with AWS Data Lake

AWS Glue can be integrated with AWS Data Lake. Further, ETL processes can be run to ingest,
clean, transform and structure data that is important to you.

4. Snowflake with AWS Glue

Snowflake has great plugins that seamlessly gel with AWS Glue. Snowflake data warehouse
customers can manage their programmatic data integration process without worrying about
physically maintain it or maintaining any kind of servers and spark clusters. This allows you to get
the benefits of Snowflake’s query pushdown, SQL translation into Snowflake and Spark
workloads.

AWS Glue Limitations and Challenges

While there are many noteworthy features of AWS glue, there are some serious limitations as well.

 In comparison to the other ETL options available today, Glue has only a few pre-built
components. Also, given it is developed by and for the AWS Console, it is not open to match all
kinds of environments.
 Glue works well only with ETL from JDBC and S3 (CSV) data sources. In case you are looking to
load data from other cloud applications, File Storage Base, etc. Glue would not be able to support.
 Using Glue, all data is first staged on S3. This sync has no option for incremental sync from your
data source. This can be limiting if you are looking ETL data in real-time.
 Glue is a managed AWS Service for Apache spark and not a full-fledged ETL solution. Tons of
new work is required to optimize pyspark and scala for Glue.
 Glue does not give any control over individual table jobs. ETL is applicable to the complete
database.
 While Glue provides support to writing transformations in scala and python, it does not provide an
environment to test the transformation. You are forced to deploy your transformation on parts of
real data, thereby making the process slow and painful.
 Glue does not have good support for traditional relational database type of queries. Only SQL
types of queries are supported that too through some complicated virtual table.
 The learning curve for Glue is steep. If you are looking to use glue for your ETL needs, then you
would have to ensure that your team comprises engineering resources that have a strong
knowledge of spark concepts.
 The soft limit of handling concurrent jobs is 3 only, though it can be increased by building a queue
for handling limits. You will have to write a script to handle a smart auto-DPU to adjust the input
data size.

AWS Glue Concepts

You define jobs in AWS Glue to accomplish the work that’s required to extract, transform, and load (ETL)
data from a data source to a data target. You typically perform the following actions:
 Firstly, you define a crawler to populate your AWS Glue Data Catalog with metadata table
definitions. You point your crawler at a data store, and the crawler creates table definitions in the
Data Catalog.In addition to table definitions, the Data Catalog contains other metadata that is
required to define ETL jobs. You use this metadata when you define a job to transform your data.
 AWS Glue can generate a script to transform your data or you can also provide the script in the
AWS Glue console or API.
 You can run your job on-demand, or you can set it up to start when a specified trigger occurs. The
trigger can be a time-based schedule or an event.
 When your job runs, a script extracts data from your data source, transforms the data, and loads it
to your data target. This script runs in an Apache Spark environment in AWS Glue.

AWS Glue Terminology


Terminology Description
Data Catalog The persistent metadata store in AWS Glue. It contains table definitions, job definition
other control information to manage your AWS Glue environment.
Classifier Determines the schema of your data. AWS Glue provides classifiers for common file types
as CSV, JSON, AVRO, XML, and others.
Connection It contains the properties that are required to connect to your data store.
Crawler A program that connects to a data store (source or target), progresses through a prioritiz
of classifiers to determine the schema for your data and then creates metadata tables
Data Catalog.
Database A set of associated Data Catalog table definitions organized into a logical group in AWS G
Data Store, Data A data store is a repository for persistently storing your data. Data source is a data store
Source, Data Target used as input to a process or transform. A data target is a data store that a process or tran
writes to.
Development Endpoint An environment that you can use to develop and test your AWS Glue ETL scripts.
Job The business logic is required to perform ETL work. It is composed of a transformation
data sources, and data targets.
Notebook Server A web-based environment that you can use to run your PySpark statements. PySpar
Python dialect for ETL programming.
Script Code that extracts data from sources, transforms it and loads it into targets. AWS
generates PySpark or Scala scripts.
Table It is the metadata definition that represents your data. A table defines the schema of your
Transform You use the code logic to manipulate your data into a different format.
Trigger Initiates an ETL job. You can define triggers based on a scheduled time or event.
How does AWS Glue work?

Here I am going to demonstrate an example where I will create a transformation script with Python and
Spark. I will also cover some basic Glue concepts such as crawler, database, table, and job.

1.Create a data source for AWS Glue:

Glue can read data from a database or S3 bucket. For example, I have created an S3 bucket called glue-
bucket-edureka. Create two folders from S3 console and name them read and write. Now create a text file
with the following data and upload it to the read folder of S3 bucket.
rank,movie_title,year,rating

 The Shawshank Redemption,1994,9.2


 The Godfather,1972,9.2
 The Godfather: Part II,1974,9.0
 The Dark Knight,2008,9.0
 12 Angry Men,1957,8.9
 Schindler’s List,1993,8.9
 The Lord of the Rings: The Return of the King,2003,8.9
 Pulp Fiction,1994,8.9
 The Lord of the Rings: The Fellowship of the Ring,2001,8.8
 Fight Club,1999,8.8

2.Crawl the data source to the data catalog:

In this step, we will create a crawler. The crawler will catalog all files in the specified S3 bucket and
prefix. All the files should have the same schema. In Glue crawler terminology the file format is known as
a classifier. The crawler identifies the most common classifiers automatically including CSV, json and
parquet. Our sample file is in the CSV format and will be recognized automatically.

 In the left panel of the Glue management console click Crawlers.


 Click the blue Add crawler button.
 Give the crawler a name such as glue-demo-edureka-crawler.
 In Add a data store menu choose S3 and select the bucket you created. Drill down to select
the read folder.
 In Choose an IAM role create new. Name the role to for example glue-demo-edureka-iam-role.
 In Configure the crawler’s output add a database called glue-demo-edureka-db.

When you are back in the list of all crawlers, tick the crawler that you created. Click Run crawler.

3. The crawled metadata in Glue tables:

Once the data has been crawled, the crawler creates a metadata table from it. You find the results from the
Tables section of the Glue console. The database that you created during the crawler setup is just an
arbitrary way of grouping the tables. Glue tables don’t contain the data but only the instructions on how to
access the data.

4. AWS Glue jobs for data transformations:

From the Glue console left panel go to Jobs and click blue Add job button. Follow these instructions to
create the Glue job:

 Name the job as glue-demo-edureka-job.


 Choose the same IAM role that you created for the crawler. It can read and write to the S3 bucket.
 Type: Spark.
 Glue version: Spark 2.4, Python 3.
 This job runs: A new script to be authored by you.
 Security configuration, script libraries, and job parameters
 Maximum capacity: 2. This is the minimum and costs about 0.15$ per run.
 Job timeout: 10. Prevents the job to run longer than expected.
 Click Next and then Save job and edit the script.

5. Editing the Glue script to transform the data with Python and Spark:

Copy the following code to your Glue script editor Remember to change the bucket name for the
s3_write_path variable. Save the code in the editor and click Run job.

IMPORT LIBRARIES AND SET VARIABLES

Import python modules

1. from datetime import datetime

Import pyspark modules

1.

Import glue modules

1.

Initialize contexts and session


1.

Parameters

1.

EXTRACT (READ DATA)

Log starting time

1.

Read movie data to Glue dynamic frame

1. dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db,


table_name = glue_tbl)

Convert dynamic frame to data frame to use standard pyspark functions

1. data_frame = dynamic_frame_read.toDF()

TRANSFORM (MODIFY DATA)

Create a decade column from year

1.

Group by decade: Count movies, get average rating

1.

Print result table

Note: Show function is an action. Actions force the execution of the data frame plan. With big data the
slowdown would be significant without cacching.

1. data_frame_aggregated.show(10)
AWS Sample Resumes! Download & Edit, Get Noticed by Top Employers! DOWNLOAD

LOAD (WRITE DATA)

Create just 1 partition, because there is so little data

1. data_frame_aggregated = data_frame_aggregated.repartition(1)

Convert back to dynamic frame

1. dynamic_frame_write = DynamicFrame.fromDF(data_frame_aggregated, glue_context,


“dynamic_frame_write”)

Write data back to S3


1.

Here you could create S3 prefixes according to a values in specified columns

“partitionKeys”: [“decade”]

1.

Log end time

1.

The detailed explanations are commented in the code. Here is the high-level description:

 Read the movie data from S3


 Get movie count and rating average for each decade
 Write aggregated data back to S3

The execution time with 2 Data Processing Units (DPU) was around 40 seconds. A relatively longduration
is explained by the start-up overhead

The data transformation script creates summarized movie data. For example, 2000 decade has 3 movies in
IMDB top 10 with average rating 8.9. You can download the result file from the write folder of your S3
bucket. Another way to investigate the job would be to take a look at the CloudWatch logs.

The data is stored back to S3 as a CSV in the “write” prefix. The number of partitions equals the number
of the output files.

With this, we have come to the end of this article on AWS Glue. I hope you have understood everything
that I have explained here.

Conclusion

For many developers and IT professionals, AWS Glue has successfully helped them reduce the complexity
and manual labor involved in the ETL process since its release in August 2017.

However, the drawbacks of AWS Glue, such as the newness of the service and the difficult learning curve,
mean that it’s not the right choice for every situation. Companies that are looking for a more well-
established, user-friendly, fully managed ETL solution with strong customer support would do well to
check out Xplenty.
Introduction To AWS Glue ETL



The Extract, Transform, Load(ETL) process has been designed specifically for the
purpose of transferring data from its source database to the data warehouse. However,
the challenges and complexities of ETL can make it hard to implement them
successfully for all our enterprise data. For this reason, Amazon has introduced AWS
Glue.
AWS Glue is a fully managed ETL(Extract, Transform, and Load) service that makes
it simple and cost-effective to categorize our data, clean it, enrich it, and move it
reliably between various data stores. It consists of a central metadata repository
known as the AWS Glue data catalog an ETL engine that automatically generates
Python code and a flexible scheduler that handles dependency resolution job
monitoring. AWS Glue is serverless which means that there is no infrastructure to set
or manage a setup.
AWS Glue
AWS Glue is used to prepare data from different sources and prepare that data for
analytics, machine learning, and application development. It will reduce the manual
effort by performing the automation of the jobs like data integration, data
transformation, and data loading. AWS glue is a serverless data integration service
which makes it more useful for the preparation of the data also the data that has been
prepared will be maintained centrally in a catalog which makes it easy to find and
understand the data.
How To Use AWS Glue ETL
Follow the steps mentioned below to use AWS Glue ETL
1. Create and Attach An IAM Role for Your ETL Job
Identity and Access Management (IAM) manages Amazon Web Services (AWS)
users and their access to AWS accounts and services. It controls the level of access a
user can have over an AWS account & sets users, grants permission, and allows a user
to use different features of an AWS account.
2. Create a crawler
AWS Glue’s main job was to create a data catalog from the data it had collected from
the different data sources. Crawler is the best program used to discover the data
automatically and it will index the data source which can be further used by the AWS
Glue.
3. Create a job
Create a job in AWS Glue to create a job follow the steps mentioned below.
 Open AWS console and navigate to the AWS glue and click on the create
job.
 Make all the configuration required for the job and click on the create job.
4. Run your job
 After creating the job select the job that you want to run and Click Run job.
5. Monitor your job
 You can monitor the progress of the job in AWS Glue console.
Best Practices For AWS Glue ETL
Following are the some of the best practices that you can follow while implementing
the AWS Glue ETL.
 Data Catalog: Use data catalog as an centralized metadata repository try to
store all the metadata about the data sources, transformations, and targets.
 Crawlers: You need to keep you metadata uptodate for that you can use the
crawler to to run the periodically which keeps the metadata up to date.
 Leverage Dynamic Allocators: Dynamic allocates are used to scale up and
scale down the workers and executors based up on the load which will store
lots of resources.
 Utilize Bulk Loading: Try to use the bulk loading teefforts of chnique
which is more efficient educing the number of individual file writes and
improving overall performance.
 Monitor and Analyze Job Metrics: WIth the help of cloudwatch you can
monitor the performance of the Glue. You can monitor the job metrics such
as execution time, resource utilization, and errors, to identify performance
bottlenecks and potential issues.
Case studies of AWS Glue ETL
Follwing are the some of the organization that are using the AWS glue ETL. To Know
How to create AWS Account refer to the Amazon Web Services (AWS) – Free Tier
Account Set up.
 Media and Entertainment: Media company will produces lots of video
content which need to be transferred and catalog their data efficeiently. In
that AWS Glue for ETL process and organize the metadata, making it
searchable and accessible for content delivery.
 Retail: The companies which are in the retail industry will consists of
multiple online and offline sales they can use AWS Glue for ETL to
consolidate and analyze customer data from various sources. You can gain
more insights of overall coustmer experience.
 Healthcare: AWS Glue for ETL was used by a healthcare organisation
with a variety of data sources, including IoT devices and electronic health
records, to combine and analyse patient data. This enhanced patient care by
streamlining data processing for medical research.
 Financial Services: You can analyze the patient data which can be further
used for the medical research and improved patient care.
 Travel and Hospitality: Travel companies will manage there data like
customer reviews and pricing of the bus ticket can be used AWS glue for
ETL to centralize and harmonize their data.
Future of AWS Glue ETL
 Enhanced Machine Learning Integration: You can integrate with other
service in the AWS like SageMaker, ML models in the amazon console.
The AWS Glue can automate the data and feature engineering for machine
learning models.
 Real-Time Data Processing: AWS glue can enhance the real time data
which can be used for crucial requirements of the applications which
requires immediate insights from data streams.
 Serverless Architecture Expansion: The serverless architecture of AWS
Glue will keep growing, offering even more precise control over resource
distribution and cost reduction. This will guarantee effective resource
utilisation by enabling users to scale their ETL processes in accordance
with exact requirements.
 Advanced Data Transformation: The feature is all about data AWS glue
may introduce the features like data cleansing, enrichment and analysis to
support increasingly complex ETL requirements.
AWS Glue Architecture
We define jobs in AWS Glue to accomplish the work that is required to extract,
transform and load data from a data source to a data target. So if we talk about the
workflow, the first step here is we define a crawler to populate our AWS data catalog
with metadata and table definitions. We point our crawler at a data source post and the
crawler creates table definitions in the data catalog. In addition to table definitions, the
data catalog contains other metadata that is required to define ETL jobs. we use this
metadata when we define a job to transform our data in the second step. AWS Glue
can generate a script to transform our data or we can also provide the script in the
AWS Glue console. In the third step, we can run our job on demand or we can set it
up to start when a specified trigger occurs. The trigger can be a time-based schedule or
an event. Finally, when our job runs, a script extracts data from our data source,
transforms the data, and loads it into our target. The script runs in an Apache Spark
environment in AWS Glue.
 Data Catalog: It is the persistent metadata store in AWS Glue. It contains
table definitions, job definitions, etc. AWS Glue has one data catalog per
region.
 Database: It is a set of associated data catalog table definitions organized
into a logical group in the AWS group.
 Crawler: It is a program that connects to our data store. Maybe a source or
a target progresses through a prioritized list of classifiers to determine the
schema for our data and then it creates metadata tables in the data catalog.
 Connection: AWS Glue Connection is the data catalog that holds the
information needed to connect to a certain data storage.
 Classifier: It determines the schema of our data. AWS Glue provides
classifiers for common file types such as CSV, JSON, etc. It also provides
classifiers for common relational database management systems using a
JDBC connection.
 Data Store: It is a repository for persistently storing our data. Examples
include Amazon S3, buckets, and relational databases.
 Data Source: It is a target data store that is used as an input to process or
transform.
 Data Target: It is a data store where the transformed data is written.
 Development Endpoint: It is an environment where we can develop and
test our AWS Glue ETL scripts.
 Job: It is a business logic required to perform the ETL work It is composed
of a transformation script data sources and data targets. They can be
initiated by triggers that can be scheduled or triggered by events.
 Trigger: It initiates an ETL job. We can define triggers based on a
scheduled time or an event.
 Notebook Server: It is a web-based environment that we can use to run our
PySpark statements, which is a Python dialect used for ETL programming.
 Script: It contains the code that extracts data from sources transforms it and
loads it into the targets.
 Table: It contains the name of columns, data types, definitions, and other
metadata about a base dataset.
 Transform: We use the code logic to manipulate our data into different
formats using the transform.
Use Cases of AWS Glue
 To build a Data Warehouse to Organize, Cleanse, Validate, and
Format Data: We can transform and move AWS cloud data into our data
store. We can also load data from different sources into our data warehouse
for regular reporting and analysis. By storing it in the warehouse, we
integrate information from different parts of our business and form a
common source of data for decision-making.
 When we run Serverless Queries against our Amazon S3 Data Link: S3
here means simple storage service. AWS Glue can catalog our simple
storage service that is Amazon S3 data making it available for querying
with Amazon Athena and Amazon RedShift Spectrum. With crawlers, our
metadata stays in synchronization with the underlying data. AWS RedShift
Spectrum can access and analyze data through one unified interface without
loading it into multiple data.
 Creating Event-driven ETL Pipelines: We can run our ETL jobs as soon
as new data becomes available in Amazon S3 by invoking our AWS Glue
ETL jobs from an AWS Lambda function. We can also register this new
data in the AWS load data catalog as a part of our details.
 To Understand our Data Assets: We can store our data using various
AWS services and still maintain a unique, unified view of our data using the
AWS Glue data catalog. We can view the data catalog to quickly search
and discover the datasets that we own and maintain the relative data in one
central location.
Benifits of AWS Glue
 Less Hassle: AWS Glue is integrated across a wide range of AWS services.
AWS Glue natively supports data stored in Amazon Aurora and
other Amazon Relational Database Service engines, Amazon RedShift and
Amazon S3 along with common database engines and databases in our
virtual private cloud running on Amazon EC2.
 Cost Effective: AWS Glue is serverless. There is no infrastructure to
provision or manage AWS Glue handles, provisioning, configuration, and
scaling of the resources required to run our ETL jobs. We only pay for the
resources that we use while our jobs are running.
 More Power: AWS Glue automates much of the effort in building,
maintaining, and running ETL jobs. It identifies data formats and suggests
schemas and transformations. Glue automatically generates the code to
execute our data transformations and loading processes.
Disadvantages of AWS Glue
 Amount of Work Involved: It is not a full-fledged ETL service. Hence in
order to customize the services as per our requirements, we need
experienced and skillful candidates. And it involves a huge amount of work
to be done as well.
 Platform Compatibility: AWS Glue is specifically made for the AWS
console and its subsidiaries. And hence it isn’t compatible with other
technologies.
 Limited Data Sources: It only supports limited data sources like S3 and
JDBC
 High Skillset Requirement: AWS Glue is a serverless application, and it is
still a new technology. Hence, the skillset required to implement and
operate the AWS Glue is high.
FAQs On AWS Glue
1. AWS Data Catalog
A centralised metadata repository that houses information about your data from
multiple data sources is the AWS Glue Data Catalogue. It offers a single interface for
finding, comprehending, and managing your data assets. This catalogue is used by an
AWS Glue ETL job during execution to comprehend data properties and guarantee
proper transformation.
2. AWS DataBrew
AWS Glue data brew is an visual data preparation service with which we can get the
clean data which can be used for the data analytics and machine learning purpose.
You can also create and manage the data preparation workflows with the help of
visual development of AWS glue databrew.
3. AWS Glue Studio
AWS Glue studio will helps you to visualize the data integration service that is ETL
(extract,transform,load) with out writing the code you can just manage by using the
drag and drop option.
4. AWS Glue Dynamic Frame
Working with big datasets in AWS Glue is made flexible and effective with the help of
AWS Glue Dynamic Frame, a data representation tool.
5. AWS Glue Connectors
You can connect AWS Glue ETL jobs to a variety of data sources and destinations by
using the pre-built connectors known as AWS Glue Connectors. These connectors
offer a standardised method of interacting with various data sources and formats,
making the process of extracting, transforming, and loading data easier.
6. AWS Glue API
You can automate and manage a number of AWS Glue features through the API, such
as job execution, data catalogues, crawlers, and more.

Lost in the complex landscape of DevOps? It's time to find your way! Enroll in
our DevOps Engineering Planning to Production Live Course and set out on an
exhilarating expedition to conquer DevOps methodologies with precision and
timeliness.
What We Offer:
 Comprehensive DevOps Curriculum
 Expert Guidance for Streamlined Learning
 Hands-on Experience with Real-world Scenarios
 Proven Track Record with 100,000+ Successful DevOps Enthusiasts

=================Welcome back to our advanced AWS Glue tutorial

series. In this blog , we’ll delve into AWS Glue’s powerful ETL
(Extract, Transform, Load) job types, providing real-world code
examples that demonstrate the platform’s adaptability and capacity.

1. Python Shell Jobs

Python Shell Jobs in AWS Glue are essential for simplifying and
precision data processing processes. Below is a brief example of a
Python Shell Job that modifies a CSV file by adding a new column
and applying data filtering:

import pandas as pd

# Load data from an S3 source


input_bucket = 'input-bucket'
input_key = 'data/input.csv'
df = pd.read_csv(f's3://{input_bucket}/{input_key}')

# Perform data transformation


df['new_column'] = df['old_column'] * 2
filtered_df = df[df['some_condition'] == True]

# Export the result to S3


output_bucket = 'output-bucket'
output_key = 'data/output.csv'
filtered_df.to_csv(f's3://{output_bucket}/{output_key}', index=False)

2. Apache Spark Jobs

Complex data transformations become simple when using the vast


power of Apache Spark within AWS Glue. Consider the word count
on a text file as an example.

Want to learn Apache Spark follow this


link: https://brilliantprogrammer.medium.com/list/apache-
pyspark-8181c687929e

from pyspark.context import SparkContext


from pyspark.sql import SparkSession

# Initialize Spark
sc = SparkContext()
spark = SparkSession(sc)

# Load data from an S3 source


input_bucket = 'input-bucket'
input_key = 'data/input.txt'
lines = spark.read.text(f's3://{input_bucket}/{input_key}')
# Perform word count
word_counts = lines.rdd.flatMap(lambda line: line[0].split('
')).countByValue()

# Store the results back in S3


output_bucket = 'output-bucket'
output_key = 'data/word_counts.txt'
with open(f's3://{output_bucket}/{output_key}', 'w') as f:
for word, count in word_counts.items():
f.write(f'{word}: {count}\n')

3. Ray Jobs

Ray Jobs in AWS Glue come to top of the list for compute-intensive
operations that require distribution and parallelization. Consider
processing a series of photos in parallel:

import ray

@ray.remote
def process_image(image_path):
# Some processing logic
return processed_image

# List of image paths


image_paths = ['image1.jpg', 'image2.jpg', 'image3.jpg']

# Process images in parallel


results = ray.get([process_image.remote(path) for path in image_paths])

# Save processed images back to S3


output_bucket = 'output-bucket'
for i, processed_image in enumerate(results):
output_key = f'processed/image_{i}.jpg'
processed_image.save(f's3://{output_bucket}/{output_key}')

4. Jupyter Notebook Jobs

AWS Glue Jupyter Notebook Jobs provide an interactive and


collaborative environment for data exploration and analysis. Here’s
an example of how to use Jupyter Notebook for data exploration:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

# Load and explore data


data = pd.read_csv('data.csv')
data.head()

# Visualize data
plt.scatter(data['X'], data['Y'])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.show()

Choosing the Right Job Type

Selecting the right job type depends on the specific requirements of


your ETL tasks:

 Use Python Shell Jobs for lightweight data transformations


or quick scripting needs.

 Apache Spark Jobs are suitable for handling large-scale


data processing tasks and complex transformations.

 When scalability and parallelization are essential, opt


for Ray Jobs to distribute workloads.

 Jupyter Notebook Jobs are perfect for interactive data


exploration and documentation.

Best Practices for ETL Jobs

Regardless of the job type you choose, it’s essential to follow best
practices:
1. Data Validation: Ensure your data is clean and valid
before processing it.

2. Error Handling: Implement robust error handling to


handle unexpected issues gracefully.

3. Monitoring: Keep an eye on job execution with AWS


Glue’s monitoring features.

4. Testing: Test your ETL jobs thoroughly in development


environments before deploying them to production.

5. Documentation: Document your code, transformations,


and job dependencies for future reference.

AWS Glue-All you need to


Simplify the ETL process

Vishal
·

Follow
Published in

Edureka
·

9 min read

Jan 6, 2020

13
1

The ETL process has been designed specifically for the purposes of
transferring data from its source database into a data warehouse.
However, the challenges and complexities of ETL can make it hard
to implement successfully for all of your enterprise data. For this
reason, Amazon has introduced AWS Glue. In this article, the
pointers that we are going to cover are as follows:

 What is AWS Glue?


 When should I use AWS Glue?
 AWS Glue Benefits
 The AWS Glue Concepts
 AWS Glue Terminology
 How does AWS Glue work?

So let us begin with our first topic.

What is AWS Glue?

 AWS Glue is a fully managed ETL service. This service


makes it simple and cost-effective to categorize your data,
clean it, enrich it, and move it swiftly and reliably between
various data stores.
 It comprises components such as a central metadata
repository known as the AWS Glue Data Catalog, an ETL
engine that automatically generates Python or Scala code,
and a flexible scheduler that handles dependency
resolution, job monitoring, and retries.
 AWS Glue is serverless, which means that there’s no
infrastructure to set up or manage.

When Should I Use AWS Glue?


1. To build a data warehouse to organize, cleanse,
validate, and format data.

 You can transform as well as move AWS Cloud data into


your data store.
 You can also load data from disparate sources into your
data warehouse for regular reporting and analysis.
 By storing it in a data warehouse, you integrate information
from different parts of your business and provide a
common source of data for decision making.

2. When you run serverless queries against your


Amazon S3 data lake.

 AWS Glue can catalog your Amazon Simple Storage Service


(Amazon S3) data, making it available for querying with
Amazon Athena and Amazon Redshift Spectrum.
 With crawlers, your metadata stays in synchronization with
the underlying data. Athena and Redshift Spectrum can
directly query your Amazon S3 data lake with the help of
the AWS Glue Data Catalog.
 With AWS Glue, you access as well as analyze data through
one unified interface without loading it into multiple data
silos.

3. When you want to create event-driven ETL pipelines

 You can run your ETL jobs as soon as new data becomes
available in Amazon S3 by invoking your AWS Glue ETL
jobs from an AWS Lambda function.
 You can also register this new dataset in the AWS Glue Data
Catalog considering it as part of your ETL jobs.

4. To understand your data assets.

 You can store your data using various AWS services and
still maintain a unified view of your data using the AWS
Glue Data Catalogs
 View the Data Catalog to quickly search and discover the
datasets that you own, and maintain the relevant metadata
in one central repository.
 The Data Catalog also serves as a drop-in replacement for
your external Apache Hive Metastore.

AWS Glue Benefits

1. Less hassle

AWS Glue is integrated across a very wide range of AWS services.


AWS Glue natively supports data stored in Amazon Aurora and all
other Amazon RDS engines, Amazon Redshift, and Amazon S3,
along with common database engines and databases in your Virtual
Private Cloud (Amazon VPC) running on Amazon EC2.

2. Cost-effective

AWS Glue is serverless. There is no infrastructure to provision or


manage. AWS Glue handles provisioning, configuration, and scaling
of the resources required to run your ETL jobs on a fully managed,
scale-out Apache Spark environment. You pay only for the resources
that you use while your jobs are running.
3. More power

AWS Glue automates a significant amount of effort in building,


maintaining, and running ETL jobs. It crawls your data sources,
identifies data formats as well as suggests schemas and
transformations. AWS Glue automatically generates the code to
execute your data transformations and loading processes

AWS Glue Concepts

You define jobs in AWS Glue to accomplish the work that’s required
to extract, transform, and load (ETL) data from a data source to a
data target. You typically perform the following actions:

 Firstly, you define a crawler to populate your AWS Glue


Data Catalog with metadata table definitions. You point
your crawler at a data store, and the crawler creates table
definitions in the Data Catalog. In addition to table
definitions, the Data Catalog contains other metadata that
is required to define ETL jobs. You use this metadata when
you define a job to transform your data.
 AWS Glue can generate a script to transform your data or
you can also provide the script in the AWS Glue console or
API.
 You can run your job on-demand, or you can set it up to
start when a specified trigger occurs. The trigger can be a
time-based schedule or an event.
 When your job runs, a script extracts data from your data
source, transforms the data, and loads it to your data target.
This script runs in an Apache Spark environment in AWS
Glue.
How does AWS Glue work?

Here I am going to demonstrate an example where I will create a


transformation script with Python and Spark. I will also cover some
basic Glue concepts such as crawler, database, table, and job.

1. Create a data source for AWS Glue:

Glue can read data from a database or S3 bucket. For example, I


have created an S3 bucket called glue-bucket-Edureka. Create two
folders from the S3 console and name them to read and write.
Now create a text file with the following data and upload it to the
read folder of the S3 bucket.

1. rank,movie_title,year,rating
1, The Shawshank Redemption,1994,9.2
2, The Godfather,1972,9.2
3, The Godfather: Part II,1974,9.0
4, The Dark Knight,2008,9.0
5,12 Angry Men,1957,8.9
6, Schindler’s List,1993,8.9
7, The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.9
9, The Lord of the Rings: The Fellowship of the
Ring,2001,8.8
10,Fight Club,1999,8.8

2. Crawl the data source to the data catalog:

In this step, we will create a crawler. The crawler will catalog all files
in the specified S3 bucket and prefix. All the files should have the
same schema. In Glue crawler terminology the file format is known
as a classifier. The crawler identifies the most common classifiers
automatically including CSV, json and parquet. Our sample file is in
CSV format and will be recognized automatically.

 In the left panel of the Glue management console


click Crawlers.
 Click the blue Add crawler button.
 Give the crawler a name such as glue-demo-edureka-
crawler.
 In Add a data store menu chooses S3 and select the
bucket you created. Drill down to select the read folder.
 In Choose an IAM role create new. Name the role to for
example glue-demo-edureka-iam-role.
 In Configure the crawler’s output adds a database
called glue-demo-edureka-DB.

When you are back in the list of all crawlers, tick the crawler that
you created. Click Run crawler.

3. The crawled metadata in Glue tables:

Once the data has been crawled, the crawler creates a metadata table
from it. You find the results from the Tables section of the Glue
console. The database that you created during the crawler setup is
just an arbitrary way of grouping the tables. Glue tables don’t
contain the data but only the instructions on how to access the data.

4. AWS Glue jobs for data transformations:

From the Glue console left panel go to Jobs and click the blue Add
job button. Follow these instructions to create the Glue job:

 Name the job as glue-demo-edureka-job.


 Choose the same IAM role that you created for the
crawler. It can read and write to the S3 bucket.
 Type: Spark.
 Glue version: Spark 2.4, Python 3.
 This job runs: A new script to be authored by you.
 Security configuration, script libraries, and job
parameters
 Maximum capacity: 2. This is the minimum and costs
about 0.15$ per run.
 Job timeout: 10. Prevents the job to run longer than
expected.
 Click Next and then Save the job and edit the script.

5. Editing the Glue script to transform the data with


Python and Spark:

Copy the following code to your Glue script editor Remember


to change the bucket name for
the s3_write_path variable. Save the code in the editor and
click Run job.
#########################################
### IMPORT LIBRARIES AND SET VARIABLES
#########################################

#Import python modules


from datetime import datetime

#Import pyspark modules


from pyspark.context import SparkContext
import pyspark.sql.functions as f

#Import glue modules


from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job

#Initialize contexts and session


spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session

#Parameters
glue_db = “glue-demo-edureka-db”
glue_tbl = “read”
s3_write_path = “s3://glue-demo-bucket-edureka/write”

#########################################
### EXTRACT (READ DATA)
#########################################

#Log starting time


dt_start = datetime.now().strftime(“%Y-%m-%d %H:%M:%S”)
print(“Start time:”, dt_start)

#Read movie data to Glue dynamic frame


dynamic_frame_read =
glue_context.create_dynamic_frame.from_catalog(database =
glue_db, table_name = glue_tbl)

#Convert dynamic frame to data frame to use standard pyspark


functions
data_frame = dynamic_frame_read.toDF()

#########################################
### TRANSFORM (MODIFY DATA)
#########################################

#Create a decade column from year


decade_col = f.floor(data_frame[“year”]/10)*10
data_frame = data_frame.withColumn(“decade”, decade_col)

#Group by decade: Count movies, get average rating


data_frame_aggregated = data_frame.groupby(“decade”).agg(
f.count(f.col(“movie_title”)).alias(‘movie_count’),
f.mean(f.col(“rating”)).alias(‘rating_mean’),
)

#Sort by the number of movies per the decade


data_frame_aggregated =
data_frame_aggregated.orderBy(f.desc(“movie_count”))

#Print result table


#Note: Show function is an action. Actions force the execution
of the data frame plan.
#With big data the slowdown would be significant without
caching.
data_frame_aggregated.show(10)

#########################################
### LOAD (WRITE DATA)
#########################################

#Create just 1 partition, because there is so little data


data_frame_aggregated = data_frame_aggregated.repartition(1)
#Convert back to dynamic frame
dynamic_frame_write = DynamicFrame.fromDF(data_frame_aggregated,
glue_context, “dynamic_frame_write”)

#Write data back to S3


glue_context.write_dynamic_frame.from_options(
frame = dynamic_frame_write,
connection_type = “s3”,
connection_options = {
“path”: s3_write_path,
#Here you could create S3 prefixes according to a values in
specified columns
#”partitionKeys”: [“decade”]
},
format = “csv”
)

#Log end time


dt_end = datetime.now().strftime(“%Y-%m-%d %H:%M:%S”)
print(“Start time:”, dt_end)

The detailed explanations are commented in the code. Here is the


high-level description:

 Read the movie data from S3


 Get movie count and rating average for each decade
 Write aggregated data back to S3

The execution time with 2 Data Processing Units (DPU) was around
40 seconds. A relatively long duration is explained by the start-up
overhead.
The data transformation script creates summarized movie data. For
example, the 2000 decade has 3 movies in IMDB top 10 with an
average rating of 8.9. You can download the result file from
the write folder of your S3 bucket. Another way to investigate the
job would be to take a look at the CloudWatch logs.

The data is stored back to S3 as a CSV in the “write” prefix. The


number of partitions equals the number of the output files.

With this, we have come to the end of this article on AWS Glue. I
hope you have understood everything that I have explained here. I
hope you enjoyed this What is Cloud Computing Tutorial. If you
wish to check out more articles on the market’s most trending
technologies like Artificial Intelligence, DevOps, Ethical Hacking,
then you can refer to Edureka’s official site.

Do look out for other articles in this series that will explain the
various other aspects of Cloud

You might also like