You are on page 1of 10

AWS Glue

AWS Glue is a server less ETL (Extract, transform and load) service on AWS cloud. It makes it easy for to
prepare data for analytics.

Components of AWS Glue

 Data catalog: The data catalog holds the metadata and the structure of the data.

 Database: It is used to create or access the database for the sources and targets.

 Table: Create one or more tables in the database that can be used by the source and target.

 Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or
custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.

 Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python
or scal0061 language writes this business logic.

 Trigger: A trigger starts the ETL job execution on-demand or at a specific time.

 Development endpoint: It creates a development environment where the ETL job script can be
tested, developed and debugged.

Create an IAM role to access AWS Glue + Amazon S3

 Open the Amazon IAM console


 Click on Roles in the left pane. Then click on Create Role.
 Choose the AWS service from Select type of trusted entity section
 Choose Glue service from “Choose the service that will use this role” section
 Choose Glue from “Select your use case” section
 Click on Next: Permissions
 Select AWSGlueServiceRole from the Attach Permissions Policies section. AWSGlueServiceRole
policy contains permissions to access Glue, CloudWatch, EC2, S3, and IAM
 Click on Next:Tags. Leave the Add tags section blank and click on Next:Review
 Provide a name to identify the service role AWSGlueServiceRole-<role name>(for simplicity add
prefix ‘AWSGlueServiceRole-’ in the role name) for the role
 Click on Create role
 Role with full access to AWS Glue and limited access to Amazon S3 has been created
Upload source CSV files to Amazon S3:

 On the Amazon S3 console, click on the Create a bucket where can store files and folders.
 Enter a bucket name, select a Region and click on Next.
 The remaining configuration settings for creating an S3 bucket are optional. Click Next to create S3
bucket.
 Create a new folder in bucket and upload the source CSV files.

Create a connection for the target database into Amazon Redshift:

 Prerequisite: Must have an existing cluster, database name and user for the database in Amazon
Redshift.
 In the AWS Glue console, click on the Add Connection in the left pane.

 Select existing cluster in Amazon Redshift as the cluster for connection


 Enter the database name, username, and password of the cluster and click Next
Review the details of the connection and then click Finish

Amazon Redshift connection is now created and can be verified through the Test Connection.

Adding a crawler to create data catalog using Amazon S3 as a data source

On the left pane in the AWS Glue console, click on Crawlers -> Add Crawler

Enter the crawler name in the dialog box and click Next
Choose S3 as the data store from the drop-down list
Select the folder where your CSVs are stored in the Include path field

Select Choose an existing IAM role and select the previously created role name from the dropdown list
of IAM roles and click Next

Leave the default option of Run on Demand and click Next


Choose an existing database. If do not have one, Click Add Database to create a new database on the fly.

Table prefixes are optional and left to the user to customer. The system would also create these
automatically after running the crawler. Click Next.

Once are satisfied with the configuration, click on Finish

Can now run the crawler by clicking Run it now


Databases on the left pane let verify if the tables were created automatically by the crawler. Amazon
Athena enables to view the data in the tables.

Add a Job to execute ETL work

 In the left pane, Click on Job, then click on Add Job


 Enter a name for the Job and then select an IAM role previously created for AWS Glue
 Select Spark for the Type and select Python or Scala.
 Can edit the number of DPU (Data processing unit) value in the Maximum capacity field of Security
configuration, script libraries, and job parameters (optional).
 The remaining configuration is optional and default values would work fine.
 Click Next
Choose a data source table from Choose a data source section. Can choose only a single data source.
Click Next.

Choose a data target table from the list of tables. Either can create new tables or choose an existing one.

If haven’t created any target table, select Create tables in data target option

Enter a database name that must exist in the target data store. Click Next.
Can map the columns of the source table with those of the target table. Click Save job and edit script.

Open the Python script by selecting the recently created job name. Click on Action -> Edit Script.

The left pane shows a visual representation of the ETL process. The right-hand pane shows the script
code and just below that can see the logs of the running Job.

The script that performs extraction, transformation and loading process on AWS Glue.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

Get the name of Job through the command line.

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])


Initialize the GlueContext and SparkContext for the Job.

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

Extract the data of tbl_syn_source_1_csv and tbl_syn_source_2_csv tables from the data catalog. AWS
Glue supports Dynamic Frames of the data.

datasource1 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_1_csv", transformation_ctx =
"datasource1")

datasource2 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_2_csv", transformation_ctx =
"datasource2")

Now, Apply transformation on the source tables. Can join both the tables on statecode column of
tbl_syn_source_1_csv and code column of tbl_syn_source_2_csv.

Several transformations are available within AWS Glue such as RenameField, SelectField, Join, etc.
Refer – https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html.

join1 = Join.apply(frame1 = datasource1, frame2 = datasource2, keys1 =


"statecode", keys2 = "code", transformation_ctx = "join1")

Load the joined Dynamic Frame in Amazon Redshift (Database=dev and Schema=shc_demo_1).

datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = join1,


catalog_connection = "my-redshift-1", connection_options = {"dbtable":
"sch_demo_1.tbl_joined", "database": "dev"}, redshift_tmp_dir =
args["TempDir"], transformation_ctx = "datasink1")

Finally, commit your Job.

job.commit()

Save and execute the Job by clicking on Run Job.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])

## Initialize the GlueContext and SparkContext


sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

## Read the data from Amazon S3 and have their structure in the data catalog.
datasource1 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_1_csv", transformation_ctx =
"datasource1")

datasource2 = glueContext.create_dynamic_frame.from_catalog(database =
"db_demo1", table_name = "tbl_syn_source_2_csv", transformation_ctx =
"datasource2")

## Apply transformation, join the tables


join1 = Join.apply(frame1 = datasource1, frame2 = datasource2, keys1 =
"statecode", keys2 = "code", transformation_ctx = "join1")

## Write the transformed data into Amazon Redshift


datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = join1,
catalog_connection = "my-redshift-1", connection_options = {"dbtable":
"sch_demo_1.tbl_joined", "database": "dev"}, redshift_tmp_dir =
args["TempDir"], transformation_ctx = "datasink1")
job.commit()

More on transformation with AWS Glue

AWS Glue has a few limitations on the transformations such as UNION, LEFT JOIN, RIGHT JOIN, etc. To
overcome this issue, can use Spark.

Convert Dynamic Frame of AWS Glue to Spark DataFrame and then can apply Spark functions for various
transformations.

Example: Union transformation is not available in AWS Glue. Can use spark union() to achieve Union on
two tables.

## Convert Glue Dynamic frame to Spark DataFrame


spark_data_frame_1 = glue_dynamic_frame_1.toDF()
spark_data_frame_2 = glue_dynamic_frame_2.toDF()

## Apply UNION Transformation on Spark DataFrame


spark_data_frame_union =
spark_data_frame_1.union(spark_data_frame_2).distinct()

## Again, convert Spark DataFrame back to Glue Dynamic Frame


glue_dynamic_frame_union = DynamicFrame.fromDF(spark_data_frame_union,
glueContext, "spark_data_frame_union")

Github link for source code:

https://gist.github.com/nitinmlvya/ba4626e8ec40dc546119bb14a8349b45

You might also like