You are on page 1of 13

Amazon Athena is serverless query service so no setup required.

It is not a Database service hence, you


just pay for the queries you run. You just point your data in S3, define the schema required and with a
standard SQL you are good to go.

Amazon Athena scales automatically, executing queries in parallel, which gives fast results, even with a
large dataset and complex queries.

Difference between RDBMS and Athena is that RDBMS used for DML, DCL, DDL and TCL operations on
Database. But on the other hand Athena used for DML operations on Database. Athenadoes not support
DDL and user defined functions. It works with external table only.

Athena helps you analyze unstructured, semi-structured and structured data that is stored in Amazon
S3. Using Athena you can create dynamic queries for your dataset. Athena also works with AWS Glue to
give you a better way to store the metadata in S3.

Using AWS CloudFormation and Athena, you can use named queries. Named query allows you to name
your query and then call it using the name. It is also used to fetch data from S3, load it to different data
stores using Athena JDBC driver, for log storage/analysis and Data Warehousing events.

Demo – I (Creating Tables In Athena)

As you know all about Amazon Athena, let’s take a dive on how to query your data stored as .json file in
Amazon S3 using Athena.

1: Create multiple JSON files containing entries


2: Store the files to S3 bucket
3: Create an external table for the files stored in S3
4: Write a Query for accessing the data

Let’s understand how to do the above-said tasks one by one. 

Create JSON Files. (Create the data without using newline character)

We will access S3 bucket using AWS CLI

a. Configure IAM User


Create S3 Bucket

Copy files to S3 Bucket

 Create External Table in Athena. There are two ways of doing this: 

a. Using AWS Glue Crawler


b. Manually

 We will create it manually:

a. Create table.

Create a new database if you don’t have one. Give a table name. Give the location of your file.
Select the type of file you will be working with. Select the architecture of the data in your file.
As the entered data is not that complex, we don’t need a partition. Click on “Create Table”.

Athena will auto-generate the Query for creating External Table and run it.

e.  You have your external table ready.

 We write a query to select all data from the table.

a. select * from testdb;


b. Click on Run Query and you have all the information in your table.

Demo – II (Comparison Between Amazon Athena And


MySQL)
Loading CSV file to MySQL took around 1 hour but in Athena, it took just 3 mins to upload the CSV file to
S3 and 0.42 seconds to create a table for the same.
The ETL process has been designed specifically for the purpose of transferring data from its source
database into a data warehouse.

 AWS Glue is a fully managed ETL service. It comprises of components such as a central
metadata repository known as the AWS Glue Data Catalog, an ETL engine that
automatically generates Python or Scala code, and a flexible scheduler that handles
dependency resolution, job monitoring, and retries.
 AWS Glue is serverless, this means that there’s no infrastructure to set up or manage.

When Should I Use AWS Glue?


1. To build a data warehouse to organize, cleanse, validate, and format data. 

 You can transform as well as move AWS Cloud data into your data store.
 You can also load data from disparate sources into your data warehouse for regular
reporting and analysis.
 By storing it in a data warehouse, you integrate information from different parts of your
business and provide a common source of data for decision making.

2. When you run serverless queries against your Amazon S3 data lake. 

 AWS Glue can catalog your Amazon Simple Storage Service (Amazon S3) data, making
it available for querying with Amazon Athena and Amazon Redshift Spectrum.
 With crawlers, your metadata stays in synchronization with the underlying data. Athena
and Redshift Spectrum can directly query your Amazon S3 data lake with the help of the
AWS Glue Data Catalog.
 With AWS Glue, you access as well as analyze data through one unified interface
without loading it into multiple data silos.

3. When you want to create event-driven ETL pipelines 

 You can run your ETL jobs as soon as new data becomes available in Amazon S3 by
invoking your AWS Glue ETL jobs from an AWS Lambda function.
 You can also register this new dataset in the AWS Glue Data Catalog considering it as
part of your ETL jobs.

4.  To understand your data assets. 

 You can store your data using various AWS services and still maintain a unified view of
your data using the AWS Glue Data Catalog.
 View the Data Catalog to quickly search and discover the datasets that you own, and
maintain the relevant metadata in one central repository.
 The Data Catalog also serves as a drop-in replacement for your external Apache Hive
Metastore.
AWS Glue Concepts
You define jobs in AWS Glue to accomplish the work that’s required to extract, transform, and
load (ETL) data from a data source to a data target. You typically perform the following actions:

 Firstly, you define a crawler to populate your AWS Glue Data Catalog with metadata
table definitions. You point your crawler at a data store, and the crawler creates table
definitions in the Data Catalog. In addition to table definitions, the Data Catalog contains
other metadata that is required to define ETL jobs. You use this metadata when you
define a job to transform your data.
 AWS Glue can generate a script to transform your data or you can also provide the script
in the AWS Glue console or API.
 You can run your job on-demand, or you can set it up to start when a
specified trigger occurs. The trigger can be a time-based schedule or an event.
 When your job runs, a script extracts data from your data source, transforms the data, and
loads it to your data target. This script runs in an Apache Spark environment in AWS
Glue.

AWS Glue Terminology

How does AWS Glue work?


Here I am going to demonstrate an example where I will create a transformation script with
Python and Spark. I will also cover some basic Glue concepts such as crawler, database, table,
and job.

1. Create a data source for AWS Glue:

Glue can read data from a database or S3 bucket. For example, I have created an S3
bucket called glue-bucket-edureka. Create two folders from S3 console and name
them read and write. Now create a text file with the following data and upload it to the
read folder of S3 bucket.

rank,movie_title,year,rating
1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Godfather: Part II,1974,9.0
4,The Dark Knight,2008,9.0
5,12 Angry Men,1957,8.9
6,Schindler’s List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.9
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,Fight Club,1999,8.8

2. Crawl the data source to the data catalog:

In this step, we will create a crawler. The crawler will catalog all files in the specified S3 bucket
and prefix. All the files should have the same schema. In Glue crawler terminology the file
format is known as a classifier. The crawler identifies the most common classifiers
automatically including CSV, json and parquet. Our sample file is in the CSV format and will be
recognized automatically.

1.

 In the left panel of the Glue management console click Crawlers.


 Click the blue Add crawler button.
 Give the crawler a name such as glue-demo-edureka-crawler.
 In Add a data store menu choose S3 and select the bucket you created. Drill
down to select the read folder.
 In Choose an IAM role create new. Name the role to for example glue-demo-
edureka-iam-role.
 In Configure the crawler’s output add a database called glue-demo-edureka-db.

When you are back in the list of all crawlers, tick the crawler that you created. Click Run
crawler.
3. The crawled metadata in Glue tables:

Once the data has been crawled, the crawler creates a metadata table from it. You find the results
from the Tables section of the Glue console. The database that you created during the crawler
setup is just an arbitrary way of grouping the tables. Glue tables don’t contain the data but only
the instructions on how to access the data.

4. AWS Glue jobs for data transformations:

From the Glue console left panel go to Jobs and click blue Add job button. Follow these
instructions to create the Glue job:

1.
 Name the job as glue-demo-edureka-job.
 Choose the same IAM role that you created for the crawler. It can read and write to the
S3 bucket.
 Type: Spark.
 Glue version: Spark 2.4, Python 3.
 This job runs: A new script to be authored by you.
 Security configuration, script libraries, and job parameters
 Maximum capacity: 2. This is the minimum and costs about 0.15$ per run.
 Job timeout: 10. Prevents the job to run longer than expected.
 Click Next and then Save job and edit the script.

5. Editing the Glue script to transform the data with Python and Spark:

Copy the following code to your Glue script editor Remember to change the bucket name for
the s3_write_path variable. Save the code in the editor and click Run job.

#########################################

### IMPORT LIBRARIES AND SET VARIABLES

#########################################

#Import python modules

from datetime import datetime

#Import pyspark modules

from pyspark.context import SparkContext

import pyspark.sql.functions as f
 

#Import glue modules

from awsglue.utils import getResolvedOptions

from awsglue.context import GlueContext

from awsglue.dynamicframe import DynamicFrame

from awsglue.job import Job

#Initialize contexts and session

spark_context = SparkContext.getOrCreate()

glue_context = GlueContext(spark_context)

session = glue_context.spark_session

#Parameters

glue_db = "glue-demo-edureka-db"

glue_tbl = "read"

s3_write_path = "s3://glue-demo-bucket-edureka/write"

#########################################

### EXTRACT (READ DATA)

#########################################

#Log starting time

dt_start = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

print("Start time:", dt_start)

#Read movie data to Glue dynamic frame

dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database
= glue_db, table_name = glue_tbl)

#Convert dynamic frame to data frame to use standard pyspark functions

data_frame = dynamic_frame_read.toDF()

#########################################

### TRANSFORM (MODIFY DATA)

#########################################

#Create a decade column from year

decade_col = f.floor(data_frame["year"]/10)*10

data_frame = data_frame.withColumn("decade", decade_col)

#Group by decade: Count movies, get average rating

data_frame_aggregated = data_frame.groupby("decade").agg(

f.count(f.col("movie_title")).alias('movie_count'),

f.mean(f.col("rating")).alias('rating_mean'),

#Sort by the number of movies per the decade

data_frame_aggregated = data_frame_aggregated.orderBy(f.desc("movie_count"))

#Print result table

#Note: Show function is an action. Actions force the execution of the data
frame plan.

#With big data the slowdown would be significant without cacching.

data_frame_aggregated.show(10)
 

#########################################

### LOAD (WRITE DATA)

#########################################

#Create just 1 partition, because there is so little data

data_frame_aggregated = data_frame_aggregated.repartition(1)

#Convert back to dynamic frame

dynamic_frame_write = DynamicFrame.fromDF(data_frame_aggregated,
glue_context, "dynamic_frame_write")

#Write data back to S3

glue_context.write_dynamic_frame.from_options(

frame = dynamic_frame_write,

connection_type = "s3",

connection_options = {

"path": s3_write_path,

#Here you could create S3 prefixes according to a values in specified columns

#"partitionKeys": ["decade"]

},

format = "csv"

#Log end time

dt_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

print("Start time:", dt_end)


The detailed explanations are commented in the code. Here is the high-level description:

 Read the movie data from S3


 Get movie count and rating average for each decade
 Write aggregated data back to S3

The execution time with 2 Data Processing Units (DPU) was around 40 seconds. A relatively
long duration is explained by the start-up overhead.

The data transformation script creates summarized movie data. For example, 2000 decade has 3
movies in IMDB top 10 with average rating 8.9. You can download the result file from
the write folder of your S3 bucket. Another way to investigate the job would be to take a look at
the CloudWatch logs.

The data is stored back to S3 as a CSV in the “write” prefix. The number of partitions equals the
number of the output files.

With this, we have come to the end of this article on AWS Glue. I hope you have understood
everything that I have explained here.

You might also like