You are on page 1of 7

We will access S3 bucket using AWS CLI

a. Configure IAM User

Create S3 Bucket

Copy files to S3 Bucket

 Create External Table in Athena. There are two ways of doing this: 

a. Using AWS Glue Crawler


b. Manually

 We will create it manually:

a. Create table.

Create a new database if you don’t have one. Give a table name. Give the location of your file.
Select the type of file you will be working with. Select the architecture of the data in your file.
As the entered data is not that complex, we don’t need a partition. Click on “Create Table”.

Athena will auto-generate the Query for creating External Table and run it.

e.  You have your external table ready.

 We write a query to select all data from the table.

a. select * from testdb;


b. Click on Run Query and you have all the information in your table.

Demo – II (Comparison Between Amazon Athena And


MySQL)
Loading CSV file to MySQL took around 1 hour but in Athena, it took just 3 mins to upload the CSV file to
S3 and 0.42 seconds to create a table for the same.
The ETL process has been designed specifically for the purpose of transferring data from its source
database into a data warehouse.

 AWS Glue is a fully managed ETL service. It comprises of components such as a central
metadata repository known as the AWS Glue Data Catalog, an ETL engine that
automatically generates Python or Scala code, and a flexible scheduler that handles
dependency resolution, job monitoring, and retries.
 AWS Glue is serverless, this means that there’s no infrastructure to set up or manage.

When Should I Use AWS Glue?


1. To build a data warehouse to organize, cleanse, validate, and format data. 

 You can transform as well as move AWS Cloud data into your data store.
 You can also load data from disparate sources into your data warehouse for regular
reporting and analysis.
 By storing it in a data warehouse, you integrate information from different parts of your
business and provide a common source of data for decision making.

2. When you run serverless queries against your Amazon S3 data lake. 

 AWS Glue can catalog your Amazon Simple Storage Service (Amazon S3) data, making
it available for querying with Amazon Athena and Amazon Redshift Spectrum.
 With crawlers, your metadata stays in synchronization with the underlying data. Athena
and Redshift Spectrum can directly query your Amazon S3 data lake with the help of the
AWS Glue Data Catalog.
 With AWS Glue, you access as well as analyze data through one unified interface
without loading it into multiple data silos.

3. When you want to create event-driven ETL pipelines 

 You can run your ETL jobs as soon as new data becomes available in Amazon S3 by
invoking your AWS Glue ETL jobs from an AWS Lambda function.
 You can also register this new dataset in the AWS Glue Data Catalog considering it as
part of your ETL jobs.

4.  To understand your data assets. 

 You can store your data using various AWS services and still maintain a unified view of
your data using the AWS Glue Data Catalog.
 View the Data Catalog to quickly search and discover the datasets that you own, and
maintain the relevant metadata in one central repository.
 The Data Catalog also serves as a drop-in replacement for your external Apache Hive
Metastore.
AWS Glue Concepts
You define jobs in AWS Glue to accomplish the work that’s required to extract, transform, and
load (ETL) data from a data source to a data target. You typically perform the following actions:

 Firstly, you define a crawler to populate your AWS Glue Data Catalog with metadata
table definitions. You point your crawler at a data store, and the crawler creates table
definitions in the Data Catalog. In addition to table definitions, the Data Catalog contains
other metadata that is required to define ETL jobs. You use this metadata when you
define a job to transform your data.
 AWS Glue can generate a script to transform your data or you can also provide the script
in the AWS Glue console or API.
 You can run your job on-demand, or you can set it up to start when a
specified trigger occurs. The trigger can be a time-based schedule or an event.
 When your job runs, a script extracts data from your data source, transforms the data, and
loads it to your data target. This script runs in an Apache Spark environment in AWS
Glue.

AWS Glue Terminology

How does AWS Glue work?


Here I am going to demonstrate an example where I will create a transformation script with
Python and Spark. I will also cover some basic Glue concepts such as crawler, database, table,
and job.

1. Create a data source for AWS Glue:

Glue can read data from a database or S3 bucket. For example, I have created an S3
bucket called glue-bucket-edureka. Create two folders from S3 console and name
them read and write. Now create a text file with the following data and upload it to the
read folder of S3 bucket.

rank,movie_title,year,rating
1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Godfather: Part II,1974,9.0
4,The Dark Knight,2008,9.0
5,12 Angry Men,1957,8.9
6,Schindler’s List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.9
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,Fight Club,1999,8.8

2. Crawl the data source to the data catalog:

In this step, we will create a crawler. The crawler will catalog all files in the specified S3 bucket

You might also like