You are on page 1of 106

AWS Professional Services:

Big Data and Analytics


Data Lake on AWS

Vijay K
AWS Partner Trainer
vkkasibh@amazon.com
Course Agenda  Introduction
 Data Lake overview
 Building Data Lakes on AWS
 Lab 01: Creating a Glue Data Crawler
 Lab 02: Modifying Table Schemas
 Working with Amazon Athena
 Lab 03: Querying your Data Lake with Amazon
Athena
 Working with AWS Glue
 Lab 04: Transforming Data with AWS Glue
 Wrap Up & Conclusion

2
Introduction
Introducing UnicornNation
UnicornNation is a global entertainment company that provides
ticketing, merchandising and promotion of large concerts and events.

UNICORN NATION
In recent years, they have been collecting data through a number of
disparate systems and want to consolidate this data in a modern data
architecture

A workshop was held with the key stakeholders in UnicornNation and they identified
three key data sources they would like to consolidate and have provided the
funding and resources to built a Data Lake on AWS.

During the course of this session, you will be building a Data Lake on AWS to meet
their requirements and gain experience with a number of core AWS services,
including Amazon S3, AWS Glue, and Amazon Athena.

4 Visit aws.training
UnicornNation - Target Architecture
On- AWS Cloud AWS Glue ETL
premises

Merchandise
Sales-Parquet

Merchandise Crawler
Sales-CSV Crawler

AWS Glue
Data Catalog
Amazon Athena Data Analysts
Tickit-History Crawler

5 Visit aws.training
Handouts
Lab Manual
Step by Step instructions to completing the hands-
on labs – will be sent in an email

Lab Login Details


Login details to the AWS Console and S3 bucket
configuration – will be included in the same email

Training Content
Will be sent at the end of the tomorrow’s training.

6 Visit aws.training
Attendance
To get attendance credit for this course, you must attend
sessions on both days.

7 Visit aws.training
Data Lakes on AWS
What is a Data Lake?
A data lake is a centralized repository
that allows you to store all your structured
and unstructured data at any scale.

You can store your data as-is, without


having to first structure the data, and run
different types of analytics—from
dashboards and visualizations to big data
processing, real-time analytics, and
machine learning to guide better
decisions.

9 Visit aws.training
Legacy Data Architectures Exist as Isolated Data Silos

Hadoop Data SQL


Cluster Warehouse Database
Appliance

10 Visit aws.training
Legacy Data Architectures Are Monolithic

Hadoop Master Node

CPU CPU CPU

Memory Memory Memory

HDFS Storage HDFS Storage HDFS Storage

Multiple layers of functionality all


on a single cluster

11 Visit aws.training
Enter Data Lake Architectures

Data Lake is a new and


increasingly popular
architecture to store and
analyze massive volumes and
heterogeneous types of data.

12 Visit aws.training
Benefits of a Data Lake – Quick Ingest

“How can I collect data quickly Quickly ingest data


from various sources and without needing to force it into a
store it efficiently?” pre-defined schema.

13 Visit aws.training
Benefits of a Data Lake – Storage vs Compute

“How can I scale up with the Separating your storage and compute
volume of data being allows you to scale each component as
generated?” required

14 Visit aws.training
Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple A Data Lake enables ad-hoc
analytics and processing analysis by applying schemas
frameworks to the same data?” on read, not write.

15 Visit aws.training
Comparison of a Data Lake to an Enterprise Data Warehouse

EMR S3 Enterprise DW
Complementary to EDW (not replacement) Data lake can be source for EDW

No Predefined Schemas (Schema-On-Read) Predefined Schemas (Shema-On-Write)

Structured/semi-structured/Unstructured data (Typically) structured data only

(Typically) Time consuming to introduce new


Fast ingestion of new data/content
content
Data Science + Prediction/Advanced Analytics + BI
Traditional BI use cases
use cases
Data at at all levels (raw, transformed, summary
(Typically) Summary / Aggregated level data only
etc)
Decoupled storage & computation Storage & Computation Coupled tightly

Flexibility in tools / greater processing capabilities / Limited flexibility in tools / limited processing
parallelization of processing capabilities and windows of opportunity
16 Visit aws.training
Building a Data Lake on AWS
Reference architecture

17 Visit aws.training
Components of a Data Lake
Data Storage
API & UI
• High durability
Entitlements • Stores raw data from input sources
• Support for any type of data
• Low cost
Catalogue & Search

Storage Ingestion

• Streaming
• Streamed ingest of feed data
• Provides the ability to consume any
dataset as a stream
• Facilitates low latency analytics
• Batch
• AWS Batch / Snowball
• Storage Gateway
• sFTP
18 Visit aws.training
S3 for data lake

Durable Available High performance


 Multiple upload
Designed for 11 9s Designed for  Range GET
of durability 99.99% availability

Amazon S3
Easy to use Scalable Integrated
 Simple REST API  Store as much as you need  Amazon EMR
 AWS SDKs  Scale storage and compute  Amazon Redshift
 Read-after-create consistency independently  Amazon DynamoDB
 Event notification  No minimum usage commitments
 Lifecycle policies

19 Visit aws.training
Secure data lake on Amazon S3

FSx for
Lustre

Amazon S3 access points S3 Block Public Access Amazon S3 object tags Amazon S3 object lock

• Multi-tenant bucket • Across AWS accounts and • Access control, lifecycle • Immutable Amazon S3
• Dedicated access points Amazon S3 bucket level policies, and analysis objects
• Customer permissions • Specify public permissions • Classify data with • Retention management
from an Amazon Virtual using Access Control List metadata controls
Private Cloud (Amazon (ACL) or policy • Use tags to filter objects • Data protection and
VPC) • Four settings: • Define replication policies compliance
• BlockPublicAcls • Populate tags with AWS
• IgnorePublicAcls Lambda functions or S3
• BlockPublicPolicy Batch Operations
• RestrictPublicBuckets
20 Visit aws.training
Data Ingestion into Amazon S3

AWS Direct Connect AWS Snowball ISV Connectors

Amazon Kinesis S3 Transfer AWS Storage


Firehose Acceleration Gateway

21 Visit aws.training
Components of a Data Lake
Catalogue
API & UI
• Metadata (Technical & Business)
Entitlements • Used for summary statistics and data
Classification management
Catalogue & Search
Search
Storage
• Simplify discoverability and access to the
data

22 Visit aws.training
Data Catalogue – Metadata Index
• Store data about your Amazon S3 storage environment
• Total size & count of objects by prefix, data classification,
refresh schedule, object version information
• Amazon S3 events processed by Lambda function
• DynamoDB metadata tables store required attributes

PUT object ObjectCreated, PUT item


ObjectDeleted
Data Collectors S3 bucket AWS Lambda Metadata Index
(EC2, ECS) (DynamoDB)

23 Visit aws.training
Catalogue & Search Architecture
Metadata Index
(DynamoDB)
PUT object ObjectCreated, PUT item
ObjectDeleted
Data Collectors S3 bucket AWS Lambda
(EC2, ECS) Update Stream

AWS Lambda
Extract Search Fields

Update Index

Amazon Elasticsearch Service

24 Visit aws.training
AWS Glue Data Catalog
Unified metadata repository

Amazon RDS Amazon Redshift Amazon S3


lake house

• Get a single view into data, no matter where it is stored


• Automatically classify data in a central, searchable list More about
• Track data evolution using schema versioning this topic on
• Query data using Amazon Athena or Amazon Redshift Spectrum AWS Glue
• Hive metastore compatible session

25 Visit aws.training
Components of a Data Lake
Entitlements system
API & UI
• Encryption
Entitlements • Authentication
• Authorisation
Catalogue & Search • Chargeback
• Quotas
Storage • Data masking
• Regional restrictions

26 Visit aws.training
Implement the right cloud security controls

Security Identity Encryption Compliance

Amazon GuardDuty AWS Identify and Access AWS Certification Manager AWS Artifact
Management (IAM) (ACM)
AWS Shield Amazon Inspector
AWS Single Sign-On AWS Key Management
AWS Well-Architected Tool Service (AWS KMS) AWS CloudHSM
Amazon Cloud Directory
Amazon Macie Encryption at rest Amazon Cognito
Amazon Virtual Private AWS Directory Service
Encryption in transit AWS CloudTrail
Cloud (Amazon VPC)
AWS Organizations Bring your own keys,
hardware security module
(HSM) support

27 Visit aws.training
Components of a Data Lake
API & User Interface
API & UI
• Exposes the data lake to customers
Entitlements • Programmatically query catalogue
• Expose search API
Catalogue & Search • Ensures that entitlements are
respected
Storage

28 Visit aws.training
AWS Solution - Data Lake on AWS
Reference Architecture deployment
via CloudFormation

Configures core services to tag,


search and catalogue datasets

Deploys a console to search and


browse available datasets

http://amzn.to/2nTVjcp

29 Visit aws.training
Data Lake Usage Processing & Analytics

Real-time Batch
Elasticsearch Kinesis Analytics,
Service Kinesis Streams EMR Athena
Hadoop, Spark, Query Service
Spark Streaming Apache Flink Presto
on EMR on EMR
Redshift
Apache Storm Data Warehouse
AWS Lambda on EMR
Kinesis Streams
& Firehose

AI & Predictive Transactional &


RDBMS
Amazon Lex Amazon Polly
Speech Text to speech
recognition

DynamoDB Aurora
Amazon Machine Learning
Rekognition Predictive analytics NoSQL DB Relational Database

BI & Data Visualization

30 Visit aws.training
AWS Quick Starts for Data Lakes
Quick Starts are built by AWS solutions architects and
partners to help you deploy popular solutions on AWS,
based on AWS best practices for security and high
availability.

These reference deployments implement key technologies


automatically on the AWS Cloud, often with a single click
and in less than an hour. You can build your test or
production environment in a few steps, and start using it
immediately.

https://aws.amazon.com/quickstart/

31 Visit aws.training
Data Lakes: Summary
• Use S3 as the storage repository for your data lake, instead of a
Hadoop cluster or data warehouse
• Decoupled storage and compute is cheaper and more efficient to
operate
• Decoupled storage and compute allow us to evolve to clusterless
architectures like Athena
• Do not build data silos in Hadoop or the Enterprise DW
• Gain flexibility to use all the analytics tools in the ecosystem around
S3 & future proof the architecture

32 Visit aws.training
Summary of AWS Analytics & AI Tools
Data visualization, engagement, and machine learning
AWS Data Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon
Exchange QuickSight Pinpoint SageMaker Comprehend Polly Lex Rekognition Translate

Analytics
Amazon Amazon
Amazon Amazon EMR AWS Glue Amazon Kinesis Data
Elasticsearch Service
Redshift (Spark and Presto) (Spark and Python) Athena Analytics
(Amazon ES)

Data lake infrastructure and management


Amazon Simple Storage Service (Amazon S3) AWS Lake Formation AWS Glue
Amazon S3 Glacier

Data movement
AWS Database Migration Service (AWS DMS) | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams | Amazon Managed Streaming for Apache Kafka

33 Visit aws.training
Building Data Lakes
Data Lakes start with S3…

…and some Data Engineering


35 Visit aws.training
What is Data Engineering?

A data engineer transforms data into a useful format


for analysis including:

• Business Intelligence
• Machine Learning
• Deep Learning

36 Visit aws.training
Roles and Responsibilities

Data Stewardship Business


Operationalisation
Understanding
Reliability
Data Pipelines Data Structures Software
Statistical Feature Engineering
Development
Analysis Engineering
Extraction Data
Enrichment Model Application
Loading
& Preparation Re/Training Integration
Transformation
UX/UI Containerizatio
Aggregation Data Lakes Visualization Experimentation n Web Services,
Normalization Data Marts Tuning Production
Edge ML
Dashboard Design Environments
Data Data Science
Engineering Data Analytics DevOps

37 Visit aws.training
ROLE PRIORITIES PITFALLS WITHOUT THEM
Makes sense of data, generates and communicates Naïve insights and low model
Data Scientist insights to improve or create business processes, yield; missed opportunities to
creates predictive ML models to support them unveil business value

Value generation is slow,


Builds scalable pipelines, transforms and loads
Data because DS is spending their
data into structures complete with metadata that
Engineer time doing infrastructure
can be readily consumed by DS
work
Sponsors struggle to see
Data Product Manages data as a product. Ensures freshness and
business value, projects miss
Manager consistency of data; understands lineage and
the mark due to insufficient
compliance needs; treats DS as customers
data
Integrating models with applications, automating Slow model deployment, poor
DevOps
model delivery and retraining, measuring ongoing performance; premature model
Engineer model relevance and accuracy obsolescence

Low engagement and


Creating engaging visual and narrative journeys
Data Analysts for analytical solutions
adoption from end users; lack
of executive support

If project prioritisation isn’t


Business Vetting the priortization and ROI, funding
made sensibly, DS do not know
Sponsor projects, providing ongoing feedback
when to pull the plug

38 Visit aws.training
Design Patterns

• Data Export – Cloud or On-Premise Database


• Log Aggregation – Various Logs from AWS, internal, etc.
• Real-time Data Collection – Streaming data
• Analytics and Reporting – Aggregating, Consolidating,
Prepping Data

39 Visit aws.training
Hydrating Data Lakes

40 Visit aws.training
Many different options…

AWS DMS AWS Snowball AWS Snowmobile AWS Command AWS Management
Line Interface Console

Amazon Kinesis Amazon Kinesis Amazon Managed AWS Glue Third-party Tools
Data Streams Data Firehose Streaming for
Kafka

41 Visit aws.training
Third Party Tools

Amazon S3

42 Visit aws.training
Cataloging your Data Lake

43 Visit aws.training
Discover and Organize Data Transformations

Databases
Ad-hoc
Glue ETL investigation

Amazon RDS
Describes
Uses
Datawarehouses Uses Amazon
Athena
Describes

Data warehouse
Uses
Amazon Redshift Uses

Data Lakes Describes Glue Data


Catalog Analytics
Redshift Spectrum

Amazon S3
44 EMR
Visit aws.training
(Hadoop/Spark)
Glue Data Catalog

Manage table metadata through a Web


Interface, Hive metastore API, Hive SQL, or
automated through crawlers

Listening to our customers, we’ve added:


 Search metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing
files
 Versioning of table metadata as schemas

45 Visit aws.training
Glue Data Catalog – Table Details

Table Description

Table Properties

Table Schema

Table Partitions

46 Visit aws.training
Glue Data Catalog – Table View Partitions
Partitioning Values

Jump to Objects
or
View Properties

S3 Objects Linked

47 Visit aws.training
Glue Data Catalog – Schema Version Control
List table schema versions

Compare schema versions

48 Visit aws.training
Glue Data Catalog – Editing Schema
Edit schemas to add, remove and update columns and types

49 Visit aws.training
Database Connections
Configure Database
connection to be used as
source and destination

• JDBC

• Amazon RDS and Aurora

• Amazon Redshift

• Amazon DocumentDB

• MongoDB

• Kafka

• JDBC
50 Visit aws.training
Crawlers: Auto-Populate Data Catalog

Run crawlers on-demand and on a schedule to discover new data and schema changes.
Serverless – only pay when crawls run.

Automatic schema inference:


• Built-in classifiers
• Detect file type
• Extract schema
• Identify partitions
• Add your own classifiers

• Grok
• XML
• JSON
• CSV
51 Visit aws.training
Crawlers: Classifiers
AWS IAM Built-in classifiers
role MySQL
AWS Glue
crawler MariaDB
PostgreSQL
Databases Amazon Aurora
JDBC Oracle
connection
Amazon Redshift
NoSQL Amazon Redshift
Avro
connection Parquet
ORC
Object XML
connection Amazon DynamoDB JSON and JSONPaths
AWS CloudTrail
BSON
Logs
Delimited
Amazon S3 …always growing…

52 Visit aws.training
Detecting partitions

S3 bucket hierarchy Table definition

Column Type
sim=.93 month=Nov
month str

date str
sim=.99 date=10 … sim=.95 date=15
col 1 int
col 2 float

file 1
… file N file 1 … file N

estimate schema similarity among files at each level to


handle semi-structured logs, schema evolution…
53 Visit aws.training
Exclusion Patterns
Exclude pattern Description
*.csv Matches an Amazon S3 path that represents an object name ending in .csv

*.* Matches all object names that contain a dot


*.{csv,avro} Matches object names ending with .csv or .avro
foo.? Matches object names starting with foo. that are followed by a single character
extension
/myfolder/* Matches objects in one level of subfolder from myfolder, such
as /myfolder/mysource
/myfolder/*/* Matches objects in two levels of subfolders from myfolder, such
as /myfolder/mysource/data
/myfolder/** Matches objects in all subfolders of myfolder, such
as /myfolder/mysource/mydata and/myfolder/mysource/data
Market* Matches tables in a JDBC database with names that begin with Market, such
as Market_us and Market_fr

54 Visit aws.training
Lab 01

Creating a Glue Data


Crawler

55
Lab 01: Creating a Glue Data Crawler
The IT team at UnicornNation has exported Merchandise Sales
data from their finance system and transferred this data to a single
S3 bucket. There are multiple .CSV files in the bucket, each
representing a month’s worth of data. UNICORN NATION

In this lab, you are going to create a Glue Data Crawler to crawl
across this bucket. Once you have created the crawler, you are
going to run it to determine the tables that are located in the bucket
and add their definitions to the Glue Data Catalog.
Sales-jan18.csv

Sales-feb18.csv

Sales-mar18.csv

Data Catalog

56 Visit aws.training
Lab 02

Modifying Table Schemas

57
Lab 02: Modifying Table Schemas
The IT team at UnicornNation has extracted historical data from
their ticketing system, named “Tickit” which processes the majority
of transactions for the company. This data source is known as the
Tickit History. UNICORN NATION
They have stored this data in an S3 bucket and have created
folder/prefixes for each table of data they have exported.
In this lab, we are going to create a Glue Data Crawler to crawl
across this bucket. Once we have created the crawler, we are going
to run it to determine the tables that are located in the bucket and
add their definitions to the Glue Data Catalog. category

events

listings

Data Catalog

58 Visit aws.training
Working with Amazon Athena
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL

60 Visit aws.training
Benefits of Athena
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM for authentication; Encryption at rest & in
transit
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions

61 Visit aws.training
Familiar Technologies Under the Covers

Used for DDL functionality


Complex data types
Multitude of formats
Supports data partitioning

Used for SQL Queries


In-memory distributed query engine
ANSI-SQL compatible with extensions

62 Visit aws.training
A Better Model
Old Methodology With Amazon Athena
• Analyst asks for a report • Analyst creates table
• Developer writes code • Analyst iterates
• Code executes on shared • Generate final report
cluster for several hours
• Analyst reviews report
• Analyst asks for more…

Simple, Quick and No Infrastructure or Developer to Manage

63 Visit aws.training
Simple Pricing
• DDL operations – FREE

• SQL operations – FREE

• Query concurrency – FREE

• Data scanned - $5 / TB

• Standard S3 rates for storage, requests, and data transfer apply

64 Visit aws.training
Cost Monitoring
• Billing console provides spend per account
• Athena APIs are logged in CloudTrail
• Combine CloudTrail and Athena API for per IAM user
cost

65 Visit aws.training
Security and Access Control
• Encryption – SSE, SSE-KMS, CSE-KMS
• Auto detect source bucket KMS key
• Destination bucket may use separate key

• Access Control
• IAM
• S3 ACL
• S3 bucket policies

• Integrated with Glue Data Catalog


• Database level
66
• Table level
Visit aws.training
When to Use Athena
• Use Athena for easy access to Amazon S3

• Use alongside EMR, Redshift & Redshift Spectrum to


support different use cases

• Athena is free when not in use so no cost to just defining


tables in case you need the data

• Easy to query operational logs e.g. ALB, ELB, S3,


CloudTrail, CloudFront, etc.
67 Visit aws.training
Lab 03

Querying your Data Lake


with Amazon Athena

68
Lab 03: Querying your Data Lake with Amazon Athena
The ticketing team at UnicornNation has heard about the work you
did in setting up the data catalog. They have some data that they
need urgently and need your help in setting up and running these
queries.
UNICORN NATION

In this lab, you are going to use Amazon Athena to create some
queries, which you will then save to make it easy for users to run
and consume.

As part of the lab, you will also be running your own queries to help
users answer some basic questions around ticket sales, customers
and more.

69 Visit aws.training
Working with AWS Glue
AWS Glue
• AWS Glue is a fully managed ETL (extract, transform, and load) service

• Categorize your data, clean it, enrich it and move it reliably between
various data stores

• Once catalogued, your data is immediately searchable and queryable


across your data silos

• Simple and cost-effective

• Serverless; runs on a fully managed, scale-out Spark environment

71 Visit aws.training
AWS Glue

Job Scheduling and


Orchestration AWS Glue
Data Catalog
Data
Stores

Data Target
Data Source

72 Visit aws.training
ETL Engine
Why would AWS get into the ETL space?

73 Visit aws.training
We have lots of ETL Partners

Fivetran

74 Visit aws.training
Customers are still hand-coding ETL

Why do we do so much hand-coding?


• Code is flexible
• Code is powerful
• You can unit test
• You can deploy with other code
• You know your development tools

75 Visit aws.training
Customers are still hand-coding ETL

But there are down-sides to hand coding


• Hand-coding involves a lot of
undifferentiated heavy lifting…
• Hand-coding is brittle, error-prone and
laborious
• Especially as data formats change and
target schemas change

76 Visit aws.training
AWS Glue
• AWS Glue automates the
undifferentiated heavy-lifting of ETL
• Discover and organize data, regardless
of where it lives
• Focus on writing transformations,
not handling undifferentiating heavy
lifting
• ETL jobs run under a Serverless
execution model
77 Visit aws.training
AWS Glue
Data Sources Data Targets
1 Triggers

ETL Job
Amazon S3 Amazon RDS Amazon S3 Amazon RDS
2
4
Extracts
Data 3 Loads Data
Transforms
Amazon Amazon Amazon Amazon
Redshift DynamoDB Redshift DynamoDB
Runs

Writes
Statistics

Amazon JDBC 5 Amazon JDBC


DocumentDB AWS Glue Data DocumentDB
Script
Catalog
Serverless
78 Visit aws.training
AWS Glue: components
 Hive metastore compatible with enhanced functionality.
 Crawlers automatically extracts metadata and creates tables.
AWS Glue Data
 Integrates with Amazon Athena, Amazon EMR, and many more.
Catalog
 Generate ETL code.
 Build on open frameworks – Python/Scala and Apache Spark.
Job authoring  Developer-centric – editing, debugging, sharing

 Run jobs on a serverless Spark platform


 Use flexible scheduling, job monitoring, and alerting
Job execution

 Orchestrate triggers, crawlers, and jobs


 Author and monitor entire flows and integrated alerting

Job workflow

79 Visit aws.training
Job authoring in Glue
Script generated by AWS Glue
You have
choices on how Existing script brought into AWS
to get started… Glue
Blank script authored by you

80 Visit aws.training
Choose a Data Source

Once crawler completes updating Data Catalog, those tables become


sources for Glue ETL

81 Visit aws.training
Choose a Data Destination
Write output results into an existing table.

Write output results into a new location


• S3 Bucket
• JDBC Connection

82 Visit aws.training
Automatic code generation

Existing columns
in target
Can extend/add
new columns to
target

83 Visit aws.training
Automatic code generation

1. Customize the mappings


2. Glue generates transformation graph and Python/Scala code
3. Specify trigger condition
84 Visit aws.training
Glue ETL scripts are forgiving and flexible

 Human-readable code run on a scalable platform


 Forgiving in the face of failures – handles bad data and crashes
 Flexible: handles complex semi-structured data, and adapts to source schema changes
85 Visit aws.training
Write your own scripts
 Convert to a Spark DataFrame
 Run with custom code and libraries
 Convert back to Glue DynamicFrame

86 Visit aws.training
Add Custom Modules and Files
 Add external Python modules
 Java JARs required by the script
 Additional files such as configuration, etc.

87 Visit aws.training
Glue transformations are flexible and adaptive
Semi-structured schema Relational schema

A B B C.X C.Y FK

PK Offs Valu
et e
A B B C D[ ]

X Y

 Flatten semi-structured objects with arbitrary complexity into relational tables, on-the-fly.
 Pivot arrays and other collection types into a separate table, generating key-foreign key values.
 Modify mapping as the source schema changes, and modify the target schemas as needed.

88 Visit aws.training
Glue transformations…

 Pre-Built Transformation click and add to


your job with simple configuration
 Spigot writes sample data from DynamicFrame
to S3 in JSON format
 Expanding…More transformations to come

89 Visit aws.training
ETL Job Progress and History

Track ETL job progress and inspect logs directly from the console.

Logs are written to CloudWatch for simple access. Errors are automatically extracted and
presented in the Error Logs for easy troubleshooting of jobs.

90 Visit aws.training
Orchestration & resource management
Fully managed, serverless job execution

91 Visit aws.training
Job Bookmarks…

Glue keeps track of data that has Option Behavior


already been processed by a Enable Pick up from where you left off
previous run of an ETL job.
Disable Ignore and process the entire
dataset every time
Pause Temporarily disable continuation but
This persisted state information is keep the bookmark
called a bookmark.

Determines
Marketing: Ad-spend by
Behavior
customer segment
For example, your ETL job might read new partitions in an
Amazon S3 file. AWS Glue keeps track of which partitions
have successfully been processed by the job to prevent
Data Objects
duplicate processing and duplicate data in the job's target
data store.

92 Visit aws.training
Job composition and triggers

Compose jobs globally with event- Marketing: Ad-spend by


based dependencies customer segment
Event Based
 Easy to reuse and leverage work Lambda Trigger
across organization boundaries Data
based

Multiple triggering mechanisms


Schedule
 Schedule-based: e.g., time of day Data
weekly
 Event-based: e.g., data availability, sales based Central: ROI by
customer
job completion
segment
 External sources: e.g., AWS Lambda Sales: Revenue by
customer segment

93 Visit aws.training
Serverless job execution

Warm pool of instances

There is no need to provision, configure, or


manage servers
 Warm pools: pre-configured fleets of
instances to reduce job startup time

 Auto-configure VPC and role-based access

 Automatically scale resources to meet SLA


and cost objectives

 You pay only for the resources you


consume while consuming them.
Customer VPC Customer VPC

94 Visit aws.training
Developer Endpoints

Environment to iteratively develop Glue Instances


and test ETL scripts.

Develop your script in a notebook


and point to an AWS Glue endpoint
to test it.

When you are satisfied with the Customer VPC


results of your development
process you can create an ETL job
that runs your script.
Development Amazon Amazon
Notebook RDS Redshift

95 Visit aws.training S3 Endpoint


Glue ETL Security Model

• ETL Jobs that do not require special handling can be launched


without additional configuration

• For jobs requiring restricted access to data Glue utilizes a VPC


endpoint launching dedicated ENIs into the customer’s VPC which
are assigned private IP address from the customer’s subnet.

• For such jobs, Glue also enforces the use of an S3 VPC endpoint

96 Visit aws.training
AWS Lake Formation

97 Visit aws.training
Building a secure data lake
Typical steps to build a secure data lake
Ingestion and cleaning Security
Analytics and machine learning

1 Set up
storage

4 Configure and
2 Move data enforce security and
3 Cleanse,
compliance policies
prepare, and 5 Make data available
catalog data for analytics

Data engineer Data security officer Data analyst

98 Visit aws.training
Lake Formation for a secure data lake
1 2 3 4

Ingest and organize Secure and control Collaborate and use Monitor and audit
Automates creating data Sets up fine-grained Search and data Based on data access
lake and data ingestion. access control and data discovery using Data and governance policies,
governance. Catalog metadata. alert notifications are
raised on policy violation
To protect data, all and logged.
access is checked
against set policies.

99 Visit aws.training
AWS Lake Formation builds on AWS Glue
AWS Lake Formation

Blueprints Security, search, collaboration


Monitoring

Workflow AWS Glue Data Catalog

Connections,
AWS Glue ETL jobs AWS Glue crawlers
databases, tables

AWS Glue

10 Visit aws.training
0
AWS Lake Formation benefits

Amazon Athena Amazon Amazon Redshift Amazon Amazon EMR


QuickSight SageMaker Comprehensive set of integrated
tools enables every user equally.

Centralized management of
fine-grained permissions
empowers security officers.
AWS Lake
Formation
Simplified ingest and cleaning
AWS Glue Blueprints ML Data Access enables data engineers to build
Transforms Catalog control faster.

Cost effective, durable storage


includes global replication
capabilities.
Amazon S3
data lake storage

10 Visit aws.training
1
Lab 04

Transforming Data with


AWS Glue

10
2
Lab 04: Transforming Data with AWS Glue

The IT team at UnicornNation is looking for a way to reduce their


AWS spending for this project. After reviewing the file formats they
are using, they have decided that for the Merchandise Sales data,
they are going to change the file format from CSV to Parquet. UNICORN NATION

Parquet is a columnar, compressed format and will help them


reduce the amount of data that is scanned when using Amazon
Athena.
In this lab, you are going to create am S3 bucket for the new
Parquet files and then create a Glue Job to convert the existing
CSV data files to Parquet.

10 Visit aws.training
3
Q&A
Class Evaluation

Please look for the email link to take the HEADS


UP!
class evaluation survey.

10 Visit aws.training
5
Thank You

Vijay K
AWS Partner Trainer
vkkasibh@amazon.com

You might also like