Professional Documents
Culture Documents
Vijay K
AWS Partner Trainer
vkkasibh@amazon.com
Course Agenda Introduction
Data Lake overview
Building Data Lakes on AWS
Lab 01: Creating a Glue Data Crawler
Lab 02: Modifying Table Schemas
Working with Amazon Athena
Lab 03: Querying your Data Lake with Amazon
Athena
Working with AWS Glue
Lab 04: Transforming Data with AWS Glue
Wrap Up & Conclusion
2
Introduction
Introducing UnicornNation
UnicornNation is a global entertainment company that provides
ticketing, merchandising and promotion of large concerts and events.
UNICORN NATION
In recent years, they have been collecting data through a number of
disparate systems and want to consolidate this data in a modern data
architecture
A workshop was held with the key stakeholders in UnicornNation and they identified
three key data sources they would like to consolidate and have provided the
funding and resources to built a Data Lake on AWS.
During the course of this session, you will be building a Data Lake on AWS to meet
their requirements and gain experience with a number of core AWS services,
including Amazon S3, AWS Glue, and Amazon Athena.
4 Visit aws.training
UnicornNation - Target Architecture
On- AWS Cloud AWS Glue ETL
premises
Merchandise
Sales-Parquet
Merchandise Crawler
Sales-CSV Crawler
AWS Glue
Data Catalog
Amazon Athena Data Analysts
Tickit-History Crawler
5 Visit aws.training
Handouts
Lab Manual
Step by Step instructions to completing the hands-
on labs – will be sent in an email
Training Content
Will be sent at the end of the tomorrow’s training.
6 Visit aws.training
Attendance
To get attendance credit for this course, you must attend
sessions on both days.
7 Visit aws.training
Data Lakes on AWS
What is a Data Lake?
A data lake is a centralized repository
that allows you to store all your structured
and unstructured data at any scale.
9 Visit aws.training
Legacy Data Architectures Exist as Isolated Data Silos
10 Visit aws.training
Legacy Data Architectures Are Monolithic
11 Visit aws.training
Enter Data Lake Architectures
12 Visit aws.training
Benefits of a Data Lake – Quick Ingest
13 Visit aws.training
Benefits of a Data Lake – Storage vs Compute
“How can I scale up with the Separating your storage and compute
volume of data being allows you to scale each component as
generated?” required
14 Visit aws.training
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple A Data Lake enables ad-hoc
analytics and processing analysis by applying schemas
frameworks to the same data?” on read, not write.
15 Visit aws.training
Comparison of a Data Lake to an Enterprise Data Warehouse
EMR S3 Enterprise DW
Complementary to EDW (not replacement) Data lake can be source for EDW
Flexibility in tools / greater processing capabilities / Limited flexibility in tools / limited processing
parallelization of processing capabilities and windows of opportunity
16 Visit aws.training
Building a Data Lake on AWS
Reference architecture
17 Visit aws.training
Components of a Data Lake
Data Storage
API & UI
• High durability
Entitlements • Stores raw data from input sources
• Support for any type of data
• Low cost
Catalogue & Search
Storage Ingestion
• Streaming
• Streamed ingest of feed data
• Provides the ability to consume any
dataset as a stream
• Facilitates low latency analytics
• Batch
• AWS Batch / Snowball
• Storage Gateway
• sFTP
18 Visit aws.training
S3 for data lake
Amazon S3
Easy to use Scalable Integrated
Simple REST API Store as much as you need Amazon EMR
AWS SDKs Scale storage and compute Amazon Redshift
Read-after-create consistency independently Amazon DynamoDB
Event notification No minimum usage commitments
Lifecycle policies
19 Visit aws.training
Secure data lake on Amazon S3
FSx for
Lustre
Amazon S3 access points S3 Block Public Access Amazon S3 object tags Amazon S3 object lock
• Multi-tenant bucket • Across AWS accounts and • Access control, lifecycle • Immutable Amazon S3
• Dedicated access points Amazon S3 bucket level policies, and analysis objects
• Customer permissions • Specify public permissions • Classify data with • Retention management
from an Amazon Virtual using Access Control List metadata controls
Private Cloud (Amazon (ACL) or policy • Use tags to filter objects • Data protection and
VPC) • Four settings: • Define replication policies compliance
• BlockPublicAcls • Populate tags with AWS
• IgnorePublicAcls Lambda functions or S3
• BlockPublicPolicy Batch Operations
• RestrictPublicBuckets
20 Visit aws.training
Data Ingestion into Amazon S3
21 Visit aws.training
Components of a Data Lake
Catalogue
API & UI
• Metadata (Technical & Business)
Entitlements • Used for summary statistics and data
Classification management
Catalogue & Search
Search
Storage
• Simplify discoverability and access to the
data
22 Visit aws.training
Data Catalogue – Metadata Index
• Store data about your Amazon S3 storage environment
• Total size & count of objects by prefix, data classification,
refresh schedule, object version information
• Amazon S3 events processed by Lambda function
• DynamoDB metadata tables store required attributes
23 Visit aws.training
Catalogue & Search Architecture
Metadata Index
(DynamoDB)
PUT object ObjectCreated, PUT item
ObjectDeleted
Data Collectors S3 bucket AWS Lambda
(EC2, ECS) Update Stream
AWS Lambda
Extract Search Fields
Update Index
24 Visit aws.training
AWS Glue Data Catalog
Unified metadata repository
25 Visit aws.training
Components of a Data Lake
Entitlements system
API & UI
• Encryption
Entitlements • Authentication
• Authorisation
Catalogue & Search • Chargeback
• Quotas
Storage • Data masking
• Regional restrictions
26 Visit aws.training
Implement the right cloud security controls
Amazon GuardDuty AWS Identify and Access AWS Certification Manager AWS Artifact
Management (IAM) (ACM)
AWS Shield Amazon Inspector
AWS Single Sign-On AWS Key Management
AWS Well-Architected Tool Service (AWS KMS) AWS CloudHSM
Amazon Cloud Directory
Amazon Macie Encryption at rest Amazon Cognito
Amazon Virtual Private AWS Directory Service
Encryption in transit AWS CloudTrail
Cloud (Amazon VPC)
AWS Organizations Bring your own keys,
hardware security module
(HSM) support
27 Visit aws.training
Components of a Data Lake
API & User Interface
API & UI
• Exposes the data lake to customers
Entitlements • Programmatically query catalogue
• Expose search API
Catalogue & Search • Ensures that entitlements are
respected
Storage
28 Visit aws.training
AWS Solution - Data Lake on AWS
Reference Architecture deployment
via CloudFormation
http://amzn.to/2nTVjcp
29 Visit aws.training
Data Lake Usage Processing & Analytics
Real-time Batch
Elasticsearch Kinesis Analytics,
Service Kinesis Streams EMR Athena
Hadoop, Spark, Query Service
Spark Streaming Apache Flink Presto
on EMR on EMR
Redshift
Apache Storm Data Warehouse
AWS Lambda on EMR
Kinesis Streams
& Firehose
DynamoDB Aurora
Amazon Machine Learning
Rekognition Predictive analytics NoSQL DB Relational Database
30 Visit aws.training
AWS Quick Starts for Data Lakes
Quick Starts are built by AWS solutions architects and
partners to help you deploy popular solutions on AWS,
based on AWS best practices for security and high
availability.
https://aws.amazon.com/quickstart/
31 Visit aws.training
Data Lakes: Summary
• Use S3 as the storage repository for your data lake, instead of a
Hadoop cluster or data warehouse
• Decoupled storage and compute is cheaper and more efficient to
operate
• Decoupled storage and compute allow us to evolve to clusterless
architectures like Athena
• Do not build data silos in Hadoop or the Enterprise DW
• Gain flexibility to use all the analytics tools in the ecosystem around
S3 & future proof the architecture
32 Visit aws.training
Summary of AWS Analytics & AI Tools
Data visualization, engagement, and machine learning
AWS Data Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon
Exchange QuickSight Pinpoint SageMaker Comprehend Polly Lex Rekognition Translate
Analytics
Amazon Amazon
Amazon Amazon EMR AWS Glue Amazon Kinesis Data
Elasticsearch Service
Redshift (Spark and Presto) (Spark and Python) Athena Analytics
(Amazon ES)
Data movement
AWS Database Migration Service (AWS DMS) | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams | Amazon Managed Streaming for Apache Kafka
33 Visit aws.training
Building Data Lakes
Data Lakes start with S3…
• Business Intelligence
• Machine Learning
• Deep Learning
36 Visit aws.training
Roles and Responsibilities
37 Visit aws.training
ROLE PRIORITIES PITFALLS WITHOUT THEM
Makes sense of data, generates and communicates Naïve insights and low model
Data Scientist insights to improve or create business processes, yield; missed opportunities to
creates predictive ML models to support them unveil business value
38 Visit aws.training
Design Patterns
39 Visit aws.training
Hydrating Data Lakes
40 Visit aws.training
Many different options…
AWS DMS AWS Snowball AWS Snowmobile AWS Command AWS Management
Line Interface Console
Amazon Kinesis Amazon Kinesis Amazon Managed AWS Glue Third-party Tools
Data Streams Data Firehose Streaming for
Kafka
41 Visit aws.training
Third Party Tools
Amazon S3
42 Visit aws.training
Cataloging your Data Lake
43 Visit aws.training
Discover and Organize Data Transformations
Databases
Ad-hoc
Glue ETL investigation
Amazon RDS
Describes
Uses
Datawarehouses Uses Amazon
Athena
Describes
Data warehouse
Uses
Amazon Redshift Uses
Amazon S3
44 EMR
Visit aws.training
(Hadoop/Spark)
Glue Data Catalog
45 Visit aws.training
Glue Data Catalog – Table Details
Table Description
Table Properties
Table Schema
Table Partitions
46 Visit aws.training
Glue Data Catalog – Table View Partitions
Partitioning Values
Jump to Objects
or
View Properties
S3 Objects Linked
47 Visit aws.training
Glue Data Catalog – Schema Version Control
List table schema versions
48 Visit aws.training
Glue Data Catalog – Editing Schema
Edit schemas to add, remove and update columns and types
49 Visit aws.training
Database Connections
Configure Database
connection to be used as
source and destination
• JDBC
• Amazon Redshift
• Amazon DocumentDB
• MongoDB
• Kafka
• JDBC
50 Visit aws.training
Crawlers: Auto-Populate Data Catalog
Run crawlers on-demand and on a schedule to discover new data and schema changes.
Serverless – only pay when crawls run.
• Grok
• XML
• JSON
• CSV
51 Visit aws.training
Crawlers: Classifiers
AWS IAM Built-in classifiers
role MySQL
AWS Glue
crawler MariaDB
PostgreSQL
Databases Amazon Aurora
JDBC Oracle
connection
Amazon Redshift
NoSQL Amazon Redshift
Avro
connection Parquet
ORC
Object XML
connection Amazon DynamoDB JSON and JSONPaths
AWS CloudTrail
BSON
Logs
Delimited
Amazon S3 …always growing…
52 Visit aws.training
Detecting partitions
Column Type
sim=.93 month=Nov
month str
date str
sim=.99 date=10 … sim=.95 date=15
col 1 int
col 2 float
file 1
… file N file 1 … file N
54 Visit aws.training
Lab 01
55
Lab 01: Creating a Glue Data Crawler
The IT team at UnicornNation has exported Merchandise Sales
data from their finance system and transferred this data to a single
S3 bucket. There are multiple .CSV files in the bucket, each
representing a month’s worth of data. UNICORN NATION
In this lab, you are going to create a Glue Data Crawler to crawl
across this bucket. Once you have created the crawler, you are
going to run it to determine the tables that are located in the bucket
and add their definitions to the Glue Data Catalog.
Sales-jan18.csv
Sales-feb18.csv
Sales-mar18.csv
Data Catalog
56 Visit aws.training
Lab 02
57
Lab 02: Modifying Table Schemas
The IT team at UnicornNation has extracted historical data from
their ticketing system, named “Tickit” which processes the majority
of transactions for the company. This data source is known as the
Tickit History. UNICORN NATION
They have stored this data in an S3 bucket and have created
folder/prefixes for each table of data they have exported.
In this lab, we are going to create a Glue Data Crawler to crawl
across this bucket. Once we have created the crawler, we are going
to run it to determine the tables that are located in the bucket and
add their definitions to the Glue Data Catalog. category
events
listings
Data Catalog
58 Visit aws.training
Working with Amazon Athena
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
60 Visit aws.training
Benefits of Athena
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM for authentication; Encryption at rest & in
transit
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions
61 Visit aws.training
Familiar Technologies Under the Covers
62 Visit aws.training
A Better Model
Old Methodology With Amazon Athena
• Analyst asks for a report • Analyst creates table
• Developer writes code • Analyst iterates
• Code executes on shared • Generate final report
cluster for several hours
• Analyst reviews report
• Analyst asks for more…
63 Visit aws.training
Simple Pricing
• DDL operations – FREE
• Data scanned - $5 / TB
64 Visit aws.training
Cost Monitoring
• Billing console provides spend per account
• Athena APIs are logged in CloudTrail
• Combine CloudTrail and Athena API for per IAM user
cost
65 Visit aws.training
Security and Access Control
• Encryption – SSE, SSE-KMS, CSE-KMS
• Auto detect source bucket KMS key
• Destination bucket may use separate key
• Access Control
• IAM
• S3 ACL
• S3 bucket policies
68
Lab 03: Querying your Data Lake with Amazon Athena
The ticketing team at UnicornNation has heard about the work you
did in setting up the data catalog. They have some data that they
need urgently and need your help in setting up and running these
queries.
UNICORN NATION
In this lab, you are going to use Amazon Athena to create some
queries, which you will then save to make it easy for users to run
and consume.
As part of the lab, you will also be running your own queries to help
users answer some basic questions around ticket sales, customers
and more.
69 Visit aws.training
Working with AWS Glue
AWS Glue
• AWS Glue is a fully managed ETL (extract, transform, and load) service
• Categorize your data, clean it, enrich it and move it reliably between
various data stores
71 Visit aws.training
AWS Glue
Data Target
Data Source
72 Visit aws.training
ETL Engine
Why would AWS get into the ETL space?
73 Visit aws.training
We have lots of ETL Partners
Fivetran
74 Visit aws.training
Customers are still hand-coding ETL
75 Visit aws.training
Customers are still hand-coding ETL
76 Visit aws.training
AWS Glue
• AWS Glue automates the
undifferentiated heavy-lifting of ETL
• Discover and organize data, regardless
of where it lives
• Focus on writing transformations,
not handling undifferentiating heavy
lifting
• ETL jobs run under a Serverless
execution model
77 Visit aws.training
AWS Glue
Data Sources Data Targets
1 Triggers
ETL Job
Amazon S3 Amazon RDS Amazon S3 Amazon RDS
2
4
Extracts
Data 3 Loads Data
Transforms
Amazon Amazon Amazon Amazon
Redshift DynamoDB Redshift DynamoDB
Runs
Writes
Statistics
Job workflow
79 Visit aws.training
Job authoring in Glue
Script generated by AWS Glue
You have
choices on how Existing script brought into AWS
to get started… Glue
Blank script authored by you
80 Visit aws.training
Choose a Data Source
81 Visit aws.training
Choose a Data Destination
Write output results into an existing table.
82 Visit aws.training
Automatic code generation
Existing columns
in target
Can extend/add
new columns to
target
83 Visit aws.training
Automatic code generation
86 Visit aws.training
Add Custom Modules and Files
Add external Python modules
Java JARs required by the script
Additional files such as configuration, etc.
87 Visit aws.training
Glue transformations are flexible and adaptive
Semi-structured schema Relational schema
A B B C.X C.Y FK
PK Offs Valu
et e
A B B C D[ ]
X Y
Flatten semi-structured objects with arbitrary complexity into relational tables, on-the-fly.
Pivot arrays and other collection types into a separate table, generating key-foreign key values.
Modify mapping as the source schema changes, and modify the target schemas as needed.
88 Visit aws.training
Glue transformations…
89 Visit aws.training
ETL Job Progress and History
Track ETL job progress and inspect logs directly from the console.
Logs are written to CloudWatch for simple access. Errors are automatically extracted and
presented in the Error Logs for easy troubleshooting of jobs.
90 Visit aws.training
Orchestration & resource management
Fully managed, serverless job execution
91 Visit aws.training
Job Bookmarks…
Determines
Marketing: Ad-spend by
Behavior
customer segment
For example, your ETL job might read new partitions in an
Amazon S3 file. AWS Glue keeps track of which partitions
have successfully been processed by the job to prevent
Data Objects
duplicate processing and duplicate data in the job's target
data store.
92 Visit aws.training
Job composition and triggers
93 Visit aws.training
Serverless job execution
94 Visit aws.training
Developer Endpoints
• For such jobs, Glue also enforces the use of an S3 VPC endpoint
96 Visit aws.training
AWS Lake Formation
97 Visit aws.training
Building a secure data lake
Typical steps to build a secure data lake
Ingestion and cleaning Security
Analytics and machine learning
1 Set up
storage
4 Configure and
2 Move data enforce security and
3 Cleanse,
compliance policies
prepare, and 5 Make data available
catalog data for analytics
98 Visit aws.training
Lake Formation for a secure data lake
1 2 3 4
Ingest and organize Secure and control Collaborate and use Monitor and audit
Automates creating data Sets up fine-grained Search and data Based on data access
lake and data ingestion. access control and data discovery using Data and governance policies,
governance. Catalog metadata. alert notifications are
raised on policy violation
To protect data, all and logged.
access is checked
against set policies.
99 Visit aws.training
AWS Lake Formation builds on AWS Glue
AWS Lake Formation
Connections,
AWS Glue ETL jobs AWS Glue crawlers
databases, tables
AWS Glue
10 Visit aws.training
0
AWS Lake Formation benefits
Centralized management of
fine-grained permissions
empowers security officers.
AWS Lake
Formation
Simplified ingest and cleaning
AWS Glue Blueprints ML Data Access enables data engineers to build
Transforms Catalog control faster.
10 Visit aws.training
1
Lab 04
10
2
Lab 04: Transforming Data with AWS Glue
10 Visit aws.training
3
Q&A
Class Evaluation
10 Visit aws.training
5
Thank You
Vijay K
AWS Partner Trainer
vkkasibh@amazon.com