PSO Data Analytics Day 1

AWS Professional Services:
Big Data and Analytics

Data Lake on AWS
Vijay K
AWS Partner Trainer
vkkasibh@amazon.com
Course Agenda  Introduction
 Data Lake overview
 Building Data Lakes on AWS
 Lab 01: Creating a Glue Data Crawler
 Lab 02: Modifying Table Schemas
 Working with Amazon Athena
 Lab 03: Querying your Data Lake with Amazon
Athena
 Working with AWS Glue
 Lab 04: Transforming Data with AWS Glue
 Wrap Up & Conclusion
2
Introduction
Introducing UnicornNation
UnicornNation is a global entertainment company that provides
ticketing, merchandising and promotion of large concerts and events.
UNICORN NATION
In recent years, they have been collecting data through a number of
disparate systems and want to consolidate this data in a modern data
architecture
A workshop was held with the key stakeholders in UnicornNation and they identified
three key data sources they would like to consolidate and have provided the
funding and resources to built a Data Lake on AWS.
During the course of this session, you will be building a Data Lake on AWS to meet
their requirements and gain experience with a number of core AWS services,
including Amazon S3, AWS Glue, and Amazon Athena.
4 Visit aws.training
UnicornNation - Target Architecture
On- AWS Cloud AWS Glue ETL
premises
Merchandise
Sales-Parquet
Merchandise Crawler
Sales-CSV Crawler
AWS Glue
Data Catalog
Amazon Athena Data Analysts
Tickit-History Crawler
Handouts
Lab Manual
Step by Step instructions to completing the hands-
on labs – will be sent in an email
Lab Login Details

Login details to the AWS Console and S3 bucket
configuration – will be included in the same email
Training Content
Will be sent at the end of the tomorrow’s training.
Attendance
To get attendance credit for this course, you must attend
sessions on both days.
Data Lakes on AWS
What is a Data Lake?
A data lake is a centralized repository
that allows you to store all your structured
and unstructured data at any scale.
You can store your data as-is, without

having to first structure the data, and run
different types of analytics—from
dashboards and visualizations to big data
processing, real-time analytics, and
machine learning to guide better
decisions.
Legacy Data Architectures Exist as Isolated Data Silos
Hadoop Data SQL

Cluster Warehouse Database
Appliance
Legacy Data Architectures Are Monolithic
Hadoop Master Node
CPU CPU CPU
Memory Memory Memory
HDFS Storage HDFS Storage HDFS Storage
Multiple layers of functionality all

on a single cluster
Enter Data Lake Architectures
Data Lake is a new and

increasingly popular
architecture to store and
analyze massive volumes and
heterogeneous types of data.
Benefits of a Data Lake – Quick Ingest
“How can I collect data quickly Quickly ingest data

from various sources and without needing to force it into a
store it efficiently?” pre-defined schema.
Benefits of a Data Lake – Storage vs Compute
“How can I scale up with the Separating your storage and compute
volume of data being allows you to scale each component as
generated?” required
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple A Data Lake enables ad-hoc
analytics and processing analysis by applying schemas
frameworks to the same data?” on read, not write.
Comparison of a Data Lake to an Enterprise Data Warehouse
EMR S3 Enterprise DW
Complementary to EDW (not replacement) Data lake can be source for EDW
No Predefined Schemas (Schema-On-Read) Predefined Schemas (Shema-On-Write)
Structured/semi-structured/Unstructured data (Typically) structured data only
(Typically) Time consuming to introduce new

Fast ingestion of new data/content
content
Data Science + Prediction/Advanced Analytics + BI
Traditional BI use cases
use cases
Data at at all levels (raw, transformed, summary
(Typically) Summary / Aggregated level data only
etc)
Decoupled storage & computation Storage & Computation Coupled tightly
Flexibility in tools / greater processing capabilities / Limited flexibility in tools / limited processing
parallelization of processing capabilities and windows of opportunity
Building a Data Lake on AWS
Reference architecture
Components of a Data Lake
Data Storage
API & UI
• High durability
Entitlements • Stores raw data from input sources
• Support for any type of data
• Low cost
Catalogue & Search
Storage Ingestion
• Streaming
• Streamed ingest of feed data
• Provides the ability to consume any
dataset as a stream
• Facilitates low latency analytics
• Batch
• AWS Batch / Snowball
• Storage Gateway
• sFTP
S3 for data lake
Durable Available High performance

 Multiple upload
Designed for 11 9s Designed for  Range GET
of durability 99.99% availability
Amazon S3
Easy to use Scalable Integrated
 Simple REST API  Store as much as you need  Amazon EMR
 AWS SDKs  Scale storage and compute  Amazon Redshift
 Read-after-create consistency independently  Amazon DynamoDB
 Event notification  No minimum usage commitments
 Lifecycle policies
Secure data lake on Amazon S3
FSx for
Lustre
Amazon S3 access points S3 Block Public Access Amazon S3 object tags Amazon S3 object lock
• Multi-tenant bucket • Across AWS accounts and • Access control, lifecycle • Immutable Amazon S3
• Dedicated access points Amazon S3 bucket level policies, and analysis objects
• Customer permissions • Specify public permissions • Classify data with • Retention management
from an Amazon Virtual using Access Control List metadata controls
Private Cloud (Amazon (ACL) or policy • Use tags to filter objects • Data protection and
VPC) • Four settings: • Define replication policies compliance
• BlockPublicAcls • Populate tags with AWS
• IgnorePublicAcls Lambda functions or S3
• BlockPublicPolicy Batch Operations
• RestrictPublicBuckets
Data Ingestion into Amazon S3
AWS Direct Connect AWS Snowball ISV Connectors
Amazon Kinesis S3 Transfer AWS Storage

Firehose Acceleration Gateway
Catalogue
API & UI
• Metadata (Technical & Business)
Entitlements • Used for summary statistics and data
Classification management
Catalogue & Search
Search
Storage
• Simplify discoverability and access to the
data
Data Catalogue – Metadata Index
• Store data about your Amazon S3 storage environment
• Total size & count of objects by prefix, data classification,
refresh schedule, object version information
• Amazon S3 events processed by Lambda function
• DynamoDB metadata tables store required attributes
PUT object ObjectCreated, PUT item

ObjectDeleted
Data Collectors S3 bucket AWS Lambda Metadata Index
(EC2, ECS) (DynamoDB)
Catalogue & Search Architecture
Metadata Index
(DynamoDB)
PUT object ObjectCreated, PUT item
ObjectDeleted
Data Collectors S3 bucket AWS Lambda
(EC2, ECS) Update Stream
AWS Lambda
Extract Search Fields
Update Index
Amazon Elasticsearch Service
AWS Glue Data Catalog
Unified metadata repository
Amazon RDS Amazon Redshift Amazon S3

lake house
• Get a single view into data, no matter where it is stored

• Automatically classify data in a central, searchable list More about
• Track data evolution using schema versioning this topic on
• Query data using Amazon Athena or Amazon Redshift Spectrum AWS Glue
• Hive metastore compatible session
Entitlements system
API & UI
• Encryption
Entitlements • Authentication
• Authorisation
Catalogue & Search • Chargeback
• Quotas
Storage • Data masking
• Regional restrictions
Implement the right cloud security controls
Security Identity Encryption Compliance
Amazon GuardDuty AWS Identify and Access AWS Certification Manager AWS Artifact
Management (IAM) (ACM)
AWS Shield Amazon Inspector
AWS Single Sign-On AWS Key Management
AWS Well-Architected Tool Service (AWS KMS) AWS CloudHSM
Amazon Cloud Directory
Amazon Macie Encryption at rest Amazon Cognito
Amazon Virtual Private AWS Directory Service
Encryption in transit AWS CloudTrail
Cloud (Amazon VPC)
AWS Organizations Bring your own keys,
hardware security module
(HSM) support
API & User Interface
API & UI
• Exposes the data lake to customers
Entitlements • Programmatically query catalogue
• Expose search API
Catalogue & Search • Ensures that entitlements are
respected
Storage
AWS Solution - Data Lake on AWS
Reference Architecture deployment
via CloudFormation
Configures core services to tag,

search and catalogue datasets
Deploys a console to search and

browse available datasets
http://amzn.to/2nTVjcp
Data Lake Usage Processing & Analytics
Real-time Batch
Elasticsearch Kinesis Analytics,
Service Kinesis Streams EMR Athena
Hadoop, Spark, Query Service
Spark Streaming Apache Flink Presto
on EMR on EMR
Redshift
Apache Storm Data Warehouse
AWS Lambda on EMR
Kinesis Streams
& Firehose
AI & Predictive Transactional &

RDBMS
Amazon Lex Amazon Polly
Speech Text to speech
recognition
DynamoDB Aurora
Amazon Machine Learning
Rekognition Predictive analytics NoSQL DB Relational Database
BI & Data Visualization
AWS Quick Starts for Data Lakes
Quick Starts are built by AWS solutions architects and
partners to help you deploy popular solutions on AWS,
based on AWS best practices for security and high
availability.
These reference deployments implement key technologies

automatically on the AWS Cloud, often with a single click
and in less than an hour. You can build your test or
production environment in a few steps, and start using it
immediately.
https://aws.amazon.com/quickstart/
Data Lakes: Summary
• Use S3 as the storage repository for your data lake, instead of a
Hadoop cluster or data warehouse
• Decoupled storage and compute is cheaper and more efficient to
operate
• Decoupled storage and compute allow us to evolve to clusterless
architectures like Athena
• Do not build data silos in Hadoop or the Enterprise DW
• Gain flexibility to use all the analytics tools in the ecosystem around
S3 & future proof the architecture
Summary of AWS Analytics & AI Tools
Data visualization, engagement, and machine learning
AWS Data Amazon Amazon Amazon Amazon Amazon Amazon Amazon Amazon
Exchange QuickSight Pinpoint SageMaker Comprehend Polly Lex Rekognition Translate
Analytics
Amazon Amazon
Amazon Amazon EMR AWS Glue Amazon Kinesis Data
Elasticsearch Service
Redshift (Spark and Presto) (Spark and Python) Athena Analytics
(Amazon ES)
Data lake infrastructure and management

Amazon Simple Storage Service (Amazon S3) AWS Lake Formation AWS Glue
Amazon S3 Glacier
Data movement
AWS Database Migration Service (AWS DMS) | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams | Amazon Managed Streaming for Apache Kafka
Building Data Lakes
Data Lakes start with S3…
…and some Data Engineering

What is Data Engineering?
A data engineer transforms data into a useful format

for analysis including:
• Business Intelligence
• Machine Learning
• Deep Learning
Roles and Responsibilities
Data Stewardship Business

Operationalisation
Understanding
Reliability
Data Pipelines Data Structures Software
Statistical Feature Engineering
Development
Analysis Engineering
Extraction Data
Enrichment Model Application
Loading
& Preparation Re/Training Integration
Transformation
UX/UI Containerizatio
Aggregation Data Lakes Visualization Experimentation n Web Services,
Normalization Data Marts Tuning Production
Edge ML
Dashboard Design Environments
Data Data Science
Engineering Data Analytics DevOps
ROLE PRIORITIES PITFALLS WITHOUT THEM
Makes sense of data, generates and communicates Naïve insights and low model
Data Scientist insights to improve or create business processes, yield; missed opportunities to
creates predictive ML models to support them unveil business value
Value generation is slow,

Builds scalable pipelines, transforms and loads
Data because DS is spending their
data into structures complete with metadata that
Engineer time doing infrastructure
can be readily consumed by DS
work
Sponsors struggle to see
Data Product Manages data as a product. Ensures freshness and
business value, projects miss
Manager consistency of data; understands lineage and
the mark due to insufficient
compliance needs; treats DS as customers
data
Integrating models with applications, automating Slow model deployment, poor
DevOps
model delivery and retraining, measuring ongoing performance; premature model
Engineer model relevance and accuracy obsolescence
Low engagement and

Creating engaging visual and narrative journeys
Data Analysts for analytical solutions
adoption from end users; lack
of executive support
If project prioritisation isn’t

Business Vetting the priortization and ROI, funding
made sensibly, DS do not know
Sponsor projects, providing ongoing feedback
when to pull the plug
Design Patterns
• Data Export – Cloud or On-Premise Database

• Log Aggregation – Various Logs from AWS, internal, etc.
• Real-time Data Collection – Streaming data
• Analytics and Reporting – Aggregating, Consolidating,
Prepping Data
Hydrating Data Lakes
Many different options…
AWS DMS AWS Snowball AWS Snowmobile AWS Command AWS Management
Line Interface Console
Amazon Kinesis Amazon Kinesis Amazon Managed AWS Glue Third-party Tools
Data Streams Data Firehose Streaming for
Kafka
Third Party Tools
Amazon S3
Cataloging your Data Lake
Discover and Organize Data Transformations
Databases
Ad-hoc
Glue ETL investigation
Amazon RDS
Describes
Uses
Datawarehouses Uses Amazon
Athena
Describes
Data warehouse
Uses
Amazon Redshift Uses
Data Lakes Describes Glue Data

Catalog Analytics
Redshift Spectrum
Amazon S3
44 EMR
Visit aws.training
(Hadoop/Spark)
Glue Data Catalog
Manage table metadata through a Web

Interface, Hive metastore API, Hive SQL, or
automated through crawlers
Listening to our customers, we’ve added:

 Search metadata for data discovery
 Connection info – JDBC URLs, credentials
 Classification for identifying and parsing
files
 Versioning of table metadata as schemas
Glue Data Catalog – Table Details
Table Description
Table Properties
Table Schema
Table Partitions
Glue Data Catalog – Table View Partitions
Partitioning Values
Jump to Objects
or
View Properties
S3 Objects Linked
Glue Data Catalog – Schema Version Control
List table schema versions
Compare schema versions
Glue Data Catalog – Editing Schema
Edit schemas to add, remove and update columns and types
Database Connections
Configure Database
connection to be used as
source and destination
• JDBC
• Amazon RDS and Aurora
• Amazon Redshift
• Amazon DocumentDB
• MongoDB
• Kafka
• JDBC
Crawlers: Auto-Populate Data Catalog
Run crawlers on-demand and on a schedule to discover new data and schema changes.
Serverless – only pay when crawls run.
Automatic schema inference:

• Built-in classifiers
• Detect file type
• Extract schema
• Identify partitions
• Add your own classifiers
• Grok
• XML
• JSON
• CSV
Crawlers: Classifiers
AWS IAM Built-in classifiers
role MySQL
AWS Glue
crawler MariaDB
PostgreSQL
Databases Amazon Aurora
JDBC Oracle
connection
Amazon Redshift
NoSQL Amazon Redshift
Avro
connection Parquet
ORC
Object XML
connection Amazon DynamoDB JSON and JSONPaths
AWS CloudTrail
BSON
Logs
Delimited
Amazon S3 …always growing…
Detecting partitions
S3 bucket hierarchy Table definition
Column Type
sim=.93 month=Nov
month str
date str
sim=.99 date=10 … sim=.95 date=15
col 1 int
col 2 float
file 1
… file N file 1 … file N
estimate schema similarity among files at each level to

handle semi-structured logs, schema evolution…
Exclusion Patterns
Exclude pattern Description
*.csv Matches an Amazon S3 path that represents an object name ending in .csv
*.* Matches all object names that contain a dot

*.{csv,avro} Matches object names ending with .csv or .avro
foo.? Matches object names starting with foo. that are followed by a single character
extension
/myfolder/* Matches objects in one level of subfolder from myfolder, such
as /myfolder/mysource
/myfolder/*/* Matches objects in two levels of subfolders from myfolder, such
as /myfolder/mysource/data
/myfolder/** Matches objects in all subfolders of myfolder, such
as /myfolder/mysource/mydata and/myfolder/mysource/data
Market* Matches tables in a JDBC database with names that begin with Market, such
as Market_us and Market_fr
Lab 01
Creating a Glue Data

Crawler
55
Lab 01: Creating a Glue Data Crawler
The IT team at UnicornNation has exported Merchandise Sales
data from their finance system and transferred this data to a single
S3 bucket. There are multiple .CSV files in the bucket, each
representing a month’s worth of data. UNICORN NATION
In this lab, you are going to create a Glue Data Crawler to crawl
across this bucket. Once you have created the crawler, you are
going to run it to determine the tables that are located in the bucket
and add their definitions to the Glue Data Catalog.
Sales-jan18.csv
Sales-feb18.csv
Sales-mar18.csv
Data Catalog
Lab 02
Modifying Table Schemas
57
Lab 02: Modifying Table Schemas
The IT team at UnicornNation has extracted historical data from
their ticketing system, named “Tickit” which processes the majority
of transactions for the company. This data source is known as the
Tickit History. UNICORN NATION
They have stored this data in an S3 bucket and have created
folder/prefixes for each table of data they have exported.
In this lab, we are going to create a Glue Data Crawler to crawl
across this bucket. Once we have created the crawler, we are going
to run it to determine the tables that are located in the bucket and
add their definitions to the Glue Data Catalog. category
events
listings
Data Catalog
Working with Amazon Athena
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
Benefits of Athena
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Secure – IAM for authentication; Encryption at rest & in
transit
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions
Familiar Technologies Under the Covers
Used for DDL functionality

Complex data types
Multitude of formats
Supports data partitioning
Used for SQL Queries

In-memory distributed query engine
ANSI-SQL compatible with extensions
A Better Model
Old Methodology With Amazon Athena
• Analyst asks for a report • Analyst creates table
• Developer writes code • Analyst iterates
• Code executes on shared • Generate final report
cluster for several hours
• Analyst reviews report
• Analyst asks for more…
Simple, Quick and No Infrastructure or Developer to Manage
Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer apply
Cost Monitoring
• Billing console provides spend per account
• Athena APIs are logged in CloudTrail
• Combine CloudTrail and Athena API for per IAM user
cost
Security and Access Control
• Encryption – SSE, SSE-KMS, CSE-KMS
• Auto detect source bucket KMS key
• Destination bucket may use separate key
• Access Control
• IAM
• S3 ACL
• S3 bucket policies
• Integrated with Glue Data Catalog

• Database level
66
• Table level
Visit aws.training
When to Use Athena
• Use Athena for easy access to Amazon S3
• Use alongside EMR, Redshift & Redshift Spectrum to

support different use cases
• Athena is free when not in use so no cost to just defining

tables in case you need the data
• Easy to query operational logs e.g. ALB, ELB, S3,

CloudTrail, CloudFront, etc.
Lab 03
Querying your Data Lake

with Amazon Athena
68
Lab 03: Querying your Data Lake with Amazon Athena
The ticketing team at UnicornNation has heard about the work you
did in setting up the data catalog. They have some data that they
need urgently and need your help in setting up and running these
queries.
UNICORN NATION
In this lab, you are going to use Amazon Athena to create some
queries, which you will then save to make it easy for users to run
and consume.
As part of the lab, you will also be running your own queries to help
users answer some basic questions around ticket sales, customers
and more.
Working with AWS Glue
AWS Glue
• AWS Glue is a fully managed ETL (extract, transform, and load) service
• Categorize your data, clean it, enrich it and move it reliably between
various data stores
• Once catalogued, your data is immediately searchable and queryable

across your data silos
• Simple and cost-effective
• Serverless; runs on a fully managed, scale-out Spark environment
AWS Glue
Job Scheduling and

Orchestration AWS Glue
Data Catalog
Data
Stores
Data Target
Data Source
ETL Engine
Why would AWS get into the ETL space?
We have lots of ETL Partners
Fivetran
Customers are still hand-coding ETL
Why do we do so much hand-coding?

• Code is flexible
• Code is powerful
• You can unit test
• You can deploy with other code
• You know your development tools
Customers are still hand-coding ETL
But there are down-sides to hand coding

• Hand-coding involves a lot of
undifferentiated heavy lifting…
• Hand-coding is brittle, error-prone and
laborious
• Especially as data formats change and
target schemas change
AWS Glue
• AWS Glue automates the
undifferentiated heavy-lifting of ETL
• Discover and organize data, regardless
of where it lives
• Focus on writing transformations,
not handling undifferentiating heavy
lifting
• ETL jobs run under a Serverless
execution model
AWS Glue
Data Sources Data Targets
1 Triggers
ETL Job
Amazon S3 Amazon RDS Amazon S3 Amazon RDS
2
4
Extracts
Data 3 Loads Data
Transforms
Amazon Amazon Amazon Amazon
Redshift DynamoDB Redshift DynamoDB
Runs
Writes
Statistics
Amazon JDBC 5 Amazon JDBC

DocumentDB AWS Glue Data DocumentDB
Script
Catalog
Serverless
AWS Glue: components
 Hive metastore compatible with enhanced functionality.
 Crawlers automatically extracts metadata and creates tables.
AWS Glue Data
 Integrates with Amazon Athena, Amazon EMR, and many more.
Catalog
 Generate ETL code.
 Build on open frameworks – Python/Scala and Apache Spark.
Job authoring  Developer-centric – editing, debugging, sharing
 Run jobs on a serverless Spark platform

 Use flexible scheduling, job monitoring, and alerting
Job execution
 Orchestrate triggers, crawlers, and jobs

 Author and monitor entire flows and integrated alerting
Job workflow
Job authoring in Glue
Script generated by AWS Glue
You have
choices on how Existing script brought into AWS
to get started… Glue
Blank script authored by you
Choose a Data Source
Once crawler completes updating Data Catalog, those tables become

sources for Glue ETL
Choose a Data Destination
Write output results into an existing table.
Write output results into a new location

• S3 Bucket
• JDBC Connection
Automatic code generation
Existing columns
in target
Can extend/add
new columns to
target
Automatic code generation
1. Customize the mappings

2. Glue generates transformation graph and Python/Scala code
3. Specify trigger condition
Glue ETL scripts are forgiving and flexible
 Human-readable code run on a scalable platform

 Forgiving in the face of failures – handles bad data and crashes
 Flexible: handles complex semi-structured data, and adapts to source schema changes
Write your own scripts
 Convert to a Spark DataFrame
 Run with custom code and libraries
 Convert back to Glue DynamicFrame
Add Custom Modules and Files
 Add external Python modules
 Java JARs required by the script
 Additional files such as configuration, etc.
Glue transformations are flexible and adaptive
Semi-structured schema Relational schema
A B B C.X C.Y FK
PK Offs Valu
et e
A B B C D[ ]
X Y
 Flatten semi-structured objects with arbitrary complexity into relational tables, on-the-fly.
 Pivot arrays and other collection types into a separate table, generating key-foreign key values.
 Modify mapping as the source schema changes, and modify the target schemas as needed.
Glue transformations…
 Pre-Built Transformation click and add to

your job with simple configuration
 Spigot writes sample data from DynamicFrame
to S3 in JSON format
 Expanding…More transformations to come
ETL Job Progress and History
Track ETL job progress and inspect logs directly from the console.
Logs are written to CloudWatch for simple access. Errors are automatically extracted and
presented in the Error Logs for easy troubleshooting of jobs.
Orchestration & resource management
Fully managed, serverless job execution
Job Bookmarks…
Glue keeps track of data that has Option Behavior

already been processed by a Enable Pick up from where you left off
previous run of an ETL job.
Disable Ignore and process the entire
dataset every time
Pause Temporarily disable continuation but
This persisted state information is keep the bookmark
called a bookmark.
Determines
Marketing: Ad-spend by
Behavior
customer segment
For example, your ETL job might read new partitions in an
Amazon S3 file. AWS Glue keeps track of which partitions
have successfully been processed by the job to prevent
Data Objects
duplicate processing and duplicate data in the job's target
data store.
Job composition and triggers
Compose jobs globally with event- Marketing: Ad-spend by

based dependencies customer segment
Event Based
 Easy to reuse and leverage work Lambda Trigger
across organization boundaries Data
based
Multiple triggering mechanisms

Schedule
 Schedule-based: e.g., time of day Data
weekly
 Event-based: e.g., data availability, sales based Central: ROI by
customer
job completion
segment
 External sources: e.g., AWS Lambda Sales: Revenue by
customer segment
Serverless job execution
Warm pool of instances
There is no need to provision, configure, or

manage servers
 Warm pools: pre-configured fleets of
instances to reduce job startup time
 Auto-configure VPC and role-based access
 Automatically scale resources to meet SLA

and cost objectives
 You pay only for the resources you

consume while consuming them.
Customer VPC Customer VPC
Developer Endpoints
Environment to iteratively develop Glue Instances

and test ETL scripts.
Develop your script in a notebook

and point to an AWS Glue endpoint
to test it.
When you are satisfied with the Customer VPC

results of your development
process you can create an ETL job
that runs your script.
Development Amazon Amazon
Notebook RDS Redshift
95 Visit aws.training S3 Endpoint

Glue ETL Security Model
• ETL Jobs that do not require special handling can be launched

without additional configuration
• For jobs requiring restricted access to data Glue utilizes a VPC

endpoint launching dedicated ENIs into the customer’s VPC which
are assigned private IP address from the customer’s subnet.
• For such jobs, Glue also enforces the use of an S3 VPC endpoint
AWS Lake Formation
Building a secure data lake
Typical steps to build a secure data lake
Ingestion and cleaning Security
Analytics and machine learning
1 Set up
storage
4 Configure and
2 Move data enforce security and
3 Cleanse,
compliance policies
prepare, and 5 Make data available
catalog data for analytics
Data engineer Data security officer Data analyst
Lake Formation for a secure data lake
1 2 3 4
Ingest and organize Secure and control Collaborate and use Monitor and audit
Automates creating data Sets up fine-grained Search and data Based on data access
lake and data ingestion. access control and data discovery using Data and governance policies,
governance. Catalog metadata. alert notifications are
raised on policy violation
To protect data, all and logged.
access is checked
against set policies.
AWS Lake Formation builds on AWS Glue
AWS Lake Formation
Blueprints Security, search, collaboration

Monitoring
Workflow AWS Glue Data Catalog
Connections,
AWS Glue ETL jobs AWS Glue crawlers
databases, tables
AWS Glue
0
AWS Lake Formation benefits
Amazon Athena Amazon Amazon Redshift Amazon Amazon EMR

QuickSight SageMaker Comprehensive set of integrated
tools enables every user equally.
Centralized management of
fine-grained permissions
empowers security officers.
AWS Lake
Formation
Simplified ingest and cleaning
AWS Glue Blueprints ML Data Access enables data engineers to build
Transforms Catalog control faster.
Cost effective, durable storage

includes global replication
capabilities.
Amazon S3
data lake storage
1
Lab 04
Transforming Data with

AWS Glue
10
2
Lab 04: Transforming Data with AWS Glue
The IT team at UnicornNation is looking for a way to reduce their

AWS spending for this project. After reviewing the file formats they
are using, they have decided that for the Merchandise Sales data,
they are going to change the file format from CSV to Parquet. UNICORN NATION
Parquet is a columnar, compressed format and will help them

reduce the amount of data that is scanned when using Amazon
Athena.
In this lab, you are going to create am S3 bucket for the new
Parquet files and then create a Glue Job to convert the existing
CSV data files to Parquet.
3
Q&A
Class Evaluation
Please look for the email link to take the HEADS

UP!
class evaluation survey.
5
Thank You
Vijay K
AWS Partner Trainer
vkkasibh@amazon.com

PSO Data Analytics Day 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PSO Data Analytics Day 1

Uploaded by

Copyright:

Available Formats

AWS Professional Services:

Big Data and Analytics

Lab Login Details

You can store your data as-is, without

Hadoop Data SQL

Hadoop Master Node

CPU CPU CPU

Memory Memory Memory

HDFS Storage HDFS Storage HDFS Storage

Multiple layers of functionality all

Data Lake is a new and

“How can I collect data quickly Quickly ingest data

No Predefined Schemas (Schema-On-Read) Predefined Schemas (Shema-On-Write)

Structured/semi-structured/Unstructured data (Typically) structured data only

(Typically) Time consuming to introduce new

Durable Available High performance

AWS Direct Connect AWS Snowball ISV Connectors

Amazon Kinesis S3 Transfer AWS Storage

PUT object ObjectCreated, PUT item

Amazon Elasticsearch Service

Amazon RDS Amazon Redshift Amazon S3

• Get a single view into data, no matter where it is stored

Security Identity Encryption Compliance

Configures core services to tag,

Deploys a console to search and

AI & Predictive Transactional &

BI & Data Visualization

These reference deployments implement key technologies

Data lake infrastructure and management

…and some Data Engineering

A data engineer transforms data into a useful format

Data Stewardship Business

Value generation is slow,

Low engagement and

If project prioritisation isn’t

• Data Export – Cloud or On-Premise Database

Data Lakes Describes Glue Data

Manage table metadata through a Web

Listening to our customers, we’ve added:

Compare schema versions

• Amazon RDS and Aurora

Automatic schema inference:

S3 bucket hierarchy Table definition

estimate schema similarity among files at each level to

*.* Matches all object names that contain a dot

Creating a Glue Data

Modifying Table Schemas

Used for DDL functionality

Used for SQL Queries

Simple, Quick and No Infrastructure or Developer to Manage

• SQL operations – FREE

• Query concurrency – FREE

• Standard S3 rates for storage, requests, and data transfer apply

• Integrated with Glue Data Catalog

• Use alongside EMR, Redshift & Redshift Spectrum to

• Athena is free when not in use so no cost to just defining

• Easy to query operational logs e.g. ALB, ELB, S3,

Querying your Data Lake

• Once catalogued, your data is immediately searchable and queryable

• Simple and cost-effective

• Serverless; runs on a fully managed, scale-out Spark environment

Job Scheduling and

. Matches all object names that contain a dot