Professional Documents
Culture Documents
on AWS
Module 0: Course overview
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Course objectives
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3
Course prerequisites
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 4
Agenda
Module 1: Introduction to data lakes
Module 2: Data ingestion, cataloging, and preparation
• Lab 1: Set up a simple data lake
Module 3: Data processing and analytics
Module 4: Building a data lake with AWS Lake Formation
• Lab 2: Build a data lake using AWS Lake Formation
Module 5: AWS Lake Formation additional configurations
• Lab 3: Automate data lake creation using AWS Lake Formation blueprints
• Lab 4: Data visualization using Amazon QuickSight
Module 6: Course review and wrap-up
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 5
Introduce yourself
• Name
• What do you do day-to-day?
• What do you want to get out of this class?
• What is your experience level with AWS?
• Choose a language: Java, Python, or C#
• Something personal
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 6
Log in for access to lab environments
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 7
Lab requirements
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 8
Thank you
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 1: Introduction to data lakes
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
To p i c A
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why data lakes?
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 13
Benefits of data lakes
Data Silos
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 14
Challenges with building data lakes
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 15
Customer case study:
Customer challenge
• Web-connected gaming platform
• Millions of monthly downloads
Needed to create a constant user feedback loop for
game designers at very large scale • Many millions of concurrent players
• Hundreds of millions registered users
• Available on multiple platforms
• Petabytes of daily player stats
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a data lake?
A centralized repository for large amounts of structured and
unstructured data to enable direct analytics.
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 17
To p i c B
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake or data warehouse?
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 19
What is a data warehouse?
• 3-tiered architecture
Business
• Structured and relational data intelligence
• Schema-on-write
• PB (petabyte) scale Data warehouse
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 20
Comparison
Price Fastest query results using local storage Cost-effective storage based on Amazon S3
Data quality Highly curated data Raw data, unstructured data, many formats
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consider this
• What types of data analytics are you using to position your data to
answer them?
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 23
Analytics functionality
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Databases
IoT
Embedded
Streaming Storage Processes Queries analytics
data
Mobile
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 24
AWS services build the data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Databases
On Prem
Databases Amazon
Cloud
Elasticsearch
Service
Object AWS Snow Amazon RDS AWS Glue
storage Family** Data Catalog
User stats
Amazon S3 Amazon S3 Amazon
AWS Glue
Athena
Mobile
Databases
Crawler Data Catalog Crawler
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
Example: Event-driven workflow
Data Sources
Mobile
(S3 trigger)
loT
Ingestion Amazon S3 Amazon S3 Crawler AWS Glue
Data Catalog
Amazon S3 Raw data Processed data
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 27
Module review
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1
What is the most cost-effective storage option for your data lake?
A. Amazon EBS
B. Amazon S3
C. Amazon RDS
D. Amazon Redshift
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 31
Knowledge Check 2
What is the most cost-effective storage option for your data lake?
A. Amazon EBS
B. Amazon S3
C. Amazon RDS
D. Amazon Redshift
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 32
Knowledge Check 3
Which services are used in the processing layer of a data lake architecture?
(SELECT TWO)
A. AWS Snowball
B. AWS Glue
C. Amazon EMR
D. Amazon QuickSight
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 33
Knowledge Check 3
Which services are used in the processing layer of a data lake architecture?
(SELECT TWO)
A. AWS Snowball
B. AWS Glue
C. Amazon EMR
D. Amazon QuickSight
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 34
Thank You
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 2: Data ingestion, cataloging, and preparation
Lab 1
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 37
To p i c A
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake storage
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Databases Amazon S3
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 39
Amazon S3 – default data lake storage
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 40
Storage of datasets
Transformed data Merged and cleansed data Data engineers, data analysts
Data Value
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 41
Securing data lake storage
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 42
Consider this
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 43
To p i c B
Data ingestion
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 45
Data ingestion sources and methods
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 46
Transactional data ingestion
Migration Storage
AWS Database Migration Service
• Low cost, ease of use, reliable
• Wide variety of DB import tools ORCL AWS DMS Amazon
• Database consolidation RDS for Oracle
• Continuous replication
• Change Data Capture (CDC)
MySQL
AWS DMS
Amazon S3
AWS Schema Conversion Tool
• Converts source to target formats
• Schema, objects, views AWS
ORCL SCT
• Procedures, functions, and code AWS Amazon
DMS Aurora
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 47
Files and object ingestion
Upload
Storage
SMB & NFS
AWS Storage
Gateway
AWS
Direct Connect
Amazon
Remote compute S3 Glacier
& storage
AWS
Snow Family**
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 48
Streaming data ingestion
Log data Processing
and analytics
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 49
Consider this
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 50
To p i c B
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data catalog and processing
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
AWS Glue
Databases
IoT
Embedded
Streaming Storage Queries analytics
data
Mobile AWS Glue
Data Catalog
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 52
AWS Glue
AWS Glue Data AWS Glue crawler AWS Glue ETL AWS Glue Studio
Catalog
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 53
AWS Glue Data Catalog and crawlers
Sources and AWS Glue Analytics
ingestion
Data stores
Databases
Amazon
Objects Athena
IoT
Crawler AWS Glue
Mobile Data Catalog
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 54
AWS Glue Data Catalog: Metadata
Data schema
Column name Data Type
marketplace string
Table
customer_id bigint
information Name:
review_id string
Description:
Location Database: product_id string
Classification:
and format product_price num
Location:
Connection:
Deprecated:
Last updated: Partition information
Input format: Partition name View
Output format:
Serde serialization lib: pc_electronics
Table
properties musical_instruments
Table properties:
health_homecare
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 55
Crawler data sources and classifiers
Access
Data sources Classifier type Classification string
type
Native • Amazon S3
client • Amazon DynamoDB Apache Avro avro
• Amazon RDS Apache ORC orc
• Amazon Aurora
• MariaDB Apache Parquet parquet
Built-in
• Microsoft SQL Server
JDBC • MySQL JSON json
• Oracle
• PostgreSQL Binary JSON bson
XML xml
• MongoDB
MongoDB • Amazon DocumentDB Custom
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 56
To p i c C
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Raw data is not optimized for querying
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 59
Data formatting with AWS Glue
Sources and AWS Glue workflow and jobs Analytics Visualization
ingestion
AWS Glue
Amazon Amazon
transformation
Redshift QuickSight
job
Raw data Formatted data
(Parquet)
AWS Storage
Gateway Amazon
Athena
Crawler AWS Glue
Data Catalog
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 60
Formatting: Parquet format
Run time
SELECT count(*) FROM line item Number of files (seconds) Compaction
Many files 356 files 8.4 3.6x faster than
JSON format
Single file 1 file 2.31
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 62
Partitioning data stores
Hive-style partitions - Cost and performance optimization
• Amazon S3 prefix convention
S3://<bucket>/<prefix>/year=2018/month=01/day=20/
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 63
Amazon S3 bucketing
The crawler automatically identifies partitions based on Amazon S3 object key prefixes.
Object keys:
• /year=2018/month=01/day=20/data001.csv.gz
• /year=2018/month=01/day=20/data003.csv.gz
• /year=2018/month=01/day=21/data001.csv.gz
Bucket: service-logs/2
• /year=2018/month=01/day=21/data002.csv.gz
• /year=2018/month=01/day=22/data001.csv.gz
Bucket: service-logs/3 • /year=2018/month=01/day=22/data002.csv.gz
• /year=2018/month=01/day=22/data003.csv.gz
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 64
Compression
• Benefits
• Reduced storage requirements
• Reduced I/O reading data from storage
• Faster query processing
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 65
Consider this
What are some of the best practices discussed for the following?
• Formatting
• Partitioning
• Compression
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 66
Module review
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1
Which services can be used for data ingestion into your data lake?
(SELECT TWO)
B. Amazon QuickSight
C. Amazon Athena
Which services can be used for data ingestion into your data lake?
(SELECT TWO)
B. Amazon QuickSight
C. Amazon Athena
AWS Glue
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 76
Thank You
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 3: Data processing and analytics
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 79
To p i c A
Data preparation
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data processing and preparation
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Databases
IoT
Embedded
Streaming Storage AWS Glue Queries analytics
Data DataBrew
Mobile
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 81
Data cleansing and preparation
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 82
Merge and join datasets
[ I.E. Raw data bucket ]
events event_results
evt_id date city team evt_id name type Score
dfg46sd5 dfg46sd5
34fww54 34fww54
iuu44tyur iuu44tyur
q34qdwe1 q34qdwe1
</>
Python: results = pd.merge(event_results, events((‘evtid’, ‘date’, ‘team’)), on=‘evt_id’)
results
evt_id date team name type Score
dfg46sd5
34fww54
iuu44tyur
q34qdwe1
10%
60%
20%
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 84
AWS Glue DataBrew
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 85
AWS Glue DataBrew console
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 86
To p i c B
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 89
Extract, transform, and load (ETL)
SQL
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 90
How do you develop AWS Glue scripts?
• Blank graph – add source, target, and transform
AWS Glue Studio activities
• Script generated from the graph
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 91
Creating jobs in AWS Glue Studio
sc = SparkContext()
glueContext = GlueContext(sc)
Transform spark = glueContext.spark_session
ApplyMapping job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)
DataSource0 = glueContext.create_dynamic_frame…
…
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 92
Creating jobs in the AWS Glue console
Configure job properties Generate
Action Save Run Job
Job: demo_run diagram
Name:
import sys
demo_run + from awsglue.transforms impor
Database Name from awsglue.utils import get
Type - my_catalog
from pyspark.context import
Spark Table Name from awsglue.context import
my_table_csv from awsglue.job import job
Spark Streaming
Python shell …
Transform Name
This job runs ApplyMapping
o A proposed script generated by AWS
Glue Logs Schema
o An existing script that you provide
o A new script to be authored by you
Transform Name
SelectFields
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 93
Adding development endpoints
o Properties Development endpoint name Endpoint details
o Networking demo_run_endpoint
o SSH public key IAM role
Endpoint name demo_run_endpoint
o Review Glue_Admin_RAI
Security configuration Provisioning status READY
None Data processing units (DPUs) 5
Worker type Public address ec2-XXXXXXXX.eu-west-1.com...
Standard
Public key contents ssh-rsa AAAXXXXXXXXXXXX...
G.1X
G.2X IAM role AWSGlueServiceRole-incre…
Data processing units (DPUs) SSH to Python REPL ssh –i <private-key.pem> glue...
5 SSH to Scala REPL ssh –i <private-key.pem> glue...
Python library path
… …
S3://bucket/prefix/object
Dependent jars path
S3://bucket/prefix/object
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 94
Example AWS Glue jobs in Python
# Joining tables:
l_history =
Join.apply(orgs, Join.apply(persons, memberships, 'id', 'person_id'),
'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
print "Count: ", l_history.count()
l_history.printSchema()
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 95
Running AWS Glue jobs
• IAM role
AWS Glue console • Type: Spark or Python
• Glue version
• Source inputs
• Transforms
Job script • Target outputs
AWS Glue triggers • Job bookmarks
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 96
Running AWS Glue jobs
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 97
AWS Glue workflows
Fix phone
numbers
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 98
To p i c C
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Keeping private data private
FirstName LastName Phone Email TaxID
Cleansed data Nikki Wolf (123) 555-0190 nikkiwolf@gmail.com 987-12-3241
I.E. Admin access María García (198) 555-0121 mariagarcia@yahoo.com 555-23-9812
Liu Jie (112) 555-0155 liujie@hotmail.com 912-43-8965
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 101
Using Amazon Macie in data lakes
!
Custom data identifiers
• Keywords
• <REGEX>
Suppress findings Review manually
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 102
Using roles to define data access
Databases
IoT
Embedded
Streaming Storage Processes Queries analytics
data
Mobile
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 104
Monitoring a data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Amazon AWS
CloudWatch CloudTrail
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 105
Optimizing a data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Optimizing
Amazon S3
• Use Amazon Glacier for infrequently accessed data.
• Follow Amazon S3 best practices for asset naming.
• Transform data into columnar, compressed file formats.
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 106
Consider this
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 107
To p i c D
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 109
Athena Federated Query
Amazon Athena
• Query Amazon S3 directly
Amazon S3 On
Premises
• Use AWS Lambda serverless compute
• Run query across many data sources
Amazon ElastiCache
• On-premises or cloud data sources for Redis
HBase in
Amazon EMR
• Ad-hoc queries on complex data
Amazon Amazon
DocumentDB DynamoDB
Amazon Redshift
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 110
Metadata databases and tables in Athena
• Tables and databases within Athena (to run queries) are based on metadata
• Select from AWS Glue Data Catalog or register new datasets with Athena
• Athena uses metadata to process Structured Query Language (SQL)
covid19-db
states_daily_csv
CREATE EXTERNAL TABLE `states_daily_csv`( ... )
ROW FORMAT ...
STORED AS INPUTFORMAT ...
OUTPUTFORMAT ...
LOCATION s3://bucketname/folder/
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 111
Running queries in Athena
Results
State Positive Date
CA 3039044 20210121
TX 2188643 20210121
FL 1584442 20210121
NY 1285337 20210121
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 112
Module review
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1
During your data preparation stage, the raw data has been enriched to support additional
insights. You need to improve query performance and reduce costs of the final analytics
solution.
A. CSV
B. JSON
C. Apache Parquet
D. Apache ORC
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 114
Knowledge Check 1
During your data preparation stage, the raw data has been enriched to support additional
insights. You need to improve query performance and reduce costs of the final analytics
solution.
A. CSV
B. JSON
C. Apache Parquet
D. Apache ORC
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 115
Knowledge Check 2
Your small start-up company is developing a data analytics solution. You need to clean and
normalize large datasets, but you do not have developers with the skill set to write custom
scripts.
Which tool will help you efficiently design and run the data preparation activities?
Your small start-up company is developing a data analytics solution. You need to clean and
normalize large datasets, but you do not have developers with the skill set to write custom
scripts.
Which tool will help you efficiently design and run the data preparation activities?
A. Analyze data in real time as data comes into the data lake
B. Transform data in real time as data comes into the data lake
C. Analyze data in batches on schedule or on demand
D. Transform data in batches on schedule or on demand
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 118
Knowledge Check 3
A. Analyze data in real time as data comes into the data lake
B. Transform data in real time as data comes into the data lake
C. Analyze data in batches on schedule or on demand
D. Transform data in batches on schedule or on demand
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 119
Knowledge Check 4
Your data resides in multiple data stores, including Amazon S3, Amazon RDS, and Amazon
DynamoDB. You need to efficiently query the combined datasets.
Which tool can achieve this, using a single query, without moving data?
Your data resides in multiple data stores, including Amazon S3, Amazon RDS, and Amazon
DynamoDB. You need to efficiently query the combined datasets.
Which tool can achieve this, using a single query, without moving data?
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions?
Contact us at https://support.aws.amazon.com/#/contacts/aws-training. All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 4: Building a data lake with
AWS Lake Formation
Lab 2
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 124
To p i c A
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consider this
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 126
Building a data lake, manually
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Databases
IoT
Embedded
Streaming Storage Processes Queries analytics
data
Mobile
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 127
AWS Lake Formation
Ingestion Data stores Catalog and processing
Search and
Data sources AWS Lake Formation Visualization
analytics
Databases
IoT
Embedded
Blueprints Amazon S3 Security Queries analytics
Mobile
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 128
To p i c B
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Three stages of Lake Formation
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 130
The AWS Lake Formation dashboard
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 131
AWS Lake Formation: Stage 1
Register data lake • Specify an IAM role with read and write access
storage locations
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 132
AWS Lake Formation: Stage 2
Create a database in the data • Can select path within registered data lake storage to
lake’s Data Catalog
simplify location permissions
Data lake Note: Permissions are role/persona based. To perform specific tasks, the
administrator user must log in with the correct credentials/permissions.
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 133
AWS Lake Formation: Stage 3
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 134
To p i c C
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake security with Lake Formation
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 136
Security roles in Lake Formation
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 137
Access control methods
`
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 138
Consider this
Create
What permissions apply to these user scenarios? table Read
1 IAM user designated as Lake Formation data lake administrator
3 IAM user in another account that has access to query the Lake Formation catalog
Data lake administrator of another account that has been granted access to create tables in
4 this account
5 Member of Amazon QuickSight group that has been configured for federated access
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 139
Module review
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1
Which benefit do you achieve by using AWS Lake Formation to build data
lakes?
Which benefit do you achieve by using AWS Lake Formation to build data
lakes?
B. Create a database
D. Grant permissions
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 144
Knowledge Check 2
What are the three stages to set up a data lake using AWS Lake Formation?
(SELECT THREE)
B. Create a database
D. Grant permissions
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 145
Knowledge Check 3
How does AWS Lake Formation relate to the AWS Glue service? (SELECT
THREE)
B. Job monitoring
B. Job monitoring
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 148
Thank You
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 5: AWS Lake Formation
additional configurations
Labs 3 and 4
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 151
To p i c A
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consider this
How can we automate and optimize processing jobs and
scripts within AWS Lake Formation?
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 153
Blueprints build on AWS Glue
Blueprints
Monitoring
AWS Glue Data
Workflow
Catalog
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 154
Creating a workflow
1 2 3
Choose blueprint type Configure source and target Configure the workflow
• Database snapshot • Source connection and data path: • Workflow name
• Incremental database <database>/<schema>/<table> • IAM Role
• Target database in the Data Catalog • Table prefix
• AWS CloudTrail
• Storage location: • Max capacity: DPUs
• Classic Load Balancer logs
<s3://bucket/prefix/> (Data Processing Units)
• Application Load Balancer logs
• Format: Parquet/CSV • Concurrency: Default 5
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 155
To p i c B
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Access control: Permission checks
Services authenticate,
retrieve resources
2 request access “Table1”
1 directly from
Amazon S3.
query
“Table1” AWS Lake Formation
short-term access “Table1” 3
Amazon Athena
Principals
IAM users, roles,
Active Directory users
4 Request object “Table1”
Amazon Redshift
Transformed dataset
Return object “Table1” 5 “Table1”
Amazon EMR
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 158
AWS Lake Formation personas
• Superuser • Register Amazon S3 • Create and run • Run queries against • Run a workflow on
• Create IAM users and locations crawlers and the data lake behalf of a user
roles • Access the Data workflows
• Have the Catalog • Grant permissions on
Administrator Access • Create databases the assets they
AWS managed policy • Create and run create
workflows
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 159
Security deep dive
Services authenticate,
retrieve resources
2 request access “Table1”
1 directly from
Amazon S3.
query
“Table1” AWS Lake Formation
short-term access “Table1” 3
Amazon Athena
Principals
IAM users, roles,
Active Directory users
4 Request object “Table1”
Amazon Redshift
Transformed dataset
Return object “Table1” 5 “Table1”
Amazon EMR
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 160
Fine-grained access control
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 161
Data lake access control
Metadata Data
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 162
Consider this
Think about how these personas align with the user
community in your organization.
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 163
Lab 3
Automate data lake creation using AWS Lake Formation
blueprints
• Automate the data lake setup process with AWS Lake Formation blueprints
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 164
To p i c C
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FindMatches: an ML transformation
• Matching Customers
Linking customer records across different customer databases
• Matching Products
Matching products in your catalog against other product sources
• Improving Fraud Detection
New customer accounts from a previously known fraudulent user
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 166
Training the ML transform
New columns Sample dataset match_id
labeling_set_id label first_name last_name Birthday match_id
Labeling set ABC123 A John Doe 04/01/1980 1
ABC123 B Jane Smith 04/03/1980 2
ABC123 A Nikki Wolf 04/01/1980 1
Label ABC123 A Akua Mansa 04/01/1980 1
DEF345 A Richard Roe 12/11/1992 3
DEF345 A Shirley Rodriguez 11/12/1992 3
DEF345 B Paulo Santos 12/11/1992 4
DEF345 C Carlos Salazar 05/06/2017 5
DEF345 B Wang Xiulan 12/11/1992 4
GHI678 A María García 1/3/1999 6
GHI678 A Martha Rivera 1/3/1999 6
XYZABC A Efu Owusu 2/5/2001 7
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 167
To p i c D
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Visual insights from the data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics
Databases
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 169
What is Amazon QuickSight?
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 170
Accessing data with QuickSight
SPICE
Super-fast, Parallel, In-memory Calculation Engine
• Faster processing
• Reduced wait time vs. direct queries
• Reduced cost through reuse
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 171
Embedded analytics
1. Create a dashboard
2. Apply permissions
3. Authenticate your app server
4. Embed via JavaScript SDK
Desktop or mobile
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 172
Module review
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1
AWS Lake Formation has a set of suggested personas and IAM permissions.
Which is a required persona?
B. Data engineer
C. Data analyst
D. Business analyst
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 175
Knowledge Check 1
AWS Lake Formation has a set of suggested personas and IAM permissions.
Which is a required persona?
B. Data engineer
C. Data analyst
D. Business analyst
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 176
Knowledge Check 2
Which three types of blueprints does AWS Lake Formation support? (SELECT
THREE)
B. Database snapshot
C. Incremental database
Which three types of blueprints does AWS Lake Formation support? (SELECT
THREE)
B. Database snapshot
C. Incremental database
Data analyst
Visualize
AWS Lake Formation
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 181
Thank You
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 6: Course review and wrap-up
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What did this course cover?
• How to plan and design a data lake using effective data lake methodologies
• The components and services required to build a data lake on AWS
• How to secure a data lake on AWS using appropriate permissions
• The ways data can be ingested, stored, and transformed in a data lake on AWS
• How to analyze and visualize data stored in a data lake on AWS
• How to automate the deployment of a data lake with AWS Lake Formation
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 184
Serverless data lake reference architecture
Data Source Ingestion AWS Lake Formation: Data lake storage, cataloging, and processing Consumption
Streaming
Amazon
data Landing Raw data Curated Amazon
Kinesis Data
data Validate, clean ETL Normalize, enrich ETL data QuickSight
Firehose
Location, location, location! Partition for performance! Size files for speed!
Continue compacting! The details are in the deltas! Align keys with queries!
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 186
Slide 186
ND24 In slide notes, this looks like a heavy dose of unfamiliar new content for a course review and wrap-up. Just an
observation.
Neal, Dave, 2/17/2021
ND25
Learn more…
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 187
Slide 187
ND25 In slide notes, the link for AWS Lake Formation Dashboard takes me to the AWS Console signin page.
Neal, Dave, 2/17/2021
AWS Skill Builder online learning center
Get started
Use case challenges Exam preparation https://aws.amazon.com/training/digital
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 188
Don’t miss these learning opportunities
Learn with hundreds of free, self- Deepen your technical skills and Validate your expertise with an
paced digital courses on AWS learn from an accredited AWS industry-recognized credential.
fundamentals. instructor.
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 189
AWS certification
Role-based certifications align to the following roles and levels: Specialty certifications align
to domain expertise in the
Architect Operations Developer following areas:
Professional
Associate
Cloud
Foundational Practitioner
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 190
Course feedback
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 191
Thanks for participating!
Corrections, feedback, or other questions?
Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission
from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us
at https://support.aws.amazon.com/#/contacts/aws-training. All trademarks are the property of their owners.