You are on page 1of 187

AWS Partner: Building Data Lakes

on AWS
Module 0: Course overview

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Course objectives

In this course, you learn how to:


• Apply data lake methodologies in planning and designing a data lake
• Articulate the components and services required for building an AWS data lake
• Describe how to secure a data lake with appropriate permissions
• Describe how data is ingested, stored, and transformed in a data lake
• Describe how to analyze and visualize data within a data lake

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 3
Course prerequisites

We recommend that attendees of this course have previously completed the


following AWS courses:

• Data Analytics Fundamentals digital training


• AWS Technical Essentials classroom training

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 4
Agenda
Module 1: Introduction to data lakes
Module 2: Data ingestion, cataloging, and preparation
• Lab 1: Set up a simple data lake
Module 3: Data processing and analytics
Module 4: Building a data lake with AWS Lake Formation
• Lab 2: Build a data lake using AWS Lake Formation
Module 5: AWS Lake Formation additional configurations
• Lab 3: Automate data lake creation using AWS Lake Formation blueprints
• Lab 4: Data visualization using Amazon QuickSight
Module 6: Course review and wrap-up

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 5
Introduce yourself

• Name
• What do you do day-to-day?
• What do you want to get out of this class?
• What is your experience level with AWS?
• Choose a language: Java, Python, or C#
• Something personal

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 6
Log in for access to lab environments

Sign in to AWS Builder Labs.


• Your instructor will provide an access URL.
• Select AWS Partner to use your Partner login.

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 7
Lab requirements

• Computer running: • Reliable internet connection able


• Windows to browse the internet using HTTPS
• macOS • Register for AWS Builder Labs:
• Linux: Ubuntu, SUSE, or Red Hat • Turn off ad and script blockers
• Recommended web browser:
• Google Chrome
• Mozilla Firefox
• Microsoft Edge

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 8
Thank you

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 1: Introduction to data lakes

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives

• Describe the benefits and challenges of data lakes


• Compare data lakes and data warehouses
• Describe the components and architectures of data lakes

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 11
To p i c A

Why data lakes?

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why data lakes?

Exponential data growth: Unstructured data: Need analytics faster:


• Text • Days
100 zettabytes by 2022! • Video • Hours
(100 followed by 21 zeroes) • Images • Real time
• Audio

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 13
Benefits of data lakes

Data Silos

• Democratization of data • Consistent, scalable • Real-time analytics


access and analysis infrastructure • Faster, more efficient insights
• Maximum data value

Analyze more data from more sources in less time.

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 14
Challenges with building data lakes

Technical Operational Data organization


• Data velocity • Data distribution • Cataloging mechanism
• Data scale • Data availability timelines • Search
• Data formats • Security approach
• Development tools

Raw data is often stored with no oversight of the contents.

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 15
Customer case study:
Customer challenge
• Web-connected gaming platform
• Millions of monthly downloads
Needed to create a constant user feedback loop for
game designers at very large scale • Many millions of concurrent players
• Hundreds of millions registered users
• Available on multiple platforms
• Petabytes of daily player stats

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a data lake?
A centralized repository for large amounts of structured and
unstructured data to enable direct analytics.

Ingest and store Secure and protect

AWS Data Lake

Catalog and search Analytics and insights

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 17
To p i c B

Data lakes compared to


data warehouses

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake or data warehouse?

Data lake Data warehouse

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 19
What is a data warehouse?

• 3-tiered architecture
Business
• Structured and relational data intelligence

• Schema-on-write
• PB (petabyte) scale Data warehouse

• Traditionally, compute and storage


are tightly coupled
• Optimized for analytic queries
OLTP ERP CRM LOB

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 20
Comparison

Data Warehouse Data Lake


Data Relational, structured data (databases) Non-relational (object) and relational data

Schema Schema-on-write Schema-on-read

Price Fastest query results using local storage Cost-effective storage based on Amazon S3

Indexes many sources and formats


Performance Faster query results: table structure
(less performant)

Data quality Highly curated data Raw data, unstructured data, many formats

Data scientists, data analysts, and business


Users Business analysts
analysts
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 21
To p i c C

Components and architectures

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consider this

• In your organization, what are the most important questions your


data needs to answer?

• What types of data analytics are you using to position your data to
answer them?

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 23
Analytics functionality
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases

Database import Databases Data Catalog Search Interactive


dashboards
Objects

IoT
Embedded
Streaming Storage Processes Queries analytics
data
Mobile

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 24
AWS services build the data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases
On Prem

AWS DMS Amazon S3 AWS Glue

Mobile Amazon Athena Amazon


QuickSight
loT Amazon Amazon EMR
Amazon
Kinesis Redshift Third Party

Databases Amazon
Cloud

Elasticsearch
Service
Object AWS Snow Amazon RDS AWS Glue
storage Family** Data Catalog

Security and monitoring


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 25
Example: Basic data lake architecture

AWS Data Lake


Data Sources Transformed
Raw data data Data Analytics

User stats
Amazon S3 Amazon S3 Amazon
AWS Glue
Athena
Mobile

Databases
Crawler Data Catalog Crawler

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 26
Example: Event-driven workflow

Trigger CSV to Event Dataset


AWS Glue job Parquet rule refresh

Data Sources

AWS Lambda AWS Amazon AWS Amazon


Databases Glue EventBridge Lambda QuickSight

Mobile
(S3 trigger)

loT
Ingestion Amazon S3 Amazon S3 Crawler AWS Glue
Data Catalog
Amazon S3 Raw data Processed data

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 27
Module review

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1

A data lake is a centralized repository that enables which operation?

A. Store unstructured data from a single data source

B. Store structured data from any data source

C. Store structured and unstructured data from any source

D. Store structured and unstructured data from a single source


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 29
Knowledge Check 1

A data lake is a centralized repository that enables which operation?

A. Store unstructured data from a single data source

B. Store structured data from any data source

C. Store structured and unstructured data from any source

D. Store structured and unstructured data from a single source


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 30
Knowledge Check 2

What is the most cost-effective storage option for your data lake?

A. Amazon EBS

B. Amazon S3

C. Amazon RDS

D. Amazon Redshift
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 31
Knowledge Check 2

What is the most cost-effective storage option for your data lake?

A. Amazon EBS

B. Amazon S3

C. Amazon RDS

D. Amazon Redshift
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 32
Knowledge Check 3

Which services are used in the processing layer of a data lake architecture?
(SELECT TWO)

A. AWS Snowball

B. AWS Glue

C. Amazon EMR

D. Amazon QuickSight
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 33
Knowledge Check 3

Which services are used in the processing layer of a data lake architecture?
(SELECT TWO)

A. AWS Snowball

B. AWS Glue

C. Amazon EMR

D. Amazon QuickSight
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 34
Thank You

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 2: Data ingestion, cataloging, and preparation
Lab 1

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives

• Describe the relationship between storage and data ingestion


• Describe AWS Glue crawlers and how they are used to create a data
catalog
• Identify data formatting, partitioning, and compression for efficient storage
and query

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 37
To p i c A

Data lake storage

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake storage
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases Amazon S3

Database import Data Catalog Search Interactive


dashboards
Objects
Amazon
Redshift
IoT
Embedded
Streaming Processes Queries analytics
Data
Mobile
Amazon RDS

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 39
Amazon S3 – default data lake storage

• Decouple storage from compute Compute


• Exabyte-scale object storage

• Provides 99.999999999% durability


AWS AWS
Glue Lambda
• Strong read-after-write consistency

• Cost-optimized, centralized data architecture Storage


• Standardized APIs, serverless architecture

Amazon S3 AWS Databases

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 40
Storage of datasets

Data type Description Audience

Published data Closely governed data. Owned, Broadly available


managed, and maintained datasets

Transformed data Merged and cleansed data Data engineers, data analysts
Data Value

Formatted data Standard optimized form Analysts and data scientists


(i.e. Parquet)

Raw data Unmodified, raw source data in it’s Operations admin,


original format infrastructure engineers, data
scientists

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 41
Securing data lake storage

Encryption in transit HTTPS/TLS

Encryption at rest SSE-S3 (Amazon S3 managed keys)

Server-side encryption SSE-KMS (AWS Key Management Service)

SSE-C (customer-provided keys)

Client-side encryption Encrypt with the AWS Encryption SDK

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 42
Consider this

• What would be the advantages of separating storage from compute


resources?

• How could you take advantage of this in your own organization?

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 43
To p i c B

Data ingestion

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data ingestion
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases AWS DMS

Databases Data Catalog Search Interactive


dashboards
Objects
Amazon
Kinesis
IoT
Embedded
Storage Processes Queries analytics
Mobile AWS Snow
Family**

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 45
Data ingestion sources and methods

Transactional data Files and objects Streaming data


• Database reads/writes • Log, sensor geo data • Mobile data, gaming data
• App server/web server • Pictures / video • AWS IoT streams
• Flume, Log4j • Fluentd, Sqoop, Storm

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 46
Transactional data ingestion
Migration Storage
AWS Database Migration Service
• Low cost, ease of use, reliable
• Wide variety of DB import tools ORCL AWS DMS Amazon
• Database consolidation RDS for Oracle

• Continuous replication
• Change Data Capture (CDC)
MySQL
AWS DMS
Amazon S3
AWS Schema Conversion Tool
• Converts source to target formats
• Schema, objects, views AWS
ORCL SCT
• Procedures, functions, and code AWS Amazon
DMS Aurora

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 47
Files and object ingestion
Upload

Storage
SMB & NFS

AWS Storage
Gateway

Applications & Amazon S3


interfaces

AWS
Direct Connect

Amazon
Remote compute S3 Glacier
& storage

AWS
Snow Family**
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 48
Streaming data ingestion
Log data Processing
and analytics

Amazon EC2 Amazon Kinesis Amazon Kinesis


Data Firehose Data Analytics

Data lake storage

Amazon CloudWatch AWS Lambda


Amazon S3
data processing

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 49
Consider this

• What types of data sources will you need to ingest into


the data lake?

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 50
To p i c B

Crawl and catalog data

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data catalog and processing
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

AWS Glue
Databases

Database import Databases Search Interactive


dashboards
Objects
Crawler

IoT
Embedded
Streaming Storage Queries analytics
data
Mobile AWS Glue
Data Catalog

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 52
AWS Glue

Serverless data preparation service – discover, prepare,


and combine data for analytics, machine learning, and
AWS Glue
application development

Components Implementation Feature

AWS Glue Data AWS Glue crawler AWS Glue ETL AWS Glue Studio
Catalog

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 53
AWS Glue Data Catalog and crawlers
Sources and AWS Glue Analytics
ingestion
Data stores

Databases

Amazon
Objects Athena

IoT
Crawler AWS Glue
Mobile Data Catalog

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 54
AWS Glue Data Catalog: Metadata
Data schema
Column name Data Type
marketplace string
Table
customer_id bigint
information Name:
review_id string
Description:
Location Database: product_id string
Classification:
and format product_price num
Location:
Connection:
Deprecated:
Last updated: Partition information
Input format: Partition name View
Output format:
Serde serialization lib: pc_electronics
Table
properties musical_instruments
Table properties:
health_homecare

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 55
Crawler data sources and classifiers
Access
Data sources Classifier type Classification string
type
Native • Amazon S3
client • Amazon DynamoDB Apache Avro avro
• Amazon RDS Apache ORC orc
• Amazon Aurora
• MariaDB Apache Parquet parquet

Built-in
• Microsoft SQL Server
JDBC • MySQL JSON json
• Oracle
• PostgreSQL Binary JSON bson

XML xml
• MongoDB
MongoDB • Amazon DocumentDB Custom

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 56
To p i c C

Data formatting, partitioning, and


compression

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Raw data is not optimized for querying

Best practice is to optimize for:


Formatting: optimal file storage format
Partitioning: dividing large datasets into manageable file sizes
Compression: optimizing file storage size vs. performance

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 59
Data formatting with AWS Glue
Sources and AWS Glue workflow and jobs Analytics Visualization
ingestion

AWS Glue
Amazon Amazon
transformation
Redshift QuickSight
job
Raw data Formatted data
(Parquet)

AWS Storage
Gateway Amazon
Athena
Crawler AWS Glue
Data Catalog

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 60
Formatting: Parquet format

Run time
SELECT count(*) FROM line item Number of files (seconds) Compaction
Many files 356 files 8.4 3.6x faster than
JSON format
Single file 1 file 2.31

Size Run time


SELECT count(*) FROM events Number of files (GB) (seconds) Conversion
JSON (compacted) 46,182 176 463.33 74x faster than
JSON format
Parquet file 11,640 213 6.25

* Tests run using Athena


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 61
Row vs column formats
ID Age State
Less data is read in columnar data 123 20 CA
345 25 WA
Example: return all IDs in the dataset 678 40 FL
999 21 WA

Row format: 12 data points read into memory


123 20 CA 345 25 WA 678 40 FL 999 21 WA

Column format: 4 data points read into memory


123 345 678 999

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 62
Partitioning data stores
Hive-style partitions - Cost and performance optimization
• Amazon S3 prefix convention
S3://<bucket>/<prefix>/year=2018/month=01/day=20/

Spark partitions - Optimal distributed performance


• “Logical chunk of a large distributed dataset”
• You can repartition – or redistribute – data in Spark
AWS Glue API example, partitioned by column “type”
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_type = "s3",
connection_options = {"path": "$outpath", "partitionKeys":
["type"]},
format = "parquet")

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 63
Amazon S3 bucketing
The crawler automatically identifies partitions based on Amazon S3 object key prefixes.

Object keys:

• /year=2018/month=01/day=20/data001.csv.gz

Bucket: service-logs • /year=2018/month=01/day=20/data002.csv.gz

• /year=2018/month=01/day=20/data003.csv.gz

• /year=2018/month=01/day=21/data001.csv.gz
Bucket: service-logs/2
• /year=2018/month=01/day=21/data002.csv.gz

• /year=2018/month=01/day=22/data001.csv.gz
Bucket: service-logs/3 • /year=2018/month=01/day=22/data002.csv.gz

• /year=2018/month=01/day=22/data003.csv.gz

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 64
Compression

• Benefits
• Reduced storage requirements
• Reduced I/O reading data from storage
• Faster query processing

• Compression codecs for AWS Glue output


• SNAPPY (default)
• LZO
• GZIP
• UNCOMPRESSED

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 65
Consider this
What are some of the best practices discussed for the following?

• Formatting

• Partitioning

• Compression

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 66
Module review

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1

Which services can be used for data ingestion into your data lake?
(SELECT TWO)

A. Amazon Kinesis Data Firehose

B. Amazon QuickSight

C. Amazon Athena

D. AWS Storage Gateway


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 68
Knowledge Check 1

Which services can be used for data ingestion into your data lake?
(SELECT TWO)

A. Amazon Kinesis Data Firehose

B. Amazon QuickSight

C. Amazon Athena

D. AWS Storage Gateway


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 69
Knowledge Check 2

Which service uses continuous data replication with high availability to


consolidate databases into a petabyte-scale data warehouse by streaming
data to Amazon Redshift and Amazon S3?

A. AWS Storage Gateway

B. AWS Schema Conversion Tool


C. AWS Database Migration Service
D. Amazon Kinesis Data Firehose
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 70
Knowledge Check 2

Which service uses continuous data replication with high availability to


consolidate databases into a petabyte-scale data warehouse by streaming
data to Amazon Redshift and Amazon S3?

A. AWS Storage Gateway

B. AWS Schema Conversion Tool


C. AWS Database Migration Service
D. Amazon Kinesis Data Firehose
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 71
Knowledge Check 3

What is the AWS Glue Data Catalog?

A. A fully managed ETL (extract, transform, and load) pipeline


service
B. A service to schedule jobs
C. A visual data preparation tool
D. An index to the location, schema, and runtime metrics of your
data
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 72
Knowledge Check 3

What is the AWS Glue Data Catalog?

A. A fully managed ETL (extract, transform, and load) pipeline


service
B. A service to schedule jobs
C. A visual data preparation tool
D. An index to the location, schema, and runtime metrics of your
data
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 73
Knowledge Check 4

What AWS Glue feature “catalogs” your data?

A. AWS Glue crawler


B. AWS Glue DataBrew

C. AWS Glue Studio

D. AWS Glue Elastic Views


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 74
Knowledge Check 4

What AWS Glue feature “catalogs” your data?

A. AWS Glue crawler


B. AWS Glue DataBrew

C. AWS Glue Studio

D. AWS Glue Elastic Views


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 75
Lab 1
Set up a simple data lake
• Using Glue Crawler, catalog the project dataset
• Query the data using Amazon Athena

AWS Glue

CSV files S3 data lake Crawler Data Amazon


30 minutes source Catalog Athena

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 76
Thank You

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 3: Data processing and analytics

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives

• Recognize how data processing applies to a data lake


• Use AWS Glue to process data within a data lake
• Describe how to use Amazon Athena to analyze data in a data lake

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 79
To p i c A

Data preparation

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data processing and preparation
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases

Database import Databases AWS Glue Search Interactive


dashboards
Objects

IoT
Embedded
Streaming Storage AWS Glue Queries analytics
Data DataBrew
Mobile

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 81
Data cleansing and preparation

ID Name Birthday Age Phone Postal code Place Locale


1234 Costa, Rui 18.2.80 37 ***-***-**** 1000 Lisboa B. Cawal
1234 Ana Costa Mar. 1, 84 37 965-555-2123 55555 Lsiboa 55555
1235 Rui Costa 18.2.80 27 963-555-4568 1000 Portugal 98980

Data preparation – cleanse, correct, and regularize:


• Uniqueness • Deduplication
• Representation • Contraindications
• Data anomaly • Missing values
• Standardization • Incorrect values

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 82
Merge and join datasets
[ I.E. Raw data bucket ]
events event_results
evt_id date city team evt_id name type Score
dfg46sd5 dfg46sd5
34fww54 34fww54
iuu44tyur iuu44tyur
q34qdwe1 q34qdwe1

</>
Python: results = pd.merge(event_results, events((‘evtid’, ‘date’, ‘team’)), on=‘evt_id’)

results
evt_id date team name type Score
dfg46sd5
34fww54
iuu44tyur
q34qdwe1

[ I.E. Merged data bucket ]


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 83
Data preparation: 80% of data lake work
3%
4%
5%

10%

60%
20%

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 84
AWS Glue DataBrew

Data sources AWS Glue DataBrew Amazon S3 Analytics and machine


learning
• Visual data prep environment
• Cleanse and prepare your data
• Over 250 built-in transformations
• Evaluate data quality
• Automate at scale

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 85
AWS Glue DataBrew console

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 86
To p i c B

Data processing with AWS Glue

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue

Serverless data preparation service – discover, prepare,


and combine data for analytics, machine learning, and
AWS Glue
application development

Components Implementation Feature

AWS Glue Data AWS Glue crawler AWS Glue Studio


Catalog AWS Glue ETL

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 89
Extract, transform, and load (ETL)

Extract Transform Load


Download files Transform, join, Save data in
from SFTP cleanse data Parquet format

SQL

Raw data Merged data

Scripts: Python or Apache Spark – running in AWS Glue engine

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 90
How do you develop AWS Glue scripts?
• Blank graph – add source, target, and transform
AWS Glue Studio activities
• Script generated from the graph

• Start with script, generate diagram


AWS Glue Script
• Import script or write your own
Editor • Run script from the editor for testing
Job script
Amazon
Jupyter
AWS Glue SageMaker
Developer Endpoint
Zeppelin

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 91
Creating jobs in AWS Glue Studio

Visual graph editor Script editor


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
Data source
from awsglue.context import GlueContext
Amazon S3 bucket from awsglue.job import job

args = getResolvedOptions(sys.argv, [‘JOB_NAME’]

sc = SparkContext()
glueContext = GlueContext(sc)
Transform spark = glueContext.spark_session
ApplyMapping job = Job(glueContext)
job.init(args[‘JOB_NAME’], args)
DataSource0 = glueContext.create_dynamic_frame…

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 92
Creating jobs in the AWS Glue console
Configure job properties Generate
Action Save Run Job
Job: demo_run diagram
Name:
import sys
demo_run + from awsglue.transforms impor
Database Name from awsglue.utils import get
Type - my_catalog
from pyspark.context import
Spark Table Name from awsglue.context import
my_table_csv from awsglue.job import job
Spark Streaming
Python shell …
Transform Name
This job runs ApplyMapping
o A proposed script generated by AWS
Glue Logs Schema
o An existing script that you provide
o A new script to be authored by you
Transform Name
SelectFields

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 93
Adding development endpoints
o Properties Development endpoint name Endpoint details
o Networking demo_run_endpoint
o SSH public key IAM role
Endpoint name demo_run_endpoint
o Review Glue_Admin_RAI
Security configuration Provisioning status READY
None Data processing units (DPUs) 5
Worker type Public address ec2-XXXXXXXX.eu-west-1.com...
Standard
Public key contents ssh-rsa AAAXXXXXXXXXXXX...
G.1X
G.2X IAM role AWSGlueServiceRole-incre…

Data processing units (DPUs) SSH to Python REPL ssh –i <private-key.pem> glue...
5 SSH to Scala REPL ssh –i <private-key.pem> glue...
Python library path
… …
S3://bucket/prefix/object
Dependent jars path
S3://bucket/prefix/object
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 94
Example AWS Glue jobs in Python

# Joining tables:
l_history =
Join.apply(orgs, Join.apply(persons, memberships, 'id', 'person_id'),
'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
print "Count: ", l_history.count()
l_history.printSchema()

# Split into multiple files:


glueContext.write_dynamic_frame.from_options(frame = l_history,
connection_type = "s3",
connection_options = {"path": "s3://glue-sample-target/output-dir/legislator_history"},
format = "parquet")

# Partition based on data values:


l_history.toDF().write.parquet('s3://glue-sample-target/output-dir/legislator_part', partitionBy=['org_name’])

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 95
Running AWS Glue jobs

• IAM role
AWS Glue console • Type: Spark or Python
• Glue version
• Source inputs
• Transforms
Job script • Target outputs
AWS Glue triggers • Job bookmarks

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 96
Running AWS Glue jobs

Scheduler Job events On demand


(conditional)

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 97
AWS Glue workflows

Schedule Deduplicate Fix, dedup Update


or trigger succeeded catalog

Fix phone
numbers

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 98
To p i c C

Data privacy and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Keeping private data private
FirstName LastName Phone Email TaxID
Cleansed data Nikki Wolf (123) 555-0190 nikkiwolf@gmail.com 987-12-3241
I.E. Admin access María García (198) 555-0121 mariagarcia@yahoo.com 555-23-9812
Liu Jie (112) 555-0155 liujie@hotmail.com 912-43-8965

1. Hide column (drop field LastName)


AWS Glue ETL 2. Replace text between positions (TaxID)
transformation 3. Replace text between delimiters (Email)
job

FirstName 1 Phone Email 3 TaxID 2


Nikki (123) 555-0190 nikkiwolf@xxxxx.com XXX-XX-3241
Masked data
Customer access María (198) 555-0121 mariagarcia@xxxxx.com XXX-XX-9812
Liu (112) 555-0155 liujie@xxxxx.com XXX-XX-8965

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 101
Using Amazon Macie in data lakes

Sensitive data identification Automation approaches

Managed data identifiers


• Credentials ###
• Financial info
Amazon • Health info Mask data Apply business logic
Macie • PII

!
Custom data identifiers
• Keywords
• <REGEX>
Suppress findings Review manually

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 102
Using roles to define data access

Control access to staged data using roles.

Data lake storage

Raw Formatted & Merged Cleansed Masked Published


data partitioned data data data data
data

Administrator Data scientist Business analyst General


availability
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 103
Monitoring a data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases

Database import Databases Data Catalog Search Interactive


dashboards
Objects

IoT
Embedded
Streaming Storage Processes Queries analytics
data
Mobile

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 104
Monitoring a data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Security and monitoring

Amazon AWS
CloudWatch CloudTrail

Logs Rule Events Alarm

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 105
Optimizing a data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Optimizing

• Implement Amazon S3 lifecycle management.


• Use Amazon S3 storage class analysis to develop lifecycle rules.

Amazon S3
• Use Amazon Glacier for infrequently accessed data.
• Follow Amazon S3 best practices for asset naming.
• Transform data into columnar, compressed file formats.

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 106
Consider this

• What are your data privacy requirements?

• How would you utilize monitoring in your data lake environment?

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 107
To p i c D

Query data with Amazon Athena

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena

Serverless query engine


Benefits
• Query Amazon S3 data directly or the
AWS Glue Data Catalog
Amazon
S3 • Use SQL based on Presto

• CSV, JSON, ORC, Avro, Parquet


Crawler Amazon
Athena • Can integrate with Amazon QuickSight

• Support for federated query


Data Catalog
(metadata)

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 109
Athena Federated Query
Amazon Athena
• Query Amazon S3 directly
Amazon S3 On
Premises
• Use AWS Lambda serverless compute
• Run query across many data sources
Amazon ElastiCache
• On-premises or cloud data sources for Redis
HBase in
Amazon EMR
• Ad-hoc queries on complex data

Amazon Amazon
DocumentDB DynamoDB

Amazon Redshift
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 110
Metadata databases and tables in Athena
• Tables and databases within Athena (to run queries) are based on metadata
• Select from AWS Glue Data Catalog or register new datasets with Athena
• Athena uses metadata to process Structured Query Language (SQL)

covid19-db

states_daily_csv
CREATE EXTERNAL TABLE `states_daily_csv`( ... )
ROW FORMAT ...
STORED AS INPUTFORMAT ...
OUTPUTFORMAT ...
LOCATION s3://bucketname/folder/

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 111
Running queries in Athena

1 SELECT state, positive, date


2 FROM "covid19-db"."states_daily_csv"
3 WHERE date = 20210121 --Formatted as YYYYMMDD
4 ORDER BY positive DESC
5 limit 10;
6

Results
State Positive Date
CA 3039044 20210121
TX 2188643 20210121
FL 1584442 20210121
NY 1285337 20210121
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 112
Module review

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1

During your data preparation stage, the raw data has been enriched to support additional
insights. You need to improve query performance and reduce costs of the final analytics
solution.

Which data formats meet these requirements? (SELECT TWO)

A. CSV
B. JSON
C. Apache Parquet
D. Apache ORC
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 114
Knowledge Check 1

During your data preparation stage, the raw data has been enriched to support additional
insights. You need to improve query performance and reduce costs of the final analytics
solution.

Which data formats meet these requirements? (SELECT TWO)

A. CSV
B. JSON
C. Apache Parquet
D. Apache ORC
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 115
Knowledge Check 2

Your small start-up company is developing a data analytics solution. You need to clean and
normalize large datasets, but you do not have developers with the skill set to write custom
scripts.

Which tool will help you efficiently design and run the data preparation activities?

A. AWS Glue Data Catalog


B. AWS Glue DataBrew
C. Amazon Athena
D. AWS Glue ETL
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 116
Knowledge Check 2

Your small start-up company is developing a data analytics solution. You need to clean and
normalize large datasets, but you do not have developers with the skill set to write custom
scripts.

Which tool will help you efficiently design and run the data preparation activities?

A. AWS Glue Data Catalog

B. AWS Glue DataBrew


C. Amazon Athena
D. AWS Glue ETL
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 117
Knowledge Check 3

In which scenario would you use AWS Glue jobs?

A. Analyze data in real time as data comes into the data lake
B. Transform data in real time as data comes into the data lake
C. Analyze data in batches on schedule or on demand
D. Transform data in batches on schedule or on demand
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 118
Knowledge Check 3

In which scenario would you use AWS Glue jobs?

A. Analyze data in real time as data comes into the data lake
B. Transform data in real time as data comes into the data lake
C. Analyze data in batches on schedule or on demand
D. Transform data in batches on schedule or on demand
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 119
Knowledge Check 4

Your data resides in multiple data stores, including Amazon S3, Amazon RDS, and Amazon
DynamoDB. You need to efficiently query the combined datasets.

Which tool can achieve this, using a single query, without moving data?

A. Amazon Athena Federated Query


B. Amazon Redshift Query Editor
C. SQL Workbench
D. Amazon Redshift Spectrum
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 120
Knowledge Check 4

Your data resides in multiple data stores, including Amazon S3, Amazon RDS, and Amazon
DynamoDB. You need to efficiently query the combined datasets.

Which tool can achieve this, using a single query, without moving data?

A. Amazon Athena Federated Query


B. Amazon Redshift Query Editor
C. SQL Workbench
D. Amazon Redshift Spectrum
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 121
Thank You

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions?
Contact us at https://support.aws.amazon.com/#/contacts/aws-training. All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 4: Building a data lake with
AWS Lake Formation
Lab 2

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives

• Describe the features and benefits of AWS Lake Formation


• Use Lake Formation to create a data lake
• Understand the Lake Formation security model

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 124
To p i c A

AWS Lake Formation overview

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consider this

• What are the most challenging actions in creating your data


lakes thus far?

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 126
Building a data lake, manually
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases

Database import Databases Data Catalog Search Interactive


dashboards
Objects

IoT
Embedded
Streaming Storage Processes Queries analytics
data
Mobile

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 127
AWS Lake Formation
Ingestion Data stores Catalog and processing
Search and
Data sources AWS Lake Formation Visualization
analytics

Databases

AWS Glue Data Catalog ETL Search Interactive


dashboards
Objects

IoT
Embedded
Blueprints Amazon S3 Security Queries analytics
Mobile

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 128
To p i c B

Using AWS Lake Formation to


create a data lake

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Three stages of Lake Formation

Stage 1 Stage 2 Stage 3

Register data lake Create a database in the Grant permissions to


storage locations data lake’s Data Catalog data lake resources

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 130
The AWS Lake Formation dashboard

Data lake setup

Stage 1 Stage 2 Stage 3

Register your Amazon S3 storage Create a database Grant permissions

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 131
AWS Lake Formation: Stage 1

Stage 1 • Register an Amazon S3 bucket for data lake storage

• Review permissions on the selected data store

Register data lake • Specify an IAM role with read and write access
storage locations

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 132
AWS Lake Formation: Stage 2

Stage 2 • Database stores metadata tables in the Data Catalog

• Database creators and data lake administrators only

Create a database in the data • Can select path within registered data lake storage to
lake’s Data Catalog
simplify location permissions

Data lake Note: Permissions are role/persona based. To perform specific tasks, the
administrator user must log in with the correct credentials/permissions.

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 133
AWS Lake Formation: Stage 3

Stage 3 • Configure data lake access via IAM users or roles

Data lake • Table permissions - create, alter, drop, describe

Grant permissions to • Understand affects of cross-account grants


data lake resources

• Use identity federation or Amazon QuickSight user groups

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 134
To p i c C

AWS Lake Formation personas

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake security with Lake Formation

Secure multiple data sources Deny access by default and


provide specific access where needed

Support multiple user roles and teams Enable governance

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 136
Security roles in Lake Formation

Data lake Database creator Table creator


administrator
 Full read access to resources  All database permissions on  Permissions on tables they
 Data location permissions databases they create create
 Grant/revoke access to  Permissions on tables they  Grant permissions on tables
resources, including self create they create
 Create databases  Use console or API to designate  View databases containing the
 Grant permission to create database creators tables they create
databases

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 137
Access control methods
`

AWS Lake Formation AWS IAM


• IAM policies control all access
Default • IAMAllowedPrincipals group
• Data Catalog resources
Use only IAM access control • Super permission
• Individual Amazon S3 buckets
• Fine-grained access • Coarse-grained
• Grant limited permissions • Permissions on individual
Alternative configuration • Individual principals operations
• Individual Data Catalog resources, • Access to Amazon S3 locations
Amazon S3 locations, and • Access only to needed API
underlying data in those locations operations

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 138
Consider this
Create
What permissions apply to these user scenarios? table Read
1 IAM user designated as Lake Formation data lake administrator

2 Member of an IAM group designated within Lake Formation as database creators

3 IAM user in another account that has access to query the Lake Formation catalog

Data lake administrator of another account that has been granted access to create tables in
4 this account
5 Member of Amazon QuickSight group that has been configured for federated access

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 139
Module review

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1

Which benefit do you achieve by using AWS Lake Formation to build data
lakes?

A. Build data lakes quickly

B. Simplify security management

C. Provide self-service access to data

D. All of the above


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 142
Knowledge Check 1

Which benefit do you achieve by using AWS Lake Formation to build data
lakes?

A. Build data lakes quickly

B. Simplify security management

C. Provide self-service access to data

D. All of the above


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 143
Knowledge Check 2
What are the three stages to set up a data lake using AWS Lake Formation?
(SELECT THREE)

A. Register the storage location

B. Create a database

C. Populate the database

D. Grant permissions
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 144
Knowledge Check 2
What are the three stages to set up a data lake using AWS Lake Formation?
(SELECT THREE)

A. Register the storage location

B. Create a database

C. Populate the database

D. Grant permissions
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 145
Knowledge Check 3
How does AWS Lake Formation relate to the AWS Glue service? (SELECT
THREE)

A. ETL code generation

B. Job monitoring

C. Data ingestion and cataloging

D. Simplify security management


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 146
Knowledge Check 3
How does AWS Lake Formation relate to the AWS Glue service? (SELECT
THREE)

A. ETL code generation

B. Job monitoring

C. Data ingestion and cataloging

D. Simplify security management


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 147
Lab 2
Build a data lake using AWS Lake Formation
• Use AWS Lake Formation to create a data lake
• Crawl the data with AWS Glue to create the metadata and table
• Query the data using Amazon Athena
• Transform the data from CSV to Parquet format
Data analyst

AWS Lake Formation Analyze

AWS public data COVID-19 data lake Amazon


45 minutes (COVID-19) Athena

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 148
Thank You

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 5: AWS Lake Formation
additional configurations
Labs 3 and 4

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Module objectives

• Automate AWS Lake Formation using blueprints and workflows


• Apply security and access controls to AWS Lake Formation
• Match records with AWS Lake Formation FindMatches
• Visualize data with Amazon QuickSight

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 151
To p i c A

Using blueprints and workflows

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consider this
How can we automate and optimize processing jobs and
scripts within AWS Lake Formation?

AWS Lake Formation AWS Lake Formation workflow


blueprint
Orchestrate ETL activities
Workflow template for in the data lake
common data sources

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 153
Blueprints build on AWS Glue

Blueprints

Monitoring
AWS Glue Data
Workflow
Catalog

AWS Glue AWS Glue


Connections, databases, tables
jobs crawlers

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 154
Creating a workflow

1 2 3

Choose blueprint type Configure source and target Configure the workflow
• Database snapshot • Source connection and data path: • Workflow name
• Incremental database <database>/<schema>/<table> • IAM Role
• Target database in the Data Catalog • Table prefix
• AWS CloudTrail
• Storage location: • Max capacity: DPUs
• Classic Load Balancer logs
<s3://bucket/prefix/> (Data Processing Units)
• Application Load Balancer logs
• Format: Parquet/CSV • Concurrency: Default 5

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 155
To p i c B

Security and access controls

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Access control: Permission checks
Services authenticate,
retrieve resources
2 request access “Table1”
1 directly from
Amazon S3.
query
“Table1” AWS Lake Formation
short-term access “Table1” 3
Amazon Athena
Principals
IAM users, roles,
Active Directory users
4 Request object “Table1”
Amazon Redshift

Transformed dataset
Return object “Table1” 5 “Table1”
Amazon EMR

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 158
AWS Lake Formation personas

Required Optional Required

IAM Data lake


administrator administrator Data engineer Data analyst Workflow role

• Superuser • Register Amazon S3 • Create and run • Run queries against • Run a workflow on
• Create IAM users and locations crawlers and the data lake behalf of a user
roles • Access the Data workflows
• Have the Catalog • Grant permissions on
Administrator Access • Create databases the assets they
AWS managed policy • Create and run create
workflows

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 159
Security deep dive
Services authenticate,
retrieve resources
2 request access “Table1”
1 directly from
Amazon S3.
query
“Table1” AWS Lake Formation
short-term access “Table1” 3
Amazon Athena
Principals
IAM users, roles,
Active Directory users
4 Request object “Table1”
Amazon Redshift

Transformed dataset
Return object “Table1” 5 “Table1”
Amazon EMR

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 160
Fine-grained access control

Lake Formation Permissions IAM Permissions


Open Fine-grained
Default
Super permission is granted. IAM policies control access to Data Catalog
resources and Amazon S3 buckets.
Fine-grained Coarse-grained
Recommended
Limited permissions are granted. Broader permissions apply.
Example: “glue:*” or “glue:Create*”

Note: By default, Lake Formation has the


Use only IAM access control settings enabled.

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 161
Data lake access control

Metadata Data

Resources Data Catalog resources Amazon S3


Perform CRUD operations on Read and write data to
Actions enabled
Data Catalog underlying Amazon S3 location

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 162
Consider this
Think about how these personas align with the user
community in your organization.

IAM admins Data lake


(superuser) administrators Data engineers Data analysts Workflow rule

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 163
Lab 3
Automate data lake creation using AWS Lake Formation
blueprints
• Automate the data lake setup process with AWS Lake Formation blueprints

AWS Lake Formation Data lake workflow in AWS Glue

Blueprints AWS CloudTrail Crawler User activity Amazon


45 minutes
logs data Athena

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 164
To p i c C

Matching records with


AWS Lake Formation FindMatches

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FindMatches: an ML transformation

Machine learning custom transforms to cleanse your data

• Matching Customers
Linking customer records across different customer databases
• Matching Products
Matching products in your catalog against other product sources
• Improving Fraud Detection
New customer accounts from a previously known fraudulent user

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 166
Training the ML transform
New columns Sample dataset match_id
labeling_set_id label first_name last_name Birthday match_id
Labeling set ABC123 A John Doe 04/01/1980 1
ABC123 B Jane Smith 04/03/1980 2
ABC123 A Nikki Wolf 04/01/1980 1
Label ABC123 A Akua Mansa 04/01/1980 1
DEF345 A Richard Roe 12/11/1992 3
DEF345 A Shirley Rodriguez 11/12/1992 3
DEF345 B Paulo Santos 12/11/1992 4
DEF345 C Carlos Salazar 05/06/2017 5
DEF345 B Wang Xiulan 12/11/1992 4
GHI678 A María García 1/3/1999 6
GHI678 A Martha Rivera 1/3/1999 6
XYZABC A Efu Owusu 2/5/2001 7

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 167
To p i c D

Visualizing data with Amazon


QuickSight

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Visual insights from the data lake
Catalog and Search and
Data sources Ingestion Data stores Visualization
processing analytics

Databases

Database import Databases Data Catalog Search Amazon


QuickSight
Objects
Third Party
IoT
Streaming Storage Processes Queries
data
Mobile

Security and monitoring

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 169
What is Amazon QuickSight?

• Scalable, serverless business intelligence service


• Embeddable into your existing applications
• Dynamic, machine learning (ML) powered insights

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 170
Accessing data with QuickSight

SPICE
Super-fast, Parallel, In-memory Calculation Engine

• Faster processing
• Reduced wait time vs. direct queries
• Reduced cost through reuse

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 171
Embedded analytics

Embed visualizations in any


application for desktop or mobile
applications.

1. Create a dashboard
2. Apply permissions
3. Authenticate your app server
4. Embed via JavaScript SDK
Desktop or mobile

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 172
Module review

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Knowledge Check 1

AWS Lake Formation has a set of suggested personas and IAM permissions.
Which is a required persona?

A. Data lake administrator

B. Data engineer

C. Data analyst

D. Business analyst
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 175
Knowledge Check 1

AWS Lake Formation has a set of suggested personas and IAM permissions.
Which is a required persona?

A. Data lake administrator

B. Data engineer

C. Data analyst

D. Business analyst
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 176
Knowledge Check 2

Which three types of blueprints does AWS Lake Formation support? (SELECT
THREE)

A. ETL code creation and job monitoring

B. Database snapshot

C. Incremental database

D. Log file sources (AWS CloudTrail, ELB/ALB logs)


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 177
Knowledge Check 2

Which three types of blueprints does AWS Lake Formation support? (SELECT
THREE)

A. ETL code creation and job monitoring

B. Database snapshot

C. Incremental database

D. Log file sources (AWS CloudTrail, ELB/ALB logs)


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 178
Knowledge Check 3

Which one of the following is the best description of the


capabilities of Amazon QuickSight?

A. Automated configuration service built on AWS Glue

B. Scalable, serverless business intelligence service

C. Fast, simple, cost-effective data warehousing

D. Simple, scalable, and serverless data integration


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 179
Knowledge Check 4

Which one of the following is the best description of the


capabilities of Amazon QuickSight?

A. Automated configuration service built on AWS Glue

B. Scalable, serverless business intelligence service

C. Fast, simple, cost-effective data warehousing

D. Simple, scalable, and serverless data integration


© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 180
Lab 4
Data visualization using Amazon QuickSight
• Create data visualization dashboards in Amazon QuickSight

Data analyst

Visualize
AWS Lake Formation

Movie Info data lake Amazon Amazon


45 minutes Athena QuickSight

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 181
Thank You

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.
AWS Partner: Building Data Lakes
on AWS
Module 6: Course review and wrap-up

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What did this course cover?
• How to plan and design a data lake using effective data lake methodologies
• The components and services required to build a data lake on AWS
• How to secure a data lake on AWS using appropriate permissions
• The ways data can be ingested, stored, and transformed in a data lake on AWS
• How to analyze and visualize data stored in a data lake on AWS
• How to automate the deployment of a data lake with AWS Lake Formation

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 184
Serverless data lake reference architecture
Data Source Ingestion AWS Lake Formation: Data lake storage, cataloging, and processing Consumption

File share AWS Amazon


DataSync AWS Glue Data
Catalog Athena

Streaming
Amazon
data Landing Raw data Curated Amazon
Kinesis Data
data Validate, clean ETL Normalize, enrich ETL data QuickSight
Firehose

Partner data Amazon


AWS SFTP AWS Step Functions
Redshift
Spectrum
Security & Monitoring
Operational
databases AWS DMS
Amazon AWS IAM AWS KMS AWS Amazon AWS Lake Amazon
VPC CloudTrail CloudWatch Formation SageMaker
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 185
ND24

Data lake best practices

Location, location, location! Partition for performance! Size files for speed!

Continue compacting! The details are in the deltas! Align keys with queries!
© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 186
Slide 186

ND24 In slide notes, this looks like a heavy dose of unfamiliar new content for a course review and wrap-up. Just an
observation.
Neal, Dave, 2/17/2021
ND25

Learn more…

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. 187
Slide 187

ND25 In slide notes, the link for AWS Lake Formation Dashboard takes me to the AWS Console signin page.
Neal, Dave, 2/17/2021
AWS Skill Builder online learning center

Continue to deepen the skills you need,


your way, with 500+ courses and
interactive training developed by the
experts at AWS.

Game-based learning Self-paced labs

Get started
Use case challenges Exam preparation https://aws.amazon.com/training/digital
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 188
Don’t miss these learning opportunities

Free Digital Training Classroom Training AWS Certification

Learn with hundreds of free, self- Deepen your technical skills and Validate your expertise with an
paced digital courses on AWS learn from an accredited AWS industry-recognized credential.
fundamentals. instructor.

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 189
AWS certification
Role-based certifications align to the following roles and levels: Specialty certifications align
to domain expertise in the
Architect Operations Developer following areas:

Professional

Associate

Cloud
Foundational Practitioner

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 190
Course feedback

Your feedback is critical to us!


1. Sign in to https://www.aws.training.
2. Choose My Account, and then select Transcript.
3. Choose the Archived tab.
4. Expand the completed AWS Partner: Building Data Lakes on AWS course,
and then choose Evaluate.

© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. 191
Thanks for participating!
Corrections, feedback, or other questions?
Contact us at https://support.aws.amazon.com/#/contacts/aws-training.
All trademarks are the property of their owners.

© 2022 Amazon Web Services, Inc. or its Affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission
from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or other questions? Contact us
at https://support.aws.amazon.com/#/contacts/aws-training. All trademarks are the property of their owners.

You might also like