You are on page 1of 39

Introduction to Data Lake on AWS

Tuan Vo
Solutions Architect
mintuan@amazon.com
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
To Become a Leader, Data is Your Differentiator

“Organizations that successfully generate


business value from their data, will outperform
their peers. “

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
For Data to Be a Differentiator, Customers Need
to Be Able to…
New types of analytics

Dashboards Real-time Predictive Image Voice


• Capture and store new non-relational
Recognition data at PB-EB scale in real time

• New type of analytics that go beyond


batch reporting to incorporate real-time,
predictive, voice, and image recognition

• Democratize access to data in a secure


and governed way

New types of data

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics Used to Look Like This

Business Intelligence • Relational data

• TBs–PBs scale

• Schema defined prior to data load

• Operational reporting and ad hoc


Data Warehouse
• Large initial CAPEX + $10K–$50K/TB/Year

OLTP ERP CRM LOB

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
Business Intelligence Big Data processing,
real-time, Machine Learning

• Relational and non-relational data

• TBs–EBs scale

• Diverse analytical engines


Data Warehouse Data Lake
• Low-cost storage & analytics

OLTP ERP CRM LOB Devices Web Sensors Social

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes and Analytics from AWS

Machine Learning Analytics Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

Cost-effective
On-premises Real-time Data
Data Movement Movement

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Store Data in the Format You Want
Open and comprehensive

CSV

ORC
• Store data in the format you want:
Grok • Text files like CSV
Amazon S3
• Columnar like Apache Parquet, and Apache ORC
Amazon Glacier
Avro • Logstash like Grok
AWS Glue
• JSON (simple, nested), AVRO
Parquet
• And more…

JSON

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analyze with the Broadest Set of Analytic Tools
Open and comprehensive
• Analyze data with the broadest selection
of analytics tools
Machine • Data warehousing
Analytics
Learning • Interactive SQL queries
• Big Data processing
Amazon SageMaker Amazon Athena
AWS Deep Learning AMIs Amazon EMR
• Real-time analytics
Amazon Rekognition Amazon Redshift • Dashboards & Visualizations
Amazon Lex Amazon Elasticsearch service • Machine Learning
AWS DeepLens Amazon Kinesis
Amazon Comprehend Amazon QuickSight • Query in place without moving to a
Amazon Translate
Amazon Transcribe
separate analytics system
Amazon Polly
• Up to 400% faster with S3 Select and
Glacier Select
Amazon S3 • Largest ISV ecosystem with built-in
Amazon Glacier integration
AWS Glue
• Ensures you can meet existing and future
use cases, minimizing risks

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS

Machine Learning Analytics Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

On-premises Real-time Data


Cost-effective
Data Movement Movement

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Provides Highest Levels of Security
Secure
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake

Security Identity Encryption Compliance

Amazon GuardDuty AWS IAM AWS Certification Manager AWS Artifact

AWS Shield AWS SSO AWS Key Management Amazon Inspector


Service
AWS WAF Amazon Cloud Directory Amazon Cloud HSM
Encryption at rest
Amazon Macie AWS Directory Service Amazon Cognito
Encryption in transit
VPC AWS Organizations AWS CloudTrail
Bring your own keys, HSM
support
© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS

Machine Learning Analytics Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

On-premises Real-time Data


Cost-effective
Data Movement Movement

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Any Scale
Scalable and durable

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Run analytic engines at largest scale by spinning


up any amount of compute resources in minutes

• Runs on the world’s largest global


cloud infrastructure

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched Durability and Availability
Scalable and durable

• Designed to deliver 99.999999999% durability

• Geographic redundancy & automatic replication

• Store data in multiple data centers across 3 AZs


in a single region

• Seamlessly replicates data between any region

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS

Machine Learning Analytics Open and comprehensive

Secure
Data Lake
on AWS

Scalable and durable

On-premises Real-time Data


Lowest cost
Data Movement Movement

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tiered Storage to Optimize Price/Performance
Lowest Cost

• Tiered storage to optimize price/performance


• S3 Standard
• S3 Standard—Infrequent Access
• S3 One Zone—Infrequent Access
• Amazon Glacier
S3
Standard
S3 Standard
Infrequent Access
Glacier
• Migrate between tiers based on lifecycle policies
S3 One Zone-IA
• Store data at $0.023/GB/month with S3

• Store data at $0.004/GB/month with Glacier

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pay Only for the Resources You Use as you Scale
Lowest Cost
Traditional approach leads to wasted capacity
Unmet demand
upset players
missed revenue Servers
• Pay-as-you-go for the resources you consume
Demand
Excess capacity
wasted $$$

Traditional: Rigid • As low as $0.05/GB scanned with Athena

AWS approach: pay for the capacity you use

Capacity

Demand

AWS: Elastic

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lowest Total Cost of Ownership (TCO)
Cost-effective
• Less admin time to
On-premises AWS manage, and support
Licensing Fees Subscription Fee
Support Costs Support Costs • No up-front costs—
hardware acquisition,
Server Costs installation
Hardware—Server, Rack, Chassis,
PDUs, Tor Switches (+Maintenance)
Software—OS, Virtualization Licenses
(+Maintenance) • Save on operating
Network Costs costs—data center space,
Network Hardware—LAN Switches,
Load Balancer Bandwidth costs
power, cooling
Software—Network Monitoring

IT Labor Costs
• Business value: cost of
Server admin, virtualization admin, delays, risk premium,
storage admin, network admin,
support team competitive abilities,
Extras governance, etc.
Project planning, advisors, legal,
contractors, managed services,
training, cost of capital

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More Data Lakes & Analytics on AWS than Anywhere Else

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Catalog and search Reference architecture: Access and user interface
Data lake on AWS

AWS Glue Amazon DynamoDB Amazon ES Amazon API Gateway IAM Amazon Cognito

Data ingestion Central storage Processing and analytics

AWS Data Exchange Amazon Kinesis


Machine Amazon QuickSight Amazon EMR
Amazon S3 learning

AWS AWS DMS Amazon Redshift Amazon Athena


Direct Connect AWS Snowball
Protect and secure

Amazon CloudWatch IAM AWS STS AWS KMS AWS CloudTrail


© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless data lakes and analytics

Web app data

Amazon RDS

AWS Glue AWS Glue Data Amazon QuickSight


Amazon S3 Amazon Athena
crawler Catalog
Other databases

On-premises data

Streaming data

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage

Durability, Availability Security and


Query in Place Flexible Management
& Scalability Compliance

Built for eleven nine’s of Three different forms of Run analytics & ML on Classify, report, and
durability; data encryption; encrypts data data lake without data visualize data usage
distributed across 3 in transit when movement; S3 Select can trends; objects can be
physical facilities in an replicating across regions; retrieve subset of data, tagged to see storage
AWS region; log and monitor with improving analytics consumption, cost, and
automatically replicated CloudTrail, use ML to performance by 400% security; build lifecycle
to any other AWS region discover and protect policies to automate
sensitive data with Macie tiering, and retention

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive

Durability, Availability Retrieves data in


Secure Inexpensive
& Scalability minutes

$
Built for eleven nine’s of Three retrieval options to Log and monitor with Lowest cost AWS object
durability; data fit your use case; CloudTrail, Vault Lock storage class, allowing
distributed across 3 expedited retrievals with enables WORM storage you to archive large
physical facilities in an Glacier Select can return capabilities, helping amounts of data at a very
AWS region; data in minutes satisfy compliance low cost
automatically replicated requirements
to any other AWS region

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable

“ Dark data are the information


assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships


and direct monetizing).
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data

CRM ERP Data warehouse Mainframe Web Social Log Machine Semi- Unstructured
data files data structured

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Data Catalog
Make data discoverable

Glue
Data Catalog
• Automatically discovers data and stores schema

Discover data and • Catalog makes data searchable, and available for ETL
extract schema
• Catalog contains table and job definitions

• Computes statistics to make queries efficient

Compliance

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog: Crawlers

Crawlers automatically build your Data Catalog and keep it in sync

Automatically discover new data, extracts schema definitions


• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3

Built-in classifiers for popular types; custom classifiers using Grok expressions

Run ad hoc or on a schedule; serverless – only pay when crawler runs

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Data Catalog

Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized
list that is searchable

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog: Table details

Table properties
Nested fields

Data statistics

Table schema

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Catalog: Version control
Compare schema versions List of table versions

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
Make ETL scripting and deployment easy

• Automatically generates ETL code

• Code is customizable with Python


and Spark

• Endpoints provided to edit, debug,


test code

• Jobs are scheduled or event-based

• Serverless

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
N
AWS Glue DataBrew EW

Clean and normalize data with a visual interface

250+ built-in transformations without writing code

Profile data to understand data patterns and anomalies

Work on large datasets at scale

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena
Example Query

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena: ETL & Query Use Case

Athena CTAS and INSERT INTO


to ETL
Update table partition Glue Data Catalog

AWS service logs S3 S3


Query data
Application logs

Data sourced from


Raw Data Transformed data Athena
external vendors

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Quicksight

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create Beautiful, Interactive
Dashboards

• Add rich interactivity like filters, drill downs,


zooming, and more
• Blazing fast navigation
• Accessible on any device
• Data Refresh

• Publish to everyone with a click

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML (Machine Learning) Insights
Cutting edge ML tools that automatically discover powerful insights for your users.

• Anomaly Detection
• Forecasting
• Bring your own model from
Amazon SageMaker
• Auto-generated natural language
narratives

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. *currently in preview
THANK YOU

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

You might also like