You are on page 1of 28

AWS Data, Databases, and Analytics Online Series

Unite Real-Time and Batch Analytics


June, 2020

with AWS Glue

Melody Yang
Data Specialist Solutions Architect, AWS

© 2020, Amazon Web Services, Inc. or its Affiliates.


Table of contents

• Overview data analytics lifecycle

• Introduce AWS Glue

• Unite batch and real-time needs

• Demo

© 2020, Amazon Web Services, Inc. or its Affiliates.


Analytics Lifecyle Overview

© 2020, Amazon Web Services, Inc. or its Affiliates.


Extract Visualize/
Generate Collect Store Transform Analyze
Load Report

© 2020, Amazon Web Services, Inc. or its Affiliates.


Generate

IoT device

Web log
Extract Visualize/
Collect Store Transform Analyze
Load Report
Social
media

Transaction

ERP

© 2020, Amazon Web Services, Inc. or its Affiliates.


Collect

Polling Application

Extract Visualize/
Generate Store Transform Analyze
Load Report
Amazon Kinesis

Amazon Managed
Streaming for Kafka

© 2020, Amazon Web Services, Inc. or its Affiliates.


Collect

AWS DMS

Extract Visualize/
Generate Store Transform Analyze
Amazon S3 Load Report

AWS DataSync

AWS Snowball

© 2020, Amazon Web Services, Inc. or its Affiliates.


Store

Amazon S3

Extract Visualize/
Generate Collect Transform Analyze
Load Report

Amazon
RDS

Database
on EC2

© 2020, Amazon Web Services, Inc. or its Affiliates.


Analyze

Amazon Athena

Extract Visualize/
Generate Collect Store Transform
Load Amazon EMR Report

Amazon Redshift &


Redshift Spectrum

Amazon Kinesis
Analytics

© 2020, Amazon Web Services, Inc. or its Affiliates.


Visualize/
Report

Data
scientists

Data
analysts
Extract
Generate Collect Store Transform Analyze
Load Business
users

Engagement
platforms

Automation/
events

© 2020, Amazon Web Services, Inc. or its Affiliates.


Extract
Transform
Load

Amazon EMR

Visualize/
Generate Collect Store Analyze
Report

AWS
Lambda

Amazon Kinesis
Client Library (KCL)

© 2020, Amazon Web Services, Inc. or its Affiliates.


But….
Batch Layer

( EMR ) Serving
Layer

Speed Layer

( Lambda &
Kinesis )

• Separate processing layers, complex in operation, security controls


• Different data schema management approaches
• Large proportion of ETL is hand coding

© 2020, Amazon Web Services, Inc. or its Affiliates.


Extract
Transform
Load

AWS Glue
Visualize/
Generate Collect Store Analyze
Report

© 2020, Amazon Web Services, Inc. or its Affiliates.


What is AWS Glue

© 2020, Amazon Web Services, Inc. or its Affiliates.


AWS Glue components

Data Catalog Serverless Jobs Orchestration

Discover Develop Deploy


Automatic crawling Apache Spark core Flexible scheduling
Apache Hive Metastore compatible Python and Scala Monitoring and alerting
Integrated with AWS analytic services Auto-generates ETL code External integrations

© 2020, Amazon Web Services, Inc. or its Affiliates.


New: Serverless Streaming ETL with AWS Glue

You can create streaming extract, transform, and load (ETL) jobs
that run continuously, consume data from streaming sources like:

• Amazon Kinesis Data Streams and


• Apache Kafka (including the fully-managed Amazon MSK)

Load transformed results into targets like:

• Amazon S3 data lakes or


• JDBC data stores

© 2020, Amazon Web Services, Inc. or its Affiliates.


Steps to Create Streaming ETL

Streaming In-flight Target


Source Transform
4
Specify job property
Supply own or modified script

5
3 Create ETL AWS Glue
Amazon Kinesis
Streaming Job Output Amazon S3
Streaming

2
Amazon MSK Create catalog table AWS Glue
Data Catalog
JDBC
Create Apache Kafka Data store
1 connection
Apache Kafka
AWS Glue
Connection

© 2020, Amazon Web Services, Inc. or its Affiliates.


Unite Streaming and Batch ETL In AWS Glue
Easier and more cost-effective

© 2020, Amazon Web Services, Inc. or its Affiliates.


Use Case: Consumer Profiling
Clickstream Amazon Kinesis
Data Streams
AWS Glue
Streaming
User ETL
AWS Glue

Athena Consumer Glue ETL


Clickstream Profile Consumer
CTAS
(Batch)
(every few Profile
hours)
(Real Time)
Listing

Amazon S3 Amazon Athena Amazon S3 AWS Glue Amazon Athena

© 2020, Amazon Web Services, Inc. or its Affiliates.


AWS Glue - Uniting Batch and Streaming

• Speed of implementation

• Less code to maintain plus code auto-generate

• Operations team loves the serverless aspect

• Smaller data process footprints

© 2020, Amazon Web Services, Inc. or its Affiliates.


Amazon Athena - Uniting Batch and Streaming

• Query data from Amazon S3 directly


with ANSI SQL
• Use CREATE TABLE AS SELECT (CTAS)

to create new tables using a result of


SELECT query

• Serverless, no infrastructure to manage

• Pay $5/TB scanned by your query

• Workgroups for cost control

© 2020, Amazon Web Services, Inc. or its Affiliates.


Sounds Good In Theory…
What’s It Really Like?

© 2020, Amazon Web Services, Inc. or its Affiliates.


Demo context

Amazon Kinesis
Data Stream AWS Glue

Stream

Streaming ETL
Job

IoT Log
Glue
ETL Job
Data Catalog

Amazon Amazon
Upload Athena QuickSight

Hourly CSV file Amazon S3


Data Lake

© 2020, Amazon Web Services, Inc. or its Affiliates.


How Does Unified ETL in AWS Glue Help You?

• Consolidate data process architecture


• Reduce implementation efforts
• Less overhead in application maintenance
• Easier and more cost-effective, to set up serverless ETL
pipelines
• Accelerates your insights by extending to real-time data
• Help you to focus on business outcomes of analytics.

© 2020, Amazon Web Services, Inc. or its Affiliates.


AWS Training and Certification

Explore tailored Data Build cloud skills with Demonstrate expertise


or Database learning free digital Data training with a Data industry-
paths for customers courses such as “The recognized credential
and partners elements of Data (Data analytics and
Science”, or dive deep Database Specialty AWS
with classroom training Certifications)

https://aws.amazon.com/training/

© 2020, Amazon Web Services, Inc. or its Affiliates.


Visit the Data, Databases, and Analytics
Resource Hub for more resources
Dive deeper with these newly created whitepapers and e-books
to help you uncover new insights and value from your data

• An introduction to cloud databases


• Enter the purpose-built database era
• Harness the power of data
• Creating a modern analytics architecture
• The data-driven enterprise https://tinyurl.com/aws-data-
• … and more! databases-analytics

Visit resource hub »

© 2020, Amazon Web Services, Inc. or its Affiliates.


Thank you for attending
AWS Data, Databases, and Analytics Online Series
We hope you found it interesting! A kind reminder to complete the survey.
Let us know what you thought of today’s event and how we can improve the event
experience for you in the future.

aws-apac-marketing@amazon.com
twitter.com/AWSCloud
facebook.com/AmazonWebServices
youtube.com/user/AmazonWebServices
slideshare.net/AmazonWebServices
twitch.tv/aws

© 2020, Amazon Web Services, Inc. or its Affiliates.


Thank you!

© 2020, Amazon Web Services, Inc. or its Affiliates.

You might also like