Professional Documents
Culture Documents
20 September 2017
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue automates
the undifferentiated heavy lifting of ETL
Develop Generate code to clean, enrich, and reliably move data between various data
sources; you can also use their favorite tools to build ETL jobs
Built-in classifiers for popular types; custom classifiers using Grok expressions
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable
Data Catalog: Table details
Table properties
Nested fields
Data statistics
Table schema
Data Catalog: Version control
Compare schema versions List of table versions
Data Catalog: Detecting partitions
Column Type
sim=.93 month=Nov
month str
date str
sim=.99 date=10 … sim=.95 date=15 col 1 int
col 2 float
Table
partitions
Customizable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
Collaborative: share code snippets via GitHub, reuse code across jobs
Job Authoring: Glue Dynamic Frames
ResolveChoice() B B B B B B B
C
Apply Mapping() A
A X Y
X Y
Job authoring: Relationalize() transform
A B B C.X C.Y FK
PK Offset Value
A B B C D[ ]
X Y
Remote Interpreter
interpreter server
When you are satisfied with the results you can create an ETL job that runs your code.
Job Authoring: Leveraging the community
Compute instances
There is no need to provision, configure, or
manage servers