You are on page 1of 8

Tiktok_ads_catalog

Tiktok_ads_catalog

AWS Glue is a serverless ETL (Extract, transform and load) service on


AWS cloud. Glue’s serverless architecture makes it very attractive and
cost-effective to run infrequent ETL pipelines. It makes it easy for
customers to prepare their data for analytics.

Components of AWS Glue


 Data catalog: The data catalog holds the metadata and the
structure of the data.
 Database: It is used to create or access the database for the
sources and targets.
 Table: Create one or more tables in the database that can be
used by the source and target.
 Crawler and Classifier: A crawler is used to retrieve data
from the source using built-in or custom classifiers. It
creates/uses metadata tables that are pre-defined in the data
catalog.
 Job: A job is business logic that carries out an ETL task.
Internally, Apache Spark with python or scala language writes
this business logic.
 Trigger: A trigger starts the ETL job execution on-demand or
at a specific time.

Step-by-step Glue Crawler

Before creating a crawler, we must have public data in our S3 bucket. I


uploaded an example data set called ‘animal.csv’ to my bucket.
Tiktok_ads_catalog

Then we can create a crawler…

In this section we will give the data path of S3 bucket…

Choose the respective file…


Tiktok_ads_catalog

We must have a role for crawlers to access to our S3 bucket. If we don’t


have any role, we must create one from the “create an IAM role” section.

Then we can adjust our schedule…


Tiktok_ads_catalog

We must have a database for crawler’s output, so we can add a database


or choose an existing one. crawler tables are generated with random
names. We can add a prefix …

Finally we created the crawler. If you choose the “run on demand”


button, it’s automatically running and it will create a table below the
database you chose.

We can see schema which is being generated by crawler by clicking on


above table.
Tiktok_ads_catalog

Partition in Crawler
When an AWS Glue crawler scans Amazon S3 and detects multiple folders in a
bucket, it determines the root of a table in the folder structure and which folders
are partitions of a table. The name of the table is based on the Amazon S3 prefix
or folder name. You provide an Include path that points to the folder level to
crawl. When the majority of schemas at a folder level are similar, the crawler
creates partitions of a table instead of separate tables. To influence the crawler
to create separate tables, add each table's root folder as a separate data store
when you define the crawler.

For example, consider the following Amazon S3 folder structure.

The paths to the four lowest level folders are the following:
Tiktok_ads_catalog

S3://sales/year=2019/month=Jan/day=1

S3://sales/year=2019/month=Jan/day=2

S3://sales/year=2019/month=Feb/day=1

S3://sales/year=2019/month=Feb/day=2

Assume that the crawler target is set at Sales, and that all files in
the day=n folders have the same format (for example, JSON, not encrypted),
and have the same or very similar schemas. The crawler will create a single
table with four partitions, with partition keys year, month, and day.

In the next example, consider the following Amazon S3 structure:

s3://bucket01/folder1/table1/partition1/file.txt

s3://bucket01/folder1/table1/partition2/file.txt

s3://bucket01/folder1/table1/partition3/file.txt

s3://bucket01/folder1/table2/partition4/file.txt

s3://bucket01/folder1/table2/partition5/file.txt

If the schemas for files under table1 and table2 are similar, and a single data


store is defined in the crawler with Include path s3://bucket01/folder1/,
the crawler creates a single table with two partition key columns. The first
partition key column contains table1 and table2, and the second partition
key column contains partition1 through partition3 for
the table1 partition and partition4 and partition5 for
the table2 partition. To create two separate tables, define the crawler with two
data stores. In this example, define the first Include
path as s3://bucket01/folder1/table1/ and the second
as s3://bucket01/folder1/table2.
Tiktok_ads_catalog

Partition of our specified data-

Properties
In properties we can view the json structure of schema -
Tiktok_ads_catalog

This was the quick guide for schema creation of data stored in S3 using AWS
GLUE crawler.

You might also like