Professional Documents
Culture Documents
Tiktok_ads_catalog
Partition in Crawler
When an AWS Glue crawler scans Amazon S3 and detects multiple folders in a
bucket, it determines the root of a table in the folder structure and which folders
are partitions of a table. The name of the table is based on the Amazon S3 prefix
or folder name. You provide an Include path that points to the folder level to
crawl. When the majority of schemas at a folder level are similar, the crawler
creates partitions of a table instead of separate tables. To influence the crawler
to create separate tables, add each table's root folder as a separate data store
when you define the crawler.
The paths to the four lowest level folders are the following:
Tiktok_ads_catalog
S3://sales/year=2019/month=Jan/day=1
S3://sales/year=2019/month=Jan/day=2
S3://sales/year=2019/month=Feb/day=1
S3://sales/year=2019/month=Feb/day=2
Assume that the crawler target is set at Sales, and that all files in
the day=n folders have the same format (for example, JSON, not encrypted),
and have the same or very similar schemas. The crawler will create a single
table with four partitions, with partition keys year, month, and day.
s3://bucket01/folder1/table1/partition1/file.txt
s3://bucket01/folder1/table1/partition2/file.txt
s3://bucket01/folder1/table1/partition3/file.txt
s3://bucket01/folder1/table2/partition4/file.txt
s3://bucket01/folder1/table2/partition5/file.txt
Properties
In properties we can view the json structure of schema -
Tiktok_ads_catalog
This was the quick guide for schema creation of data stored in S3 using AWS
GLUE crawler.