You are on page 1of 9

AWS Workshop

Data Mesh Project

Objective

The objective of the project is to implement a data mesh architecture in AWS. A Data Mesh consists of centralized
catalog. Producers publish their data to the catalog and consumers are given access to the cataloged data. Thus, a
Data Mesh will consist of Producer, Consumer, Central Catalog. In this implementation, each of the of the entities will
be housed in a separate AWS account.

Functions:

Producer - is responsible for maintaining ingestion of data & transformation, such as flat files, RDMS tables to the
producer account from for example on-premises. This AWS account houses a raw bucket, which is accessible by the
ETL user only, A Glue job that loads the data to the Producer data refined bucket, with data partitioned into daily
partitions

Central Catalog - The Central catalog maintains the glue catalog, Grants access to the producer and consumer to
catalog objects, maintains and assigns tags to the Glue catalog objects. Grants access to producer ETL process to
write, read add partition to Glue catalog. Grants access to consumer account for reading data from the producer
account buckets. Maintains LF TAG information and assigns to catalog resources
AWS Workshop
Consumer - Consumers can request access to the Glue catalog Objects and use Athena / Redshift spectrum to query
the data.

N.B. Data is not replicated to multiple accounts, it is loaded only onto the producer accounts.

Architecture:
AWS Workshop

Use cases:
AWS Workshop
1. Column level tagging – when users are logged on, they should only access to the columns they are allowed to
access for example for some users have access to all data including sensitive while some are allowed non-
sensitive data access only. The producer should be able to access its data, but consumers like for example a
data analyst, data scientist, executive user may have different set of column accesses, classified as sensitive,
non-sensitive. The implementer can choose their own dataset and classify the column access needed.

2. Glue job for the producer has restricted access to its own Tables and Databases only

3. Redshift spectrum allows users to have access to sensitive data

4. Athena users can only access non-sensitive data with Athena. This is a use case suggested to differentiate from
the Redshift spectrum use case where all data is accessible. If you want to use just Athena, you can have all
the data access use cases for different users mentioned in point 1 in Athena itself

5. Implement using the console

6. Automate implementation using boto3, AWS Step Functions, AWS Lambda

Pre-requisites:
AWS Workshop
Knowledge of Lake Formation, Glue, S3, boto3, Lambda Step function desired. Some suggested reference materials
are as follows:

AWS documentation

https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html
https://docs.aws.amazon.com/lake-formation/latest/dg/populating-catalog.html
https://docs.aws.amazon.com/lake-formation/latest/dg/managing-permissions.html
https://docs.aws.amazon.com/lake-formation/latest/dg/TBAC-prereqs.html
https://docs.aws.amazon.com/lake-formation/latest/dg/tag-based-access-control.html
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html

AWS Blogs

https://aws.amazon.com/blogs/big-data/build-a-modern-data-architecture-and-data-mesh-pattern-at-scale-using-
aws-lake-formation-tag-based-access-control/
https://aws.amazon.com/blogs/big-data/easily-manage-your-data-lake-at-scale-using-tag-based-access-control-in-
aws-lake-formation/
AWS Workshop
https://catalog.us-east-1.prod.workshops.aws/workshops/fae051e0-0b86-4cae-82f0-60dbdfeb7e2a/en-US/building-
datamesh/setup-producer-datalake/querying-datalake/tag-based-access-control

YouTube Videos from AWS:

https://www.youtube.com/watch?v=Aj5T5fcZZr0
https://www.youtube.com/watch?v=5H0_PeK4Uos&t=1384s
https://www.youtube.com/watch?v=YPYODx4Pfdc&t=2357s

Please proceed after you have reviewed at least the Blogs. For Additional help refer to the documentation links or
Videos above
Steps:

1. Create Producer, Central catalog and Consumer accounts in AWS. These accounts if under the same
organization does not require explicit acceptance of resource share between the accounts, otherwise
explicit acceptance step is required in the receiving account.
2. Choose. Dataset that will be used for the Project. Suggested dataset – A dataset that contains columns
like names. For example, the “user” data here: https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-
create-sample-db.html , the metadata definition and the data is available.
AWS Workshop
3. Analyze the metadata and assign sensitive and non-sensitive to each column, this classification will be
used later for data access entitlements
4. Create S3 buckets for the producer in the producer account one bucket to store the raw data one refined
bucket to store the data, which will be used for querying. Ensure the buckets are KMS encrypted. In the
S3 raw bucket upload the file you want to use in the step 2.
5. Register in the Lake formation of the central catalog, the Producer refined bucket as the data lake
location
6. Create Lake formation Admin role, Producer LF service to write to tables, and Consumer data Roles that
has access to Tables/ databases. Create two role one having access to all data and another having access
to non-sensitive data only.
7. Create LF tags in the central account. PII tags for sensitive and non- sensitive data is created
8. Define database, Define the table/s using the metadata from step 3
9. Associate LF tags to database / individual columns of the table
10. Resource link is created in producer account for database in central account
11. Update Glue catalog policy granting producer and consumer glue catalog share. (*** TAG ontology
created on the central account is used to grant access to the producer and consumer accounts). Ensure
Glue catalog evaluates by LF tags.
12. Update Glue catalog resource policy include producer account and usage of Lake formation tags
option.
AWS Workshop
13. Create a Glue Job that reads data from the raw bucket, validates the schema against the Glue catalog
and updates the Glue table with the new partition. If the partition exists then overwrite data. For
partition consider that the full file arrives daily. Transform data to parquet format.
14. The Glue job Should fail if the schema of the incoming file does not match the glue catalog table
schema.
15. Automate end to end pipeline (raw -> refined) using the Step Functions. You can create step functions
using the new feature, workflow studio. Using the new feature, orchestrate the entire flow.
16. Ensure that the Glue job service role, has required s3 policy, KMS key grants policy enabled and Glue
catalog access given in step 6.
17. S3 bucket policy on refined bucket that allows Consumer data role defined in Central catalog lake
formation account to read from this bucket
18. Consumer data roles are delegated LF tags grant and PII shared tag
19. Resource link is created in consumer account for database
20. Athena work group is created in consumer account. Query shared Glue tables in Athena. It should
allow only non-sensitive data view.
21. Create redshift cluster if using Redshift spectrum to query data. Create external tables from shared
catalog. Query Data. It should allow all data access.
22. In step 20 & Step 21 ensure that the PII tags restriction for data access are adhered to.

Data Security Considerations:


 Encrypt all the S3 buckets using KMS key.
AWS Workshop
 Enable S3 endpoint in your VPC
 Restrict the IAM roles to have controlled access to resources
 Restrict access to the S3 buckets to only to your IAM roles

Bonus:
Use Step function to read input manifest to create tables / databases in central account
Read manifest file to kick off daily loads
Create JSON schemas to define tables
Use Step functions to create LF tags and adding them to catalog resources

Infrastructure Automation: Once the above architecture is implemented, create a Cloud Formation/Terraform
template for deployment of resources so that user and consumers can be onboarded

You might also like