Professional Documents
Culture Documents
1. Repo architecture:
- ansible folder:
- deploy_dataflow folder: use for deploying data pipeline on google cloud
compute
- batch_processing folder: contain pyspark scripts to process batch data
- stream_processing folder: contain pyflink scripts to process streaming data
- config folder: PostgreSQL config sent to Debezium
- data_validation folder: use Great Expection to validate data
- deltalake folder: contain scripts to write csv files into deltalake format used for
building lakehouse
- gcp folder: necessary scripts and dependencies jar files when deploying data
pipeline on google cloud compute
- gke folder: deploy minio service using k8s
- jar-files folder: dependencies jar files needed to run flink and spark
- models folder: trained models
- PostgreSQL_utils folder: contain functions to interact with PostgreSQL such as:
create table, insert data to PostgreSQL
- run_env_airflow folder: environment to run airflow service
- utils folder: helper functions
- Note: This repo is implemented on 100GB diabetes data stored on Google Cloud
Storage
2. Installation
- Tested on Python 3.9.12 (recommended to use a virtual environment such as Conda)
- Install requirements: pip install -r requirement.txt
- Data: diabetes data collected from different sources on the internet
- Docker engine
3. Data pipeline on prem
- How to guide:
docker compose -f docker-compose.yml -f airflow-docker-compose.yaml up -d
- MinIO is used to storing diabetes data
- We can access MinIO console at port 9001
- In this example, all validated cases are successful. Data is ready to be served
- reload_and validate.ipynb: Reuse the expectation suite to validate new data
3.3. Streaming data source
- Besides csv files data, there is also a streaming diabetes data
- A new sample will be stored in a table in PostgreSQL
- Then Debezium, which is a connector for PostgreSQL, will scan the table to
check whenever the database has new data
- The detected new sample will be push to the defined topic in Kafka
- Any consumer can get messages from topic for any purposes
- How to guide:
cd postgresql_utils
./run.sh register_connector ../configs/postgresql-cdc.json
- Create a new table on PostgreSQL
python create_table.py
- Insert data to the table
python insert_table.py
- We can access Kafka at port 9021 to check the results
- Then click Topics bar to get all existing topics on Kafka
- Messages from sink will be fed into diabetes service to get predictions
- From then, New data is created
- Notice, we have to validate predictions from the diabetes model to
ensure labels are correct before using that data to train new models
3.5. Manage pipeline
- Airflow is a service to manage a pipeline
- In this repo, airflow is used to running training script on every first day of
month
- How to guide:
- Airflow is accessed at port 8080
- Copy key and paste to Enter JWT placeholder to login into MinIO
Operator