You are on page 1of 3

Data Engineering - DDH overview + standards

Context:
This document delivers the platform overview from the data engineering perspective. It also establishes the standard for the
development activities.

1. Data Engineer required skillset (Data Engineer skillset)


2. Coding languages
a. Python / PySpark (all data transformation activities)
b. T-SQL (Synapse, SQL Server)
c. PowerShell (infrastructure automation, deployments)
d. C#/.Net (auxiliary platform activities API / Functions )
e. ARM/JSON/YML
3. Tools
a. IDE / Visual Studio (VS) Code
i. Remote Containers / WSL2 DATA ENG ONLY - Create…: New joiners’ Checklist - Copy only
ii. Detabricks-Connect
iii. Manual code deployment (databricks, ADF, lake)
iv. Linting
v. Unit Tests
vi. Regression tests
vii. Validation routines + Control DB tables
b. Databricks Notebooks, only for investigation/data debugging purposes (Read.me file)
c. SSMS ( SQL Server Management Studio) - tool for querying data in DW/Control DB
d. Full Visual Studio -
i. SSDT - Data Definition
ii. Azure Function Definitions
e. Azure Storage Explorer - Accessing the Lake resources
f. Azure Portal - Access control, resources status etc.
g. Azure Data Factory - pipeline orchestration
4. Code repositories:
a. Ingestion (Databricks/ADF/Synapse) : https://dev.azure.com/diageoinsights/Orion/_git/Orion?version=GBmaster
i. ADF (triggers)
ii. Databricks
iii. Warehouse
iv. Tests
b. Services / Control DB/ API etc: https://dev.azure.com/diageoinsights/Orion/_git/Orion-Services
c. Infrastructure: https://dev.azure.com/diageoinsights/Orion/_git/Orion-Infrastructure
d. ReadMe files /
Wiki: https://dev.azure.com/diageoinsights/Orion/_wiki/wikis/Development%20Wiki/90/Development-Wiki
5. Code consistency: Data Engineering, the pipelines building standards
a. Helpers
b. Hash key creation methods on the target side
c. Schema object life cycle
d. Static vs. dynamic SQL
6. Unit Tests:
a. Local/container pyspark/pytest installation
b. TestExplorer
c. Fixtures
d. Data samples / Inputs / Expected Results
7. Azure Data Factory: https://adf.azure.com/en-us/home?factory=%2Fsubscriptions%2F3a683d84-be08-4356-bb14-
3b62df1bad55%2FresourceGroups%2Fdiageo-analytics-prod-rg-
orion%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2Fdiageo-eun-analytics-prod-adf-ingest-prod01
a. Folder structure
b. Data sets
c. Linked services
d. Triggers
e. Global parameters
f. Deployment with REST
g. BAU ADF (admin purposes): https://adf.azure.com/en-us/authoring?factory=%2Fsubscriptions%2Fcbce95f0-
3cdf-467f-852e-1b64bd2105d3%2FresourceGroups%2Fdiageo-analytics-nonprod-rg-
orion%2Fproviders%2FMicrosoft.DataFactory%2Ffactories%2Fdiageo-eun-analytics-nonprod-adf-bau-dev01
h. Development in main DEV using git/banch, deploy in VS Code, test in private ADF insta
8. Azure DevOps Pipelines:
a. Ingestion: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=90
b. Services: https://dev.azure.com/diageoinsights/Orion/_release?_a=releases&view=mine&definitionId=24
c. Infra: https://dev.azure.com/diageoinsights/Orion/_release?_a=releases&view=mine&definitionId=7
d. Config: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=98
e. Private DEV environment: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=226
i. Private Synapse Build: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=95
f. Pull Request Validation: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=101
g. Environment Dim Sync: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=229
h. Unit Tests: https://dev.azure.com/diageoinsights/Orion/_build?definitionId=211
9. Release process:
a. Pull requests https://dev.azure.com/diageoinsights/Orion/_git/Orion/pullrequests?_a=completed
i. Approvals (Viz/ BAU)
ii. Policies: Data Engineering, the pipelines building standards
b. Regression tests
c. Release standup
i. Release document
ii. Testing in UAT
d. Releasing to PROD / git tagging

You might also like