You are on page 1of 9
‘a1o22022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject For thos who code Making a Simple Data Pipeline Part 1: The ETL Pattern “ab p* Andrew Doss 4Feb 2022 CPOL Schedule Python and SQL scripts to keep your dataset clean and up-to-date in a Postgres database ‘This article isa sponsored article, Articles such as these are intended to provide you with information on products and services that ‘we consider useful and of value to developers Want to try it yourself? First, sign up for bitio to get instant access toa free Postgres database. Then clone the GitHub repo and give it awry! hitps:ilwwu.codeproject.com/Antcles/5324207/Making-a-Simple-Data Pipeline Part-1-The-ET|-Patte?eleplay=Print 19 ‘21022022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject Public and private data sources are plentiful but also problematic: 1, Source data may get updated frequently but require substantial preparation before use, 2. Already-prepared secondary sources may exist but be stale and lack provenance. 3. Multiple data sources with heterogeneous sources and formats may need to be integrated for a particular application. Fortunately, there is a general computing pattern for mitigating these problems and getting data in the right location and format for use: "Extract, Transform, Load” (ETL). ETL implementations vary in complexity and robustness, ranging from scheduling of simple Python and Postgres scripts on a single machine to industrial-strength compositions of Kubemetes Clusters, Apache Airflow, and Spark Here, we will walk through a simple Python and Postgres implementation that can get you started quickly. We will walk through key code snippets together, and the full implementation and documentation is availabe in this repo. Extract, Transform, Load According to Wikipedia: Extract, Transform, Load (ETL) is the general procedure of copying data from one or mare sources into a destination system which represents the data differently from the source(s) or in a different context than the sourcets) In other words —we use ETL to extract data from one or more sources so we can transform it into another representation that gets leaded to a separate destination. Rather than give hypothetical examples, we'lljump right into a demonstration with a real problem — keeping a clean dataset of US county-level COVID cases, deaths, and vaccinations up to date in a Postgres database. You can see the end product in our public Postgres database. hitps:lwwu.codeproject.com/Antcles/5324207/Making-a-Simple-Data Pipeline Part-1-The-ET|-Patte?eleplay=Print 29 ‘21022022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject A scheduled ETL process helps us get prepared data into a common database where we can further join, transform, and access the data for particular use cases such as analytics and ML. Extract ‘The first ETL step is extras pandas for the transform step, so our specific objective 19 data from one or more sources into a format that we can use in the transform step. We will use extract the source data into pandas DataFrames. We will work with three data sources: 1, The New York Times compilation of daily, county-level COVID cases and deaths (updated multiple times per day) 2. CDC counts of vaccines administered per county (updated daly) 3. US Census Bureau 5-Year American Community Survey estimates of county-level populations (updated annually) ‘The first two sources are accessible via direct CSV file download URLs. The census data is available manually through a web app or programmatically through an API. The census data is only updated annually, so we manually downloaded a CSV file to a local directory using the web app (the file is provided in the repo). hitps:ilwwu.codeproject.con/Antcles/5324207/Making-a-Simple-Data Pipeline Part-1-The-ET|-Patte?eleplay=Prit 39 ‘21022022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject We used the code below to extract these two types of CSV file sources into pandas Dataframes. cSV_from_get_request handles URL downloads using the Python requests package, and csv_from_local handles local CSV files. Python Provides extraction functions. Currently only supports GET from URL or local file. import pandas as pd 2 3 4 5 import io 6 7 8 import requests 9 10 11 def csv_from_get_request (url): 2 ""Extracts @ data text string accessible with a GET request. B Parameters 14 - - 15 url : ste 16 URL for the extraction endpoint, including any query string v7 Returns 18 - - 19 DataFrame 20 " 21 r= requests.get(url, timeouts) 2 data = r.content.decode( ‘utf-8") 23 df = pd.read_csv(io.StringI0(data), low_nenory=False) 24 return df 25 26 27 def csv_from_local(path): 28 “""Extracts a csv from local filesystem. 29 Parameters 30 - - BL path : str 32 Returns 3 - - 34 DataFrame 35 " 36 return pd.read_csv(path, low_memory=False) With the data extracted into Dataframes, we're ready to transform the data. Transform In the second step, we transform the pandas DatafFrames from the extract step to output new DataFrames for the load step. Data transformation is a broad process that can include handling missing values, enforcing types, filtering to a relevant subset, reshaping tables, computing derived variables and much more. Compared to the extract and load steps, we are less likely to be able to reuse cade for the entire transform step due to the particulars of each data source, However, we certainly can (and should) modularize and reuse common transformation operations where possible For this simple implementation, we define @ single transformation function for each data source, Each function contains a short pandas script. Below, we show the transformation function for the NYT county-level COVID case and death data. The other data ‘sources are handled similarly. Python Provides optional transform functions for different data sources import pandas as pd hitps:slwwu.codeproject.com/Antcles/5324207/Making-a-Simple-Data Pipeline Part-1-The-ET|-Patte?eleplay=Prit 49 ‘a1o2r2022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject def nyt_cases_counties(dt) ‘Transforms NYT county-level COVID data # Cast date as datetine aF{'date"] = pd.to_datetine(df[ ‘date'}) # Store FIPS codes as standard 5 digit strings df['fips'] = df['fips'].astype(str).str.extract('(.*)\.', expand-False).str.zfill(s) # Drop Puerto Rico due to missing deaths data, cast deaths to int af = dF.loc[af[ 'state"] I= ‘Puerto Rico" ].copy() d¢[‘deaths"] = df[ ‘deaths’ ].astype(int) return df # Script truncated for Medium We are enforcing data types on lines 9 and 14, extracting standardized F1PS codes on line 11 to support joining on county to other data sources, and handling missing values by dropping Puerto Rico on line 13. With the data transformed into new Dataframes, we're ready to load to a database, Load In the final ETL step, we load the transformed DataFrames into @ common destination where they will be ready for analytics, ML, and other use cases. In this simple implementation, we will use a PostgreSQL database on bitio. bitio is the easiest way to instantly create a standards-compliant Postgres database and load your data into one place (and it's free for most hobby-scale use cases), You simply sign up (no credit card required), follow the prompts to “create a repo" (your own private database), then follow the “connecting to bit io" docs to get a Postgres connection string for your new database. hitps:ilwwu.codeproject.com/Antcles/5324207/Making-a-Simple-Data Pipeline Part-1-The-ET|-Patte?eleplay=Print 59 ‘21022022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject After signing up, you can create a private PostgreSQL database in seconds and retrieve a connection string for SQLAIchemy Note: You can use the following code with any Postgres database, but you will be on your own for database setup and connection. With our destina n established, we're ready to walk through the code for the load step. This step requires more boilerplate than the others to handle interactions with the database. However, unlike the transform step, this code can generally be reused for every pandas-to-Postgres ETL process. ‘The primary function in the load step is to_table. This function takes in the DataFrame (df) from the transform step, a fully: qualified destination table name (examples in the next section), and a Postgres connection string pg_conn_strring. Lines 18-12 validate the connection string, parse the schema (bitio “repo") and table from the fully-qualfied table name, and create a SQLAlchemy engine. The engine is an object that manages connections to the Postgres database for both custom SQL and the pandas SQL APL Lines 24-28 check ifthe table already exists (truncated helper_table_exists).if the table already exists, we use SQLAlchemy toexecute_truncate table (another truncated helper) which clears all existing data from the table to prepare for afresh load Finally, in lines 30-39, we open another SQLAlchemy connection and use the pandas API to load the Dataframe to Postgres with @ fast custom insert method _psql_insert_copy. Python hitps:ilwwu.codeproject.con/Antcles/5324207/Making-a-Simple-Data Pipeline Part-1-The-ET|-Patte?eleplay=Prit cr ‘21022022 02:12 Making a Simple Data Pipeline Part 1: The ETL Pattern - CodeProject 1 “Load pandas DataFrames to PostgreSQL on bit.io"*™ 2 3 from sqlalcheny inport create_engine 4 5 able(df, destination, pg_conn_string): 6 7 Loads @ pandas Datafrane to a bit.io database. 8 Parameters. 8 - - 10 df : pandas.DataFrane n destination : str 2 Fully qualified bit.io Postgresgl table nane. B pe_conn_string : str 4 A bit. to Postgresgl connection string including credentials. 45 vee 16 # Validation and setup v7 if pg_conn_string is None: 18 raise Valuefrror("You must specify a PG connection string.) 19 schema, table = destination.split(".") 28 engine = create_engine(pg_conn_string) aa 2 # Check if table exists and set Load type accordingly 2B if _table_exists(engine, schema, table) 24 —truncate_table(engine, schema, table) 25 If_exists = ‘append’ 26 else: 27 ifexists = ‘fail’ 28 29 with engine.connect() as conn: 30 # 10 minute upload Limit 31 conn.execute("SET statement _timeout = 600000;") 32 df. to_sql( 33 table, 34 conn, 35 schena, 36 if_exists=if_exists, 37 index=False, 38 method=_psqi_insert_copy) 39 42 # The following helper methods are truncated here for brevity, 41 # but are available on github. com/bitdotioinc/simple-pipetine 42 # table_exists - returns boolean indicating whether a table already exists 43 # truncate _table - deletes all data from existing table to prepare for fresh Load 4a # “psql_insert_copy - implements a fast pandas -> PostgreSQL insert using COPY FROM CSV command Note: we overwrite the entire table here instead of using incremental loads for the sake of simplicity and because some of these historical datasets get both updated and appended. Implementing incremental loads would be more efficient at the expense of slightly more complexity. Putting the pieces together ‘That's it! We have all three ETL steps down, I's time to put them together as a scheduled process, We walk through those next steps in Making a Simple Data Pipeline Part 2: Automating ETL. If you'd lke to try this out right away, the full implementation of this simple approach, including scheduling, is available in this repo. Interested in future Inner Join publications and related bitio data content? Please consider subscribing to our weekly newsletter. Appendix hitps:ilwwu.codeproject.com/Antcles/5324207/Making-a-Simple-Data Pipeline Part-1-The-ET|-Patte?eleplay=Print 79 ‘21022022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject Series overview This article is part of a four-part series on making a simple, yet effective, ETL pipeline. We minimize the use of ETL tools and frameworks to keep the implementation simple and the focus on fundamental concepts, Each part introduces a new concept along the way to building the full pipeline located in this repo, 4, Part 1: The ETL Pattern 2, Part 2: Automating ETL 3, Part 3: Testing ETL 4, Part 4: CI/Cl with GitHub Actions Additional considerations n that focus, some details have been This series aims to illu left to this appendix, te the ETL pattern with a simple, usable implementation. To mai ‘Best practices — this series glosses over some important practices for making robust production pipelines: staging tables, incremental loads, containerization/dependency management, event messaging/alerting, error handling, parallel processing configuration files, data modeling, ancl more. There are great resources available for learning to add these best practices to your pipelines. ‘+ ETL vs. ELT vs. ETLT — the FTL pattern can have a connotation of one bespoke ETL process loading an exact table for each tend use case. In a modem data environment, a lot of transformation work happens post-load inside a data warehouse. This, leads to the term "ELT" or the unwieldy "ETLT". Put simply, you may want to keep pre-load transformations light (fat all) to enable iteration on transformations within the data warehouse. Keep Reading We've written @ whole eres on EL pipelines! Check them out here Core Concepts and Key Skills ‘= Making a Simple Data Pipeline Part 1: The ETL Pattern ‘+ Making 2 Simple Data Pipeline Part 2: Automating ETL. ‘© Making a Simple Data Pipeline Part 3: Testing ETL ‘Making a Simple Data Pipeline Part 4: CVCD with GitHub Actions Focus on Automation ‘Scheduled Data Ingestion with bitio and Deepnote + Cron, Anacron, and Launchd for Data Pipeline Scheduling + Automating an ETL Pipeline with Windows Task Scheduler ETL In Action ‘+ Make Your Own Air Quality Logger License This article, along with any associated source code and file, is licensed under The Code Project Open License (CPOL) About the Author hitps:lwwu.codeproject.com/Antcles/5324207/Making-a-Simple-Data Pipeline Par-1-The-ET|-Patte?eleplay=Print or ‘21022022 02:12 Making a Simple Data Pipeline Par 1: The ETL Pattern - CodeProject Andrew Doss. v A ieee 38 AA 'No Biography provided < > A Comments and Discussions 1 message has been posted for this acl Visit https://www.codeproject.com/Articles/5324207/Making-a-Simple-Data- ine-Part-1-The-ETL-Patte to post and view comments on this article, o click here to geta print view with messages Permalink Article Copyright 2022 by Andrew Doss Advertise Everything else Copyright © CodeProject, 1999- Prvacy 2022 Cookies Terms of Use ‘webot 2.82022.02.10.1 Iitpsslwwn.codeproject.con/Artcles/5324207/Making-a-Simple-Data-Pipeline-Part-1-Tne-ETL-Patte?eteplay=Print

You might also like