You are on page 1of 2

 ETL Pipelines and Data Warehouse Development

Project overview:
The purpose of this project is to demonstrate various skills associated with
data engineering projects. In particular, developing ETL pipelines using Airflow,
constructing data warehouses through Redshift databases and S3 data storage as
well as defining efficient data models e.g. star schema. As an example I will
perform a deep dive into US immigration, primarily focusing on the type of visas
being issued and the profiles associated. The scope of this project is limited to the
data sources listed below with data being aggregated across numerous
dimensions such as visa type, gender, port_of_entry, nationality and month.

Data Description & Sources:

 I94 Immigration Data: This data comes from the US National Tourism and
Trade Office. Each report contains international visitor arrival statistics by
world regions and select countries (including top 20), type of visa, mode of
transportation, age groups, states visited (first intended address only), and
the top ports of entry (for select countries).
 World Temperature Data: This dataset came from Kaggle.
 U.S. City Demographic Data: This dataset contains information about the
demographics of all US cities and census-designated places with a
population greater or equal to 65,000. Dataset comes from OpenSoft.
 Airport Code Table: This is a simple table of airport codes and
corresponding cities. The airport codes may refer to either IATA airport
code, a three-letter code which is used in passenger reservation, ticketing
and baggage-handling systems, or the ICAO airport code which is a four
letter code used by ATC systems and for airports that do not have an IATA
airport code.

You might also like