Professional Documents
Culture Documents
Airflow.
The Battle for the Orchestration Future.
DANIEL BEACH
APR 10, 2023 ∙ PAID
17 7 2 Share
Well, even though this is going to be the truth about Prefect, Mage,
and Airflow, I do wonder, in 2023, if there is any truth left to be
had. The data world is changing, and the pace at which the new tools
are being released is mind-numbing. Every week it’s a new library
like Polars, or the next hot SQL thing like duckDB. It can be hard to
sort out fact from fiction, and marketing goup from the real deal.
What about arguably one of the most important parts of a data stack?
The orchestration and dependency tooling?
But, if one thing is sure, it’s that as soon as something takes the
crown of glory and sets it upon its head, somewhere behind the throne
the swords and plots are being sharpened. If Rome fell, does Airflow
have a chance?
In today's fast-paced business world, data has become one of the most
valuable assets. With the ever-increasing volume and complexity of
data, it has become essential to have efficient data orchestration
tools that can manage data workflows. Three popular data
orchestration tools are Prefect, Mage, and Airflow. In this article,
we will compare these three tools at a high level, explore the
concepts, and help you choose the right one for your data stack.
Honestly, I don’t care how shiny the UIs are, I’m sure both Prefect
and Mage have better ones than Airflow. That’s ok. We should probably
have other reasons for picking a tool than it’s like a shiny quarter
you found on the ground.
How do these tools affect the way Data Engineers solve problems?
So, whether they like it or not, I’m going to lump all these tools
into the “data orchestration and pipeline management” group. I don’t
care what the marketing hype is, that is what they are used for.
Also, I’m not an expert, I’m just an average engineer trying to do
average things.
First, let’s review what I would call the “core concepts” of each
tool. Then explore along the way, which tool is the best.
Prefect
Prefect's core concepts revolve around building, managing, and
monitoring data workflows. These concepts are fundamental to
understanding and working with Prefect effectively.
What I found telling about this quote is that I have not found
anything fundamentally different about Prefect, say than Airflow.
There isn’t anything earth-shattering or special. It’s simply just
trying to be Airflow, except slightly better.
It didn’t excite me. I need a good reason to abandon Airflow’s
massive community.
Let’s talk about the core concepts of Prefect. How do they approach
Data Engineering problems?
I combed the website back and forth and the docs. Just a lot of talk
about building high-performance data pipelines, but there isn’t
actually anything “solid” behind those words. I can’t point to x or y
feature, or design concept and say “This is what makes Prefect
different or better than Airflow.”
I’m not sure. I personally fail to see how Prefect blows away Airflow
in any category. Of course, the UI is better and has more features,
although the more complex something becomes, the less easy it is to
use and the learning curve steepens. Maybe it’s better running at
scale? Again, it just isn’t enough for me to want to invest time or
resources into it. Maybe that’s just me.
I feel this is the only, and main difference between Prefect and
Airflow. Prefect does away with the DAG abstraction and uses Python
and decorators to do your work.
Think I’m being too harsh on Prefect? Maybe, but as an Airflow user,
I don’t see any compelling reason I would switch to Prefect. It
doesn’t have any earth-shattering features or breakthroughs that
would make me want to switch to it. If I was starting clean would I
pick it? Probably not. I want something that has a large community
behind it.
Any Data Engineer knows a tool works well for a while, but when it
doesn’t, you need somewhere to turn. Do you have some experience with
Prefect? Do you love it? Drop a comment and let us all know.
Mage
Well, I do like honesty. At least they are being upfront about what
they are trying to do, replace Airflow. Before you tell me I’m an
Airflow lover and will never find anything to replace it, hold your
horses. Airflow does drive me crazy sometimes, its backend
architecture leaves a lot to be desired, can be clunky, and
frequently pukes to the point it’s easier to spin up a new
environment, rather than fix it.
Mage has several unique features that make it stand out from other
data orchestration tools. For example, it has a built-in data catalog
that allows you to track your data lineage, and it has a powerful
retry mechanism that can automatically retry failed tasks. Also, it
offers a Notebook UI to develop with, offering quick iteration and
feedback. Very nice!!
Now, these are features that are substantially different from what we
are used to with Airflow. It’s a different approach, and that’s what
matters.
How does Mage approach data pipelines?
a. Transformer
b. Data Loader
c. Data Exporter
d. Sensor
e. Chart
a. Bock reusability.
b. Automatic Testing of Blocks.
What are some things that jump out at me about Mage, without diving
into too much detail?
Honestly, if Mage was just another nice UI, with a different take on
developing pipelines than Airflow … I would be skeptical. But it
appears they took a fundamentally different approach, providing
probably not only a better UI, logging, and monitoring … but actually
focusing on the Developer experience and pushing Engineering Best
Practices in a way that provides clear value that Airflow and Prefect
do not.
Apache Airflow.
I’m not going to devote time to this tool. There are reams and
volumes of video and text content produced on the topic. You can
google it. Airflow is boring and popular.
Comparison
Before you accuse me of nepotism I don’t have any skin in the game of
Airflow, Prefect, or Mage. I’ve never used Prefect or Mage in
anything but a play-around experience. I’ve used Airflow in a
standalone deployment, via Composer, and via MWAA.
Scalable.
Reliable.
As someone who loves and has used Airflow for years, it isn’t any of
those things. It works. It’s enough. It does the job tolerably well.
Hence people use it.
When you have fans of all tools, how can you get past the arguments
about which features are better, what’s faster, which one does this
better, or that better? It can be hard to compare tools and boil down
the essence of the matter.
Sure, we could install each tool, build a pipeline, and run it.
But, what would that really tell us? We know that each tool,
Airflow, Prefect, and Mage are capable of building and running
pipelines for probably the majority of situations and
circumstances.
The answer is clearly Mage.ai. No, I have not been paid to say that,
and I’m saying that as someone who’s been an active Airflow user for
years, via both self-hosted, GCP Composer, and MWAA on AWS.
I’ve nothing against Prefect, other than that it just isn’t providing
enough differentiation from Airflow. Maybe it’s just their bad
marketing, who knows? I could be missing something amazing.
Airflow is here to stay for a long time, but cracks start showing at scale.
I will always love Airflow, it’s a unique tool that does a decent job
with an amazing community, and it’s simple to learn and use. But,
anyone who has used Airflow at scale and doesn’t have a dedicated
team to support it (why should you need one?), knows that you can
easily shoot yourself in the foot, and what was once easy to use
becomes a bear to manage and massage at scale to keep it performant.
It’s also inevitable that newcomers show up and start to pick and
scratch at the corners of Airflow, putting best practices in place
and solving pain points for next-level Airflow users.
Sure, there is plenty of use cases for Airflow at a small scale where
it works like charm. Should those organizations migrate away?
Probably wouldn’t hurt them too, but it most likely doesn’t make a
ton of sense.
How do you know you should switch from Airflow to either Mage (probably),
or Prefect?
I’m going to ask you a few questions about your Data Engineering
culture and use cases.
Is your Airflow setup starting to run over 50+ DAGs and growing?
If the above items are where you want to go, and you care about
excellence in Engineering Culture, then I would suggest checking out
a tool like Mage.ai over Airflow if you get the choice.
7 Comments
Write a comment…
5 more comments…