Airflow is not an Orchestrator

When I started working with Airflow 2 and a half years ago, I was referred to the following article: We’re All Using Airflow Wrong and How to Fix It. This was the main basis Airflow setup at my company, and at the moment, I really liked it.

As the article states in the beginning

Tl;dr: only use Kubernetes Operators

I was used to working with Kubernetes and other PaaS systems, and this looked like a perfect setup to avoid vendor lock-in to Airflow libraries and take advantage of the Kubernetes container scheduling.

As I keep working with Airflow, I am more convinced that this setup is deceiving. It might work for some cases and some companies, but in general, it is a bad idea. In this post, I want to take a look at why Airflow is not (only) and Orchestrator

The main problems

There are some things to consider about the approach in the cited article.

Technology keeps evolving

The first crucial point to address is that this article is as of today 4 years old. This is almost an eternity in the fast-moving data ecosystem. During these years, Airflow has matured as a technology and it has increased in its adoption. Airflow 2.0 has been released with huge performance improvements, HA deployment capabilities, and with the Task Flow API.

Everything is going to become outdated at some point. We should take the chance to evaluate past assumptions and designs with our new lenses in the present.

Kubernetes Compatible, but not Cloud Native

Another very important thing to consider is that Airflow is not Cloud Native. It is true that we can integrate Airflow with Kubernetes using the KubernetesPodOperator. However, Airflow was not designed to be a Cloud Native solution and it shows. Is true that new Airflow releases had improved the experience of deploying airflow on Kubernetes and the use of Kubernetes Executors. It is still dangerous to rely too much on the Kubernetes specific features from Airflow as they are not first-class citizens.

One example is how the exceptions are managed for the KubernetesPodOperator. An ImagePullBackOff error will not be shown in the Airflow logs. The same will happen if there is a Kubernetes capacity issue. And because the pods are deleted on completion, you have no way to diagnose the issue unless you catch the errors live with kubectl.

High complexity for local test

The usage exclusively of KubernetesPodOperator made Airflow very difficult to test end-to-end jobs. Our team had to develop an abstraction package utility to distinguish between local and production environments and return either KubernetesPodOperator or DockerOperator. This kind of custom-made package interfaces kept growing, making the onboarding of new team members difficult, the maintenance challenging, and creating a single point of failure for all the DAGs in our deployment.

Not only that but the only way to test a DAG end-to-end was to deploy it and test it on the production environment. The problem was that this can affect the downstream consumers. The solution we took was to replicate each DAG to have a “test” version in the production environment. This in practice meant duplicating almost all of our DAGs: more code, more configuration, and a polluted Airflow UI.

Too many Docker Images!

We also quickly run into the problem of creating custom containers for each job in every DAG. It is true that we develop generic docker images to perform common operations like data loads, data extractions, and so on. But for any type of custom transformation with specific business logic required us to create a new docker image, which implies: a GitLab repository, CI pipelines, configuration boilerplate, AWS Container Registry entry... Even the simplest transformation needed all this setup. This means that for a 100 pipeline deployment you can potentially be 100 different docker images to maintain in 100 different repositories. All of this without taking into account other code as common libraries or CI templates. It is clear that complexity and the cognitive load becomes impossible to manage.

We also run into very weird design decisions. For example launching AWS SageMaker or EMR jobs for a custom docker image that runs as a KubernetesPodOperator, instead of using the native Airflow operators for this.

All of this was a cascading of technical decisions derived from the first statement in the cited article: only use Kubernetes Operators.

What is (actually) Airflow?

So, why all of these problems? What is the cause? Is Airflow the wrong tool? Are que actually using airflow wrong?

IMHO, the most important point in the cited article is that it miss-interprets Airflow as a product.

Airflow is a tool for developers to schedule, execute, and monitor their workflows

While this is true, it is only partially true. And this is the root cause of all of our problems.

A Data Pipeline Scheduler

First of all, Airflow allows us to schedule capabilities with a high focus on data pipelines, not just generic jobs. It also gives a lot of importance to the execution time. In general, the main features of Airflow that distinguish it from the rest of the technologies are:

Sets schedules and implements retries for DAGs, so you can monitor the execution of jobs through time.

Allows backfilling to correct data from the past, or to catch up to current execution times for new releases.

Handles secrets/connections and variables needed by the jobs using Jinja2 templates.

In summary: Airflow enables your team to write code that can process data and deploy it with ease and confidence

This is the part that I agree with the original article. I think that almost everybody that knew about Airflow will give a similar definition. However is missing another part…

A Data Pipeline Framework

Java backend developers might use Spring, Frontend developers might use React.js or Vue.js, Data Scientists might use SkLearn... and pipeline developers use Airflow.

Airflow is a Framework for Data Developers

Airflow provides a huge amount of tools to develop data processing jobs through the operators, both the built-in ones and the third party providers. This allows us to maintain less code, less boilerplate, and follow the Airflow standards on our team. It makes the most sense when using Airflow as a launcher for external platforms (SageMaker, DataBricks, EMR...) or when reusing common operations (launching SQL code over a database).

Whenever we need to write specific business logic for a DAG, we can do it in the DAG definition or create a simple module inside our DAGs package. Remember that the DAGs are just python code! As everything is in the same repo, it is easier to manage the complexity of navigating dozens of repositories to find the implementation of a job. The unit tests and the integration tests can also live very close to each other so we will be sure we don’t break anything in our deployments.

In other words, treating Airflow like a framework will allow our team to speed up the development, reuse code, and establish consistency in common patterns.

Adhering to a framework can (and will) lead to vendor lock-in. But it helps the team to develop more efficiently for the platform they are developing. In my opinion, it is a trade-off worth taking. Lock-in can be minimized following software development good practices.

Integrated with Kubernetes (or not)

The next questions to answer are: do we need Kubernetes? No, we don’t need it. Can Kubernetes be useful? Yes, of course, but it will depend on the platform-specific details of your company.

The deployment details should be abstracted as much as possible from the development. This way the data engineers can test and develop their pipelines with transparency from local or remote deployments. The easiest path to deploy new data processing pipelines should be to write the code, push it to the repo, and deploy it in Airflow with a CI/CD pipeline.

If your company has a team that maintains a Kubernetes cluster, it might be reasonable to use the Airflow Helm chart to deploy it. But to avoid coupling the development with the deployment we should avoid using the KubernetesPodOperator.

There is another approach nowadays that is more elegant to use: the airflow Kubernetes Executor. But first, we need to understand the difference between executors and operators in Airflow:

Operator → the actual business logic execution, the task to be run. For example, SQL queries, plain python code, S3 data movement, spark-submit...

Executor → the platform runtime. There are multiple executors for Airflow available like local ones and remote (production-grade) ones like Celery Executor, Dask Executor, or the Kubernetes Executor.

As the executor is just the runtime, it is very easy to swap executors for local/development vs production. In fact, that is the reason this abstraction exists in the first place in Airflow. With the Kubernetes executor, Airflow will already launch all the operators as Kubernetes Pods in your production environment.

To handle the dependencies in the runtime, Airflow offers different approaches. The most modern one is using the taskflow API docker decorator or virtualenv. This is a new addition to Airflow 2.2. It has incredible value as it is easy to use and transparent to the execution runtime. You can create base versioned Docker images with the common dependencies for your pipelines, while also being flexible enough to have custom environments for specific jobs.

I want to use (only) the `KubernetesPodOperator`

Maybe you already had a setup that uses the KubernetesPodOperator, or maybe you prefer the cited article idea better than the one explained here. That is alright, but maybe Airflow should not be the choice for you.

There are tools more cloud-native, with less maintenance overhead, and with very similar features and scheduling capabilities than Airflow. Some of this tools might be a better choice for your company if you just one to orchestrate containerized jobs. The one that I am more familiar with is Argo Workflows, which is cloud-native and fully integrated with Kubernetes.

The take away here is that you should use the right tool for the job. Every company is different, and apply different criteria to their software design decisions. Just remember to understand the problem each technology is trying to solve in order to be able to make the right decision for your company.

Conclusions

Seeing Airflow as just an Orchestrator/Scheduler will lead you and your team to make the wrong design decisions for your Airflow deployment and DAG development. Modern Airflow can be considered a fully-fledged data development framework and it should be treated as one.

As a general rule, challenge the decisions that require a lot of abstraction and custom code to make the technology work in your company. The developers of these technologies (PaaS or Frameworks) know what they are doing, and they enforce certain patterns for a reason.

Keep also in mind that Airflow is not the only solution, nor the best. There might be another technology that fits your company's needs better.

Remember that the industry keeps evolving, so reevaluate your current architecture from time to time. Let the decisions you have made in the past to be challenged constantly by your peers (and by yourself!). That is the path to improve.