Skip to content
Rezha Julio
Go back

4 Root Causes Hiding Behind One Airflow Error

6 min read

Last week I spent an entire afternoon chasing a single Airflow error that turned out to be four problems stacked on top of each other. Each fix revealed the next failure, like one of those Russian nesting dolls except each one is angrier than the last.

The setup: Apache Airflow running on Kubernetes with the KubernetesExecutor. DAGs mounted via a shared volume. Worker pods spin up per task using a Docker image we build in CI.

The symptom

Tasks that should be running were showing up as queued in the Airflow UI, then immediately flipping to failed. The scheduler logs had this gem:

Executor reports task instance finished (failed) although the task says it's queued.
Was the task killed externally?

No, nothing was killed externally. But okay, let’s figure out what’s going on.

Root cause 1: the Docker image wasn’t rebuilt

I started by checking the worker pod logs. The pods were crashing on startup with:

ImportError: cannot import name 'SlaCallback' from 'plugins.message_callback'

A colleague had recently added an SlaCallback class to message_callback.py in our plugins directory. The scheduler could see it fine because the plugins directory is volume-mounted. But the worker pods don’t use the volume mount for Python imports. They use whatever is baked into the Docker image.

The image hadn’t been rebuilt since the new class was added. The file existed on disk (via the mount), but Python’s import resolution was picking up the old version from the image’s site-packages.

Fix: rebuild the Docker image with the latest code and push it to our registry.

Done. Easy. Except not really.

Root cause 2: Kubernetes image caching

After pushing the new image, I restarted the Airflow deployment. Same error. Same ImportError. I double-checked the registry. The new image was there, with the correct code. So why were the pods still running the old one?

I ran kubectl describe pod on one of the worker pods and looked at the image digest. It was the old digest. The tag was the same (latest equivalent for our setup), but the node had already pulled that tag before. Since imagePullPolicy was set to IfNotPresent, Kubernetes looked at the tag, saw it already had an image with that tag cached locally, and used the cached version.

This is one of those things you know intellectually but forget in practice. If you push a new image with the same tag, Kubernetes nodes that already pulled the old image will keep using it.

Fix: deploy with a new unique tag. We switched to using the git commit SHA as the image tag, which is what we should have been doing all along. Pods pulled the new image, and the ImportError went away.

Progress. But now a different error.

Root cause 3: environment variables in the wrong namespace

The import was fixed. SlaCallback loaded correctly. But when an SLA miss actually fired, the callback crashed with:

ValueError: MESSAGE_CALLBACK_ENDPOINT is not set

This env var was defined in a ConfigMap in the pipelines namespace, where our DAG worker pods run. That makes sense for tasks. But SLA callbacks don’t run in worker pods. They run inside the scheduler process, which lives in the airflow namespace.

The scheduler had no idea MESSAGE_CALLBACK_ENDPOINT existed because it was never injected into the scheduler’s environment. The ConfigMap was scoped to the wrong namespace, and the Helm chart for the scheduler didn’t reference it.

Fix: add the env var to the scheduler deployment through our Terraform config. We added it to the extraEnv section in the Helm values for the Airflow scheduler. Applied the change, scheduler restarted, env var present.

SLA callback ran without crashing. Notification sent. I checked Slack expecting a clean message.

What I got was gibberish.

Root cause 4: iterating over a string instead of objects

The notification message looked something like:

SLA Miss for tasks: m, y, _, t, a, s, k, _, i, d

Individual characters. The task ID had been split into single characters. That’s what happens when you iterate over a string in Python thinking it’s a list.

I opened the callback code. The relevant bit looked roughly like this:

def sla_callback(dag, task_list, **kwargs):
for task in task_list:
send_notification(task_id=task.task_id)

Looks fine, right? The problem was that task_list wasn’t always a list of SlaMiss objects. In some code paths, it was a string representation that got passed through. The for task in task_list line was iterating over characters of the string "my_task_id" instead of a list containing a SlaMiss object.

The fix was to properly handle the input:

def sla_callback(dag, task_list, **kwargs):
if isinstance(task_list, str):
task_ids = [task_list]
else:
task_ids = [t.task_id for t in task_list]
for task_id in task_ids:
send_notification(task_id=task_id)

After this fix, the notification came through correctly. Clean task IDs, proper formatting.

Four bugs. One error message.

What I took away from this

One error can hide three more. The original Executor reports task instance finished (failed) message pointed at the first problem, but fixing it just uncovered the next one. I’ve started keeping a scratch note during debugging sessions so I can track whether I’m actually making progress or just peeling layers.

Always check the image digest, not the tag. kubectl describe pod shows you exactly which image digest is running. If you’re reusing tags (please don’t), the digest is the only truth. We now use commit SHAs as tags and set imagePullPolicy: Always for our staging environment.

Namespace boundaries are real walls. It’s easy to forget that the scheduler and worker pods can live in different namespaces with different ConfigMaps and Secrets. If your callback code needs an env var, make sure it’s available where the callback actually executes, not just where the tasks run.

Test with real scenarios. We had unit tests for the SLA callback, but they passed in a proper list of mock SlaMiss objects. The bug only showed up with real SLA miss data flowing through Airflow’s actual callback mechanism. Sometimes you need to trigger the real thing in a staging environment to catch these issues.

The whole thing took about four hours from first error to clean notification. Not my worst debugging session, but definitely the most layered one. Each fix felt like a victory until the next error showed up. At least now the SLA notifications work, and we have commit-SHA image tags as a bonus.


Related Posts