Runaway complexity
Last week I had to work on a Django app again. Since Python is a very portable language that works on many different platforms, of course I’ve had to work on that in a Docker container, in a Linux VM in Qemu, on an arm64 Mac running macOS. Also because the official Docker for Desktop app is somewhat annoying, I’ve been giving Lima a try. Also because the standard Django development web server doesn’t offer the best debugging experience, I’ve been running an alternative server through django-extensions.
I’ve counted at least 8 distinct software vendors so far in that paragraph. When I’ve hit a bug that completely killed my productivity, it was far from obvious which one to look at. Let’s take a dive and see what happened.
Why so complex?
A development web server has three tasks. 1. When something goes wrong, show me more context to help me understand and fix the problem; 2. watch the source code for changes, and reload the application to speed up each iteration cycle; 3. otherwise, keep things as similar to the production environment as possible.
These three are absolutely crucial in maintaining productivity. If you don’t get enough context from a crash, you will be stumbling around in the dark. If you need to manually restart the application server, or worse yet, rebuild the Docker container, you will quickly lose focus, become annoyed or distracted. If your local environment diverges too far from how you run things in production, you will eventually hit production-only bugs. None of these things are desirable.
The bug
The runserver_plus
command from django-extensions would keep
detecting changes in files that did not come from my
application. Here’s an excerpt from the logs:
* Detected change in '/usr/local/lib/python3.10/dist-packages/django/contrib/messages/storage/session.py', reloading
* Detected change in '/usr/local/lib/python3.10/dist-packages/django/contrib/messages/storage/base.py', reloading
* Detected change in '/usr/local/lib/python3.10/dist-packages/django/contrib/messages/utils.py', reloading
* Detected change in '/usr/local/lib/python3.10/dist-packages/django/contrib/messages/storage/cookie.py', reloading
This caused the server to remain stuck in a reload loop, which means every time I wanted to request a new page, it would take a few seconds for the server to start responding again, and then it would shortly go into another restart. Like a crash loop, but it does a little bit of work in between, so you can limp around for a while before you decide it’s too annoying and look for a solution.
The investigation
The exact set of files would differ, but what remained consistent, was that these were library files, which not only my text editor didn’t touch - they were on a separate volume on the VM, which was not shared with the host OS, where I was doing the editing.
However it took me a moment to connect the dots on that clue, so in the meantime I’ve been trying the following:
- Switch to the stock Django development server; that doesn’t offer the improved Werkzeug-based debugger (which would allow me to evaluate code snippets within the context of the traceback, which was crucial to help hunt down the bug). That resolved the crash loop, but left me without an imprtant tool, so switch over again.
- Switch back to the official Docker for Desktop app, as Lima turned out to have sub-par support for forwarding file system change events. I was surprised to find the 9P remote filesystem in the stack.
- Ignore the
/usr/local
hierarchy in the watcher. The--exclude-patterns
flag is under-documented (is it a regex? a glob? a prefix?), so it took me a while to get it to work, but then the reloader started picking up random files I didn’t touch that were not in/usr/local
. - Disable the reloader, restart the server manually on every change. I’ve finally felt like I’ve reclaimed a tiny bit of lost productivity, but that necessarily lengthened each edit-restart-test cycle, so this wasn’t a solution.
- I was ready to blame macOS and/or Docker, since this kind of crap is not entirely uncommon. Unfortunately my alternatives were few and far between: Asahi Linux is not exactly ready, and there was little I could spare for dual booting, with my boot disk already 80% full. I had an old laptop, and decided it was indeed too old. I could try running the code natively under macOS, but it’s increasingly relying on exact versions of software that is not available directly from PyPI, so obtaining everything would be a nightmare. I could perhaps try to do so with Nix, but as Nix is vast, complex, and somewhat under-documented, and my own knowledge of it quite superficial, I wouldn’t just be inflicting more pain upon myself, but also on anyone else working on that code.
Notice that our list of vendors has grown by another four, some of which are now suspects as well.
The culprit
I don’t know why it took me so long to suspect that it was django-extensions that caused the problem; searching their bug tracker indeed has found issue 1805. Except the bug wasn’t in django-extensions; it was a bug in Werkzeug, which django-extensions uses directly for the reloader functionality. Werkzeug itself didn’t do anything wrong; issue came from the watchdog package, which changed the default behavior by including file open events in the notification stream. That explains it - the files that were triggering the reloads were not being changed, they were being opened.
So we’ve named twelve suspects… And the problem originated in the thirteenth, which was a transitive dependency of a transitive dependency of a “support goodies” package. Quite a game of Cluedo.
Conclusions?
As I said, I don’t know why I’ve first looked everywhere, except for the bug tracker of django-extensions. Perhaps because through the crazy mix of arm64 / x86-64, Mac / Linux / Windows, Docker / Compose / Swarm, Qemu / HVM, AWS / Hetzner / on-prem, Debian / Ubuntu / CentOS / macOS / OpenBSD, Terraform / Judo / NixOS, Python / Rust / Go / JS / TS / Swift / Kotlin, Django / Flask / Vue / Angular, Traefik / nginx / ELB, Postgres / sqlite, S3 / MediaStore (+CloudFront), and maybe a couple hundred smaller things that are too small to name, I’ve come to expect the issue to usually be at the boundaries?
If you stare at it long enough, you may notice Werkzeug’s
fix is also incomplete, in that it does not secure
their code against a similar change in watchdog in the future: should
Watchdog choose to expose more kinds of inotify events,
e.g. IN_ACCESS
, Werkzeug still does not whitelist the narrow set
that is actually relevant to their use case, so the code remains a
ticking bomb for another change at the boundary.
My conclusion is that I would love to go back to programming on a Commodore 64. I was six years old, when the C64 marked the first step of my journey into both programming, and creating music. The machine is too simple and limited to afford the complexity that leads to these kinds of issues.
Of course I couldn’t use a C64 as my sole computing and development platform nowadays, probably unless I was already retired. But I’d like to write some software for it. To paraphrase ESR, I believe it is worth learning it for the experience, that may leave you a better programmer for the rest of your days, even if you never use the C64 a lot.
I don’t think we can throw away 40 years of progress in computing, but I do think that we can reflect upon this runaway complexity and start removing, simplifying things sometimes.