Why I Prefer Simple Data Layouts In Projects (Draft)

Here is an anectode of a pattern I keep refacing.

When building devel.tech, I wanted to architecture a central, common table of all the site's items. Different apps would subclass a central, common table (of a title and body), with customized computed fields and columns when necessary.

By default, querying Post.objects.all() would return upcasted objects, like [<Tip title-name>, <Snippet title-name>] and so on.

I used django-polymorphic/django-polymorphic and had great results. It worked out of the box, even though I had no clue of the performance implications or what it did under the hood. I rested on the laurels of this plugin to build the backbone of devel.tech's data against.

The homepage was Post.objects.all(), and /tips reused the Django view backend code, with the exception is queried Tips.objects.all. This helped reduce a lot of duplication, and it felt things were going good.

But there was a bit of an elephant in the room. After all the template parsing, loops, context processors, template tags, and so on, there were over 100 queries when loading the homepage.

Between this, and my liberal usage of django-mptt/django-mptt, I had a very flexible data format, that relied on machinery that was becoming very taxing. When 2.0 came out, these plugins were also late to catch up. This is not encouraging assuming how critically devel.tech relied on them.

The system I built ended up working, giving users convenient ways to filter and search posts. I felt accomplished. However, this all came with the price of machinery in the background that did a lot of duplicated queries to reflect/upcast models. The hierarchical tag system (with 3 different types of taxonomy, with their own model) was also unwieldy, drastically complicating the code and making it difficult to profile inefficiencies in the queries.

It felt generic and flexible, but when I put it into my common code library, I realized it was just solipsism. Every parent thinks their baby is beautiful, but some just aren't (get the correct analogy for this). So even being an experienced coder, I realized trying to make common data opinionated was just me overengineering.

But I couldn't tell early on.

It turns out I overestimated the amount of customization each model had. In the end, the sole model that touchups ended up being the same as the base model.

After reflection, I realized I didn't really need a model for tips, snippets, and "features". I could just use Post and have a new column of the type of post it is. No upcasting or downcasting.

If I just kept my data layouts "dumb" / simple from the start, it'd be easier to "grow into" something more complicated and possibly something reuseable. But all too often, we get swallowed into the trap dividing our limited attention span from business needs to preparing for scenarios we haven't faced - and in the end, even with experience, keep misforecasting.

If I could go back and fix my flow, I'd always keep the code simple. Just being experienced with Python or "smart" doesn't mean you can't keep it basic and concise. Then, when stuff is deployed or I'm adding new features, then I'll think out a refactor that's well-designed, because by that time I have the benefit of knowing for certain what the needs are.

We like to think we're good at predicting design in early stages. In many cases, we often aren't. It never hurts to delay moving to SPA, hierarchical data, and so on unless it's an articulable requirement for your business.

I actually went from django-mptt -> django-treebeard with materialized paths, which drastically simplified the database models, made the queries faster. But there was still an issue with 100 queries running on the main page of the site.

The culprit? When I iterated through the posts, it queries 3 categories, and for some reason me added the parent node of a tag if it existed the system leap 60 commands out of nowhere!

Due to how opaque the system was, it wasn't obvious how to weed it why this was happening. I'd be better off if I was using pure Django API and self-referencing foreign key parents.

When looking at the internals of these systems, what you have is a lot of metaprogramming. A lot of stuff which, no matter how many times you looked over it, stumps you when you read it and try to understand what the hell you were trying to do. The idea relying on an additional metaprogramming solution that's shoddily maintained compared to the ORM of django itself? Delusion.


When django 2.0 came out, django-polymoprhic was breaking on simple stuff that'd only take a conditional in the imports to fix.

I began working on it myself, trying to get the CI to greenlight. The reason behind this is adding 2.0 support can't come at the expense of older versions of Django being serviced.

I'd end up finding that someone committing straight to master broke tests. Rather than reverting the PR, the whole project was in a stasis due to a "fix" breaking the whole CI.

Then I'd find that all along there was a PR that was over 2 weeks old, with no response. You can fault me for not looking, but I didn't expect there to a PR in an essential project sitting when tests are failing. That means all other contributions are stuck until that fixes.

I didn't need the changes introduced by the PR that caused the bug to begin with. But it still affected me.

I've spent hours combing through snags in the CI just to get to a baseline. Contributors to the project were pushing straight to master without QA'ing contributions.

Then I also find there was already a PR over a month old for getting 2.0 support. It was just stuck there with nobody talking, and by this time (2017-12-06) 2.0 was relased. I spent additional hours battling another CI bug caused by someone who pushed directly to master, to find it was wholly unrelated https://github.com/django-polymorphic/django-polymorphic/issues/332.

This boils down to not pushing stuff to master and using GitHub's PR system. It cost me as a contributor (turned QA'er) a lot of time I'd rather be spending making devel.tech better.

It's nice to have this package. It's a clean API that "just works". It saves from producing multiple queries and allows you to do some awesome m2m relations that'd be difficult with proxy models + custom managers.

But when things don't work, and you go onto the open source project itself, hours get spent digging through the machinery behind those clean API's. The machinery, and the thought process behind the tests can be undocumented.

So, while you'd like to think you can smoothly abstract-away engineering overheard using these plugins, this also took a toll on me to become a bigger sceptic in these type of plugins and how they pan out in the long run.

In order for me to get this plugin working, since I'm not a maintainer at this point, I can't just go ship a release on PyPI. I have to end up maintaining a fork of it. For all the headaches involved, this is one more credit why I'd be better off forgoing this extension and doing something in house, where I'm not in a holding pattern.

Someone else's anecdote

Thanks to Russell Keith-Magee (emphasis added):

The issue with hierarchical data in a database is always the cost of querying and traversal. The simple approach that you've described works fine -- but requires either O(N^2) queries to retrieve an N-deep subtree, or requires in-view processing after retrieving the entire tree. The benefit of MPTT and other tree-strategies is that they can cut the number of queries for an arbitrary subtree down to something closer to O(1).

However, the catch with all O(N) calculations is that it depends on the size of N, and the number of times you're going to perform the calculation. If, as you say, know you you will only have 150 items in your tree, and this isn't going to change, and your site isn't going to be under extreme load (so a little extra view processing won't matter too much), the benefit of using MPTT (or a similar strategy) may well be overwhelmed by the engineering cost. You may well be better served using a simpler database model, and maybe looking into caching specific subtrees as a way to mitigate the extra load you'll be putting on the web server.

source: https://groups.google.com/forum/#!msg/django-users/JCQrRL3CzJQ/iK7wmuPBbGkJ

Well put. The engineering costs and other stuff introduced may not be worth it. For the mean time, I'm keeping data models simple as possible. It's always possible to grow into it later.

Even with the benefit of a tool like mptt and treebeard, things started to break down and get very opaque for me when it came to prefetching relations and not doing requeries. This ate up significant time.

Let's reiterate the other thing Keith-Magee said:

and maybe looking into caching specific subtrees as a way to mitigate the extra load you'll be putting on the web server.

I actually was thinking about this while pulling apart my own code and migrating between mptt and treebeard, before I read that post. But since Russell wrote it and mentioned it first, give the credit to him. :P

My theory is to use one model and a manager + model proxy to get the same effect as categorized tags. Simple and pure, no external dependencies. And you get to own your API - so when you're getting bad query performance, you can start to isolate it via your own codebase, and not worry that a dependency is at fault.

At work

Thankfully, in my situation, I'm the founder of this website. I don't face resistance when I make technical decisions. But sometimes these orders come not a natural tendency to prematurely optimizations, but orders from above. This can actually be worse.

An example I've seen circa 2013-2017 would be a founder's insistence on making stuff SPA (single page application) the first time around, rather than relying on model-backed validation and a frameworks template engine/form validation to get a product out. This almost always leads to development headaches. Building everything in angular first is the ultimate premature optimization.

Longer to develop. Harder to debug when user hit snags in the data flower. Harder to refactor when "small touchups" actually require fundamental assumptions of the data being passed around be changed. Single page applications also make you have to worry about the state of data on the server side, and when you're onboarding a user for the first time, creating a wizard, those things have to be written on the server and the client. Just sticking to a server-side web framework would give you the benefit of using their model-backed form validation and so on.

You can always go back later when you have feedback of deploying your product, then swing around to it later. It's easier to build into something more complicated. But when you're forced with colleagues or supervisors that insist on jumping the gun, and it turns out they change their mind later - You're either stuff with a crappy codebase everybody hates (and they'll blame everybody but themselves), or have to rewrite. All that time and runway would be wasted.

In a lot of cases, a lot of startups, despite having some manager guy insisting it - the complication these "tweaks" introduced didn't help the product, service, business, whatever. You're being dragged along for their "vision", or it's just a chore the manager insists on B-testing and keeps forcing it. The emperor's new clothes comes into play here, because it can be hard to talk your boss out of grandiose design decisions and leaps you're not ready for.

For the love of God, if I have one engineering lesson, it's that it's best to keep stuff simple. Ask yourself, "what's the most basic way I could do that". It's so easy to make "hard-coded" stuff generic, and so taxing to backtrack the other direction.

How I'm moving forward

The time I've invested ruling out these plugins being the culprit behind performance issues, and keeping them up to date with django releases and their own fixes, makes me want to build my django code in such a way it can be tested in a reproduceable build without the plugins.

Have tests that run at a baseline without any high level DB tools (e.g. mptt/treebeard/cte-forest, polymorphic/model-utils).

Even at the cost of initially worse and degraded performance, I'd rather than the transparency of knowing why and controlling it, rather than pushing it away to some project where there's no control.

Create tests and make your code so it can fall back on barebones Python + Standard Django, without the added help of these utilities.

Then, after that, you can introduce stuff like mptt and whatever incrementally, and have tests for that. And keep both tests running, why? It can be helpful later on to isolate whether a problem is regarding

  1. your usage of django api purely (between views/templates/plain models)

  2. how you implemented the data models with mptt/treebeard/polymorphic, or

  3. an issue with mptt/treebeard/polymoprhic itself

A second closing thought

We would probably benefit from going back to the drawing board and improving polymorphism and tree support in Django itself. These types of changes are not for the faint of heart. But we benefit from the wealth of information gleaned from using of stuff like polymorphic/treebeard/mptt in the wild. Changes made to the internal of Django's ORM to support better use of proxy model m2m relations with custom managers, in a more slick way, with less hacks. These could go a long way in making downstream libraries more maintainable.

In addition, some people may be able to forgo using these libraries completely.