Fedora, Community Health, and the birth of Hatlas
Fedora()
#
I’ve been quite busy the past many months, contributing to the Fedora project in ways that I never planned.
I initially thought that I would join to help the ops folk modernize a few things and brush up on my Ansible. But I quickly discovered that Fedora Infra is quite a difficult beast to get one’s hands around, even if you have decades of professional infra skills behind you. There are a large number of reasons for this, and none of them are, “someone is an incompetent jerk”, which makes it a difficult problem to solve. (As with most complex systems, emergent feedback loops are tricky.)
But as I started to spend some time in the community, I slowly realized that I was watching a continuous stream of fellow newcomers full of naive excitement join the project, announce themselves, and then quickly bounce off of the huge Infra knowledge cliff and disappear forever.
You’ll always have some of that, but some of those people seemed quite skilled, or at least as skilled as I am. And I had been onboarding myself for weeks at this point, without much to show for it due to the difficulty of the task. To be blunt, if someone such as myself with multiple decades of professional experience is finding the task difficult, then I think it is safe to say it’s objectively difficult.
I began to wonder – how many people have attempted this feat and succeeded? How many have failed in the past month alone? Or year? What was the main reason? And what could we change if we knew those answers? What discomfort would we be willing to bear if we could directly compare the consequences of the status quo and the potential rewards of improvement?
Most importantly, since it’s a pattern I’ve seen many times as a mentor and educator: how many of those people who failed to onboard were early professionals who may have walked away with the conclusion that they were the problem? That they just weren’t smart enough? And that perhaps open source “wasn’t for them”? (Despite this being an objectively very difficult task even for seasoned professionals.)
The thought gives me a pit in my stomach.
The quest begins #
I went looking for that data, hoping to find some quick answers.
What I found instead was a very long history of Fedora leadership begging and pleading for that data as the likely cure for many organizational woes, alongside a scattered history of those who either never found the data they needed or only found some portion of it but lacked the necessary access / tools / infra / etc to find solid answers to their questions and/or preserve any part of their process.
I really hate to see newcomers get unnecessarily discouraged. And I’d also hate to see Fedora wither away or become a purely commercial project, as it fits my needs perfectly and (I believe) represents a massive social good.
I had found my very own can of worms. (TBD whether it will resolve that pit in my stomach.)
Sr. Thingdoer #
I’m not a data scientist by any measure, but I do have a history and a habit of diving deep into the data stores that I’ve used as a software engineer & architect, because understanding how a query actually executes in, say, Postgres vs. DynamoDB, can dramatically alter which one you pick and how you model the data, not to mention the tradeoffs you must be aware of in order to build something scalable / highly reliable.
I had never heard of “data engineering”, and definitely couldn’t have told you the difference between a data engineer and a data scientist six months ago. As it turns out, what Fedora needed desperately was bucketloads of data engineering – that is, moving data into the right datastores, consolidating it, cleaning it up, and optimizing it for analysis. All the stuff that comes before you can answer any questions. (Or ask any AI to do so.)
Anyone who has been around a while has had an ETL project or 10, but I didn’t know that there was a title and profession dedicated to it, or that there was a whole huge ecosystem of tools old and new to facilitate it. But as I started to have more itches to scratch around having fresh and queryable data, the more tools and toys I found to throw at the problems. I also found a few interesting new ideas for datastores that caught my eye as having huge potential for massive capability at low cost, and pretty soon I couldn’t keep my hands off of those git repos…
Look ma, I’m a data engineer!
Let’s build a data lakehouse #
That’s a word I also didn’t know 6 months ago. Before we had Data Lakehouses we just had Parquet Piles. The Lakehouse pattern seemed to naturally fit our needs though. We have:
- Large enough data volume (>2Tb) that it doesn’t comfortably fit on most laptops.
- Lots of different data sources that we want to land together so we can blend / augment.
- High sensitivity to infra cost and operational burden.
- Most of our data is append-only, e.g. user activity logs.
- Most of our data is semi-structured JSON, and there are some recent innovations in efficient parquet encoding and query pushdown of JSON data.
After a few months of wrangling our ugliest datasets (including all the favorite hits such as “rows larger than 4 gigabytes”, and “json columns stored as non-queryable text which p.s. also contains null bytes”), I finally had some parquet flowing. And then I needed to put it somewhere, so Apache Iceberg, Apache Polaris, and many other tools in the ecosystem (shoutout to DuckDB!) were suddenly on my plate. And of course, I needed to stick all this stuff somewhere, so I guess let’s build a Kubernetes cluster?
Yeah … on that last point … you know that moment in interviews where they say, “Tell me about a
time you made a mistake”? Wow, is it a pain to run Kubernetes on a single-node bare-metal VPS
because you’re footing the bill instead of your boss. I thought I knew Kubernetes deeply, but it
turns out that it’s actually built for clouds and clusters (who knew!), and that you will need a lot
of time to figure out how to provide yourself with things you take for granted elsewhere such as
persistent volumes and secrets storage. I did try some of the single-node distros such as k3s and
k0s but ran into problems with both, so good ol’ kubeadm + cilium it was. A month later.
Kubernetes was a project requirement because someday I want to get all of this stuff upstreamed into Fedora Infra proper, and they run OpenShift there. But I’m still not sure whether that month of investment will have any return.
Hug your local Kubernetes dev.
Let’s build a community #
My work was starting to show some visible results, and pretty soon others around the Fedora community began to express interest. We quickly started to clog up the existing chat rooms with our blathering about database libraries and GIN indexes, so the community managers were quickly “inspired” to help us form our own little working group. And thus was born: the Fedora Data Working Group (FDWG).
Suddenly, my “dev” environment was multi-tenant with security concerns. And it had unpredictable load. And people who were upset when it was broken. So, I had to quickly do what anyone does whenever they find that their dev environment is now production: marketing! And thus was born: Hatlas.
I’ll let you in on a little secret: this death star is operational. But it’s also still very WIP, and we have some data governance / etc to sort out, so it’s currently only open to FDWG members. But it’s up, and we have some members doing real work with it.
Which is great, because I’m not! I now understand why data engineering is separate from data analytics. There simply isn’t time enough to do both. Some people get to have all the fun!!
I’ll let you decide which group that is.