Distributed Data You Can Afford

Distributed cookie jars

Database Server is a Single Point Of Failure (SPOF)

Previous Art

The truth: Even a single database cluster is one big SPOF. Hard to “kill” but still a SPOF. And it is one very expensive SPOF[1] indeed.

You will be forgiven to think illustrious Martin Fowler has spotted that and “solved” it. He did not. I have found good and important architectural solutions as far back as (published twenty years before Martin). For example. 1999: “An Architecture for Distributed Enterprise Data Mining” (6 Authors). Those were the dark medieval times when those good authors had to use witchcraft like “Java” and “CORBA“.

Fig 1: “A Scenario with Distributed Kensington EJB Servers” — Circa 1999

So, two decades later, on the “architectural level”, Martin being Martin, has written the long-ish article on the same subject. He also coined an easy to sell, catchy term “Distributed Data Mesh”. And several other catchy terms in that same article, immediately abused by “Big Data Consultants”. Aka purveyors of high quality “Slide Ware”.

Technical Architecture of the modern Solution

But that article is not (all that) bad. Provided you manage to read through it all the way down to the “The paradigm shift towards a data mesh”. And in there is a sentence:

“Accordingly, the data lake is no longer the centerpiece of the overall architecture.”

I promise you that is not “taken out of context”. One can check and read the article. And from over there, I have taken just a part of a single diagram. (I hope nobody would mind terribly)

Fig 2: “Data Infra as a Platform”

In there Martin rightly points to the “Data Lake” as the SPOF. I might define a “data lake” as a synonym for “database cluster”. These days “living in the Cloud”. The whole point of this post is to present the 2021 version of “data infra as a platform” actual implementation. Available, resilient, simple, and the one you can pay for. Mine is actually a Technical Architecture of the same concept.

Fig 3: Distributed Data Infrastructure — DBJ Technical Architecture

It is important to emphasize the Physical Partitioning of such a system.

Fig 4:physical partitioning

One physical node can be “anything”: Laptop, PC, Office Server. System requirements are really modest. And the geographical location is irrelevant. So we will obliterate this “data lake” out of existence and offer an “instant replacement”. Beter and (almost) free.

The Implementation

Caveat Emptor: I might be confusing you jumping straight to the final solution. Please read the stuff behind all the links and revisit this post two or three times. That will help. There are also “comments” below.

SQLite + Resilio = An Instant Distributed Data Infrastructure

I know I will be labeled with many labels but: It is as simple as that. Truly larger than the sum of its (two) parts. Use two existing pieces add them together and enjoy the synergy. Modern (actually 10+ years old) data distribution concept aka P2P and its mature desktop implementation aka “Resilio Sync” and the second half is the most used database product on the planet that is also free, aka SQLite. And so light it can run on anything. Truly anything.

The rest is my quick explanation. I can and will, add more to this article, in case anybody might want me to.

  • SQLite
    • Free and ubiquitous[3].
    • Full-Featured SQL
    • Light
    • 20+ years of maturity
  • Resilio Sync Home
    • For Personal use but not a Toy
    • From the “original” P2P “makers”.

Thankfully there is some good material explaining well why is P2P the infrastructure solution for data distribution. And now it is (almost) free and easy to install by “anyone” “anywhere”. I am not affiliated with that company in any way, but allow me to paste in this “Resilio” marketing blurb that actually pretty well describes the run-time environment of this solution:

“…The edge[4] is everywhere people live and work and all the routes between. It is the global infrastructure of business….”

P2P is a key technology for the Distributed Data Infrastructure. And “Resilio” is fast, resilient, cheap, and readily available, implementation of that.

It can be using every possible (personal or not) device to which your system has access. The more the merrier. Data “given” to “Resilio”, is transparently shared, across the whole system. Inside which your “organization” exists. You do not need any “enterprise” or “pro” solutions. You do not need “cloud” or “data center”. You just need to know and use the SQLite databases. Shared transparently.

SQLite is the “secret juice”. What am I describing here is the solution for one truly distributed database.  Not in “real-time” but in “near time”. Reminding us of the cold fact: “Network is not a transparent resource”.

The Manageability

The key weak point (gasp!) here is the manageability of a multitude of business functions solutions. “Orchestration” of unlimited SQLite databases (just files for me and you) is done almost “for free” by Resilio. They are all “just” in sync. But “taming” this for a particular business case might be a challenge. I can envisage pretty simple management components (and UI’s) specifically built for each, business function. The “core of the trick” is to keep it all decoupled. Good, are we done then?

What Could Possibly Go Wrong™?

The Legacy Strikes Back

Technical debt is a product of the whole of legacy IT, not just software. It is the key problem for all mid to large organizations today. But Legacy is not just a technical issue. There are also legacy people and legacy thinking. Soft skills are required.

Legacy technologies have been all selected, evaluated, and paid for by someone. If that “someone” is still around you are bound to have “legacy issues”. Legacy data syncing technologies are costing millions. And there are still, a lot of people making a good living from them. Gradual socio-economical migration from Legacy Data Systems to almost free (compared to legacy) and obviously better P2P Distributed Data Systems is not going to be easy.

[1] Database Servers and Clusters licensing actual costs are one big mystery. I challenge you to find out the actual and real licensing costs on-line. But we can “give you an idea”: (as an example) SQL Server pricing starts at $931.00 as a one-time payment. Although you can not buy one.

[2]We will let Masters of Slide Ware, sell the idea on top of this simple, resilient, and (almost) ridiculously cheap solution.

[3] “present, appearing, or found everywhere.” (Oxford Languages)

[4] The “edge” is a wrong term here. The whole point is we are going from the “edge” to “everywhere”.

%d bloggers like this: