Distributed Data You Can Afford

Single Database Server is a Single Point Of Failure (SPOF).

Previous Art

The truth: Even a single database cluster is one big SPOF. Hard to “kill” but still a SPOF. And it is one very expensive SPOF[1] indeed.

You will be forgiven to think illustrious Martin Fowler has spotted that and “solved” it. He did not. I have found good and important architectural solutions as far back as (published twenty years before Martin). For example. 1999: “An Architecture for Distributed Enterprise Data Mining” (6 Authors). Those were the dark medieval times when those good authors had to use witchcraft like “Java” and “CORBA“.

Fig 1: “A Scenario with Distributed Kensington EJB Servers” — Circa 1999 (https://www.researchgate.net/publication/221348843_An_Architecture_for_Distributed_Enterprise_Data_Mining)

And then, two decades later, on the “architectural level”, Martin being Martin, has written a long-ish article, solving the same issue. He also coined an easy-to-sell, catchy term “Distributed Data Mesh”. And several other catchy terms in that same article were immediately abused by “Big Data Consultants”. Aka purveyors of high-quality “Slide Ware”.

Technical Architecture of the Modern Solution

But that article is not (all that) bad. Provided you manage to read through it all the way down to “The paradigm shift towards a data mesh”. And in there is a sentence:

“Accordingly, the data lake is no longer the centerpiece of the overall architecture.”

I promise you that is not “taken out of context”. One can check and read the article. And from over there, I have taken just a part of a single diagram. (I hope nobody would mind terribly)

Fig 2: “Data Infra as a Platform”

 

In there, Martin rightly points to the “Data Lake” as the SPOF. I might define a “data lake” as a synonym for “database cluster”. These days “living in the Cloud”. The whole point of this post is to present the 2021 version of the “data infra as a platform” actual implementation. Available, resilient, simple, and the one you can pay for. My below is actually a Technical Architecture of the same concept. But I call it

Distributed Data Infrastructure

Fig 3: Distributed Data Infrastructure — DBJ Technical Architecture sketch

Where is it and how is this data infrastructure “distributed”? It is important to emphasize the Physical Partitioning of such a system.

Fig 4:physical partitioning

 

One physical node can be “anything”: Laptop, PC, Office Server. System requirements are really modest. And the geographical location is irrelevant. So we will obliterate this “data lake” of existence and offer an “instant replacement”. Probably better and (almost) free.

The Implementation

Caveat Emptor: I might be confusing you by jumping straight to the final solution. Please read the stuff behind all the links and revisit this post two or three times. That will help. There are also “comments” below. Please “comment”. 

Now quite unlike any architect let’s jump straight into the implementation, head first.

SQLite + Resilio = An Instant Distributed Data Infrastructure

I know I will be labeled with many labels but, I might boldly claim: It is as simple as that. Truly larger than the sum of its (two) parts. Use two existing pieces (e.g Laptop and a PC) add them together and enjoy the synergy. Modern (actually 10+ years old) data distribution concept aka P2P and its mature desktop implementation aka “Resilio Sync” and the second half (of each “node”) is the most used database product on the planet that is also free, aka SQLite. And so light it can run on anything. Truly anything.

The rest is my quick explanation. I can and will, add more to this article, in case anybody might want me to.

  • SQLite
    • Free and ubiquitous[3].
    • Full-Featured SQL
    • Light
    • 20+ years of maturity
  • Resilio Sync Home
    • For Personal use but not a Toy
    • From the “original” P2P “makers”.

Thankfully there is some good material explaining well why is P2P the infrastructure solution for data distribution. And now it is (almost) free and easy to install by “anyone” “anywhere”. I am not affiliated with that company in any way, but allow me to paste in this “Resilio” marketing blurb that actually pretty well describes the run-time environment of this solution:

“…The edge[4] is everywhere people live and work and all the routes between. It is the global infrastructure of business….”

P2P is a key technology for Distributed Data Infrastructure. And “Resilio” is a fast, resilient, cheap, and readily available, implementation of that.

It can be using every possible (personal or not) device to which your system has access. The more the merrier, because it increases the P2P “swarm”. Data “given” to “Resilio” (SQLite db files), is transparently shared, across the whole system. You do not need any “enterprise” or “pro” solutions. You do not need a “cloud” or “data center”. You just need to know and use the SQLite databases. Shared transparently.

SQLite is the “secret juice”. What am I describing here is the solution for one truly distributed database.  Not in “real-time” but in “near time”. Near time as to remind us of the cold fact: “Network is not a transparent resource”.

The Manageability

The key weak point (gasp!) here is the manageability of a multitude of business function solutions. “Orchestration” of unlimited SQLite databases (just files for me and you) is done almost “for free” by Resilio. They are all “just” in sync. But “taming” this for a particular business case might be a challenge. I can envisage pretty simple management components (and UI’s) specifically built for each, business function. The “core of the trick” is to keep it all decoupled. Thus be careful not to share everything all the time. Good, are we done then?

What Could Possibly Go Wrong™?

The Legacy Strikes Back

Technical debt is a product of the whole of legacy IT, not just software. It is the key problem for all mid to large organizations today. But Legacy is not just a technical issue. There are also legacy people and legacy thinking. Soft skills are required to manage them, people.

Legacy technologies have been all selected, evaluated, and paid for by someone. If that “someone” is still around you are bound to have “legacy issues”. Legacy data syncing technologies are costing millions. And there are still, a lot of people making a good living from them. Gradual socio-economical migration from Legacy Data Systems to almost free (compared to legacy) and obviously better P2P Distributed Data Systems is not going to be easy.


[1] Database Servers and Clusters licensing actual costs are one big mystery. I challenge you to find out the actual and real licensing costs online. But we can “give you an idea”: (as an example) SQL Server pricing starts at $931.00 as a one-time payment. Although you can not buy one.

[2]We will let Masters of Slide Ware, sell the idea on top of this simple, resilient, and (almost) ridiculously cheap solution.

[3] “present, appearing, or found everywhere.” (Oxford Languages)

[4] The “edge” is a wrong term here. The whole point is we are going from the “edge” to “everywhere”.