Polyglot Persistence Brain Dump

Data Al Dente

Home | Posts | About Me

Created

Updated

Relevant Resources

Gave Some Thought to an Initial Prototype

Codename Amsterdam (future ones will be cities with increasing letter names: belfast, copenhagen, dublin, edinburgh, florence, grenada, h? london, munich, york, zurich

https://whimsical.com/amsterdam-design-NCuDd7P2u61edxTXa1R7gq

this is what i want to implement first and then benchmark some OLTP and OLAP queries on sqlite and duckdb. i think prototyping this will help me make the problem more concrete and highlight gaps in my knowledge, etc.

features it should have

Single Node or Distributed (can run on premises, cloud, etc)

Intelligently shuffles data around and caches it automagically for you to optimize various params, note that the same logical entity may have 1-n physical entities for it

Can set optimization config and parameters (eg cost, perf, etc), these could be a simplified layer on top of more low level knobs (eg ottertune style)

Most config and parameters can be set at various levels (global, cluster, user, schema. or namespace, table or entity type, row or entity, column or attribute, query, etc)

a bunch of query languages on top of the data (eg graph, sql, document, kv, graphql, rest, grpc, etc). presumably the base data is stored in the most flexible graph way and the other languages are just derived from this, though we could do sql and use recursive ctes or udfs or something to do graph

Some default datastore[s] come bundled (eg sqlite or Postgres, duckdb) but then other ones you add or configure

you can do all crud sql operations eg select insert update delete

how should schema be handled? graph schema which can be translated to other schemas? or keep it simple

you can specify the level of “isolation” at various levels

you can specify policies for data locality, replication, durability, etc (India data stays in India, all data must be written to two disks, this table must exist in the us and eu, etc)

automagically chooses the best db to read and write from and to

Batch and stream processing

if i tried to build this today

  1. build a converter between all useful query languages (or into some intermediate representation), support datalog, a graph lang, sql, Bigtable or Cassandra like, mongo (dynamo the same as mongo?), redis, key value, elastic search, splunk, spark (eg arbitrary code), streaming db’s like flink druid ksql, time-series prometheus influx, array vector scidb faiss pinecone tiledb, gis stuff? bio chem etc stuff?
    1. https://arxiv.org/abs/1607.04197
    1. http://vldb.org/pvldb/vol14/p2831-shaikhha.pdf
    1. datalog as foundation? or some graph language? probably want to just build my own language that has entities, relationships, etc
    1. are there multi-modal db’s which would make sense to steal things from?
    1. compilers and query languages
  1. Build a query router which chooses the best place[s] to read or write
    1. trino
    1. calcite
    1. polystore query optimization
  1. Some way to reason about consistency and isolation levels
    1. timely dataflow
    1. reads are writes too
    1. see hermitage above
    1. distributed systems
  1. config and optimization knobs and params

Some Trino Changes

  1. Graph support
  1. Logical catalogs which can have multiple physical catalogs backing them
    1. Potentially some way to shuffle between catalogs too
    1. Some way to reason about consistency and isolation levels when inserting into logical catalog with multiple physical catalogs
  1. OLTP support
  1. Optimization knobs
    1. Ideally at a bunch of levels
    1. maybe other knobs too like where to replicate data, isolation, etc
  1. Query language mapping
  1. Make it work well embedded or on a single node