Polyglot Persistence Brain Dump

Data Al Dente

Home | Posts | About Me

Created @January 22, 2023

Updated @January 22, 2023

Outline
Relevant Resources
@May 15, 2023 Gave Some Thought to an Initial Prototype
@January 6, 2023 features it should have
@January 6, 2023 if i tried to build this today
Some Trino Changes

Relevant Resources

F1 Lightning

SAP HANA ATR

Polypheny

“One Size Fits All”: An Idea Whose Time Has Come and Gone - Stonebraker, Çetintemel (2005)
- the commercial world will fracture into a collection of independent database engines, some of which may be unified by a common front-end parser

DB sys arch (Hellerstein, Stonebraker, Hamilton) (2007) https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf
- handful of successful DBMS implementations exist. The community of people involved in designing and implementing database systems is tight: many attended the same schools, worked on the same influential research projects, and collaborated on the same commercial products

Martin Fowler polyglot persistence https://martinfowler.com/bliki/PolyglotPersistence.html

Gcloud blog OLTP OLAP barrier https://cloud.google.com/blog/products/databases/transactional-and-analytical-workloads-unite

Snowflake Unistore, Hybrid Tables

Trino

TileDB

Materialize, Timely and Differential Dataflow

CosmosDB

Singlestore

Jepsen
- Consistency Models

Hermitage

Some popular relevant interesting DBs or ideas: SQLite, DuckDB, Postgres, Iceberg, CrateDB, ElasticSearch, Mongo, Redis, Neo4j, Cassandra, Druid, Spark and Snowpark, S3, zookeeper and etcd, GIS stuff, JanusGraph, Memgraph, Fauna, TypeDB or GRAKN, InfluxDB, Kafka and Redpanda, RDF, triple store, JDBC ODBC REST GraphQL gRPC Thrift Kafka Websocket, SQL sparkl gremlin tinkerpop data log, stream and batch processing, flink

BigDAWG

@May 15, 2023 Gave Some Thought to an Initial Prototype

Codename Amsterdam (future ones will be cities with increasing letter names: belfast, copenhagen, dublin, edinburgh, florence, grenada, h? london, munich, york, zurich

https://whimsical.com/amsterdam-design-NCuDd7P2u61edxTXa1R7gq

this is what i want to implement first and then benchmark some OLTP and OLAP queries on sqlite and duckdb. i think prototyping this will help me make the problem more concrete and highlight gaps in my knowledge, etc.

@January 6, 2023 features it should have

Single Node or Distributed (can run on premises, cloud, etc)

Intelligently shuffles data around and caches it automagically for you to optimize various params, note that the same logical entity may have 1-n physical entities for it

Can set optimization config and parameters (eg cost, perf, etc), these could be a simplified layer on top of more low level knobs (eg ottertune style)

Most config and parameters can be set at various levels (global, cluster, user, schema. or namespace, table or entity type, row or entity, column or attribute, query, etc)

a bunch of query languages on top of the data (eg graph, sql, document, kv, graphql, rest, grpc, etc). presumably the base data is stored in the most flexible graph way and the other languages are just derived from this, though we could do sql and use recursive ctes or udfs or something to do graph

Some default datastore[s] come bundled (eg sqlite or Postgres, duckdb) but then other ones you add or configure

you can do all crud sql operations eg select insert update delete

how should schema be handled? graph schema which can be translated to other schemas? or keep it simple

you can specify the level of “isolation” at various levels

more details on isolation
- https://github.com/ept/hermitage
  The SQL standard tried to define four isolation levels (read uncommitted, read committed, repeatable read and serializable), but its definition is flawed. Several researchers have tried to nail down more precise definitions of weak (i.e. non-serializable) isolation levels. In particular:
  - Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica: "Scalable Atomic Visibility with RAMP Transactions" at ACM Transactions on Database Systems, Vol. 41, No. 3, Article 15, Publication date: July 2016.
  - Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M Hellerstein and Ion Stoica: “Highly Available Transactions: Virtues and Limitations (Extended Version),” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.
  - Alan Fekete, Dimitrios Liarokapis, Elizabeth O'Neil, Patrick O'Neil, and Dennis Shasha: “Making Snapshot Isolation Serializable,” ACM Transactions on Database Systems (TODS), volume 30, number 2, pages 492–528, June 2005. doi:10.1145/1071610.1071615
  - Readings in Database Systems, edited by Joseph M. Hellerstein and Michael Stonebraker, 4th edition, MIT Press, 2005. ISBN: 978-0-262-69314-1
  - Atul Adya, Barbara Liskov, Patrick O’Neil: "Generalized Isolation Level Definitions" Appears in the Proceedings of the IEEE International Conference on Data Engineering, San Diego, CA, March 2000
  - Atul Adya: “Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions,” PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, March 1999.
  - Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O'Neil and Patrick O'Neil: “A Critique of ANSI SQL Isolation Levels,” at ACM International Conference on Management of Data (SIGMOD), volume 24, number 2, May 1995. doi:10.1145/568271.223785
  - Jim N Gray, Raymond A Lorie, Gianfranco R Putzolu, and Irving L Traiger: “Granularity of Locks and Degrees of Consistency in a Shared Data Base,” in Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems, edited by G.M. Nijssen, Elsevier/North Holland Publishing, pages 364–394, 1976.
  This project is based on the formal definition of weak isolation introduced by Adya, as extended by Bailis et al. They mathematically define certain anomalies (or phenomena) which can occur in an unrestricted concurrency model, and define isolation levels as prohibiting or preventing certain anomalies from occurring.

you can specify policies for data locality, replication, durability, etc (India data stays in India, all data must be written to two disks, this table must exist in the us and eu, etc)

automagically chooses the best db to read and write from and to

Batch and stream processing

@January 6, 2023 if i tried to build this today

build a converter between all useful query languages (or into some intermediate representation), support datalog, a graph lang, sql, Bigtable or Cassandra like, mongo (dynamo the same as mongo?), redis, key value, elastic search, splunk, spark (eg arbitrary code), streaming db’s like flink druid ksql, time-series prometheus influx, array vector scidb faiss pinecone tiledb, gis stuff? bio chem etc stuff?
1. https://arxiv.org/abs/1607.04197
1. http://vldb.org/pvldb/vol14/p2831-shaikhha.pdf
1. datalog as foundation? or some graph language? probably want to just build my own language that has entities, relationships, etc
1. are there multi-modal db’s which would make sense to steal things from?
1. compilers and query languages

Build a query router which chooses the best place[s] to read or write
1. trino
1. calcite
1. polystore query optimization

Some way to reason about consistency and isolation levels
1. timely dataflow
1. reads are writes too
1. see hermitage above
1. distributed systems

config and optimization knobs and params

Some Trino Changes

Graph support

Logical catalogs which can have multiple physical catalogs backing them
1. Potentially some way to shuffle between catalogs too
1. Some way to reason about consistency and isolation levels when inserting into logical catalog with multiple physical catalogs

OLTP support

Optimization knobs
1. Ideally at a bunch of levels
1. maybe other knobs too like where to replicate data, isolation, etc

Query language mapping

Make it work well embedded or on a single node