Polyglot Persistence Brain Dump
Created
Updated
Outline
Relevant Resources
- F1 Lightning
- SAP HANA ATR
- Polypheny
- “One Size Fits All”: An Idea Whose Time Has Come and Gone - Stonebraker, Çetintemel (2005)
- the commercial world will fracture into a collection of independent database engines, some of which may be unified by a common front-end parser
- DB sys arch (Hellerstein, Stonebraker, Hamilton) (2007) https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf
- handful of successful DBMS implementations exist. The community of people involved in designing and implementing database systems is tight: many attended the same schools, worked on the same influential research projects, and collaborated on the same commercial products
- Martin Fowler polyglot persistence https://martinfowler.com/bliki/PolyglotPersistence.html
- Gcloud blog OLTP OLAP barrier https://cloud.google.com/blog/products/databases/transactional-and-analytical-workloads-unite
- Snowflake Unistore, Hybrid Tables
- Trino
- TileDB
- Materialize, Timely and Differential Dataflow
- CosmosDB
- Singlestore
- Jepsen
- Some popular relevant interesting DBs or ideas: SQLite, DuckDB, Postgres, Iceberg, CrateDB, ElasticSearch, Mongo, Redis, Neo4j, Cassandra, Druid, Spark and Snowpark, S3, zookeeper and etcd, GIS stuff, JanusGraph, Memgraph, Fauna, TypeDB or GRAKN, InfluxDB, Kafka and Redpanda, RDF, triple store, JDBC ODBC REST GraphQL gRPC Thrift Kafka Websocket, SQL sparkl gremlin tinkerpop data log, stream and batch processing, flink
- BigDAWG
Gave Some Thought to an Initial Prototype
Codename Amsterdam (future ones will be cities with increasing letter names: belfast, copenhagen, dublin, edinburgh, florence, grenada, h? london, munich, york, zurich
this is what i want to implement first and then benchmark some OLTP and OLAP queries on sqlite and duckdb. i think prototyping this will help me make the problem more concrete and highlight gaps in my knowledge, etc.
features it should have
Single Node or Distributed (can run on premises, cloud, etc)
Intelligently shuffles data around and caches it automagically for you to optimize various params, note that the same logical entity may have 1-n physical entities for it
Can set optimization config and parameters (eg cost, perf, etc), these could be a simplified layer on top of more low level knobs (eg ottertune style)
Most config and parameters can be set at various levels (global, cluster, user, schema. or namespace, table or entity type, row or entity, column or attribute, query, etc)
a bunch of query languages on top of the data (eg graph, sql, document, kv, graphql, rest, grpc, etc). presumably the base data is stored in the most flexible graph way and the other languages are just derived from this, though we could do sql and use recursive ctes or udfs or something to do graph
Some default datastore[s] come bundled (eg sqlite or Postgres, duckdb) but then other ones you add or configure
you can do all crud sql operations eg select insert update delete
how should schema be handled? graph schema which can be translated to other schemas? or keep it simple
you can specify the level of “isolation” at various levels
more details on isolation
- https://github.com/ept/hermitage
The SQL standard tried to define four isolation levels (read uncommitted, read committed, repeatable read and serializable), but its definition is flawed. Several researchers have tried to nail down more precise definitions of weak (i.e. non-serializable) isolation levels. In particular:
- Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica: "Scalable Atomic Visibility with RAMP Transactions" at ACM Transactions on Database Systems, Vol. 41, No. 3, Article 15, Publication date: July 2016.
- Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M Hellerstein and Ion Stoica: “Highly Available Transactions: Virtues and Limitations (Extended Version),” at 40th International Conference on Very Large Data Bases (VLDB), September 2014.
- Alan Fekete, Dimitrios Liarokapis, Elizabeth O'Neil, Patrick O'Neil, and Dennis Shasha: “Making Snapshot Isolation Serializable,” ACM Transactions on Database Systems (TODS), volume 30, number 2, pages 492–528, June 2005. doi:10.1145/1071610.1071615
- Readings in Database Systems, edited by Joseph M. Hellerstein and Michael Stonebraker, 4th edition, MIT Press, 2005. ISBN: 978-0-262-69314-1
- Atul Adya, Barbara Liskov, Patrick O’Neil: "Generalized Isolation Level Definitions" Appears in the Proceedings of the IEEE International Conference on Data Engineering, San Diego, CA, March 2000
- Atul Adya: “Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions,” PhD Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, March 1999.
- Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O'Neil and Patrick O'Neil: “A Critique of ANSI SQL Isolation Levels,” at ACM International Conference on Management of Data (SIGMOD), volume 24, number 2, May 1995. doi:10.1145/568271.223785
- Jim N Gray, Raymond A Lorie, Gianfranco R Putzolu, and Irving L Traiger: “Granularity of Locks and Degrees of Consistency in a Shared Data Base,” in Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems, edited by G.M. Nijssen, Elsevier/North Holland Publishing, pages 364–394, 1976.
This project is based on the formal definition of weak isolation introduced by Adya, as extended by Bailis et al. They mathematically define certain anomalies (or phenomena) which can occur in an unrestricted concurrency model, and define isolation levels as prohibiting or preventing certain anomalies from occurring.
- https://github.com/ept/hermitage
you can specify policies for data locality, replication, durability, etc (India data stays in India, all data must be written to two disks, this table must exist in the us and eu, etc)
automagically chooses the best db to read and write from and to
Batch and stream processing
if i tried to build this today
- build a converter between all useful query languages (or into some intermediate representation), support datalog, a graph lang, sql, Bigtable or Cassandra like, mongo (dynamo the same as mongo?), redis, key value, elastic search, splunk, spark (eg arbitrary code), streaming db’s like flink druid ksql, time-series prometheus influx, array vector scidb faiss pinecone tiledb, gis stuff? bio chem etc stuff?
- datalog as foundation? or some graph language? probably want to just build my own language that has entities, relationships, etc
- are there multi-modal db’s which would make sense to steal things from?
- compilers and query languages
- Build a query router which chooses the best place[s] to read or write
- trino
- calcite
- polystore query optimization
- Some way to reason about consistency and isolation levels
- timely dataflow
- reads are writes too
- see hermitage above
- distributed systems
- config and optimization knobs and params
Some Trino Changes
- Graph support
- Logical catalogs which can have multiple physical catalogs backing them
- Potentially some way to shuffle between catalogs too
- Some way to reason about consistency and isolation levels when inserting into logical catalog with multiple physical catalogs
- OLTP support
- Optimization knobs
- Ideally at a bunch of levels
- maybe other knobs too like where to replicate data, isolation, etc
- Query language mapping
- Make it work well embedded or on a single node