Dustin's Datomic FAQ

Please contribute/discuss/correct/argue on r/Clojure so this doc can get better!

Why is Datomic so interesting (beyond some cool features?)

Datomic : Relational database :: Git : subversion/cvs

Datomic uses immutability the same way git does to distribute/cache your data inside your application, the queries run in your jvm. This is the opposite of RDBMS, where the data is in the database, so to get a consistent worldview you have to send your queries over network to run over there to where the consistent source of truth is, a network hop away. This property of Datomic is called code/data locality.

means much less I/O and network roundtrips than imperative databases

solves a lot of the I/O related accidental complexity we see today

This matters a lot because ORM, REST, etc are increasingly seen as failed abstractions, the root cause of which can be traced to pervasive network I/O. People think ORM is about object relational mapping. I say it is not. Any sophomore programmer can code that. ORM is about the batching and caching necessary to make it work in practice.

RICH HICKEY (2012): What goes in the cache? What form does it take? When does it get invalidated? Whose problem are all these questions? Yours! Your problem! Or maybe you're buying some fancy ORM that makes it your problem with another a layer on top of your problem. Now you have two problems.

Batching and caching is IMO the most important problem Datomic solves. You just don't need to think about it anymore, and that fundamentally changes the programming model and how far we can abstract.

Idealized caching is how Datomic offers functional programming in the database – database as a value – which RDBMS cannot. So if you prefer functional programming you will probably prefer Datomic. You'll get to program with composable functions as if your database was a regular data structure inside your process (kind of like a really big hashmap), because that's actually pretty much how it works, just like git.

Why Code/Data Locality Matters

Code/data locality is key for real apps. Everyone has seen "Latency Numbers Every Programmer Should Know" . If you're going to do complex data analysis, e.g. machine learning, you want your data access to be on the short end of this chart :) When you end up on the long end – like in RDBMS – this is known as N+1 problem.

But modern size data doesn't fit into memory or a single hard drive. Distributed systems necessarily add latency, and to fix that we add caching, which hurts consistency. Datomic's core idea is to provide consistent data to your code; which is the opposite of how most DBMS make you bring the code into the database.

Datastore of 2020s will be designed around an immutable log because it permits both strong consistency and horizontal scaling (like git). Once you're both distributed and consistent, the problems today's stores are architected around, go away. Your distributed queries can index the immutable log however they like. column-oriented, row-oriented, documents, time-oriented, graphs, immutability means you can do all of it, as a library in your application process.

Todo: Peer functions – https://forum.datomic.com/t/exact-unordered-match-of-multi-valued-string-field/365/3

Todo: Subqueries – https://forum.datomic.com/t/subquery-examples/345

add location support to Datomic as a library (jar files, clojure code that runs in your app process): Location queries are a great example of leveraging code/data locality to(jar files, clojure code that runs in your app process): Does Datomic have location queries? (2017)

More discussion of code/data locality

"Declarative"

TOMASZ: I don't agree with what you write about relational databases. SQL is declarative and to me that makes relational databases (mostly) declarative. I understand that Hyperfiddle cannot be built on top of an RDBMS because the abstraction is leaky and you don't get a fully declarative system. That doesn't mean that to users of an RDBMS, the database isn't mostly a declarative system. DUSTIN: If declarative means "say what you want but don't say how to do it" then SQL is definitely declarative. It's the code around the SQL that has to make decisions around "this API is slow because of too many queries, so I will combine them into a join" and then later "this JOIN is too slow on large tables so i will break it up again and add a cache" that makes the "system as a whole" less declarative over time.

Code/data locality stories are different in Cloud vs Onprem

If your app has two types of load profiles (e.g. app and analytics jobs), peer model requires you to split your application into different jar files and manually allocate different types of peers. You need a gateway to route traffic to the right peer. In practice this is so hard that most people just spin up more/bigger peers, all homogenous.

Task Specific Query Groups is how you manage your different types of load now. After you ship, you realize that you have this analytics job that hits hard and competes for resources with your app. You can configure a Query Group for this job with separate resources, separate caching, doesn’t touch the operational characteristics of the rest of your app. And you don’t have to build it yourself (code). You configure it in a dashboard.

Stu says support for instrumenting the classpath of nodes with application jars is a development priority for Cloud.

Will the Peer library eventually be deprecated?

marshall: no. peer will remain an active part of Datomic On-prem hmaurer: @marshall so you still believe that the peer model has advantages? or would you push new users towards the client model? marshall: there are definitely use cases where the peer makes more sense. we’re working on options for solving some of those issues in Cloud marshall: if you’re getting started on a new project, yes i would suggest client, as it provides the most options. i wouldn’t suggest changing an existing application that already uses the peer val_waeselynck: My way of summarizing it would be that Peers give you expressive more power but also more operational constraints / challenges - you then need to see which of those matter to you

Datomic vs other databases?

Datomic Vs Other Dabases – r/Clojure

Datomic vs Facebook's graph datastore. Facebook's datastore is interesting, it is shockingly similar to Datomic (e.g. it is a triple store with writer/reader/storage separation for scaling), so Hyperfiddle might actually work on that (though it would be eventually consistent).

Datomic is strongly consistent

A frequent misconception Datomic beginners have is whether Datomic (the system) is strongly consistent, because individual nodes are eventually consistent may be stale. (This is weakly worded, help). "Is Datomic strictly linearized" is a better way to word this question that avoids CAP theorem ambiguities.

d/sync which asks "what time is it at the transactor?" (In git, you ask what time is it at origin with git fetch ). See also: Yes, the key is that every query has a time-basis, exceptwhich asks "what time is it at the transactor?" (In git, you ask what time is it at origin with). See also: Datomic and CAP theorem deep dive and reddit discussion – Dustin Getz (2017)

STU: A Datomic system is defined as one transactor and N readers. Potential transactors coordinate so that only one can write at a time. In CAP parlance, Datomic has a traditional strongly-consistent model. source

At the bottom of the Datomic ACID page linked above, under heading "implications":

Datomic provides strong consistency of the entire database, while requiring only eventual consistency for the majority of actual writes. ... Another way to understand this is to consider the failure mode introduced by an eventually consistent node that is not up-to-date yet. Datomic will always see a correct log pointer, which was placed via conditional put. If some of the tree nodes are not yet visible underneath that pointer, Datomic is consistent but partially unavailable, and will become fully available when eventually happens. https://docs.datomic.com/on-prem/acid.html

IMO, the above official doc is weakly worded: specifically I object to the phrase "eventually consistent node", though the following sentences clarify. All queries have a time-basis, thus all available nodes will always respond with a consistent answer, a node may be unable to answer the question without blocking for ~10ms to catch up to the time-basis. Individual nodes are strictly linearized, strongly consistent, sometimes unavailable, full stop.

The Myth of Poor Write Scalability

10 billion datoms in a year is 317 datoms per second. Rich told me at the Conj party you'll fail because your log got too big, not because writes can't keep up. So if, as Stu writes, Datomic can process a few thousand transactions per second, at that rate you'll exceed 10 billion datoms in a month. – Datomic and the Myth of Slow Writes

Stu (2015, Pro): A few points of interest here: The Datomic team currently tests up to 10 billion datoms. At ten billion datoms, the root nodes of Datomic data structures start to be significantly oversized. This puts memory pressure on every processes. A ten billion datom database takes a long time (hours) to fully back up or restore. Robin's concern about performance is (a) partially correct: with a bigger database, you will probably spend more time worrying about partitions, caches, etc -- but also (b) partially wrong: nothing about this is particular to Datomic, which is designed to work well with datasets substantially larger than memory. 10 billion datoms is not a hard limit, but all of the reasons above you should think twice before putting significantly more than 10 billion datoms in a single database. Stu in the google group (2015)

Datomic Cloud story for write scalability is significantly stronger as it basically comes down to how fast can DynamoDB do conditional puts? Apparently the answer is "fast" because Stu and Rich were gushing about how DynamoDB was a feat of engineering.

Multi database queries

Datomic databases are not siloed. Database is a log, you can merge a bunch the logs together and query as one. by "you can" i mean this is a first class documented feature in datomic.

Here is a very old hyperfiddle screenshot - see the colored lines? We use color to denote which database what data came from. Two databases joined across homogenously in same query. That query is efficient!!!!!

The catch is :db/type/ref can only refer inside the same database. Multi database models will need to use foreign keys instead of actual references. Unlike refs which Datomic constrains to never dangle, foreign keys can dangle.

Browser peers?

What if the datom cache was in a web browser or iphone?

Stu said at Conj dinner, they considered this when designing Datomic and there are data security issues. (Datoms have to leave the secure peer and go to untrusted third parties). However Stu did confirm in this conversation that it is technically feasible to have a tree-shaped peer hierarchy where down-tree peers populate their cache from filtered views of up-tree peers.

Projects in this space:

Implications to UI programming

Several groups have tried to sync Datomic to Datascript for client side queries, which would allow for highly dynamic UIs.

My present understanding (2018) is this is sort of possible with caveats.

If your research is not listed here, please tell me. I made this document fast and haven't had time to track down all the relevant emails and chat logs. Please help.

Internals

Datomic's sorting/pagination problem

Datomic query doesn't do sorting because Datomic query is about sets.

RICH HICKEY (2012): There is no datalog syntax for that at present. Note that it currently does no more than call take and drop on the result set, which you can do as efficiently in your programming language on a peer.

I understand the general story is to sort in the application, which is first-class performance in peer model.

But Datomic already has the ability to maintain specific value-sorted indexes for us, which is enough for many simple cases. Is it possible to leverage this ordering for datalog queries that already need to access this index, so that the resultset will naturally come out in the right order, in these trivial cases?

Developer tips and tricks

Schema

In clojure.core, ? suffix is for predicates, not booleans. Thus Datomic bool attributes don't get ? .

. Idea: assert positives, not negatives. :post/draft is bad, :post/published is good, because the default state is draft and then we assert new facts to change the state.

Query the log

AKA "omg what did I just do?"

(comment (def src (d/connect "datomic:free://datomic:4334/starter-blog-src2")) (def $log (d/log src)) (-> $log keys) ; (:db :olookup :root-id :tail) (-> $log :db :basisT) ; 1413 (->> (d/tx-range $log 1413 nil) first) )

You can also sort of invert the transaction: https://stackoverflow.com/a/25389808/959627

Triage

More code/data locality rambling

DUSTIN: Philosophically, can we even consider an architecture based on mutability, a good architecture? i say: not anymore. RDBMS was good architecture when CPUs were 10 orders of magnitude slower. The future is distributed systems, and distributed systems need immutability. Nobody who uses git says "i wish i could swap in subversion". Many of the things git can do, are impossible on subversion. We are betting that this will come to pass in databases as well. Hyperfiddle will never run on Neo4J or Postgres or MongoDB. Those systems are centralized and require hand-optimized backend coding to be fast at scale due to N+1 problems, JOIN pain, fit-database-in-memory, destructive schema updates, denormalization etc. Immutability is why the next generation of distributed systems like Datomic can avoid all these problems, which is the key reason Hyperfiddle doesn't need hand-optimized backend code - Hyperfiddle's backend is generic and abstracted, it is a one-size-fits-all backend-as-a-service (which is our business model). If it turns out that in practice this is not true, then we are in trouble.

Cloud has automatic partitions

steveb8n: as I move towards clients and cloud, I should just pull out all partitioning (storage locality) code. Is there a replacement for storage locality or is this just because it doesn’t make that much difference? BTW I’m happy to simplify by ripping it out but wondering if performance will suffer marshall: Cloud has a built-in partitioning mechanism. If you’re using clients today you’re just using a default single partition anyway ( source

d/datoms for partition

DUSTIN: d/datoms, filtered to a specific partition - Anyone know the best way to do this? Or even a way to discover what the starting entity number is for a given partition? MARSHALL: you can use entid-at

More topics:

Constraints – nothing built into Datomic

Database filters and filter gotchas

d/with

No transaction functions in cloud