Updated import db #727

alekcz · 2025-03-14T21:59:25Z

SUMMARY

The current import-db function does not work at scale. It requires a lot of memory and takes very long.
Restoring our production instance took 2 hours for 16M datoms and required a huge instance. This is not practical.

This PR changes import-db such that one thread reads the serialized data token by token while another loads them into the store. This has changed the restore of 16M datoms from 2 hours to 21 seconds.

This introduces 3 new options.

(defn import-db [conn path & opts] ...)

filter-schema?:
ignore schema datoms when importing the db. This is helpful if your schema has evolved over time.

load-entities?:
uses load-entities instead of transact. This is much faster. It is also more forgiving where intial datoms do not adhere to current schema

sync?:
When set to true old behaviour is preserved, when set to false importing is async in two threads. A status function is return to allow querying of the status

=> (status)
{:complete? false
 :preprocessed? true
 :total-datoms 1200000
 :remaining 200000})

For backwards compatibility the default mode is synchronous, uses transact, and does not filter the schema.

Checks

Bugfix

Related issues linked using fixes #633

Feature

Implements an existing feature request. Make sure the feature request has been accepted for implementation before opening a PR.
Related issues linked using fixes #633
Integration tests added
Documentation added
Architecture Decision Record added
Formatting checked

ADDITIONAL INFORMATION

I would appreciate a second pair of eyes on this. Here's how you can run your own test

(comment
  ;; include
  ;; [io.replikativ/datahike-jdbc "0.3.49"]
  ;; [org.xerial/sqlite-jdbc "3.41.2.2"]
  (require '[stub-server.migrate :as m] :reload)
  (require '[datahike.api :as d])
  (require '[clojure.java.io :as io])
  (require '[datahike-jdbc.core])
  (def dev-db (str "./temp/sqlite/ingest"))
  (io/make-parents dev-db)
  ;; for testing large db's it's best to use sqlite. File store may run out of resources threads or pointers 
  ;; my test 21 seconds with SQLite with 16M datoms
  (def cfg      {:store {:backend :jdbc :dbtype "sqlite" :dbname dev-db}
                 :keep-history? true
                 :allow-unsafe-config true
                 :store-cache-size 20000
                 :search-cache-size 20000})
  (d/create-database cfg)
  (def conn (d/connect cfg))
  (def status (m/import-db conn "prod.backup.cbor" {:sync? true :filter-schema? false})))

alekcz · 2025-03-14T22:02:00Z

@whilo @TimoKramer @awb99 @yflim a review would be greatly appreciated 🙏

whilo

@alekcz see comment.

whilo · 2025-03-18T16:45:10Z

src/datahike/migrate.clj

+       :total-datoms @datom-count
+       :remaining (- @datom-count @txn-count)})))
+
+(comment


@alekcz Thank you so much for helping with this! Overall it looks good to me and it is great that this code performs well for you. However, I cannot evaluate this comment block, stub-server is not in scope and the alias d is already loaded. Could you turn this into a test?

whilo · 2025-04-11T23:18:56Z

I read through this again and I am fine with merging it even without the test, since it is already covered and definitely better than before. I am wondering whether it is right to export the Datoms sorted by eid though, I would have guessed it would be better to sort by tx-id. Also, sorting is slow. @alekcz Do you use export-db even?

alekcz added 2 commits March 14, 2025 23:23

introduce memory efficient import-db

28e212b

maintain backwards compatibility

bdbfb36

alekcz added 2 commits March 16, 2025 16:44

remove unused codec

99855b2

extract opts correctly

b60cb2e

whilo requested changes Mar 18, 2025

View reviewed changes

alekcz mentioned this pull request Mar 31, 2025

Rebuild datahike index #728

Open

TimoKramer force-pushed the main branch 6 times, most recently from 15ca402 to 6149e16 Compare August 29, 2025 15:44

TimoKramer force-pushed the main branch from ad7eaec to bb8127e Compare September 10, 2025 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Updated import db #727

Updated import db #727

Uh oh!

alekcz commented Mar 14, 2025

Uh oh!

alekcz commented Mar 14, 2025

Uh oh!

whilo left a comment

Uh oh!

whilo Mar 18, 2025

Uh oh!

whilo commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Updated import db #727

Are you sure you want to change the base?

Updated import db #727

Uh oh!

Conversation

alekcz commented Mar 14, 2025

SUMMARY

Checks

Bugfix

Feature

ADDITIONAL INFORMATION

Uh oh!

alekcz commented Mar 14, 2025

Uh oh!

whilo left a comment

Choose a reason for hiding this comment

Uh oh!

whilo Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

whilo commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants