Skip to content

Conversation

@alekcz
Copy link
Contributor

@alekcz alekcz commented Mar 14, 2025

SUMMARY

The current import-db function does not work at scale. It requires a lot of memory and takes very long.
Restoring our production instance took 2 hours for 16M datoms and required a huge instance. This is not practical.

This PR changes import-db such that one thread reads the serialized data token by token while another loads them into the store. This has changed the restore of 16M datoms from 2 hours to 21 seconds.

This introduces 3 new options.

(defn import-db [conn path & opts] ...)

filter-schema?:
ignore schema datoms when importing the db. This is helpful if your schema has evolved over time.

load-entities?:
uses load-entities instead of transact. This is much faster. It is also more forgiving where intial datoms do not adhere to current schema

sync?:
When set to true old behaviour is preserved, when set to false importing is async in two threads. A status function is return to allow querying of the status

=> (status)
{:complete? false
 :preprocessed? true
 :total-datoms 1200000
 :remaining 200000})

For backwards compatibility the default mode is synchronous, uses transact, and does not filter the schema.

Checks

Bugfix
  • Related issues linked using fixes #633
Feature
  • Implements an existing feature request. Make sure the feature request has been accepted for implementation before opening a PR.
  • Related issues linked using fixes #633
  • Integration tests added
  • Documentation added
  • Architecture Decision Record added
  • Formatting checked

ADDITIONAL INFORMATION

I would appreciate a second pair of eyes on this. Here's how you can run your own test

(comment
  ;; include
  ;; [io.replikativ/datahike-jdbc "0.3.49"]
  ;; [org.xerial/sqlite-jdbc "3.41.2.2"]
  (require '[stub-server.migrate :as m] :reload)
  (require '[datahike.api :as d])
  (require '[clojure.java.io :as io])
  (require '[datahike-jdbc.core])
  (def dev-db (str "./temp/sqlite/ingest"))
  (io/make-parents dev-db)
  ;; for testing large db's it's best to use sqlite. File store may run out of resources threads or pointers 
  ;; my test 21 seconds with SQLite with 16M datoms
  (def cfg      {:store {:backend :jdbc :dbtype "sqlite" :dbname dev-db}
                 :keep-history? true
                 :allow-unsafe-config true
                 :store-cache-size 20000
                 :search-cache-size 20000})
  (d/create-database cfg)
  (def conn (d/connect cfg))
  (def status (m/import-db conn "prod.backup.cbor" {:sync? true :filter-schema? false})))

@alekcz
Copy link
Contributor Author

alekcz commented Mar 14, 2025

@whilo @TimoKramer @awb99 @yflim a review would be greatly appreciated 🙏

Copy link
Member

@whilo whilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alekcz see comment.

:total-datoms @datom-count
:remaining (- @datom-count @txn-count)})))

(comment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alekcz Thank you so much for helping with this! Overall it looks good to me and it is great that this code performs well for you. However, I cannot evaluate this comment block, stub-server is not in scope and the alias d is already loaded. Could you turn this into a test?

@alekcz alekcz mentioned this pull request Mar 31, 2025
@whilo
Copy link
Member

whilo commented Apr 11, 2025

I read through this again and I am fine with merging it even without the test, since it is already covered and definitely better than before. I am wondering whether it is right to export the Datoms sorted by eid though, I would have guessed it would be better to sort by tx-id. Also, sorting is slow. @alekcz Do you use export-db even?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants