Irmin's LRU cache is not bounded in memory but by number of elements

The Irmin's (index?) LRU cache is not bounded in memory but by number of elements. Running ithacanet for a while exhibit this behavior.

I can provide the resulting memtrace files of the experiments described below.

How to reproduce:

  • Go to [https://lambsonacid.nl/]

  • Download and import an ithacanet snapshot using the latest master branch:

    • ./tezos-node config init --data-dir /tmp/ithacanet --network ithacanet
    • ./tezos-node snapshot import --data-dir /tmp/ithacanet
  • Patch and run the node with memtrace:

diff --git a/src/bin_node/dune b/src/bin_node/dune
index 4eadd6a12d..de57406b8d 100644
--- a/src/bin_node/dune
+++ b/src/bin_node/dune
@@ -9,6 +9,7 @@
  (package tezos-node)
  (instrumentation (backend bisect_ppx))
  (libraries
+  memtrace
   tezos-base
   tezos-base.unix
   tezos-version

diff --git a/src/bin_node/node_run_command.ml b/src/bin_node/node_run_command.ml
index a9ba9d797f..822acd5164 100644
--- a/src/bin_node/node_run_command.ml
+++ b/src/bin_node/node_run_command.ml
@@ -417,6 +417,7 @@ let init_rpc (config : Node_config_file.t) node =

 let run ?verbosity ?sandbox ?target ~singleprocess ~force_history_mode_switch
     ~prometheus_config (config : Node_config_file.t) =
+  let () = Memtrace.trace_if_requested ~context:"my program" () in
   let open Lwt_tzresult_syntax in
   let* () = Node_data_version.ensure_data_dir config.data_dir in
   (* Main loop *)

=> MEMTRACE=trace.ctf ./tezos-node run --data-dir /tmp/ithacanet --synchronisation-threshold 0

N.b. adding the --synchronisation-threshold 0 option makes the node run the mempool while bootstrapping the chain helping exhibit the bug faster

  • Let it synchronize for about 1h, the resident memory should grow linearly and then examine the trace (requires memtrace-viewer which is not available past ocaml.4.11.0 somehow):

    => The main live allocations are originating from the mempool component which loads a (large) key from disk (i.e. the cache domain committed as a single key) which are maintained in the irmin's cache. Reducing the LRU size on the irmin store seems to give an upper-bound on the memory usage.

  • Restart the whole process but now include this patch:

diff --git a/src/lib_context/context.ml b/src/lib_context/context.ml
index 55458036ea..7b2315c12b 100644
--- a/src/lib_context/context.ml
+++ b/src/lib_context/context.ml
@@ -500,7 +500,7 @@ let add_predecessor_ops_metadata_hash v hash =

 let init ?patch_context ?(readonly = false) root =
   let index_log_size = Option.value ~default:2_500_000 !index_log_size in
-  Store.Repo.v (Irmin_pack.config ~readonly ~index_log_size root)
+  Store.Repo.v (Irmin_pack.config ~readonly ~index_log_size ~lru_size:1_000 root)
   >|= fun repo -> {path = root; repo; patch_context; readonly}

 let close index = Store.Repo.close index.repo

In one hour of synchronisation, the tezos-node process seems to reach an upper memory bound and the resident memory remain stable.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information