Irmin's LRU cache is not bounded in memory but by number of elements

The Irmin's (index?) LRU cache is not bounded in memory but by number of elements. Running ithacanet for a while exhibit this behavior.

I can provide the resulting memtrace files of the experiments described below.

How to reproduce:

  • Go to [https://lambsonacid.nl/]

  • Download and import an ithacanet snapshot using the latest master branch:

    • ./tezos-node config init --data-dir /tmp/ithacanet --network ithacanet
    • ./tezos-node snapshot import --data-dir /tmp/ithacanet
  • Patch and run the node with memtrace:

diff --git a/src/bin_node/dune b/src/bin_node/dune
index 4eadd6a12d..de57406b8d 100644
--- a/src/bin_node/dune
+++ b/src/bin_node/dune
@@ -9,6 +9,7 @@
  (package tezos-node)
  (instrumentation (backend bisect_ppx))
  (libraries
+  memtrace
   tezos-base
   tezos-base.unix
   tezos-version

diff --git a/src/bin_node/node_run_command.ml b/src/bin_node/node_run_command.ml
index a9ba9d797f..822acd5164 100644
--- a/src/bin_node/node_run_command.ml
+++ b/src/bin_node/node_run_command.ml
@@ -417,6 +417,7 @@ let init_rpc (config : Node_config_file.t) node =

 let run ?verbosity ?sandbox ?target ~singleprocess ~force_history_mode_switch
     ~prometheus_config (config : Node_config_file.t) =
+  let () = Memtrace.trace_if_requested ~context:"my program" () in
   let open Lwt_tzresult_syntax in
   let* () = Node_data_version.ensure_data_dir config.data_dir in
   (* Main loop *)

=> MEMTRACE=trace.ctf ./tezos-node run --data-dir /tmp/ithacanet --synchronisation-threshold 0

N.b. adding the --synchronisation-threshold 0 option makes the node run the mempool while bootstrapping the chain helping exhibit the bug faster

  • Let it synchronize for about 1h, the resident memory should grow linearly and then examine the trace (requires memtrace-viewer which is not available past ocaml.4.11.0 somehow):

    => The main live allocations are originating from the mempool component which loads a (large) key from disk (i.e. the cache domain committed as a single key) which are maintained in the irmin's cache. Reducing the LRU size on the irmin store seems to give an upper-bound on the memory usage.

  • Restart the whole process but now include this patch:

diff --git a/src/lib_context/context.ml b/src/lib_context/context.ml
index 55458036ea..7b2315c12b 100644
--- a/src/lib_context/context.ml
+++ b/src/lib_context/context.ml
@@ -500,7 +500,7 @@ let add_predecessor_ops_metadata_hash v hash =

 let init ?patch_context ?(readonly = false) root =
   let index_log_size = Option.value ~default:2_500_000 !index_log_size in
-  Store.Repo.v (Irmin_pack.config ~readonly ~index_log_size root)
+  Store.Repo.v (Irmin_pack.config ~readonly ~index_log_size ~lru_size:1_000 root)
   >|= fun repo -> {path = root; repo; patch_context; readonly}

 let close index = Store.Repo.close index.repo

In one hour of synchronisation, the tezos-node process seems to reach an upper memory bound and the resident memory remain stable.