Irmin's LRU cache is not bounded in memory but by number of elements
The Irmin's (index?) LRU cache is not bounded in memory but by number of elements. Running ithacanet for a while exhibit this behavior.
I can provide the resulting memtrace files of the experiments described below.
How to reproduce:
-
Go to [https://lambsonacid.nl/]
-
Download and import an ithacanet snapshot using the latest master branch:
- ./tezos-node config init --data-dir /tmp/ithacanet --network ithacanet
- ./tezos-node snapshot import --data-dir /tmp/ithacanet
-
Patch and run the node with memtrace:
diff --git a/src/bin_node/dune b/src/bin_node/dune
index 4eadd6a12d..de57406b8d 100644
--- a/src/bin_node/dune
+++ b/src/bin_node/dune
@@ -9,6 +9,7 @@
(package tezos-node)
(instrumentation (backend bisect_ppx))
(libraries
+ memtrace
tezos-base
tezos-base.unix
tezos-version
diff --git a/src/bin_node/node_run_command.ml b/src/bin_node/node_run_command.ml
index a9ba9d797f..822acd5164 100644
--- a/src/bin_node/node_run_command.ml
+++ b/src/bin_node/node_run_command.ml
@@ -417,6 +417,7 @@ let init_rpc (config : Node_config_file.t) node =
let run ?verbosity ?sandbox ?target ~singleprocess ~force_history_mode_switch
~prometheus_config (config : Node_config_file.t) =
+ let () = Memtrace.trace_if_requested ~context:"my program" () in
let open Lwt_tzresult_syntax in
let* () = Node_data_version.ensure_data_dir config.data_dir in
(* Main loop *)
=> MEMTRACE=trace.ctf ./tezos-node run --data-dir /tmp/ithacanet --synchronisation-threshold 0
N.b. adding the --synchronisation-threshold 0
option makes the
node run the mempool while bootstrapping the chain helping exhibit the
bug faster
-
Let it synchronize for about 1h, the resident memory should grow linearly and then examine the trace (requires memtrace-viewer which is not available past ocaml.4.11.0 somehow):
=> The main live allocations are originating from the mempool component which loads a (large) key from disk (i.e. the cache domain committed as a single key) which are maintained in the irmin's cache. Reducing the LRU size on the irmin store seems to give an upper-bound on the memory usage.
-
Restart the whole process but now include this patch:
diff --git a/src/lib_context/context.ml b/src/lib_context/context.ml
index 55458036ea..7b2315c12b 100644
--- a/src/lib_context/context.ml
+++ b/src/lib_context/context.ml
@@ -500,7 +500,7 @@ let add_predecessor_ops_metadata_hash v hash =
let init ?patch_context ?(readonly = false) root =
let index_log_size = Option.value ~default:2_500_000 !index_log_size in
- Store.Repo.v (Irmin_pack.config ~readonly ~index_log_size root)
+ Store.Repo.v (Irmin_pack.config ~readonly ~index_log_size ~lru_size:1_000 root)
>|= fun repo -> {path = root; repo; patch_context; readonly}
let close index = Store.Repo.close index.repo
In one hour of synchronisation, the tezos-node process seems to reach an upper memory bound and the resident memory remain stable.