Skip to content

Replace test_injection.py by injection.ml

Arvid Jakobsson requested to merge arvid@py2ml-test_injection.py into master

Context

Translate test_injection.py by the tezt equivalent injection.ml. This is a straight translation.

Fixes #3630 (closed)

Description

This tests the compilation, injection and activation of a test protocol in a node and the propagation of that protocol to a network.

Similar existing tezt tests

voting.ml does similar things, although as part of a longer voting process. I think a separate test is still warranted here, but at some point refactoring should be done between the two.

Flakiness

The python version was flaky and appeared in recent flake reports (#3685 (closed) for instance). I was unable to ascertain the source of the flakiness. It seems that sometimes, a node would start to download the protocol from its neighbor in which it was injected, but would not finish before running into a time out. The test would retry 120 times with a pause of 2 seconds in between, giving the node 4 minutes in total to download the protocol.

Here's an instance of a failure: https://gitlab.com/tezos/tezos/-/jobs/2916233472

Here's the log of the failing node:

# /builds/tezos/tezos/tezos-node run --data-dir /tmp/tezos-node.bc53wcgx --no-bootstrap-peers --connections 100 --synchronisation-threshold 0 --private-mode --network sandbox --peer 127.0.0.1:19730 --peer 127.0.0.1:19731 --peer 127.0.0.1:19732 --peer 127.0.0.1:19733 --peer 127.0.0.1:19734 --peer 127.0.0.1:19735 --peer 127.0.0.1:19736 --peer 127.0.0.1:19737 --peer 127.0.0.1:19738 --peer 127.0.0.1:19739 --peer 127.0.0.1:19740 --peer 127.0.0.1:19741 --peer 127.0.0.1:19742 --peer 127.0.0.1:19743 --peer 127.0.0.1:19744 --peer 127.0.0.1:19745 --peer 127.0.0.1:19746 --peer 127.0.0.1:19747 --peer 127.0.0.1:19748 --peer 127.0.0.1:19749 --peer 127.0.0.1:19750 --peer 127.0.0.1:19751 --peer 127.0.0.1:19752 --peer 127.0.0.1:19753 --peer 127.0.0.1:19754 --peer 127.0.0.1:19755 --peer 127.0.0.1:19756 --peer 127.0.0.1:19757 --peer 127.0.0.1:19758 --peer 127.0.0.1:19759 --peer 127.0.0.1:19760 --peer 127.0.0.1:19761 --peer 127.0.0.1:19762 --peer 127.0.0.1:19763 --peer 127.0.0.1:19764 --peer 127.0.0.1:19765 --peer 127.0.0.1:19766 --peer 127.0.0.1:19767 --peer 127.0.0.1:19768 --peer 127.0.0.1:19769 --peer 127.0.0.1:19770 --peer 127.0.0.1:19771 --peer 127.0.0.1:19772 --peer 127.0.0.1:19773 --peer 127.0.0.1:19774
Aug 22 15:14:17.260 - node.config.validation: the node configuration has been successfully validated.
Aug 22 15:14:17.261 - node.main: read identity file (peer_id = idto6m1tVdfgNNPd4w8iPXxHDav8qv)
Aug 22 15:14:17.261 - node.main: starting the Tezos node v0.0+dev (00000000) (chain = TEZOS)
Aug 22 15:14:17.476 - node.main: disabled local peer discovery
Aug 22 15:14:17.478 - node: shell-node initialization: bootstrapping
Aug 22 15:14:17.482 - node: shell-node initialization: p2p_maintain_started
Aug 22 15:14:17.482 - external_block_validator: initialized
Aug 22 15:14:17.746 - external_block_validator: block validator process started with pid 593
Aug 22 15:14:17.928 - node.validator: activate chain NetXdQprcVkpaWU
Aug 22 15:14:17.929 - p2p.maintenance: too few connections (0)
Aug 22 15:14:17.929 - validator.chain: Chain is bootstrapped
Aug 22 15:14:17.929 - validator.chain: Sync_status: sync
Aug 22 15:14:17.929 - node.chain_validator: no prevalidator filter found for protocol
Aug 22 15:14:17.929 - node.chain_validator:   ProtoGenesisGenesisGenesisGenesisGenesisGenesk612im
Aug 22 15:14:17.930 - node.main: starting RPC server on ::ffff:127.0.0.1:18732 (acl = AllowAll) (tls = false)
Aug 22 15:14:17.930 - node.main: the Tezos node is now running
Aug 22 15:14:22.936 - p2p.maintenance: too few connections (2)
Aug 22 15:14:24.977 - node.validator: fetching protocol PrqdDtBrWPcyb6qJ5rwULmrLhm7PQTSgfDmN7Pwnym4SiAu9kvW
Aug 22 15:14:27.940 - p2p.maintenance: too few connections (2)
Aug 22 15:14:32.945 - p2p.maintenance: too few connections (2)
...
Aug 22 15:19:33.059 - p2p.maintenance: too few connections (0)
(/builds/tezos/tezos/tezos-node) TERM: triggering shutdown.
Aug 22 15:19:33.229 - node.main: shutting down the Tezos node
Aug 22 15:19:33.229 - node.validator: shutting down the chain validator NetXdQprcVkpaWU
Aug 22 15:19:33.229 - node.distributed_db.requester: shutting down requester
Aug 22 15:19:33.229 - node.distributed_db.requester: shutting down requester
Aug 22 15:19:33.229 - node.validator: shutting down the block validator
Aug 22 15:19:33.229 - external_block_validator: shutting down
Aug 22 15:19:33.434 - external_block_validator: process terminated normally
Aug 22 15:19:33.435 - p2p: shutting down the p2p's welcome worker...
Aug 22 15:19:33.435 - p2p: shutting down the p2p's network maintenance worker...
Aug 22 15:19:33.435 - p2p: shutting down the p2p connection pool...
Aug 22 15:19:33.436 - p2p: shutting down the p2p connection handler...
Aug 22 15:19:33.436 - p2p: shutting down the p2p scheduler...
Aug 22 15:19:33.436 - node.main: shutting down the RPC server
Aug 22 15:19:33.436 - node.main: bye (exit_code = 127)

As can be seen, it starts fetching the protocol, but then nothing more happens until the timeout is reached in ~4 minutes.

I could not reproduce the flakiness locally and did not investigate further due to time constraints.

So what have we learned? Not much: if the translation fixes the flakiness, then it is due to some of the safe guards built into tezt/lib_tezos (more waiting on events etc) or because of the lack of timeout. In the latter case, we learn that the node can be quite slow at downloading protocols?

Notes on the translation

It took roughly 4x the time estimated (a day instead of 2 hours), even though I spent less time on analysis that I'd have liked 🤷

Manually testing the MR

Checklist

  • Document the interface of any function added or modified (see the coding guidelines)
  • Document any change to the user interface, including configuration parameters (see node configuration)
  • Provide automatic testing (see the testing guide).
  • For new features and bug fixes, add an item in the appropriate changelog (docs/protocols/alpha.rst for the protocol and the environment, CHANGES.rst at the root of the repository for everything else).
  • Select suitable reviewers using the Reviewers field below.
  • Select as Assignee the next person who should take action on that MR
Edited by Arvid Jakobsson

Merge request reports