Skip to content

Properly use the ddb fetch mechanism and fix data-race in the mempool

vbot requested to merge nomadic-labs/tezos:vbot@fix-ddb-usage-in-mempool into master

Context

This MR fixes and optimizes mempool's DDB request. In particular, there was a data-race that made the mempool create on_arrived worker's requests more than once per resource which would result in useless computation to determine if this operation was already handled or not. With this MR, we now only create one on_arrived request per resource which, incidentally, makes the already_handled check useless in the on_arrived branch as it was already done (at least once) when we would treat the on_notify request.

This MR also ensures that we add all peers that advertised a resource to the set of "requestable" peers in case the first one fails to provide the resource instead of considering only one peer. This fixes a potential attack vector where the first peer that would advertise could decide to never answer and would result in censoring the resource. In practice, however, the previous data-race made that, while waiting for the on_arrived request to be processed, the on_notify would still call fetch_operation to all new advertising peers but only when the resource was already received.

Here are two profiling reports of the mempool activity during a block lifetime (on my laptop). These results were used to measure improvements and testing that the semantics is sound:

Before patch

head:BLJ1TQdcqF8m .............................................................. 1        13971.915ms  15% +2m43s496.972ms
  handle_unprocessed ........................................................... 6477       227.648ms 100%
    operation classified ....................................................... 853                      
  on_advertise ................................................................. 52           5.064ms 101%
  on_arrived ................................................................... 4117        22.446ms 101%
    is already handled ......................................................... 4117         9.255ms  99%
      already handled .......................................................... 3689                     
      not already handled ...................................................... 428                      
  on_flush ..................................................................... 1            6.466ms 100%
  on_notify .................................................................... 2299       692.823ms 100%
    may_fetch_operation ........................................................ 29723      652.991ms 100%
      already handled .......................................................... 25605                    
      fetching thread .......................................................... 4118       460.851ms 100%
      is already handled ....................................................... 29723      133.019ms 101%
        already handled ........................................................ 25605                    
        not already handled .................................................... 4118

After patch

head:BLavSxsSKWfi .............................................................. 1        15504.976ms  15% +1m19s421.360ms
  handle_unprocessed ........................................................... 2863       251.043ms  97%
    operation classified ....................................................... 858                      
  on_advertise ................................................................. 80           8.067ms 101%
  on_arrived ................................................................... 420          7.981ms 101%
  on_flush ..................................................................... 1            4.291ms 100%
  on_notify .................................................................... 2354       638.081ms 101%
    may_fetch_operation ........................................................ 25072      593.663ms 101%
      already handled .......................................................... 21342                    
      fetching thread .......................................................... 3317       362.849ms 100%
      is already handled ....................................................... 21755      110.184ms 101%
        already handled ........................................................ 21342                    
        not already handled .................................................... 413                      
      not being fetched - not already handled .................................. 413         53.596ms 101%
        fetching thread ........................................................ 413         52.858ms 100%

We notice that the performances are slightly better after the patch but the real gain is the significantly lower amount of worker requests (divided by 3) which could make a less performant machine struggle to process (the worker loops are not profiled yet so the potential gain are not visible). We can also see that the amount of on_arrived requests amounts to exactly the number of expected operations during a block lifetime.

Manually testing the MR

CI

Checklist

  • Document the interface of any function added or modified (see the coding guidelines)
  • Document any change to the user interface, including configuration parameters (see node configuration)
  • Provide automatic testing (see the testing guide).
  • For new features and bug fixes, add an item in the appropriate changelog (docs/protocols/alpha.rst for the protocol and the environment, CHANGES.rst at the root of the repository for everything else).
  • Select suitable reviewers using the Reviewers field below.
  • Select as Assignee the next person who should take action on that MR
Edited by vbot

Merge request reports