Baker waits for endorsements even though they exist in the node
With the Cryptium Labs baker we have seen an issue about once a month, where the baking process logs that it doesn't have enough endorsements and hence waits ~30 seconds to bake its block. Other bakers have seen this issue too, but much less frequently.
Initially, we thought this is due to us not being connected enough, but running ./tezos-admin-client p2p stats
shows ~100 connected peers. Furthermore, after the first time, we saw this issue, we increased our p2p limits in order to ensure that we are always well connected.
This time I noticed something strange. Previously we simply missed those blocks, but this time we didn't, but rather that our blocks were on alternative chains. My current assumption is that this problem really is two problems disguised as one. The first problem was that we weren't connected well enough and hence when the baker waited we simply missed those blocks.
The second problem is that the baker may have a bug where even though there are endorsements in the node it thinks that there aren't any and hence waits. The missed blocks are reorged instead of just missed. When the baker finally bakes a block it includes all the endorsements as can be seen here. This implies that either the node received all endorsements while waiting, or that it had all endorsements accessible but only the baker observed otherwise and hence waited. This hypothesis is backed by the fact that as soon as both processes are restarted the baker knows about and has access to enough endorsements. Furthermore no more block are reorged.
To cut a long description short, I think that there may be a bug in the baker, where it thinks that it doesn't have access to any endorsements and hence waits, even though in reality it has access to all endorsements when it finally bakes the block.