Investigate the 7:10 DNS resolve failure from thanos-store to memcached

Background

Most of the OOM kills that we are seeing in thanos-store are because of Memcached timeouts. When looking at the logs we see a consistent burst of read tcp xxx->xxx: i/o timeout and write tcp xxxx->xxx: i/o timeout in out logs

7 day view

When this happens we always see the memory increase in thanos-store containers by running the following script:

#!/usr/bin/env bash

for shard in {0..14}
do
  for replica in {0..1}
  do
    pod="thanos-store-$shard-$replica"
    echo $pod
    kubectl -n monitoring get po $pod -o json | jq '.status.containerStatuses[].lastState.terminated.reason,.status.containerStatuses[].lastState.terminated.finishedAt'
    echo "===="
    echo
  done
done

Potential routes to investigate

Are Memcached pods available at that time?
Why are we getting i/o timeouts, is it DNS resolution or some kind of rate limiting?
Would migrating to a managed Memcached make sense here?

Edited Dec 15, 2021 by Steve Xuereb