Investigate the 7:10 DNS resolve failure from thanos-store to memcached
Background
Most of the OOM kills that we are seeing in thanos-store are because of Memcached timeouts. When looking at the logs we see a consistent burst of read tcp xxx->xxx: i/o timeout
and write tcp xxxx->xxx: i/o timeout
in out logs
When this happens we always see the memory increase in thanos-store
containers by running the following script:
#!/usr/bin/env bash
for shard in {0..14}
do
for replica in {0..1}
do
pod="thanos-store-$shard-$replica"
echo $pod
kubectl -n monitoring get po $pod -o json | jq '.status.containerStatuses[].lastState.terminated.reason,.status.containerStatuses[].lastState.terminated.finishedAt'
echo "===="
echo
done
done
Potential routes to investigate
- Are Memcached pods available at that time?
- Why are we getting i/o timeouts, is it DNS resolution or some kind of rate limiting?
- Would migrating to a managed Memcached make sense here?
Edited by Steve Xuereb