Investigate thanos-store problems
This has been promoted to an epic in &648 (closed) since there are multiple action items that we need to take.
Overview
Thanos store:
- oom kills
- bucket high latency: https://gitlab.slack.com/archives/C12RCNXK5/p1626325085106200
- continuously fails readiness and liveness probes: https://log.gprd.gitlab.net/goto/37e63ceeb47dd9ed91ff6811a2517ce5
Investigations
Memcached
Check if Memcached is best effort on Thanos, meaning if Memcached is unavailable will it store everything in memory? (write)
Check if Memcached is best effort on Thanos, meaning if Memcached is unavailable will it store everything in memory? (write)
There seems to be two types of cache:
- inmemory cache. This doesn't seem to be used anywhere in our case since we don't use that configuration
- Memcached. This is the one we configure.
Async
- Background:
- Outcome: Looking at the past week (2021-11-17 - 2021-11-24) we can see that we reached this
async-buffer-limita few times:
Memory pressure might increase here, if we are inside of the queue. Items are stored in memory and Go gragbage collector can't run to free up memory which builds up memory pressure.
What happens when fetch are unable to happen? (read)
What happens when fetch are unable to happen? (read)
- Background:
- The entrypoint is
GetMulti - It doesn't seem like we are batching requests to memcached since
max_get_multi_batch_sizeit set to0not sure if this will help
- The entrypoint is
At the moment Memcached seem to only have a 4gb limit. Which is quite odd, given that we give thanos-store 20 GB.
- Current Memcached configuration in
gprd:--index-cache.config="config": "addresses": - "memcached-thanos-index-cache.monitoring.svc.cluster.local:11211" "dns_provider_update_interval": "10s" "max_async_buffer_size": 10000 "max_async_concurrency": 20 "max_get_multi_batch_size": 0 "max_get_multi_concurrency": 100 "max_idle_connections": 100 "max_item_size": "16MB" "timeout": "500ms" "type": "memcached" - Currently
4gbis set ongprdfor the max number of memory used: https://thanos.gitlab.net/graph?g0.expr=memcached_limit_bytes%7Benv%3D%22gprd%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
Check if `max_item_size` is a problem
Check if max_item_size is a problem
- Background: When we are writing to memcached we there is a metric
thanos_memcached_operation_skipped_total{reason="max-item-size"} - Outcome: Looking at the past week (2021-11-17 - 2021-11-24) it doesn't seem like we hit this in
gprdat all, so this setting seems to be finely tuned for us.
Container memory usage
find out why there is a difference between Go and container memory usage
There seems to be some descrepcencies between go_memstats_alloc_bytes and container_memory_usage_bytes for example:
-
go_memstats_alloc_bytesonly reports around 2GB -
container_memory_usage_bytesreports around 16GB, so where is the other 14GB going?- Looking at
container_memory_usage_bytesdefenition in https://github.com/google/cadvisor/issues/913 it seems to look at inactive memory as well
- Looking at
Getting profile from pprof on gstg regional cluster
- The the node name:
NODE=$(kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.spec.nodeName' -r) - The zone of node:
ZONE=$(gcloud compute instances list --filter name=$NODE --format="value(zone)") - Get IP address of pod:
kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.status.podIP' -r - SSH inside of machine:
gcloud compute ssh $NODE --zone $ZONE- Launch toolbox:
toolbox - Run curl on the IP gotten from step:
curl -L 10.227.22.9:10902/debug/pprof/heap -o heap.pprof - Copy it so it's accessable on node disk:
cp heap.pprof /media/root/home/$USER/
- Launch toolbox:
- Copy the
heap.pproflocally to view it withgo tool pprof:gcloud compute scp --zone $ZONE $NODE:/home/$USER/heap.pprof /tmp/heap.pprof - See profile:
go tool pprof -http :9090 /tmp/heap.pprof
Looking at the flame graph we can see that Go heap size is only around 1-2GB (which matches the metric)
Getting memory profile of the whole container
We are not able to get memory profile becuase Google COS doesn't seem to support this
steve@gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh /run/containerd $ sudo perf mem record -a sleep 60 --cgroup $CONTAINER_CGROUP
failed: memory events not supported
However we can get memstats from cgroups memory.stat file
-
The the node name:
NODE=$(kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.spec.nodeName' -r) -
The zone of node:
ZONE=$(gcloud compute instances list --filter name=$NODE --format="value(zone)") -
SSH inside of machine:
gcloud compute ssh $NODE --zone $ZONE- Launch install jq in toolbox:
toolbox apt-get install jq - Get container ID:
CONTAINER_ID=$(crictl ps -q --name thanos-store) - Get memory cgroup:
CGROUP_MEMORY=$(sudo cat /run/containerd/runc/k8s.io/$CONTAINER_ID/state.json | toolbox -q jq -r '.cgroup_paths.memory') - Get current memory:
sudo cat $CGROUP_MEMORY/memory.usage_in_bytes. This reflects thecontainer_memory_usage_bytesmetric - Get memory stats:
sudo cat $CGROUP_MEMORY/memory.stat. To better understand this we can read https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt (5.2 stat file)cache 4027678720 rss 4510638080 rss_huge 318767104 shmem 0 mapped_file 3801059328 dirty 0 writeback 0 swap 0 pgpgin 35534334 pgpgout 33527379 pgfault 27764913 pgmajfault 30855 inactive_anon 0 active_anon 4510744576 inactive_file 1901416448 active_file 2125975552 unevictable 0 hierarchical_memory_limit 16106127360 hierarchical_memsw_limit 9223372036854771712 total_cache 4027678720 total_rss 4510638080 total_rss_huge 318767104 total_shmem 0 total_mapped_file 3801059328 total_dirty 0 total_writeback 0 total_swap 0 total_pgpgin 35534334 total_pgpgout 33527379 total_pgfault 27764913 total_pgmajfault 30855 total_inactive_anon 0 total_active_anon 4510744576 total_inactive_file 1901416448 total_active_file 2125975552 total_unevictable 0
Sum of
cacheandrssis the actualmemory.usage_in_bytesaccording it 5.5 usage_in_bytes - Launch install jq in toolbox:
Conculsion
So we know that go_memstats_alloc_bytes reports less then container_memory_usage_bytes is because of cache and rss are high. Go runtime doesn't free up memory and wil be used by future allocations.
Memory spike events
2021-11-24 7:10 UTC (Thanos was unable to get data `memcached`)
2021-11-24 7:10 UTC (Thanos was unable to get data memcached)
Background
-
Logs for
thanos-store-8-1(highest memory usage) we seeread tcp 10.222.70.12:52354->10.222.0.8:11211: i/o timeout -
Looking at the other
thanos-storepods we see the same error: Source -
Seem like memcached timeoutes are a known problem in thanos: https://github.com/thanos-io/thanos/issues/1979
- Looking at graphs from the suggested queries in https://github.com/thanos-io/thanos/issues/1979#issuecomment-573172023
-
rate(thanos_memcached_operations_total{env="gprd"}[1m])👉 screenshot: Everything looks normal -
histogram_quantile(1, rate(thanos_memcached_operation_duration_seconds_bucket[1m]))👉 screenshot: We see a large spike on duration. -
histogram_quantile(0.9, rate(thanos_memcached_operation_duration_seconds_bucket{env="gprd"}[1m]))👉 screenshot: We see a spike on duration when memory usage goes up. The slower memcached is the, the more memory we use. -
sum by (item_type) (rate(thanos_store_index_cache_hits_total[1m]) / rate(thanos_store_index_cache_requests_total[1m]))👉 screenshot: Everything looks normal. - CPU usage for
memcachedpods seem to be within normal range during the memory spike: thanos👉 screenshot
-
- Looking at graphs from the suggested queries in https://github.com/thanos-io/thanos/issues/1979#issuecomment-573172023
-
Thanos is using https://github.com/bradfitz/gomemcache as a memcached client which seems to have a few problems:
-
There doesn't seem like there is anything useful inside of traces
-
This seems to be a 24 hour recurring event https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13800#note_744584933
Solutions
- Reading https://github.com/thanos-io/thanos/issues/1979#issuecomment-843092954 it seems like setting
max_get_multi_batch_sizemight help. - There seems to be https://github.com/thanos-io/thanos/pull/4742 which can help with things.
- Is in-memory better?
2021-11-18 16:20 UTC
2021-11-18 16:20 UTC
Around the same time I see memcache: unexpected response line from "set": "SERVER_ERROR out of memory storing object\r\n"
Looking at the total items in memcached we also see a spike around the same time.
We also see high eviction rate during that time, so Memcached is struggling to keep items in memory and jump dump the ones that we can't keep.
2021-11-23 21:38 UTC (root cause not found)
2021-11-23 21:38 UTC
Background
-
Go memory heap stayed stable around 2GB
-
OOM killer informaiton:
journalctl -kNov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: Tasks state (memory values in pages): Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [ 8389] 65534 8389 241 1 28672 0 -998 pause Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [ 793710] 65534 793710 11802571 3902395 86560768 0 490 thanos Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b0839cb5c83fbabadd205a350d76c6a75eef1082b6de97271ea5091e52aa8ac9,mems_allowed=0,oom_memcg=/kubepods/burstable/pod28779d8a-4067-4e52-88ce-becab0ee86c8,task_memcg=/kubepods/burstable/pod28779d8a-4067-4e52-88ce-becab0ee86c8/b0839cb5c83fbabadd205a350d76c6a75eef1082b6de97271ea5091e52aa8ac9,task=thanos,pid=793710,uid=65534 Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: Memory cgroup out of memory: Killed process 793710 (thanos) total-vm:47210284kB, anon-rss:15576568kB, file-rss:33012kB, shmem-rss:0kB, UID:65534 pgtables:84532kB oom_score_adj:490 Nov 23 21:38:27 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: oom_reaper: reaped process 793710 (thanos), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Action Items/Suggestions (WIP)
Short Term
- Increase batch size to improve cache performance.
- Decrease async buffersize
- Increase memory limits on
gprdandgstg.
Long Term
- Make thanos-store a stateless deployment?










