Investigate thanos-store problems

This has been promoted to an epic in &648 (closed) since there are multiple action items that we need to take.

Overview

Thanos store:

oom kills
bucket high latency: https://gitlab.slack.com/archives/C12RCNXK5/p1626325085106200
continuously fails readiness and liveness probes: https://log.gprd.gitlab.net/goto/37e63ceeb47dd9ed91ff6811a2517ce5

Investigations

Memcached

Check if Memcached is best effort on Thanos, meaning if Memcached is unavailable will it store everything in memory? (write)

There seems to be two types of cache:

inmemory cache. This doesn't seem to be used anywhere in our case since we don't use that configuration
Memcached. This is the one we configure.

Async

Background:
- When using memcached all stores are done async. We can take a look at the SetAsync
- The async buffer is set to 10000, so it's fairly high. When we reach this limit we just skip storing this value.
Outcome: Looking at the past week (2021-11-17 - 2021-11-24) we can see that we reached this async-buffer-limit a few times:

Source

Memory pressure might increase here, if we are inside of the queue. Items are stored in memory and Go gragbage collector can't run to free up memory which builds up memory pressure.

What happens when fetch are unable to happen? (read)

Background:
- The entrypoint is GetMulti
- It doesn't seem like we are batching requests to memcached since max_get_multi_batch_size it set to 0 not sure if this will help

At the moment Memcached seem to only have a 4gb limit. Which is quite odd, given that we give `thanos-store` 20 GB.

Current Memcached configuration in gprd: --index-cache.config="config": "addresses": - "memcached-thanos-index-cache.monitoring.svc.cluster.local:11211" "dns_provider_update_interval": "10s" "max_async_buffer_size": 10000 "max_async_concurrency": 20 "max_get_multi_batch_size": 0 "max_get_multi_concurrency": 100 "max_idle_connections": 100 "max_item_size": "16MB" "timeout": "500ms" "type": "memcached"
Currently 4gb is set on gprd for the max number of memory used: https://thanos.gitlab.net/graph?g0.expr=memcached_limit_bytes%7Benv%3D%22gprd%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Check if `max_item_size` is a problem

Background: When we are writing to memcached we there is a metric thanos_memcached_operation_skipped_total{reason="max-item-size"}
Outcome: Looking at the past week (2021-11-17 - 2021-11-24) it doesn't seem like we hit this in gprd at all, so this setting seems to be finely tuned for us.

Source

Container memory usage

find out why there is a difference between Go and container memory usage

There seems to be some descrepcencies between go_memstats_alloc_bytes and container_memory_usage_bytes for example:

go_memstats_alloc_bytes only reports around 2GB
container_memory_usage_bytes reports around 16GB, so where is the other 14GB going?
- Looking at container_memory_usage_bytes defenition in https://github.com/google/cadvisor/issues/913 it seems to look at inactive memory as well

Getting profile from `pprof` on `gstg` regional cluster

The the node name: NODE=$(kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.spec.nodeName' -r)
The zone of node: ZONE=$(gcloud compute instances list --filter name=$NODE --format="value(zone)")
Get IP address of pod: kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.status.podIP' -r
SSH inside of machine: gcloud compute ssh $NODE --zone $ZONE
1. Launch toolbox: toolbox
2. Run curl on the IP gotten from step: curl -L 10.227.22.9:10902/debug/pprof/heap -o heap.pprof
3. Copy it so it's accessable on node disk: cp heap.pprof /media/root/home/$USER/
Copy the heap.pprof locally to view it with go tool pprof: gcloud compute scp --zone $ZONE $NODE:/home/$USER/heap.pprof /tmp/heap.pprof
See profile: go tool pprof -http :9090 /tmp/heap.pprof

Looking at the flame graph we can see that Go heap size is only around 1-2GB (which matches the metric)

Getting memory profile of the whole container

We are not able to get memory profile becuase Google COS doesn't seem to support this

steve@gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh /run/containerd $ sudo perf mem record -a sleep 60 --cgroup $CONTAINER_CGROUP
failed: memory events not supported

However we can get memstats from cgroups memory.stat file

The the node name: NODE=$(kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.spec.nodeName' -r)
The zone of node: ZONE=$(gcloud compute instances list --filter name=$NODE --format="value(zone)")

SSH inside of machine: gcloud compute ssh $NODE --zone $ZONE

Launch install jq in toolbox: toolbox apt-get install jq
Get container ID: CONTAINER_ID=$(crictl ps -q --name thanos-store)
Get memory cgroup: CGROUP_MEMORY=$(sudo cat /run/containerd/runc/k8s.io/$CONTAINER_ID/state.json | toolbox -q jq -r '.cgroup_paths.memory')
Get current memory: sudo cat $CGROUP_MEMORY/memory.usage_in_bytes. This reflects the container_memory_usage_bytes metric

Get memory stats: sudo cat $CGROUP_MEMORY/memory.stat. To better understand this we can read https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt (5.2 stat file)

cache 4027678720
rss 4510638080
rss_huge 318767104
shmem 0
mapped_file 3801059328
dirty 0
writeback 0
swap 0
pgpgin 35534334
pgpgout 33527379
pgfault 27764913
pgmajfault 30855
inactive_anon 0
active_anon 4510744576
inactive_file 1901416448
active_file 2125975552
unevictable 0
hierarchical_memory_limit 16106127360
hierarchical_memsw_limit 9223372036854771712
total_cache 4027678720
total_rss 4510638080
total_rss_huge 318767104
total_shmem 0
total_mapped_file 3801059328
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 35534334
total_pgpgout 33527379
total_pgfault 27764913
total_pgmajfault 30855
total_inactive_anon 0
total_active_anon 4510744576
total_inactive_file 1901416448
total_active_file 2125975552
total_unevictable 0

Sum of cache and rss is the actual memory.usage_in_bytes according it 5.5 usage_in_bytes

Conculsion

So we know that go_memstats_alloc_bytes reports less then container_memory_usage_bytes is because of cache and rss are high. Go runtime doesn't free up memory and wil be used by future allocations.

Memory spike events

2021-11-24 7:10 UTC (Thanos was unable to get data `memcached`)

Background

Memory allocation:

Source
Logs for thanos-store-8-1 (highest memory usage) we see read tcp 10.222.70.12:52354->10.222.0.8:11211: i/o timeout

Source
Looking at the other thanos-store pods we see the same error: Source
Seem like memcached timeoutes are a known problem in thanos: https://github.com/thanos-io/thanos/issues/1979
- Looking at graphs from the suggested queries in https://github.com/thanos-io/thanos/issues/1979#issuecomment-573172023
  - rate(thanos_memcached_operations_total{env="gprd"}[1m]) 👉 screenshot: Everything looks normal
  - histogram_quantile(1, rate(thanos_memcached_operation_duration_seconds_bucket[1m])) 👉 screenshot: We see a large spike on duration.
  - histogram_quantile(0.9, rate(thanos_memcached_operation_duration_seconds_bucket{env="gprd"}[1m])) 👉 screenshot: We see a spike on duration when memory usage goes up. The slower memcached is the, the more memory we use.
  - sum by (item_type) (rate(thanos_store_index_cache_hits_total[1m]) / rate(thanos_store_index_cache_requests_total[1m])) 👉 screenshot: Everything looks normal.
  - CPU usage for memcached pods seem to be within normal range during the memory spike: thanos 👉 screenshot
Thanos is using https://github.com/bradfitz/gomemcache as a memcached client which seems to have a few problems:
- No way to clear idle connections for removed servers
- Re-resolve server names on connection failure
There doesn't seem like there is anything useful inside of traces
This seems to be a 24 hour recurring event https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13800#note_744584933

Solutions

Reading https://github.com/thanos-io/thanos/issues/1979#issuecomment-843092954 it seems like setting max_get_multi_batch_size might help.
There seems to be https://github.com/thanos-io/thanos/pull/4742 which can help with things.
Is in-memory better?

2021-11-18 16:20 UTC

Source

Around the same time I see memcache: unexpected response line from "set": "SERVER_ERROR out of memory storing object\r\n"

Source

Looking at the total items in memcached we also see a spike around the same time.

source

We also see high eviction rate during that time, so Memcached is struggling to keep items in memory and jump dump the ones that we can't keep.

Source

2021-11-23 21:38 UTC (root cause not found)

2021-11-23 21:38 UTC

Background

Container memory usage shot up from 6.5GB to 16GB

Source
Go memory heap stayed stable around 2GB

Source

OOM killer informaiton: journalctl -k

Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: Tasks state (memory values in pages):
Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [   8389] 65534  8389      241        1    28672        0          -998 pause
Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [ 793710] 65534 793710 11802571  3902395 86560768        0           490 thanos
Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b0839cb5c83fbabadd205a350d76c6a75eef1082b6de97271ea5091e52aa8ac9,mems_allowed=0,oom_memcg=/kubepods/burstable/pod28779d8a-4067-4e52-88ce-becab0ee86c8,task_memcg=/kubepods/burstable/pod28779d8a-4067-4e52-88ce-becab0ee86c8/b0839cb5c83fbabadd205a350d76c6a75eef1082b6de97271ea5091e52aa8ac9,task=thanos,pid=793710,uid=65534
Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: Memory cgroup out of memory: Killed process 793710 (thanos) total-vm:47210284kB, anon-rss:15576568kB, file-rss:33012kB, shmem-rss:0kB, UID:65534 pgtables:84532kB oom_score_adj:490
Nov 23 21:38:27 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: oom_reaper: reaped process 793710 (thanos), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Action Items/Suggestions (WIP)

Short Term

Increase batch size to improve cache performance.
Decrease async buffersize
Increase memory limits on gprd and gstg.

Long Term

Make thanos-store a stateless deployment?

Edited Dec 15, 2021 by Steve Xuereb

Investigate thanos-store problems

Overview

Investigations

Memcached

Check if Memcached is best effort on Thanos, meaning if Memcached is unavailable will it store everything in memory? (write)

Async

What happens when fetch are unable to happen? (read)

At the moment Memcached seem to only have a 4gb limit. Which is quite odd, given that we give thanos-store 20 GB.

Check if max_item_size is a problem

Container memory usage

Getting profile from pprof on gstg regional cluster

Getting memory profile of the whole container

Conculsion

Memory spike events

2021-11-24 7:10 UTC (Thanos was unable to get data memcached)

Background

Solutions

2021-11-18 16:20 UTC

2021-11-23 21:38 UTC

Background

Action Items/Suggestions (WIP)

Short Term

Long Term

At the moment Memcached seem to only have a 4gb limit. Which is quite odd, given that we give `thanos-store` 20 GB.

Check if `max_item_size` is a problem

Getting profile from `pprof` on `gstg` regional cluster

2021-11-24 7:10 UTC (Thanos was unable to get data `memcached`)