Investigate thanos-store problems

This has been promoted to an epic in &648 (closed) since there are multiple action items that we need to take.

Overview

Thanos store:

Investigations

Memcached

Check if Memcached is best effort on Thanos, meaning if Memcached is unavailable will it store everything in memory? (write)

Check if Memcached is best effort on Thanos, meaning if Memcached is unavailable will it store everything in memory? (write)

There seems to be two types of cache:

  1. inmemory cache. This doesn't seem to be used anywhere in our case since we don't use that configuration
  2. Memcached. This is the one we configure.
Async
  • Background:
    • When using memcached all stores are done async. We can take a look at the SetAsync
    • The async buffer is set to 10000, so it's fairly high. When we reach this limit we just skip storing this value.
  • Outcome: Looking at the past week (2021-11-17 - 2021-11-24) we can see that we reached this async-buffer-limit a few times:

graph showing spike of hitting async-buffer-limit

Source

Memory pressure might increase here, if we are inside of the queue. Items are stored in memory and Go gragbage collector can't run to free up memory which builds up memory pressure.

What happens when fetch are unable to happen? (read)

What happens when fetch are unable to happen? (read)

At the moment Memcached seem to only have a 4gb limit. Which is quite odd, given that we give thanos-store 20 GB.

Check if `max_item_size` is a problem

Check if max_item_size is a problem

screenshot showing there is no increase in max-item-size count

Source

Container memory usage

find out why there is a difference between Go and container memory usage

There seems to be some descrepcencies between go_memstats_alloc_bytes and container_memory_usage_bytes for example:

Getting profile from pprof on gstg regional cluster

  1. The the node name: NODE=$(kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.spec.nodeName' -r)
  2. The zone of node: ZONE=$(gcloud compute instances list --filter name=$NODE --format="value(zone)")
  3. Get IP address of pod: kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.status.podIP' -r
  4. SSH inside of machine: gcloud compute ssh $NODE --zone $ZONE
    1. Launch toolbox: toolbox
    2. Run curl on the IP gotten from step: curl -L 10.227.22.9:10902/debug/pprof/heap -o heap.pprof
    3. Copy it so it's accessable on node disk: cp heap.pprof /media/root/home/$USER/
  5. Copy the heap.pprof locally to view it with go tool pprof: gcloud compute scp --zone $ZONE $NODE:/home/$USER/heap.pprof /tmp/heap.pprof
  6. See profile: go tool pprof -http :9090 /tmp/heap.pprof

Looking at the flame graph we can see that Go heap size is only around 1-2GB (which matches the metric)

Getting memory profile of the whole container

We are not able to get memory profile becuase Google COS doesn't seem to support this

steve@gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh /run/containerd $ sudo perf mem record -a sleep 60 --cgroup $CONTAINER_CGROUP
failed: memory events not supported

However we can get memstats from cgroups memory.stat file

  1. The the node name: NODE=$(kubectl -n monitoring get po thanos-store-2-1 -o json | jq '.spec.nodeName' -r)

  2. The zone of node: ZONE=$(gcloud compute instances list --filter name=$NODE --format="value(zone)")

  3. SSH inside of machine: gcloud compute ssh $NODE --zone $ZONE

    1. Launch install jq in toolbox: toolbox apt-get install jq
    2. Get container ID: CONTAINER_ID=$(crictl ps -q --name thanos-store)
    3. Get memory cgroup: CGROUP_MEMORY=$(sudo cat /run/containerd/runc/k8s.io/$CONTAINER_ID/state.json | toolbox -q jq -r '.cgroup_paths.memory')
    4. Get current memory: sudo cat $CGROUP_MEMORY/memory.usage_in_bytes. This reflects the container_memory_usage_bytes metric
    5. Get memory stats: sudo cat $CGROUP_MEMORY/memory.stat. To better understand this we can read https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt (5.2 stat file)
      cache 4027678720
      rss 4510638080
      rss_huge 318767104
      shmem 0
      mapped_file 3801059328
      dirty 0
      writeback 0
      swap 0
      pgpgin 35534334
      pgpgout 33527379
      pgfault 27764913
      pgmajfault 30855
      inactive_anon 0
      active_anon 4510744576
      inactive_file 1901416448
      active_file 2125975552
      unevictable 0
      hierarchical_memory_limit 16106127360
      hierarchical_memsw_limit 9223372036854771712
      total_cache 4027678720
      total_rss 4510638080
      total_rss_huge 318767104
      total_shmem 0
      total_mapped_file 3801059328
      total_dirty 0
      total_writeback 0
      total_swap 0
      total_pgpgin 35534334
      total_pgpgout 33527379
      total_pgfault 27764913
      total_pgmajfault 30855
      total_inactive_anon 0
      total_active_anon 4510744576
      total_inactive_file 1901416448
      total_active_file 2125975552
      total_unevictable 0

    Sum of cache and rss is the actual memory.usage_in_bytes according it 5.5 usage_in_bytes

Conculsion

So we know that go_memstats_alloc_bytes reports less then container_memory_usage_bytes is because of cache and rss are high. Go runtime doesn't free up memory and wil be used by future allocations.

Memory spike events

2021-11-24 7:10 UTC (Thanos was unable to get data `memcached`)

2021-11-24 7:10 UTC (Thanos was unable to get data memcached)

Background

Solutions

2021-11-18 16:20 UTC

2021-11-18 16:20 UTC

Screenshot_2021-11-18_at_17.27.26

Source

Around the same time I see memcache: unexpected response line from "set": "SERVER_ERROR out of memory storing object\r\n"

Screenshot_2021-11-18_at_17.29.20

Source

Looking at the total items in memcached we also see a spike around the same time.

Screenshot_2021-11-18_at_17.32.35

source

We also see high eviction rate during that time, so Memcached is struggling to keep items in memory and jump dump the ones that we can't keep.

Screenshot_2021-11-18_at_17.44.30

Source

2021-11-23 21:38 UTC (root cause not found)

2021-11-23 21:38 UTC

Background

  • Container memory usage shot up from 6.5GB to 16GB

    Source

  • Go memory heap stayed stable around 2GB

    Source

  • OOM killer informaiton: journalctl -k

    Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: Tasks state (memory values in pages):
    Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
    Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [   8389] 65534  8389      241        1    28672        0          -998 pause
    Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: [ 793710] 65534 793710 11802571  3902395 86560768        0           490 thanos
    Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b0839cb5c83fbabadd205a350d76c6a75eef1082b6de97271ea5091e52aa8ac9,mems_allowed=0,oom_memcg=/kubepods/burstable/pod28779d8a-4067-4e52-88ce-becab0ee86c8,task_memcg=/kubepods/burstable/pod28779d8a-4067-4e52-88ce-becab0ee86c8/b0839cb5c83fbabadd205a350d76c6a75eef1082b6de97271ea5091e52aa8ac9,task=thanos,pid=793710,uid=65534
    Nov 23 21:38:26 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: Memory cgroup out of memory: Killed process 793710 (thanos) total-vm:47210284kB, anon-rss:15576568kB, file-rss:33012kB, shmem-rss:0kB, UID:65534 pgtables:84532kB oom_score_adj:490
    Nov 23 21:38:27 gke-gstg-gitlab-gke-default-2-64acc0ef-v0kh kernel: oom_reaper: reaped process 793710 (thanos), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Action Items/Suggestions (WIP)

Short Term

  1. Increase batch size to improve cache performance.
  2. Decrease async buffersize
  3. Increase memory limits on gprd and gstg.

Long Term

  1. Make thanos-store a stateless deployment?
Edited by Steve Xuereb