logs not available in ELK for a number of indexes

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.


Summary

A brief summary of what happened. Try to make it as executive-friendly as possible.

Service(s) affected : Team attribution : Minutes downtime or degradation :

Timeline

2019-07-12 - 14:00 UTC

  • 14:00 UTC - Noticed some logs were not up to date in kibana
  • 15:10 UTC - MR 5 to change to 30M docs per rollover
  • 18:19 UTC - Kill -9'ed the stuck workhorse logs pubsubbeat

2019-07-13

  • 06:00 UTC - Workhorse logs caught up

https://dashboards.gitlab.net/d/USVj3qHmk/logging?orgId=1&from=now-2d&to=now

https://thanos-query.ops.gitlab.net/graph?g0.range_input=2w&g0.expr=stackdriver_pubsub_subscription_pubsub_googleapis_com_subscription_oldest_unacked_message_age%7Bsubscription_id%3D%22pubsub-rails-inf-gprd-sub%22%7D%20%2F%2060%20%2F%2060&g0.tab=0

affected indeces:

  • rails
  • workhorse
  • (...)

Working notes - 2 MRs to attempt to make rollover API a little more active:

  1. https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/esc-tools/merge_requests/5 - change docs count to 30M from 150M
  2. https://ops.gitlab.net/gitlab-com/gl-infra/gitlab-restore/esc-tools/merge_requests/6 - add nginx logs to rollover api
Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading