beats in k8s created a secondary subscription

When pubsubbeats were enabled in production: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!213 (merged) they created secondary subscriptions resulting in twice the amount of logs being sent to Elastic.

Follow ups:

  • errors in Thanos
    • errors cleared
    • what was the cause
      • fluentd errors in the production GKE cluster, the underlying metric is: fluentd_output_status_num_errors
      • errors in the actual underlying metric can be seen in this thanos query
      • multiple fluentd processes were affected, for multiple log streams
      • We currently don't forward logs from fluentd to Elastic, so the only way we can look at the fluentd logs from that time is through Stackdriver. Here's the relevant Stackdriver search
        • there are lots of errors for ES rejecting logs due to failures to index fields caused by mapping mismatch. Looking at staging, we've been hitting the exact same problem there as well Stackdriver search
          • detailed analysis of a few errors in gstg:
            • json.class in sidekiq logs, the static mapping in ES as well as the cached index mappings expect this field to be a string. The log entry sent from k8s contains a json object inside that field: "json.class"=>{"class_attributes"=>{"feature_category"=>"incident_management", "urgency"=>"high", "idempotent"=>true}},
            • json.method in git-https, ES expects a text field, but it's a json object: "json.method"=>{},
            • json.err.detail.data in registry, ES expects a json object, a text field is sent: {"data"=>"sha256:3219a255d7d585153e7580bb01c0827dafb8fa89e8501bef893fe4792bc88f3d"}, <-- this is a known issue
          • detailed analysis of a few errors in gprd:
        • there were write rejections on the cluster: Screenshot_from_2020-09-18_15-56-58
        • fluentd errors were likely caused by ES cluster refusing bulk requests due to being overloaded, we want to test this theory: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!217 (merged)
  • roll over after config change not happening
  • get rid of subscriptions created today
  • get rid of subscriptions with duplicate in the name
    • one idea is to do the exact same thing again, i.e. tell beats to create new subscriptions with cleaned names, let both beats run for a few minutes and then shut down the ones on VMs and remove the old subscriptions. If we decide to go down this route, we should probably do it as a separate step, i.e. 1) migrate beats to k8s 2) switch the subscription name
  • analyze ES saturation
  • backlog in subscriptions cleared
Edited by Michal Wasilewski