@nicholasklick who on your team should @cindy work with to determine what's most useful to visualize in the Grafana dashboards?
Additionally, I noticed that in readiness#12 (closed) there are no target SLIs. What type of alerting are you expecting to see from the metrics? And if there are no SLIs, does the service require more metrics beyond basic k8s pod monitoring?
FYI, the observability items here needs to be applied to staging, as well as production. We discussed this in the Configure team meeting and realised that it wasn't clear that staging observability was not configure at all if we read gitlab-org/gitlab#249593 (closed).
May I ask why Tracing is currently crossed out? kas honors our standard tracing headers and passes them along with all outbound requests. It also supports consuming them from inbound requests (in the future) and uses the correlation id in logs. So it'd be useful if we had it. I'm not sure if it's blocking for prod rollout or not. If not, we can have that configured later, no worries. I'm actually not sure what it means to "provision" Tracing on the infra/SRE side, so perhaps there is just nothing to do and "it just works".
The failure message seems to indicate the error was due to a time out. So I don't think the change even went through?
COMBINED OUTPUT: Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: timed out waiting for the condition: release fluentd-elasticsearch failed: timed out waiting for the conditionERROR: Job failed: exit code 1
I'm not sure where to get more information on what might have occurred from the timeout. @skarbek do you happen to know if there's a specific log that would be helpful in debugging this?
I suspect that the increase in the amount of nodes running in production, in tandem with more workloads in the cluster, has caused an increase in the amount of Pods that must be rotated with the new configuration. At this point, with the timeout bumped, I would advise that we simply try again.
there are no target SLIs. What type of alerting are you expecting to see from the metrics? And if there are no SLIs, does the service require more metrics beyond basic k8s pod monitoring?
Echoing Anthony's comment earlier, do we know what alerts and dashboards we need in order to move to production? I saw some discussion in Slack but I'm unclear if we reached a decision on this.
@amyphillips I'm trying to add KAS to the metrics catalogue, but I ran into some issues. Adding it will enable us to auto-generate some k8s dashboards and watch request and error rates from the gRPC calls.
The pubsub-kas-inf-gprd index pattern had no time field set. I added a new index pattern pubsub-kas-inf-gprd* with the timefield set to json.time; we should probably delete the older pattern, I'm just not aware how it got created, meaning it might need removal from somewhere automated.