Upon upgrading both Gitlab and Runner to v.12.10 job logs (K8s executor) are taking a long time to updating in UI (even though job has already completed)
After upgrading to Gitlab 12.10 some of the jobs logs are being processed slower. These jobs are still showing running(because runner still writes the job logs) even though they had completed long ago (30-40 mins). This is happening only with some K8s runners with larger logs (not all jobs).
This was not happening with jobs that are not using the K8s executor
Upon further investigation the issue seemed to be originating from the Gitlab runner manager pod( which is responsible for spinning up and deleting other pods for jobs, also for logging). Current resource limits aren't enough for manager pod to process logs properly. This did not happen before upgrading to 12.10.
Provisioning more resources on the runner manager pod mitigated the issue.
A few things to make clear:
- the size of the job logs didn't change after the upgrade (exactly the same job).
- using SSD drives for k8s nodes
- the pod generates logs properly but manager is not able to send them to Gitlab coordinator because of insufficient resources( especially CPU). IO is usual
- this is happening on the subset of jobs which are generating larger logs (around 12MB)
FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY flag after the issue happened to be able to see job pods logs and debug the issue. After setting this flag to false and increasing memory and cpu for manager pod issue was fixed.
test job is getting slower when manager pod of test jobs gets closer to its limits but nowhere near the slowness from before extending memory and cpu for manager pod:
Before the change (when it was slow):
resources: limits: memory: 1Gi cpu: 1 requests: memory: 1Gi cpu: 500m
After change (mitigated slowness, but seeing occasional slowdown):
resources: limits: memory: 2Gi cpu: 5 requests: memory: 1Gi cpu: 2
Before upgrading to 12.10 with the initial config manager pod was working very well (with 50-100
test jobs running and no slowness).
According to Grafana (monitoring pods) on occasion it does get all 5 CPUs.
Gitlab Runner was downgraded to 12.9 but that didn't help, the behavior is the same.
Kubernetes cluster is at v1.15.2 (version was not changed during this whole process).
Engineers from Gitlab Support and Verify teams were not able to reproduce the issue.
One possible step forward in isolating the problem could be to register the Runner with Gitlab.com and see if the slowness is still happening for jobs ran from there.