Memory leak on gitlab runner
Hello team,
We have deployed an autoscaled gitlab runner platform on our Kubernetes cluster, and for some time we observe a constantly increasing memory usage on the runner platform container :
For a bit more context, this runner platform serves scheduled jobs, along with regular jobs building on commits.
The scheduled jobs are the most intensive ones, possibly triggering more than 10 jobs in parallel.
And it happen often that those jobs run into timeouts (because of our system performance, but worth mentioning for the context in case it would help for the investigation).
The most occurring logs patterns are the following from the runner platform container :
Failed to process runner
1081
Job failed: command terminated with exit code 1
625
Job failed: execution took longer than 1h0m0s seconds
368
Error while executing file based variables removal script
454
Job failed: canceled
54
Apart from that, I filtered the non interesting logs patterns /Appending trace to coordinator|Submitting job to coordinator|Updating job|Checking for jobs|Job succeeded/.
We have also a few occurrences of Job failed (system failure): pods "runner-XXX-project-XXX-concurrent-XXX" not found (10 to 20 occurrences out of 2 last days) and Error streaming logs gitlab/runner-XXX-project-XXX-concurrent-XXX/helper:/logs-12775-9413290/output.log: command terminated with exit code 143. Retrying... (5 to 10 occurrences out of last 2 days).
If you need any more info, I will be happy to answer.