Memory leak on gitlab runner

Hello team, We have deployed an autoscaled gitlab runner platform on our Kubernetes cluster, and for some time we observe a constantly increasing memory usage on the runner platform container : For a bit more context, this runner platform serves scheduled jobs, along with regular jobs building on commits. The scheduled jobs are the most intensive ones, possibly triggering more than 10 jobs in parallel. And it happen often that those jobs run into timeouts (because of our system performance, but worth mentioning for the context in case it would help for the investigation). The most occurring logs patterns are the following from the runner platform container :

Failed to process runner
1081

Job failed: command terminated with exit code 1
625

Job failed: execution took longer than 1h0m0s seconds
368

Error while executing file based variables removal script
454

Job failed: canceled
54

Apart from that, I filtered the non interesting logs patterns /Appending trace to coordinator|Submitting job to coordinator|Updating job|Checking for jobs|Job succeeded/. We have also a few occurrences of Job failed (system failure): pods "runner-XXX-project-XXX-concurrent-XXX" not found (10 to 20 occurrences out of 2 last days) and Error streaming logs gitlab/runner-XXX-project-XXX-concurrent-XXX/helper:/logs-12775-9413290/output.log: command terminated with exit code 143. Retrying... (5 to 10 occurrences out of last 2 days).

If you need any more info, I will be happy to answer.

Edited Jun 13, 2024 by Nicolas Laduguie