Memory leak when jobs are cancelled in the Runner Kubernetes executor
Workaround: revised 2024-02-08
One thing you can do to improve this situation is change your config.toml to set "cleanup_grace_period_seconds" in [runners.kubernetes] to a small value in order to force the memory to be freed more quickly. In 16.8 a zero value for cleanup_grace_period_seconds is ignored, but in 16.9 you'll be able to set the configuration value to zero as well.
Refer to analysis here.
Summary
We are currently experiencing memory leak on our kubernetes shared runner when verbose jobs are canceled : the runner is OOMkilled and pods-jobs are stucked(see related issues).
Reproduced on gitlab-runner 13.9 and 13.11 with 256Mi, 512Mi and 1024Mi limits. (Through Gitlab 13.9.4)
Effects
- Pod-jobs stay stucked : wasted resources and associated cost
- After canceling jobs the gitlab-runner CPU usage explodes from 200m to 600m
- After canceling jobs the gitlab-runner Memory usage explodes from 11Mi to 600Mi and finaly get OOMkilled
How to reproduce
- Configure a kube-executor runner
- Create a project with this gitlab-ci.yml :
crash-test:
image: alpine:latest
script:
- |
for i in $(seq 1 1 500)
do
cat <<EOF
3:45:20,863 DEBUG [org.hibernate.SQL] - select nextval ('hibernate_sequence')
3:45:20,907 DEBUG [org.hibernate.SQL] - select nextval ('hibernate_sequence')
3:45:20,939 DEBUG [org.hibernate.SQL] - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
3:45:20,950 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [1] as [VARCHAR] - [Thorben]
3:45:20,951 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [2] as [VARCHAR] - [Janssen]
3:45:20,952 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [3] as [INTEGER] - [0]
3:45:20,953 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [4] as [BIGINT] - [1]
3:45:20,960 DEBUG [org.hibernate.SQL] - insert into Author (firstName, lastName, version, id) values (?, ?, ?, ?)
3:45:20,960 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [1] as [VARCHAR] - [Nicolia]
3:45:20,960 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [2] as [VARCHAR] - [Parlog]
3:45:20,961 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [3] as [INTEGER] - [0]
3:45:20,961 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [4] as [BIGINT] - [2]
3:45:21,488 DEBUG [org.hibernate.SQL] - update Author set firstName=?, lastName=?, version=? where id=? and version=?
3:45:21,494 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [1] as [VARCHAR] - [Nicolai]
3:45:21,503 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [2] as [VARCHAR] - [Parlog]
3:45:21,505 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [3] as [INTEGER] - [1]
3:45:21,509 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [4] as [BIGINT] - [2]
3:45:21,510 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [5] as [INTEGER] - [0]
3:45:21,522 DEBUG [org.hibernate.SQL] - select author0_.id as id1_0_, author0_.firstName as firstNam2_0_, author0_.lastName as lastName3_0_, author0_.version as version4_0_ from Author author0_ where author0_.lastName=?
3:45:21,524 TRACE [org.hibernate.type.descriptor.sql.BasicBinder] - binding parameter [1] as [VARCHAR] - [Janssen]
3:45:21,531 TRACE [org.hibernate.type.descriptor.sql.BasicExtractor] - extracted value ([id1_0_] : [BIGINT]) - [1]
3:45:21,541 INFO [org.hibernate.engine.internal.StatisticalLoggingSessionEventListener] - Session Metrics {
71600 nanoseconds spent acquiring 1 JDBC connections;
42200 nanoseconds spent releasing 1 JDBC connections;
5946500 nanoseconds spent preparing 6 JDBC statements;
9801200 nanoseconds spent executing 6 JDBC statements;
0 nanoseconds spent executing 0 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
55887200 nanoseconds spent executing 2 flushes (flushing a total of 4 entities and 0 collections);
27027800 nanoseconds spent executing 1 partial-flushes (flushing a total of 2 entities and 2 collections)
EOF
sleep 0.2
done
parallel: 5
- Trigger this project and after ~1m cancel all jobs from the Gitlab UI. (You may need to do it 2-3 times to reproduce it)
Logs
Runner logs
Kubernetes event
default 3m26s Warning OOMKilling node/ip-192-168-0-10.us-west-1.compute.internal Killed process 1465 (gitlab-runner) total-vm:1161148kB, anon-rss:1041752kB, file-rss:29416kB, shmem-rss:0kB
Related issues
Edited by Darren Eastman