Skip to content

Flaky test: TestRunIntegrationTestsWithFeatureFlag in integration_k8s test suite

The integration_k8s is recently frequently failing randomly on the TestRunIntegrationTestsWithFeatureFlag test. The example can be found at https://gitlab.com/gitlab-org/gitlab-runner/-/jobs/1833454016#L850 with the specific failure at https://gitlab.com/gitlab-org/gitlab-runner/-/jobs/1833454016#L1410.

From what I've seen, the failure is each time the same:

--- FAIL: TestRunIntegrationTestsWithFeatureFlag/testKubernetesGarbageCollection_FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY_false/pod_deletion_during_prepare_stage_in_custom_namespace (0.00s)
    kubernetes_integration_test.go:724: 
        	Error Trace:	kubernetes_integration_test.go:724
        	            				kubernetes_integration_test.go:767
        	Error:      	Received unexpected error:
        	            	The POST operation against Namespace could not be completed at this time, please try again.
        	Test:       	TestRunIntegrationTestsWithFeatureFlag/testKubernetesGarbageCollection_FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY_false/pod_deletion_during_prepare_stage_in_custom_namespace

From what I see the timeout here is defined by a context and it's set for 1 minute. So for some reason the namespace creation operation is sometimes taking more than a minute in the test environment. Which randomly fails the test -> job -> pipeline.

As a short-term workaround we will add an automatic retry to the integration_k8s test suite. But as it adds potentially up to 30 minutes more to the pipeline execution time, we should find out what is causing the randomized failures here and fix the problem properly.