Kubernetes Fault Tolerance Feedback issue
Feedback: GitLab Runner Fault Tolerance Feature
Overview
We've recently implemented fault tolerance for GitLab Runner, allowing Runner Managers to resume running jobs after restarts or failures. This feature helps address orphaned Kubernetes pods and jobs stuck in "Running" state when Runner Managers restart.
The initial implementation supports:
- Kubernetes executor with attach strategy
- File store for saving job execution state
- Seamless resumption of running jobs after manager restarts
Feedback Request
We'd like to gather feedback from users who have tested this feature to help guide future improvements. Please share your experiences with:
-
General functionality - Did the feature work as expected? Were you able to resume jobs after Runner Manager restarts?
-
Configuration - Was the configuration intuitive? Did you encounter any issues setting up the store or other options?
-
Deployment scenarios - How did you deploy Runner with fault tolerance? Single instance, multiple instances, Helm chart, Runner Operator, etc.
-
Performance impact - Did you notice any performance changes when fault tolerance was enabled?
-
Store behavior - How did the File store perform in your environment? Any issues with cleanup, space usage, or job resumption?
-
Edge cases - Did you encounter any unexpected behavior or edge cases we should address?
-
Feature requests - What additional capabilities would make this feature more valuable to you? (e.g., additional store types, support for other executors)
Usage Information
Please include the following information when providing feedback:
- GitLab Runner version
- Executor configuration (relevant parts of your config.toml)
- Deployment environment (Kubernetes version, cloud provider if applicable)
- Any relevant error messages or logs
Future Development
Based on your feedback, we plan to:
- Consider supporting additional store types (e.g., Redis)
- Potentially expand support to other executors
- Improve handling of edge cases and error scenarios
- Enhance the fault tolerance documentation
Your input is valuable in helping us prioritize these improvements. Thank you for testing this feature!