Recover panics from step-runner gRPC methods using an interceptor
## Summary Implement panic recovery for step-runner gRPC methods to ensure the Runner handles panics gracefully, cleans up properly, and reports errors correctly. ## Background From the discussion in [!438](https://gitlab.com/gitlab-org/step-runner/-/merge_requests/438#note_3168124201), while panics in step functions are now caught, we need broader panic recovery at the gRPC level to handle panics in step-runner's own code. Key points from the discussion: - gRPC does not have built-in panic recovery like `net/http`'s standard server - A ready-made interceptor is available: [go-grpc-middleware/recovery](https://github.com/grpc-ecosystem/go-grpc-middleware/tree/main/interceptors/recovery) - If a panic occurs with decent isolation between requests, it should only affect that one client/job ## Requirements 1. **Use a gRPC interceptor** to recover panics from all gRPC methods - Consider using the [go-grpc-middleware recovery interceptor](https://github.com/grpc-ecosystem/go-grpc-middleware/tree/main/interceptors/recovery) 2. **Ensure the Runner stops and cleans up** - If we have the job ID in context, immediately go into cleanup mode - Delete any state that could have been created for that job - Handle inconsistent state (e.g., job entry in syncmap may or may not have been recorded) 3. **Classify the error correctly** - Return appropriate gRPC error codes (e.g., `codes.Internal` for panics) - Ensure the Runner can differentiate between "step runner exploded" vs "a function failed to run" - Map to correct failure reasons so monitoring/alarms are triggered appropriately (e.g., `RunnerSystemFailure` should set off alarms) 4. **Cancel async processes during recovery** - Investigate using context cancellation to stop any async processes that need to be stopped - Ensure all goroutines associated with the panicked request are properly terminated ## Additional Considerations - Include stack trace in panic error messages (`debug.Stack()`) - Log panics and report to observability platform - These panics should be treated as unrecoverable for reconnection purposes - The session is broken after a panic - subsequent calls like `Status` after `FollowLogs` panics don't make sense ## Related - MR: [!438 - Catch panics in StepFunction.Run](https://gitlab.com/gitlab-org/step-runner/-/merge_requests/438) - Discussion: [note_3168124201](https://gitlab.com/gitlab-org/step-runner/-/merge_requests/438#note_3168124201)
issue