Agentk: Track agentk applier errors and send them to Rails
MR: gitlab-org/cluster-integration/gitlab-agent!1033 (merged)
Description
Currently, if there are any errors while applying the kubernetes resources in GA4K, we just log and ignore them. We should be sending this information back to Rails.
How
- Currently, we are listening for events of the applier through a goroutine
- Pass an error channel to this goroutine. It will send errors in this channel when it loops over the events here
- Outside this goroutine, somewhere in the worker.go, watch for messages on this error channel and keep a track of them(maybe like we do for for terminated workspaces)
- During workspace updates call from agentk to Rails, send this information in the
Error
field ofWorkspaceAgentInfo
(related - #396882 (closed)) - Once Rails confirms that it has persisted this state in DB(actual_state=Error), remove from local tracker
- For now, we can just log the error message on Rails.
Questions
- We would have to think about what would happen if there is some error in local tracker which has not yet been persisted in Rails and agentk restarted.
- What kind of errors does the applier throw? Ideally what we are interested in are the errors when the kubernetes resources failed to apply for due to XYZ reasons.
- What is the structure of the
Error
field that we return? Is it just a string? Or do we need to pass some additional structured metadata, error type sub-field, etc? - Do we need to consider other scenarios when designing the error field structure other than just applier errors? For example, any of the error scenarios described in Robust Error Handling and Logging (&10461 - closed)?
Acceptance Criteria
-
All applier errors are tracked in agentk and returned to rails when available -
Rails is able to persist errors and render them in the UI
Technical Requirements
-
Implementation of the tracking logic interleaves well with any combination of full/partial sync cycle -
Since applier errors are received asynchronously, edge cases pertaining to concurrency issues must be evaluated carefully -
There is an adequate coverage of tests in agentk that verifies this functionality -
There is an adequate coverage of tests in rails that verifies this functionality
Design Requirements
-
How should the returned errors be rendered in the UI
Impact Assessment
- Existing workspaces with applier errors generated before release will not be affected. Only applier errors generated post-deployment will be tracked
- Tracking failures in K8S resources is out of scope for this task
Edited by Hunar Khanna