Technical design for tracking applier errors
MR: No MR
Track applier errors
Overview
- There may be errors at the time of applying k8s resources. At present these are captured within a goroutine but are not reported in Rails. So in case of errors, the user will be misled in the UI as the workspace will be stuck in the last possible stage.
Special considerations
- Implementation of the tracking logic interleaves well with any combination of full/partial sync cycle
- Since applier errors are received asynchronously, the implementation should be able to accommodate edge cases pertaining to concurrency issues
Solution Overview
The following components are affected as a part of this change:
- agentk
- rails
Changes in agentk
An error tracker will be introduced in the reconciler to keep track of errors encountered as a part of the reconciliation loop. The tracker will be equipped to handle errors available in synchronous as well as asynchronous calls
Error Tracker schema
type errorTracker struct {
mx sync.Mutex
store map[errorTrackerKey]operationState
}
type errorTrackerKey struct {
workspace string
namespace string
}
type operationState struct {
version string
errDetails *errorDetails // errDetails is nil in case no errors have been reported so far
}
type errorDetails struct {
errorType ErrorType
message error
}
type ErrorType int
const (
ErrorTypeApplier ErrorType = "applier"
)
The error tracker will be created and bound to the lifecycle of the reconciler. Therefore, for every full sync, both the existing reconciler and the errorTracker
shall be dropped and created from scratch for subsequent loops.
The errorTracker
works similar to the terminationTracker
in that it may require multiple reconciliation cycles to fully report the error details being tracked to rails as well as clean up the state.
The working of the errorTracker can be understood through its three distinct functions
- capturing an error
- reporting an error
- cleanup after successful reporting
Capturing an error
This takes place within the applyWorkspaceChanges
method but it may occur asynchronously as well. This is especially relevant to calls the k8sClient.applier.Run
function where an error channel is returned. In such cases, the tracker may be updated asynchronously.
Some questions to consider:
- can there be cases where multiple errors are available in the
chan error
in a single call tok8sClient.applier.Run
?- For the sake of simplicity, the proposed changes will capture all errors per action published under the channel. Once the channel is closed, these errors will be merged and reported to rails in one call. The alternative implementation to capture/publish errors as they arrive will be slightly trickier but can be considered if absolutely necessary.
Reporting an error
At the start of each reconciliation loop, a snapshot of errorTracker
will be created to collect the latest errors per workspace tracked so far. This is done in a manner similar to how a snapshot of workspaces is prepared using the k8sInformer
. Another benefit of using a snapshot is that it renders handling of some edge cases unnecessary, thereby making the logic simpler. These cases primarily concern scenarios where the errorTracker
may be updated in the background.
The main work of the reporting happens inside generateWorkspaceAgentInfos
The updated logic will behave as follows:
- For each existing workspace, check if either an unpersisted state exists. If yes, collect it for reporting to rails
- Iterate the
terminatingTracker
to collect information on workspaces with a termination progress to report. The data withinterminationTracker
should be independent of the data within the error snapshot - Iterate through existing entries in the snapshot of errors and collect/prepare payload to report to Rails
Note: there may be entries in the error snapshot which may also have unpersisted data to report to Rails. In such a case, both the latest data as well applier errors must be sent to rails in the same request
Cleanup after reporting
The errorTracker may cleanup an entry of a workspace within the applyWorkspaceChanges
in a manner similar to the terminationTracker
. If the response from Rails is successful, an entry in the errorTracker
may be evicted IFF the current version of the entry in the errorTracker
is the same as the version in the snapshot. This safety check ensures that an entry that tracks errors for a newer action is unaffected.
IMPORTANT: Versioning within errorTracker
One notable aspect of the errorTracker is that it only tracks the latest version of errors received per workspace/namespace. In order to understand why this is necessary, consider a hypothetical implementation errorTracker that doesn't have any concept of versioning i.e. it is essentially a map[(workspace, namespace)]errorDetails
. Let's evaluate how this implementation behaves when subject to the following sequence of actions:
- partial sync 1: receive config 1 for workspace A
- partial sync 2: receive config 2 for workspace A
- (async): applying config 1 throws an error which is captured by the error tracker
- (async): config 2 can be applied successfully without an error
- partial sync 3: an error will be reported for a stale operation where instead it should be Running
With versioning, stale writes can be avoided. Rewriting the same scenario with versioning support:
- partial sync 1: receive config 1 for workspace A. Create entry in errorTracker for workspace A with version 1 and
errorDetails
nil - partial sync 2: receive config 2 for workspace A. Update entry in errorTracker for workspace A with version 2 and
errorDetails
nil - (async): applying config 1 throws an error. The errorTracker will ignore writes for workspace A with version 1 as the latest entry has a higher version
- (async): config 2 can be applied successfully without an error. Since there are no errors, entry for workspace A in errorTracker will be removed IFF the version is 2. This protects entries with a higher version from being removed by a different goroutine
- partial sync 3: no error is reported for a stale operation
The above example also works for cases where a stale error for an earlier action may override errors captured for a more recent action. Since the versions will be synchronously created per reconciliation cycle, they can be made to be monotonically increasing either by using an atomic counter or just the timestamp.
Changes in Rails
In case of errors in the request payload, the workspace ActualState
should transition to Error
if the DesiredState of the workspace has NOT changed since the last reconciliation cycle. This implies that the user has not initiated any other operation in between the reconciliation cycles and the error in the payload will correspond to the last action.
However, if the DesiredState
of a workspace has been modified between reconciliation cycles, then it could be misleading to change the ActualState
of the workspace to Error
as the user may interpret this error to have been caused by the latest action. It should be ok for the reconciliation API on Rails to suppress the error details received in such cases and have the agentk carry out the latest instruction and focus on reporting its errors, if any.
Example of the payload received in Rails:
{
"update_type": "partial",
"workspace_agent_infos": [{
"name": "test-workspace",
"namespace": "test-namespace",
"latest_k8s_deployment_info": ..., // may or may not be populated alongside error_details
"termination_progress": ..., // may or may not be populated alongside error_details
"error_details": {
"error_type": "applier",
"error_message": "what went wrong while applying the configs"
}
}]
}
issue
Questions raised in theWe would have to think about what would happen if there is some error in local tracker which has not yet been persisted in Rails and agentk restarted
Yes, this can happen. However, the restart would just result in the configuration being re-applied. If re-applying the resource doesn't create the error then reporting the lost error serves no purpose and the UI should reflect the latest working state. If the applier fails again, then the error will be propagated to rails in the next reconciliation cycle.
What kind of errors does the applier throw? Ideally what we are interested in are the errors when the kubernetes resources failed to apply for due to XYZ reasons. The applier errors pertain to errors when applying kubernetes sources derived from the provided devfile. In other words, we cannot expect the user to always understand what went wrong due to this translation layer as the user may only be aware of the devfile. So upfront, any sort of categorization would rely on being able to accurately classify errors into categories where the user response to an error may be different.
One way would be to figure out all/most types of errors the applier can throw and to then map them to a category of errors. However, after digging into the code, there are way too many possible errors that may be returned by the applier. In addition, some of the errors are parameterized as well, thereby complicating any categorization at the application layer. There is also a maintenance cost associated with having to update these categories of errors with lib upgrades if the error message content changes.
For the first iteration, perhaps the approach can be to just avoid such a classification of errors and limit the scope to just reporting the existence of an applier error. Even with such a limitation, a workspace user will have enough information (erroneous workspace id/name) to reach out to the cluster administrator to aid with troubleshooting.
What is the structure of the Error field that we return? Is it just a string? Or do we need to pass some additional structured metadata, error type sub-field, etc?
Since there is a clear distinction between an Error and Failure, it would make sense to not use Failure for any of these cases. The payload can be of the form:
enum ErrorType {
APPLIER = "applier";
}
{
"error_details": {
"error_type": "applier",
"error_message": "..."
}
}
We can reserve 2 error types for starters: unknown type of error and an applier error. In the future, this can be extended to capture and deal with errors from other places.
Do we need to consider other scenarios when designing the error field structure other than just applier errors? For example, any of the error scenarios described in Robust Error Handling and Logging ( &10461 (closed))?
Perhaps all errors returned by applyWorkspaceChanges
can be tracked and returned to Rails to indicate something going wrong at the time of making changes to kubernetes resources for a particular workspace