Implement full resync of workspace information
Problem
The agent has an internal state that gets incrementally updated during each reconciliation loop. We worry about this state getting out of sync with rails/kubernetes.
The following classes of issues might occur:
- Memory leaks - due to bugs in agent code, we don't clear up the state completely
- Rails DB failure and backup - if the Rails DB gets recovered from backup, the agent will "think" that Rails already knows about some workspaces, but Rails won't have the up-to-date DB.
- Some external change to the state/config of the Kubernetes cluster is made directly to the cluster, outside of the agentk reconciliation loop logic.
- ...
The requirement for full resync is this:
Agent sends the latest state of Kubernetes resources managed by the agent[1] to Rails. Rails to process this request and respond with confirmation about persisting the resource versions and expected new state.
-
[1]
- all workspace deployments that match our filtering criteria (based on labels)
Solution
In regular intervals, drop the agent's internal state and start the reconciliation loop from scratch (same as if the agent just started).
This should fix all the classes of issues we can foresee.
We also considered creating a separate logic for full resync, but reusing the reconciliation loop has the following advantages:
- Less code and cognitive load.
- The same logic used for both types of sync ensures consistency.
Original description
This issue tracks the work required to implement a workspace_updates_full_resync
message type. With this message, agentk will update rails with the current state of all the workspaces in the cluster.
Why this is needed
Under normal circumstances, only relevant updates will be shared between rails and agentk via the workspace_updates message type. Data specific to unchanged workspaces will be ignored in this message exchange. As time goes on, unforeseen bugs will likely emerge. If these affect/corrupt the internal state of agentk it may be difficult to find/fix issues without human intervention. The only way to deal with such a case may require a restart of the agentk which will interfere with the execution of other modules running on the agentk instance and makes for a bad user experience.
A full resync attempts to address this issue by ensuring that the full cluster state is periodically synced with the rails. If any data corruption occurs due to a bug/unhandled error scenario, it will be temporary as the agentk state will be recreated after a full resync with rails.
Periodic Resync vs On-demand Resync
After investigating Rails can "poke" agentk via GRPC endpoint on KAS (#387090 - closed) , it was decided that
- The primary use case for having the poke feature is to be able to trigger a full resync on demand and after a deeper discussion, it seems that there are very few / rare cases that would merit such a capability, especially with the complex nature of implementation required
- Additionally, there isnt a strong enough need to have the resync be carried out on-demand or immediately either ; an eventual reconciliation of state would suffice for most cases and it could be simply achieved through full resync that is carried out periodically (perhaps with a much longer interval between cycles)
(source)
This issue is to implement what was agreed on above - periodic workspace_updates_full_resync
How
- Send the current state of all the workspaces in the k8s cluster.
- Rails will update postgres with the current state and respond with the workspaces to be created/updated/deleted in the k8s cluster and their last resource version that rails has persisted in postgres.
- Returning the persisted resource version back to agentk gives it a confirmation that the updates for that workspace have been successfully processed on rails end.
- This persisted resource version will also help with sending only the latest workspaces changes from agentk to rails for
workspace_updates
message. - To keep things consistent between agentk and rails, agentk will send this message every time agent starts/restarts/leader-election after
prerequisites
message and after everyx
intervals of sending theworkspace_updates
message. - While a
workspace_updates_full_resync
is in progress, we should not have a simultaneousworkspace_updates
running to make sure we are not running into any edge-cases. - Rails will have to expose a separate endpoint
workspace_updates_full_resync
as mentioned here