Workspace stuck in limbo if namespace directly deleted and workspace terminated from UI right after it
MR: gitlab-org/cluster-integration/gitlab-agent!1080 (merged)
Description
In a edge case when a workspace is deleted from the user and from Kubernetes directly at the same time, the workspace state gets stuck in a limbo.
Steps to reproduce
- Create a new workspace
- Go to Kubernetes and delete the namespace in which the workspace has been created
- As soon as the namespace is deleted, immediately "Terminate" the workspace from the UI
Example from production
#<RemoteDevelopment::Workspace:0x00007f8383934438
id: 20,
created_at: Thu, 11 May 2023 12:41:37.500885000 UTC +00:00,
updated_at: Fri, 12 May 2023 08:28:57.375446000 UTC +00:00,
user_id: 10764887,
project_id: 45847091,
cluster_agent_id: 58101,
desired_state_updated_at: Fri, 12 May 2023 07:33:44.883484000 UTC +00:00,
responded_to_agent_at: Fri, 12 May 2023 08:28:57.375446000 UTC +00:00,
max_hours_before_termination: 24,
name: "workspace-58101-10764887-ljc9oj",
namespace: "gl-rd-ns-58101-10764887-ljc9oj",
desired_state: "Terminated",
actual_state: "Running",
editor: "webide",
devfile_ref: "main",
devfile_path: ".devfile.yaml",
url:
"https://60001-workspace-58101-10764887-ljc9oj.workspaces.gitlab.dev?folder=%2Fprojects%2Fexample-nodejs-express-app",
deployment_resource_version: "12352481">,
Analysis
The actual_state
of the workspace shows as Running
, desired_state
as Terminated
, responded_to_agent_at
is more recently updated than desired_state_updated_at
.
During partial reconciliation, Rails will not send information about the workspace because responded_to_agent_at
is more recently updated than desired_state_updated_at
During full reconciliation, Rails will send information about the workspace but there will be no desired_config_to_apply
since the desired_state
is Terminated
Temporary fix
Since this would only occur, if someone manually deletes the namespace and at the same time terminate the workspace in Rails, it is unlikely to occur frequently.
However, if it does happen, the solution is to create a deployment of the same name as the workspace in the expected namespace. Things will automatically get resolved.
Long term fix
- Figure out where to add this check? Maybe add a check in agentk that if it receives a workspace with
actual_state=Running
,desired_state=Terminated
but the workspace does not exist, it should report theactual_state=Terminated
- To catch such edge cases, we need to do fuzz testing of communication between agentk and Rails. This will help uncover edge cases.
Acceptance Criteria
-
Fix the issue by tracking Terminating
andTermination
progress in the tracker -
Add specs to verify the changes made
Technical Requirements
- Since this is an exclusively backend issue, the acceptance criteria is the same
Design Requirements
NA
Impact Assessment
After the fix is deployed, the fix should also correct existing workspaces impacted by the bug and should result in their status being updated to Terminated