Workspace stuck in limbo if namespace directly deleted and workspace terminated from UI right after it

MR: gitlab-org/cluster-integration/gitlab-agent!1080 (merged)

Description

In a edge case when a workspace is deleted from the user and from Kubernetes directly at the same time, the workspace state gets stuck in a limbo.

Steps to reproduce

Create a new workspace
Go to Kubernetes and delete the namespace in which the workspace has been created
As soon as the namespace is deleted, immediately "Terminate" the workspace from the UI

Example from production

 #<RemoteDevelopment::Workspace:0x00007f8383934438
  id: 20,
  created_at: Thu, 11 May 2023 12:41:37.500885000 UTC +00:00,
  updated_at: Fri, 12 May 2023 08:28:57.375446000 UTC +00:00,
  user_id: 10764887,
  project_id: 45847091,
  cluster_agent_id: 58101,
  desired_state_updated_at: Fri, 12 May 2023 07:33:44.883484000 UTC +00:00,
  responded_to_agent_at: Fri, 12 May 2023 08:28:57.375446000 UTC +00:00,
  max_hours_before_termination: 24,
  name: "workspace-58101-10764887-ljc9oj",
  namespace: "gl-rd-ns-58101-10764887-ljc9oj",
  desired_state: "Terminated",
  actual_state: "Running",
  editor: "webide",
  devfile_ref: "main",
  devfile_path: ".devfile.yaml",
  url:
   "https://60001-workspace-58101-10764887-ljc9oj.workspaces.gitlab.dev?folder=%2Fprojects%2Fexample-nodejs-express-app",
  deployment_resource_version: "12352481">,

Analysis

The actual_state of the workspace shows as Running, desired_state as Terminated, responded_to_agent_at is more recently updated than desired_state_updated_at.

During partial reconciliation, Rails will not send information about the workspace because responded_to_agent_at is more recently updated than desired_state_updated_at

During full reconciliation, Rails will send information about the workspace but there will be no desired_config_to_apply since the desired_state is Terminated

Temporary fix

Since this would only occur, if someone manually deletes the namespace and at the same time terminate the workspace in Rails, it is unlikely to occur frequently.

However, if it does happen, the solution is to create a deployment of the same name as the workspace in the expected namespace. Things will automatically get resolved.

Long term fix

Figure out where to add this check? Maybe add a check in agentk that if it receives a workspace with actual_state=Running, desired_state=Terminated but the workspace does not exist, it should report the actual_state=Terminated
To catch such edge cases, we need to do fuzz testing of communication between agentk and Rails. This will help uncover edge cases.

Acceptance Criteria

Fix the issue by tracking Terminating and Termination progress in the tracker
Add specs to verify the changes made

Technical Requirements

Since this is an exclusively backend issue, the acceptance criteria is the same

Design Requirements

Impact Assessment

After the fix is deployed, the fix should also correct existing workspaces impacted by the bug and should result in their status being updated to Terminated

Edited Aug 28, 2023 by Vishal Tak