Workspace Suspension with Delayed Termination
### Background The default behaviour with Workspaces right now is that users, set a termination timeout when create a Workspace. Due to User feedback https://gitlab.com/gitlab-org/gitlab/-/issues/478887+, we want to change this to stop the Workspace by default and have the original timeout be for total termination. ## Related ongoing work As it stands, when users create a workspace, they set the "**Workspace automatically terminates after**" field which is capped within the range of the agent's `max_hours_before_termination_limit` ( currently this is a hardcoded value of 120 from the user POV), but work is being done in https://gitlab.com/groups/gitlab-org/-/epics/11631+ to allow this to be configurable. Being able to configure that limit will give workspace users the ability to extend their workspace lifetime for a longer period (currently capped at 1 year for technical reasons). ### Problems with relying on this approach This approach does not consider the amount of resources that would be expended if we allow workspaces to exist for extended periods even though the creators are not necessarily using them. This approach does not also fit in line with our future goals of implementing some feedback mechanism to determine "user inactivity" and stop(temporarily pause) the workspace ## Solution - Add `max_active_hours_before_stop` to the `remote_development_agent_configs` table. This field will be a certain amount of hours since the `desired_state` of a workspace changed (by checking `desired_state_updated_at`), after which the state of all non-`Stopped` and non-`Terminated` workspaces will be set to `Stopped`. - Add `max_stopped_hours_before_termination` to the `remote_development_agent_configs` as an allowance period, before which all non-`Stopped` workspaces will be set to `Terminated`. ### Validations We would need to ensure certain invariants are not violated when adding these new configurations and updating them. There could be instances where updating one of these `limit`/ `interval` fields could break another assumption and put us in a bad state. We have already foreseen these kinds of issues and we plan on versioning the configuration at the time of creation. see: https://gitlab.com/groups/gitlab-org/-/epics/14872+ Assuming versioned configs are set, the set of validations we would now want to make are as follows: - `max_active_hours_before_stop` + `max_stopped_hours_before_termination` < 8760 - both values should be greater than 0 ### Default values - `max_active_hours_before_stop`: Default to 36 hours (1.5 days), so that workspace should usually stop outside of working hours, assuming it was last restarted during working hours. - `max_stopped_hours_before_termination`: Default to 1 month, so that the workspace is still available with no data loss even after normal vacation periods. A longer interval here is OK, because the workspace should be incurring minimal resources. ## Workspace Update flow with new fields ```plantuml @startuml start if ( desired_state == "Terminated" ) then (true) -> :no action; kill else (false) if ( desired_state != "Stopped" && desired_state_updated_at + max_active_hours_before_stop > current_time ) then (true) :set desired_state = "Stopped"; kill else (false) if ( desired_state == "Stopped" && current_time - desired_state_updated_at > max_stopped_hours_before_termination ) then (true) :set desired_state = "Terminated"; kill else (false) :no action; kill endif endif endif @enduml ``` In this new scheme, we would add the check to know if enough time has passed to set the workspace to `Stopped`: ``` ruby if desired_state != "Terminated" && desired_state != "Stopped" && desired_state_updated_at + max_active_hours_before_stop > current_time # set desired_state to "Stopped" end ``` We will also want to know if enough time has passed to `Terminate` `Stopped` workspaces ``` ruby if desired_state != "Terminated" && desired_state == "Stopped" && current_time - desired_state_updated_at > max_stopped_hours_before_termination # set desired_state to "Terminated end ``` ### How do we implement setting these states? **Using Reconciliation path**: We already have periodic reconciliation intervals where rails sync the update values in the database to Kubernetes. We could hop on the reconciliation flow and add these as changes on. I would propose to add a new method in [the main ROP reconciliation chain](https://gitlab.com/gitlab-org/gitlab/-/blob/332dff478b47c193ce573fe1a6a9386da1cc6930/ee/lib/remote_development/workspace_operations/reconcile/main.rb#L24-24). Ideally before the [`WorkspacesToBeReturnedFinder`](https://gitlab.com/gitlab-org/gitlab/-/blob/332dff478b47c193ce573fe1a6a9386da1cc6930/ee/lib/remote_development/workspace_operations/reconcile/persistence/workspaces_to_be_returned_finder.rb#L49) step. We need to do this as this step fetches all workspaces with `with_desired_state_updated_more_recently_than_last_response_to_agent` as true. ## Consequence of this approach - As part of our versioning plans highlighted above, this would mean that workspace timeout fields would be tied to a config version as such, creating a new config version will not affect preexisting workspaces. This would mean the admin might have to manually terminate existing workspaces if circumstances change. ## Considerations: - This has some intersection with the versioning epic: https://gitlab.com/groups/gitlab-org/-/epics/14872+ , is this work urgent, or can it wait for that to be done? IMO we should make sure we have versioning ready before embarking on new stuff that adds fields like this to prevent confusion and unforeseen circumstances. If we can roll out https://gitlab.com/groups/gitlab-org/-/epics/11631+ in the meantime this could be a drop in replacement as users can extend the limit timeout. # Implementation Plan # New system: Set workspace to `Stopped` and `Stopped` -> `Terminated` | Order | Summary | Issue | MR | Details | | - | - | - | - | - | | 1 | Add `max_stopped_hours_before_termination` and `max_active_hours_before_stop` fields to the GitLab Kubernetes agent `agentcfg` | https://gitlab.com/gitlab-org/gitlab/-/issues/489077+ | https://gitlab.com/gitlab-org/cluster-integration/gitlab-agent/-/merge_requests/1825+ | See [Acceptance Criteria section on issue](https://gitlab.com/gitlab-org/gitlab/-/issues/489077#acceptance-criteria) | | 2a | Add a migration to add the `max_active_hours_before_stop` and `max_stopped_hours_before_termination` fields to the `remote_development_agent_config` table (or if versioning is ready, the `versioned_workspaces_agent_configs` table at that point) | https://gitlab.com/gitlab-org/gitlab/-/issues/489078+ | https://gitlab.com/gitlab-org/gitlab/-/merge_requests/166065+ | See [Migrations section on issue](https://gitlab.com/gitlab-org/gitlab/-/issues/489078#migrations) | | 2b | Add `max_stopped_hours_before_termination` and `max_active_hours_before_stop` field to domain | Same as above | Same as above | See [Models section on issue](https://gitlab.com/gitlab-org/gitlab/-/issues/489078#models) | | 2c | Adds logic to set new fields from agent config into database and adds a step in the reconciliation logic chain to stop workspaces and terminate stopped workspaces | Same as above | Same as above | See [Domain Logic section on issue](https://gitlab.com/gitlab-org/gitlab/-/issues/489078#domain-logic) | | 3 | Add (or decide not to add) replacement for "Terminates" column of Workspaces list, to reflect new suspension-then-termination logic | TODO: create issue | TODO: create MR | Need UX input on how this should work, or if we need to replace it at all | ## Deprecating old system The work to deprecate old system and remove all references to user-editable `max_hours_before_termination` has been moved to https://gitlab.com/groups/gitlab-org/-/epics/15251+
epic