Decide how we will manage workspace restarts due to ongoing reconciliation config changes

MR: Pending

Description

Background

Impact of workspace pod restarts due to desired config changes

  • When the content of this config changes, some types of changes may cause the Workspaces pod to restart, which causes the workspace to restart. This will happen when the new code is rolled out - when the changes are deployed on .com, or when a new release with the changes is deployed to on-prem.
  • A Workspace restart such as this can result in disruptions to the user:

Current situation

  • Up until this point, we have made the assumption that we should attempt to avoid this category of workspace restarts, and have gone to lengths to try to minimize this.
  • We have done this by implementing an approach to "versioning" the config that is sent, and attempting to always send the same version for the same workspace over the lifetime of the workspace. This is persisted in the workspaces.desired_config_generator_version field.
  • However, this approach is currently not being approached in an intentional, reliable or sustainable way, for various reasons:
    1. We are not clear exactly which types of config changes cause a workspace restart, and which do not.
    2. We have no reliable way to know when a given logic change has resulted in a config change - but this is possible - see Workspaces golden master test to check desired_... (!186342 - merged) • Chad Woolley • 17.11 for a spike implementation of how we could achieve this.
    3. Given that the current max lifetime of a workspace is 1 year, the current approach to versioning may be unsustainable, as it involves maintaining independent custom implementations of the logic for each currently-supported version. If we continue to iterate and evolve the config rapidly, there could be many versions we need to maintain, which would represent a significant source of ongoing work and complexity.

Normal restarts

The other important point to be aware of is that workspace pod restarts could be possible at any time, even if there were no config/resource changes. This is the nature of Kubernetes. Here's a brief AI-generated list of some normal events which could cause a pod and workspace restarts. Note that this list _does not include avoidable node-/cluster-level events such as node resource (CPU/memory/disk) exhaustion:

List of potential reasons for Kubernetes pod/container restarts
  • Node-level issues:
    • Node failures or reboots
    • Node maintenance or auto-scaling events
  • Cluster-level events:
    • Kubernetes control plane upgrades
    • Network partitions
    • DNS failures
    • Storage issues (PV/PVC problems)
  • Scheduler-driven events:
    • Pod evictions due to node pressure
    • Preemption by higher-priority pods
    • Cluster autoscaler activities
    • Descheduling by disruption controllers
  • Administrative actions:
    • Manual node draining
    • Cluster maintenance
    • Security patches requiring pod rotation
  • Container-level issues
    • OOMKilled (Out of Memory)
    • Application crashes or panics
    • Probe failures
    • Container runtime issues

Other potentially related problems

Questions / Decisions

IMPORTANT NOTE: We must remember that the status of Workspace changes (user level settings) are los... (&15769) • Unassigned is very important to these questions, because it will greatly mitigate the negative impact of most workspaces restarts._

  1. Should we even be attempting to ALWAYS avoid workspace restarts upon deployment of new versions?
    1. As AI summarized when making the list of potential reasons for pod restarts above: "These scenarios highlight why applications deployed in Kubernetes should be designed for resilience, with proper health checks, graceful shutdown handling, and statelessness where possible."
    2. Given that statement, and the fact that it is normal for Kubernetes to restart pods for many reasons listed above, is this a "misuse" of Kubernetes to make an assumption that users can have long-running (up to 1 year) workspace containers which are never restarted?
    3. How much effort and complexity is achieving this goal worth? As noted above, it is a complex, detail-oriented, and tedious effort to to maintain and lock all supported workspace versions for the max-1-year workspace lifetime timespan. This will involve ongoing effort, and make the code harder to understand and maintain. For example, we will have to backport any bugfixes to all supported versions.
  2. Or should we instead take the approach that it is acceptable to have restarts (again, assuming that Workspace changes (user level settings) are los... (&15769) • Unassigned is completed) when config changes are deployed?
  3. If we are OK with restarts, to what extent do we need to notify users of this possibility? Can we simply make a blanket statement that a workspace may restart at any time?
    1. Note that the implementation of this is very different for .com vs. on-prem. For on-prem, this can be controlled, and users can be warned to save data beforre a Release installation, with perhaps a reminder in each release notes. But for .com, our Continuous Deployment means that a config change can be deployed at any time after it is merged to master and passes the pipelines.
  4. How much effort do we want to invest in all of this, in the short term and long term? Especially given that there are high-priority things to work on which are still unstarted, such as Workspace changes (user level settings) are los... (&15769) • Unassigned
  5. Regardless of the decisions above:
    1. Do we want to have an investigation to know exactly what types of config changes do and do not cause restarts?
    2. Do we want to invest any effort in being able to predict when a config change is going to get merged which will cause workspace restarts?
    3. What impact does Workspace hangs on restart after being stopped (#533807 - closed) • Vishal Tak, Safwan Ahmed • 17.11 have on this (if it turns out to be an actual config-related regression)? Does that justify the effort to track config changes, or are we better off instead investing in adequate E2E test coverage as proposed above?

Acceptance criteria

  • Answer all the questions and decisions above

Implementation plan

TODO: Pending based on questions/discussion.

Edited by Chad Woolley