Improve Workspace Failure Visibility and Troubleshooting (#16787) · Epics · GitLab.org

Improve Workspace Failure Visibility and Troubleshooting

**Note:** This epic has been structured to emphasize the fundamental user challenge, allowing for open discussion of various approaches to resolve it. # Problem Statement Currently, when workspaces fail, users receive no error information or guidance about what went wrong. This epic addresses this gap in the user experience. The focus of this Epic is to address the following use cases: 1. As a developer, when my workspace fails to create, I want to see clear error messages that help me understand and _potentially_ fix the issue myself. 2. As a GitLab admin, when a developer reports a workspace failure, I want to error messages so that I can quickly identify the root cause and resolve the issue. # Description Currently, in various stages of the workspace lifecycle (ex. creation, restart, etc) users encounter failures or workspaces hanging and receive limited or unhelpful error information. This creates a poor user experience and increases support burden since only administrators with Kubernetes cluster access can diagnose the root cause. The failures typically stem from either: 1. Agent configuration issues * This is out of scope and will be handled in a separate effort 2. **Workspace startup (cluster) problems** * **Focus for this epic** 3. Workspace setup stage problems * We've moved cloning and setup operations out of the init container, logs are now accessible to end users within their workspace. Some examples of possible workspace startup failure points: * No infrastructure in Kubernetes for scheduling * Cluster doesn't have the right permissions to pull from private container registry * Kuberenetes cluster not autoscaled, work cannot be scheduled * etc. # Acceptance Criteria 1. Users receive clear, human-readable error messages when workspace fails 2. Error messages include: * What component failed (agent config vs workspace setup) * Reason for failure/error * _Stretch: Suggested remediation steps where applicable_ 3. Error information is accessible through the UI without requiring cluster access Additional Considerations: 1. Error messages are logged and traceable for both users and administrators 2. Documentation is updated to include common error scenarios and their solutions ## Out Of Scope ##### Agent configuration validation * Agent issues are going to be handled separately and caught before users reach the stage of starting a workspace ##### Workspaces hanging in starting state * This may be a more complex issue caused by invalid agent configurations causing orphaned workspaces. We can address this in later iterations

epic