Reduce workspace state transition delay from 'Starting' to 'Running'
Problem
During GDK workspace startup, there's a significant delay (up to 1 minute and 6 seconds) waiting for the workspace state to transition from 'Starting' to 'Running'. This delay blocks user-defined postStart commands and contributes to the overall GDK startup time.
From the timing breakdown:
Operation | Duration | Notes |
---|---|---|
Workspace state wait | 1:06 | Waiting for 'Running' state |
This represents 6.5% of the total 17-minute installation time.
Root Cause
As explained by @cwoolley-gitlab and @vtak:
- The workspace state is based on the reconciliation loop between
agentk
andkas
/GitLab monolith -
agentk
sends the current state every 10 seconds by default during reconciliation - The actual state is mounted as a file from a Kubernetes secret (
gl_workspace_reconciled_actual_state.txt
) - Kubernetes takes time to flush the updated secret contents to the visible filesystem in the container
- This delay can be up to a minute depending on cluster load and is not under our direct control
Current Behavior
The gl-sleep-until-container-is-running-command script polls the state file every 5 seconds:
# Workspace state wait example from logs
2025-07-23T14:54:47+00:00: Workspace state is 'Starting' from status file. Blocking remaining postStart events execution for 5 seconds until state is 'Running'...
# ... (repeats multiple times)
2025-07-23T14:55:53+00:00: Workspace state is now 'Running', continuing postStart hook execution.
Impact
- User Experience: Users see an available "Open Workspace" button but the workspace setup is still blocked internally
- Performance: Adds ~1 minute to total GDK startup time
- Resource Utilization: Workspace resources are idle during this waiting period
Potential Solutions
-
Reduce reconciliation frequency: Investigate if the default 10-second reconciliation interval can be reduced for workspace startup scenarios
-
Alternative state communication: Explore mechanisms beyond mounted Kubernetes secrets for faster state communication
-
Parallel execution: Consider if some user-defined postStart commands could run in parallel rather than being completely blocked
-
Future improvement: The in-container
agentk
work mentioned by @cwoolley-gitlab could make this more immediate
Acceptance Criteria
-
Reduce the workspace state transition delay from ~1 minute to under 30 seconds -
Maintain the guarantee that user-defined postStart commands only run after workspace is marked as Running -
Document any changes in behavior or expectations -
Measure impact on overall GDK startup time
Related Issues
- Parent epic: &12382 (closed) (GDK - Reduce GDK startup time by 50% in Remote Development workspace)
- Related to workspace postStart command execution flow
Additional Context
This issue was identified during analysis of GDK startup performance in Remote Development workspaces. While the workspace becomes functionally available before the state file is updated, the current design intentionally blocks user-defined postStart commands to provide state guarantees.