Skip to content

Reduce workspace state transition delay from 'Starting' to 'Running'

Problem

During GDK workspace startup, there's a significant delay (up to 1 minute and 6 seconds) waiting for the workspace state to transition from 'Starting' to 'Running'. This delay blocks user-defined postStart commands and contributes to the overall GDK startup time.

From the timing breakdown:

Operation Duration Notes
Workspace state wait 1:06 Waiting for 'Running' state

This represents 6.5% of the total 17-minute installation time.

Root Cause

As explained by @cwoolley-gitlab and @vtak:

  1. The workspace state is based on the reconciliation loop between agentk and kas/GitLab monolith
  2. agentk sends the current state every 10 seconds by default during reconciliation
  3. The actual state is mounted as a file from a Kubernetes secret (gl_workspace_reconciled_actual_state.txt)
  4. Kubernetes takes time to flush the updated secret contents to the visible filesystem in the container
  5. This delay can be up to a minute depending on cluster load and is not under our direct control

Current Behavior

The gl-sleep-until-container-is-running-command script polls the state file every 5 seconds:

# Workspace state wait example from logs
2025-07-23T14:54:47+00:00: Workspace state is 'Starting' from status file. Blocking remaining postStart events execution for 5 seconds until state is 'Running'...
# ... (repeats multiple times)
2025-07-23T14:55:53+00:00: Workspace state is now 'Running', continuing postStart hook execution.

Impact

  • User Experience: Users see an available "Open Workspace" button but the workspace setup is still blocked internally
  • Performance: Adds ~1 minute to total GDK startup time
  • Resource Utilization: Workspace resources are idle during this waiting period

Potential Solutions

  1. Reduce reconciliation frequency: Investigate if the default 10-second reconciliation interval can be reduced for workspace startup scenarios

  2. Alternative state communication: Explore mechanisms beyond mounted Kubernetes secrets for faster state communication

  3. Parallel execution: Consider if some user-defined postStart commands could run in parallel rather than being completely blocked

  4. Future improvement: The in-container agentk work mentioned by @cwoolley-gitlab could make this more immediate

Acceptance Criteria

  • Reduce the workspace state transition delay from ~1 minute to under 30 seconds
  • Maintain the guarantee that user-defined postStart commands only run after workspace is marked as Running
  • Document any changes in behavior or expectations
  • Measure impact on overall GDK startup time

Related Issues

  • Parent epic: &12382 (closed) (GDK - Reduce GDK startup time by 50% in Remote Development workspace)
  • Related to workspace postStart command execution flow

Additional Context

This issue was identified during analysis of GDK startup performance in Remote Development workspaces. While the workspace becomes functionally available before the state file is updated, the current design intentionally blocks user-defined postStart commands to provide state guarantees.

Edited by 🤖 GitLab Bot 🤖