Skip to content

Remote development workspaces setup: Agent reconcile fails with HTTP 500 against server Rails API

Summary

Setting up a Kubernetes cluster for remote development workspaces works until the point where the agent for Kubernetes can be selected in the Create workspace form. After that, the provisioning icon is spinning and nothing happens anymore.

The agent pod logs unveil that calls to the Rails API throw a 500 error.

This is hard to debug as a user, and needs more verbose error logging to understand what exactly is required to fix. Is it the .devfile.yaml, a problem with the cluster permissions, or a bug in the software (agent for Kubernetes, Rails server)?

Steps to reproduce

  1. Spin up a Google Kubernetes Engine cluster, register a domain in cloud DNS.
  2. Create a new test group somewhere
  3. Follow the remote development workspaces documentation https://docs.gitlab.com/ee/user/workspace/ to setup the infrastructure - full walkthrough in https://gitlab.com/gitlab-de/use-cases/remote-development/agent-kubernetes-gke
    • Ensure that the agent for Kubernetes shows up in a new project in the test group.
  4. After the agent for Kubernetes and gitlab-workspaces-proxy are installed, fork this demo project https://gitlab.com/gitlab-org/remote-development/examples/example-go-http-app into the test group
  5. Navigate to Menu > Your Work > Workspaces and create a new workspace. Search for example-go-http-app. Select the agent for Kubernetes from the drop down.
  6. Create the workspace. The wheel is spinning.
  7. Use kubectl to inspect the pod logs for the agent for Kubernetes.
  8. Increase the logging to debug for the agent config, check again.

Example Project

What is the current bug behavior?

No workspace is provisioned. The frontend shows an endless spinner, and no errors.

The Kubernetes cluster agent logs show more insights - they call the GitLab.com Rails API /reconcile endpoint. Which itself returns a 500 error.

The agent for Kubernetes does not log any response body that would help debug the error.

The Rails API code also has a "catch all exceptions" block that makes debugging harder, everything is treated as 500 error. ee/lib/ee/api/internal/kubernetes.rb

There is no visibility into the chain of possible errors.

What is the expected correct behavior?

  1. The agent for Kubernetes logs why the Rails API fails.
  2. The error message helps to identify which parts are failing - the Rails API must be calling something that sends a "create pod" event action back to the agent.
  3. Troubleshooting documentation captures the cases and provides help how to resolve.

Relevant logs and/or screenshots

kubectl logs -f -l app.kubernetes.io/name=gitlab-agent -n gitlab-agent-remote-dev-dev   
                                                                        ─╯
{"level":"info","time":"2023-05-19T18:18:18.787Z","msg":"starting partial update","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:18.787Z","msg":"Running reconciliation loop","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:18.787Z","msg":"Making GitLab request","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:22.900Z","msg":"Made request to the Rails API","mod_name":"remote_development","status_code":500,"request_id":"d8f35d22709df62d56c68304e1142803","duration_in_ms":4112,"agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:22.900Z","msg":"Reconciliation loop ended","mod_name":"remote_development","agent_id":60387}
{"level":"error","time":"2023-05-19T18:18:22.900Z","msg":"Remote Dev - partial sync cycle ended with error","mod_name":"remote_development","error":"unexpected status code: 500","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:28.777Z","msg":"ContainerScanning config is empty, security policies are disabled","mod_name":"starboard_vulnerability","agent_id":60387}
{"level":"info","time":"2023-05-19T18:18:32.901Z","msg":"starting partial update","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:32.901Z","msg":"Running reconciliation loop","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:32.901Z","msg":"Making GitLab request","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:36.905Z","msg":"Made request to the Rails API","mod_name":"remote_development","status_code":500,"request_id":"1976fcdb8631dfe2048b378b9cd9b762","duration_in_ms":4004,"agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:36.905Z","msg":"Reconciliation loop ended","mod_name":"remote_development","agent_id":60387}
{"level":"error","time":"2023-05-19T18:18:36.905Z","msg":"Remote Dev - partial sync cycle ended with error","mod_name":"remote_development","error":"unexpected status code: 500","agent_id":60387}

image

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

Capture more error messages and log the string on the agent side.

/cc @vtak @timofurrer @oregand @ericschurter

Edited by Chad Woolley