Remote development workspaces setup: Agent reconcile fails with HTTP 500 against server Rails API
Summary
Setting up a Kubernetes cluster for remote development workspaces works until the point where the agent for Kubernetes can be selected in the Create workspace
form. After that, the provisioning icon is spinning and nothing happens anymore.
The agent pod logs unveil that calls to the Rails API throw a 500 error.
This is hard to debug as a user, and needs more verbose error logging to understand what exactly is required to fix. Is it the .devfile.yaml, a problem with the cluster permissions, or a bug in the software (agent for Kubernetes, Rails server)?
Steps to reproduce
- Spin up a Google Kubernetes Engine cluster, register a domain in cloud DNS.
- Create a new test group somewhere
- Follow the remote development workspaces documentation https://docs.gitlab.com/ee/user/workspace/ to setup the infrastructure - full walkthrough in https://gitlab.com/gitlab-de/use-cases/remote-development/agent-kubernetes-gke
- Ensure that the agent for Kubernetes shows up in a new project in the test group.
- After the agent for Kubernetes and gitlab-workspaces-proxy are installed, fork this demo project https://gitlab.com/gitlab-org/remote-development/examples/example-go-http-app into the test group
- Navigate to Menu > Your Work > Workspaces and create a new workspace. Search for
example-go-http-app
. Select the agent for Kubernetes from the drop down. - Create the workspace. The wheel is spinning.
- Use kubectl to inspect the pod logs for the agent for Kubernetes.
- Increase the logging to debug for the agent config, check again.
Example Project
What is the current bug behavior?
No workspace is provisioned. The frontend shows an endless spinner, and no errors.
The Kubernetes cluster agent logs show more insights - they call the GitLab.com Rails API /reconcile endpoint. Which itself returns a 500 error.
The agent for Kubernetes does not log any response body that would help debug the error.
The Rails API code also has a "catch all exceptions" block that makes debugging harder, everything is treated as 500 error. ee/lib/ee/api/internal/kubernetes.rb
There is no visibility into the chain of possible errors.
What is the expected correct behavior?
- The agent for Kubernetes logs why the Rails API fails.
- The error message helps to identify which parts are failing - the Rails API must be calling something that sends a "create pod" event action back to the agent.
- Troubleshooting documentation captures the cases and provides help how to resolve.
Relevant logs and/or screenshots
kubectl logs -f -l app.kubernetes.io/name=gitlab-agent -n gitlab-agent-remote-dev-dev
─╯
{"level":"info","time":"2023-05-19T18:18:18.787Z","msg":"starting partial update","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:18.787Z","msg":"Running reconciliation loop","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:18.787Z","msg":"Making GitLab request","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:22.900Z","msg":"Made request to the Rails API","mod_name":"remote_development","status_code":500,"request_id":"d8f35d22709df62d56c68304e1142803","duration_in_ms":4112,"agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:22.900Z","msg":"Reconciliation loop ended","mod_name":"remote_development","agent_id":60387}
{"level":"error","time":"2023-05-19T18:18:22.900Z","msg":"Remote Dev - partial sync cycle ended with error","mod_name":"remote_development","error":"unexpected status code: 500","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:28.777Z","msg":"ContainerScanning config is empty, security policies are disabled","mod_name":"starboard_vulnerability","agent_id":60387}
{"level":"info","time":"2023-05-19T18:18:32.901Z","msg":"starting partial update","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:32.901Z","msg":"Running reconciliation loop","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:32.901Z","msg":"Making GitLab request","mod_name":"remote_development","agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:36.905Z","msg":"Made request to the Rails API","mod_name":"remote_development","status_code":500,"request_id":"1976fcdb8631dfe2048b378b9cd9b762","duration_in_ms":4004,"agent_id":60387}
{"level":"debug","time":"2023-05-19T18:18:36.905Z","msg":"Reconciliation loop ended","mod_name":"remote_development","agent_id":60387}
{"level":"error","time":"2023-05-19T18:18:36.905Z","msg":"Remote Dev - partial sync cycle ended with error","mod_name":"remote_development","error":"unexpected status code: 500","agent_id":60387}
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
Capture more error messages and log the string on the agent side.