During a recent outage: gitlab-com/gl-infra/production#2381 (closed) it was brought up that in various places in the GitLab UI, we will show incorrect information. If the API calls to GitLab fail, we do not indicate a problem with the Container Registry, but instead return nil values. This is a serious problem as scripts may rely on the ability to find tags to determine if an object needs to be built. This can lead to unnecessary rebuilds. It would appear that our API may not be properly returning a valid response to the user when the Container Registry is suffering a problem.
Intentionally shutdown the Container Registry service
Perform various operations against the Container Registry:
Make an API call to find an image
Tell a CI job to pull an image
What is the current bug behavior?
The API will respond noting that the Docker image tag does not exist.
What is the expected correct behavior?
The API should return an error.
Proposal
Update the copy when an error due to the registry being unavailable occurs. This impacts both the API and the UI.
User interface
We are having trouble connecting to the Container Registry. Please try refreshing the page. If this error persists, please review the troubleshooting documentation.
API
We are having trouble connecting to the Container Registry. If this error persists, please review the troubleshooting documentation.
This is a bit tricky to reproduce on GDK, to properly do so:
Boot GDK and registry, access the page once
After the first successful access turn off docker app
Refresh to show the error
If too much time passes the 'connection' to the registry will be lost (and a full-page error will be raied Failed to open TCP connection to 0.0.0.0:5000 (Connection refused - connect(2) for "0.0.0.0" port 5000)) and the docker app must be restarted
we do have a connection_error flag (character_error) that it should be set to on before we even reach calling the API, such flag is somehow not set
Are there any steps they can take other than refreshing the page? We can send them to troubleshooting docs, but if there are actions they can take without leaving the product, that would be great...
@sselhorn We would like to update the UI copy when there is an error with the Container Registry.
Current UI
My first suggestion is:
### Container Registry errorWe are having trouble connecting to the registry, please try refreshing the page. If this error persists, please review the documentation for troubleshooting.
I'm not sure what the standards are for big error messages like this, so I would appreciate your feedback.
@sselhorn I have this design issue for 13.5. I think the only change should be focused on the copy. Would you mind reviewing my suggestion above and we can move this forward?
I think Nico's comment: #227466 (comment 376368291) summarizes when the error occurs. Basically, if the registry is down and you attempt to use it, you will receive this error. Both in the API and in the UI. There are three use cases we need to tackle this issue:
change the meaning /value of character_error to a more general connection_error that is set to true in any case we can't communicate with the registry.
change a bit copy / ux so that the error message is more general
@nmezzopera@trizzi@icamacho Thanks everyone for the explanation. Here is an attempt at editing Ian's UI text. Please feel free to correct/edit as you see fit...
### Container Registry errorWe are having trouble connecting to the Container Registry. Please try refreshing the page. If this error persists, please review the troubleshooting documentation.
@icamacho Good point, I think we will likely need someone from the backend to update the API response when the registry is down. So it would be:
User interface
We are having trouble connecting to the Container Registry. Please try refreshing the page. If this error persists, please review the troubleshooting documentation.
API
We are having trouble connecting to the Container Registry. If this error persists, please review the troubleshooting documentation.
@10io can we update the API response for the registry?
The UI message shown in #227466 (comment 376392325) is displayed when @character_error is set. This variable is set if ContainerRegistry::Path::InvalidRegistryPathError. This error is only thrown if a container repository path is not valid.
From my understanding, we have two cases:
Can't connect to the registry because the path is invalid. That's the case by the @character_error.
Can't connect to the registry because of a network issue. This case is not currently handled as far as I know.
can we update the API response for the registry?
@trizzi From the above, we want the same error message for both cases. We will need to update the frontend and the api to handle these.
@10io@trizzi I think if we want to handle registry network errors which are not path errors we would need a new backend flag to show this, currently if the path is correct but the registry is unreachable we immediately get a 500 and are not even able to reach the frontend
The backend has to handle the network errors. The most straightforward way to do this is to create a new error class and let the rails controller catch it and then, it can present this in whatever form to the frontend.
@10io if we can catch this at the controller level, adding a new data attribute in the view file(s) - project and group - would be the best solution frontend wise.
Marin Jankovskichanged title from During Container Registry outage, the UI of GitLab shows incorrect information to When Container Registry is unavailable, the UI of GitLab shows incorrect information
changed title from During Container Registry outage, the UI of GitLab shows incorrect information to When Container Registry is unavailable, the UI of GitLab shows incorrect information
Update: Unfortunately we won't be able to get to this in 13.8. I'm going to move this to the backlog for now, but will try and schedule in subsequent milestones.
@jhampton This issue is on the infradev board. Please work with your PM to get it scheduled for a milestone. Here is the process documentation for reference.
Update: Unfortunately, we will not have the capacity to work on this in 13.11. Based on a recent rebalancing of our roadmap, this will likely need to wait a few months. Moving to 14.3.