During a recent outage: gitlab-com/gl-infra/production#2381 (closed) it was brought up that in various places in the GitLab UI, we will show incorrect information. If the API calls to GitLab fail, we do not indicate a problem with the Container Registry, but instead return nil values. This is a serious problem as scripts may rely on the ability to find tags to determine if an object needs to be built. This can lead to unnecessary rebuilds. It would appear that our API may not be properly returning a valid response to the user when the Container Registry is suffering a problem.
Intentionally shutdown the Container Registry service
Perform various operations against the Container Registry:
Make an API call to find an image
Tell a CI job to pull an image
What is the current bug behavior?
The API will respond noting that the Docker image tag does not exist.
What is the expected correct behavior?
The API should return an error.
Proposal
Update the copy when an error due to the registry being unavailable occurs. This impacts both the API and the UI.
User interface
We are having trouble connecting to the Container Registry. Please try refreshing the page. If this error persists, please review the troubleshooting documentation.
API
We are having trouble connecting to the Container Registry. If this error persists, please review the troubleshooting documentation.
This is a bit tricky to reproduce on GDK, to properly do so:
Boot GDK and registry, access the page once
After the first successful access turn off docker app
Refresh to show the error
If too much time passes the 'connection' to the registry will be lost (and a full-page error will be raied Failed to open TCP connection to 0.0.0.0:5000 (Connection refused - connect(2) for "0.0.0.0" port 5000)) and the docker app must be restarted
we do have a connection_error flag (character_error) that it should be set to on before we even reach calling the API, such flag is somehow not set
Are there any steps they can take other than refreshing the page? We can send them to troubleshooting docs, but if there are actions they can take without leaving the product, that would be great...
@sselhorn We would like to update the UI copy when there is an error with the Container Registry.
Current UI
My first suggestion is:
### Container Registry errorWe are having trouble connecting to the registry, please try refreshing the page. If this error persists, please review the documentation for troubleshooting.
I'm not sure what the standards are for big error messages like this, so I would appreciate your feedback.
@sselhorn I have this design issue for 13.5. I think the only change should be focused on the copy. Would you mind reviewing my suggestion above and we can move this forward?
I think Nico's comment: #227466 (comment 376368291) summarizes when the error occurs. Basically, if the registry is down and you attempt to use it, you will receive this error. Both in the API and in the UI. There are three use cases we need to tackle this issue:
change the meaning /value of character_error to a more general connection_error that is set to true in any case we can't communicate with the registry.
change a bit copy / ux so that the error message is more general
@nmezzopera@trizzi@icamacho Thanks everyone for the explanation. Here is an attempt at editing Ian's UI text. Please feel free to correct/edit as you see fit...
### Container Registry errorWe are having trouble connecting to the Container Registry. Please try refreshing the page. If this error persists, please review the troubleshooting documentation.
@icamacho Good point, I think we will likely need someone from the backend to update the API response when the registry is down. So it would be:
User interface
We are having trouble connecting to the Container Registry. Please try refreshing the page. If this error persists, please review the troubleshooting documentation.
API
We are having trouble connecting to the Container Registry. If this error persists, please review the troubleshooting documentation.
@10io can we update the API response for the registry?
The UI message shown in #227466 (comment 376392325) is displayed when @character_error is set. This variable is set if ContainerRegistry::Path::InvalidRegistryPathError. This error is only thrown if a container repository path is not valid.
From my understanding, we have two cases:
Can't connect to the registry because the path is invalid. That's the case by the @character_error.
Can't connect to the registry because of a network issue. This case is not currently handled as far as I know.
can we update the API response for the registry?
@trizzi From the above, we want the same error message for both cases. We will need to update the frontend and the api to handle these.
@10io@trizzi I think if we want to handle registry network errors which are not path errors we would need a new backend flag to show this, currently if the path is correct but the registry is unreachable we immediately get a 500 and are not even able to reach the frontend
The backend has to handle the network errors. The most straightforward way to do this is to create a new error class and let the rails controller catch it and then, it can present this in whatever form to the frontend.
@10io if we can catch this at the controller level, adding a new data attribute in the view file(s) - project and group - would be the best solution frontend wise.
Marin Jankovskichanged title from During Container Registry outage, the UI of GitLab shows incorrect information to When Container Registry is unavailable, the UI of GitLab shows incorrect information
changed title from During Container Registry outage, the UI of GitLab shows incorrect information to When Container Registry is unavailable, the UI of GitLab shows incorrect information
Update: Unfortunately we won't be able to get to this in 13.8. I'm going to move this to the backlog for now, but will try and schedule in subsequent milestones.
@jhampton This issue is on the infradev board. Please work with your PM to get it scheduled for a milestone. Here is the process documentation for reference.
Update: Unfortunately, we will not have the capacity to work on this in 13.11. Based on a recent rebalancing of our roadmap, this will likely need to wait a few months. Moving to 14.3.
@10io & @nmezzopera - can you please add a weight to this issue for the work we need to do please? If necessary, please feel free to split this issue into ruby and frontend efforts.
An error is already used for one of the cases and the other case should raise a Faraday error. See #227466 (comment 422530078)
I think, it's better for the backend to raise those different errors for the different cases and let clients, such as the frontend, decide if they want to handle those cases separately (eg. There is an issue with the path of your image) or with a single error message (eg. The container registry can't be reached.)
@trizzi since this infradev issue is a year old and well outside our SLO, I've moved it up to %14.4 in support of our focus on reliability this quarter and commitment to address infradev issues.
The container registry pages pull data from 2 sources: rails and the container registry.
The question is: what is better? Should we show a connection error when we know that the container registry is not available or should we do the "best effort" here and show what we have in the rails side (incomplete data)?
Here is what I mean by incomplete data:
We have the images (that's the rails part) but the tags count (container registry part) are missing.
To me, I think it's better to not show incomplete data because we could be in confusing situations.
Look at the screen above, yes, we show the container images list but we also show the Something went wrong while fetching the repository list.. This communicates that the list might be incomplete.
Next, we have the delete buttons available but if the container registry is down, that delete button will not do anything.
Lastly, we can click on the container image to get the details but this will not work at all. The container image details page is mainly pulling data from the container registry, so the page will show errors (no tags at all).
Overall, I think that it's better to "block" the index page with a connection error message when we know for sure that the container registry is down. We could show the rails part but I don't think it's worth it.
The same situation happens in the API side.
@trizzi / @katiemacoy Can you confirm that it's ok to display the connection error message even though we could display "incomplete" data on the UI side?
I have a small concern that the single MR might end up being too big. If that happens, I'll split it into two. One for the UI and the other for the APIs.
We will need a follow up issue on the frontend side but I will discuss that in the MR directly.
I'm having a hard time to get the MR reviewed by a maintainer. I just switched to a new one. This will be the fourth maintainer selected to review the MR.
Having said that, I don't expect many 🏓 with the maintainer. As such, I'm currently confident that this will make it for %14.4
Tim Rizzichanged title from When Container Registry is unavailable, the UI of GitLab shows incorrect information to Better support for connection errors with the Container registry
changed title from When Container Registry is unavailable, the UI of GitLab shows incorrect information to Better support for connection errors with the Container registry
@10io Does that mean that this issue cannot be closed until is done? Or do you feel comfortable closing this and reopening it if needed? (I would vote for closing and reopening if it is needed)