Let's Encrypt integration doesn't scale and does not give any feedback to user on errors
Problem to solve
Currently, LE certificates "just work" on gitlab.com, and if properly configured will work properly on on-premise installations of GitLab.
The line above is not true anymore. =(
If GitLab for some reason is unable to obtain Let's Encrypt certificate, we save
acme_order and continue to check it every 15 minutes.
This causes 2 problmes:
- check every 15 minutes makes us hit the very basic rate limit on obtaining acme directory: #35934 (closed). This is invisible for user and will be retried, but it spams our error tracking system(sentry). And can potentially cause problems for users once we accumulate many acme orders...
- acme order expiration time is
7 daysby default. This means that we will actually retry certificate creation only after a week, and user has no interface to retry it earlier. I tried to reduce this time to
2 hours, but it caused us to hit another rate limit: #197978 (closed).
Kinds of errors
Most of the errors on gitlab.com are failed domain verification challenges. Some of the users might have transferred their domains to other services.
Some other options:
- generic API error(I can think about only one way of reproducing this - saving wrong account private key)
- Challenges taking too long to pass - this can be caused by misconfigured pages daemon or firewall settings
Only try to obtain the certificate once.
In case of any errors:
- mark domain as failed for Let's Encrypt validation, and not retry it
- notify the owner of the project by email(as we currently do for our own validation), similar to https://gitlab.com/gitlab-org/gitlab/blob/60787452eaff915d645d979cdca80156f5c2aacf/app/views/notify/pages_domain_disabled_email.html.haml#L4
- Change message to "Something went wrong while obtaining Let's Encrypt certificate" on the certificate section, also change color to warning:
- Add a "Retry" button to this error message
- When "Retry" button is pressed, remove error state, and change message back to "GitLab is obtaining a Let's Encrypt SSL certificate for this domain. This process can take some time. Please try again later."
As part of this solution, we need to enable the silenced
pages_domain_ssl_renewalalerts in AlertManager as discussed in this issue gitlab-com/runbooks!2002 (merged)
Old description1. I think in the first case we should just show a generic message like `GitLab couldn't obtain Let's Encrypt's certificates, please ask your system administrator to check the logs`, because displaying actual error might create a security risk. 1. For the second case, I would show something like `Let's Encrypt couldn't verify your domain, please check that your domain is properly served by gitlab pages and ask your system administrator to check pages configuration and logs.`
Note: jugging from https://gitlab.com/gitlab-org/gitlab-ce/issues/66107 example I would say that displaying the error to user would be a bad idea since it can contain private information(it doesn't contain in that case, but these logs look more like a system information.
I would put this kind of errors to sentry and logs for the start.
Extracted from https://gitlab.com/gitlab-org/gitlab-ce/issues/28996