Observability Improvement
Summary
Logs
When gitlab-pages
receives a request that cannot be fulfilled due to the inability to complete a TLS handshake, we don't receive much information from the logs. Example:
2021/06/30 18:53:19 http: TLS handshake error from 34.73.184.90:33570: remote error: tls: bad certificate
The only useful information we have is the source IP, and the reason why gitlab-pages
cannot complete the request. We have no details as to what domain was requested. This prevents us from assisting in the ability to target a potential bad actor. The only method that can be utilized currently is to perform a packet capture, hoping to see another request come in and observing the PROXY protocol data for the domain that was requested by the client.
Metrics
In that same situation, we also do not receive data on this request in our metrics. During incident gitlab-com/gl-infra/production#5050 (closed) it was noted that this error was showing up a whopping 4500 times per 10 seconds, yet our own dashboard reports no requests during the same time period. This is captured in this thread: gitlab-com/gl-infra/production#5050 (comment 615703006)
Steps to reproduce
- Create a pages project using TLS with an intently malformed certificate that would be rejected by the client.
- Observe the logs from
gitlab-pages
when attempting to reach the site - Observe the metrics from the
gitlab-pages
service when attempting to reach the site
What is the expected correct behavior?
Logs
- Log Output should be in structured JSON format - only successful requests are properly structured, the above message is not
- Log Output should include the requested domain
- This specific log output should be of
warning
status considering it's no fault of ours in most cases as this service doesn't maintain the certificates but instead is the responsibility of the end user who configures this feature
Metrics
- Metrics should be improved to capture legitimate requests regardless of whether or not they are able to be successfully completed. Perhaps a counter called
gitlab_pages_failed_tls_connect
or something to that liking. We can then add this to our existing metric that captures request rate.
Output of checks
This feature request is for GitLab.com
~"devops::release" ~"group::release" Category:Pages