Better logging of errors in structured logs
Problem to solve
Make it easier to troubleshoot/correlate application errors in Gitlab.
Currently it's hard to correlate failures to specific error messages in the logs, especially for busier systems and/or distributed systems.
Intended users
GitLab Admins
Further details
For most of our customers, the only place to get error details and backtraces for most GitLab errors is the unstructured production.log file. I think it's rare for our customers to include Sentry in their GitLab infra. I've never seen it, and unless we ship it with Omnibus I don't think it's reasonable to assume that people will add it. Also in my experience searching sentry by correlation ID has been unreliable.
I have an example of what this problem looked like today.
- Self-managed customer is getting project info via the GitLab API. When they get to a specific page of results, GitLab returns a 500 error instead of expected results.
- It's hard to generically search for errors and sort through them because they're on several nodes and get tons of traffic.
- I suggested that it may be easy to grab the correlation ID from the response headers of the failed request then search their logs for that.
- We got the relevant log entries, and I can see the 500 error in the structured log entry. But it doesn't contain any error details.
- I offered a workaround of capturing a GitLab tail while reproducing the issue, with a
hosts
entry pointing them to a specific node so they know where to capture logs. But this seems more complicated than it needs to be.
Proposal
Every application error should have the exception class, message, and backtrace included in the structured log entry for that request.