Improve SeatLink Apdex score
Problem
groupprovision apdex customersdot rails_requests failure is high. For the past 7 days the apdex failure ratio is 47255/93360. It is holding back our overall error budget.
By sorting the averge on Grafana ranking, Api::V1::SeatLinksController#create seems to be the largest contributing factor. It usually peaks around 4 AM UTC each day at -50% mark. E.g. in CDot production log, I found one request taking 7s.
Proposal
We should first gather stats on request time in order to see how the distribution and average looks like. Since we don't have log.gitlab.net for CDot this will take some manual effort.
Engineering-wise, we can look into reducing seat link time.
The peak probably means a large number of self-managed instances are concentrated in the same timezone. We may also look into spreading the sync time further apart.
Product management-wise: we may have to consider increase the request threshold (urgency). According to Rails request Apdex SLI doc, automated process is not user face therefore it makes sense to have a lower urgency.
Result
Reduce the number of apdex errors.
Next steps (if any)
How will we measure success?
A healthier apdex percentage of less than -5%.