Improve SeatLink Apdex score

Problem

groupprovision apdex customersdot rails_requests failure is high. For the past 7 days the apdex failure ratio is 47255/93360. It is holding back our overall error budget.

By sorting the averge on Grafana ranking, Api::V1::SeatLinksController#create seems to be the largest contributing factor. It usually peaks around 4 AM UTC each day at -50% mark. E.g. in CDot production log, I found one request taking 7s.

Proposal

We should first gather stats on request time in order to see how the distribution and average looks like. Since we don't have log.gitlab.net for CDot this will take some manual effort.

Engineering-wise, we can look into reducing seat link time.

The peak probably means a large number of self-managed instances are concentrated in the same timezone. We may also look into spreading the sync time further apart.

Product management-wise: we may have to consider increase the request threshold (urgency). According to Rails request Apdex SLI doc, automated process is not user face therefore it makes sense to have a lower urgency.

Result

Reduce the number of apdex errors.

Next steps (if any)

How will we measure success?

A healthier apdex percentage of less than -5%.

Edited by Mark Chao