Export Prometheus metrics for Mailgun failures

Today I learned we are receiving the Mailgun permanent failure Webhooks:

  1. https://docs.gitlab.com/ee/administration/integration/mailgun.html
  2. https://log.gprd.gitlab.net/goto/8b7d18b0-aa36-11ec-bd7b-c108343628c3

It looks to me that the main purpose is to confirm a member invite was successfully sent: https://gitlab.com/gitlab-org/gitlab/blob/9c8a128ea056ee9170c7a32ad28a65900ec873aa/app/services/members/mailgun/process_webhook_service.rb.

I think we should use this endpoint and track both permanent and temporary failure metrics. For example, from the Mailgun Failure graphs over the last month, we can see a significant jump in failures starting on March 17:

image

If we had these metrics in real-time, we might have been able to detect a rise in errors.

UPDATE: We probably should use the Mailgun API to scrape this data since there's a lot more insights there.

Edited Mar 23, 2022 by Stan Hu
Assignee Loading
Time tracking Loading