Propose plan for GCP-AWS VPN monitoring
Purpose
This document outlines a plan for adding monitoring to the VPN tunnels that connect our Google Cloud Platform and Amazon Web Services accounts for MacOS runners. This monitoring could help us detect deterioration or misconfiguration in our VPN tunnels, particularly in situations where some tunnels are still operational and runners appear to be working normally due to having at least one functioning tunnel for communication.
Current state
Currently, we have two VPN connections in our AWS VPN gateway. Each AWS VPN connection has two VPN tunnels each. The four total VPN tunnels connect to our GCP VPN gateway.
Motivation for change
Currently, we lack any monitoring for these connections. In the event of losing all VPN tunnels, it would be evident that jobs are not completed. However, without monitoring, we are unable to determine if we only lose certain tunnels. MacOS will be released to GA at the end of the current iteration and it would be beneficial to have this monitoring in place.
Requirements
An acceptable solution will give us a reasonably prompt indicator of deterioration or misconfiguration of our VPN tunnels.
Potential solutions
The state of our VPNs could polled and the metrics could be sent from gitlab-runner
, where we already have prometheus configured and metrics regularly scraped. The most important reasons I’ve dismissed this solution:
- Most runner managers do not have a VPN connection to worry about
- Multiple runner managers would be relaying the same status metrics simultaneously
Another solution I considered was to dedicate a machine to gather the VPN statuses and relay the metrics to Prometheus. In my opinion, this remains a close second option if we cannot meet the requirements of my chosen solution. The main reason I don't consider this choice preferable is that we probably don't need VPN metrics more frequently than every half hour to hour, which would make a dedicated machine costly.
Proposed solution
The preferred solution would be to create a new project that can collect VPN statuses and send those metrics to Prometheus and configured the project with scheduled jobs in GitLab at the interval that we consider most appropriate. In order to do this, we would need to have an option for pushing metrics to Prometheus, rather than waiting for Prometheus to scrape the metrics from our machine.
Prometheus has the ability to push metrics data using Pushgateway. However, we need to confirm if this feature is possible or configured in our Prometheus server. I have stared a Slack convo to find out if this would be possible.
I also have a proof of concept in the following MR in this project: