Opened a Ticket with Azure (#116082514590681) as the are currently limiting us to 20 public IP addresses in the ARM space. This is to get additional public static IP addresses for usage by proxy nodes, workers, and the eventual database and redis nodes.
Thank you for contacting Microsoft Support. My name is Jennifer Phuong. I understand that you would like to increase the static IP limit for subscription c802e1f4-573f-4049-8645-4f735e6411b3.
I have engaged our Subscription Management team who will review the request and assist you further. Thanks for your patience while the team reviews and reaches out.
As requested, we have increased the Static Public ip to 60 for the Microsoft Azure Sponsorship subscription id: c802e1f4-573f-4049-8645-4f735e6411b3 in EastUS2 region
They'll be in their own Azure network segment. Inside of Azure, but not on
the same same space as the Prometheus server. This shouldn't cause an issue
though.
This morning we found out that ARM to classic traffic accounts for a big portion of our monthly azure bill (see https://gitlab.com/gitlab-com/infrastructure/issues/1063#note_23345878). This is mainly due to the fact that workers are in classic while storage is in ARM. Since azure considers the two as separate DCs we pay as it was external traffic.
To overcome this we decided to move the git workers to ARM so that they'll live on the same network. But we should also move the load balancers otherwise we would be just shifting a piece and external traffic will remain.
Now for the plan. If we move the load balancers first then we'll have a situation where traffic would go from ARM (lb) to classic (worker) to ARM again (storage). Not ideal.
My plan is then to make the git workers jump over the fence, so they'll be living next to the storage nodes, while keeping them in the haproxy pool in classic and then move the load balancers at a later stage. While I'm at it, I can see if terraform can manage and provision these instances for us.
Ok - note when we say "move" what we mean is "create new machines". I like the idea of moving the git workers to ARM first, I think though that we shouldn't drag this out in a multi-step dance. The work between spinning git workers and spinning the request amount of machines to offset the worker fleet in ARM is minimal, let's just bit the bullet and do it.
As the person who has made the majority of the ARM infrastructure, I've been able to keep a naming principle and scope guide. I'll document that over in https://gtlab.com/gitlab-com/infrastructure/ and would ask that we stick to it for consistency and to avoid confusion and collision.
Yes, that's what I meant by jumping the fence: new workers/lbs instances have to be created.
The two-step approach is more logical than practical. We can set up everything in ARM and then repoint services one at a time within a short period of time.
@sytses it fits in our idea of separating ephemeral hosts from persistent hosts: ephemeral can be spot instances that can go away at any time, managed by an orchestrator that makes sure that we have enough instances for our load.
We are in fact taking the first steps to move in that direction, you can read more about it in this issue
Sadly Azure does not offer spot instances that I know, which means that we can't use this right away, but we are working on making the fleet much more flexible.
Current status: all load balancers, web and git nodes are in ARM. All the traffic is hitting the ARM load balancers. Only the api nodes are left to move but they are being bootstrapped right now along with the new sidekiq nodes.
The Sidekiq nodes are up and processing data off of the stack.
I've got an calendar event set to restore the database settings to their original value Saturday night when traffic is the lowest and nobody minds a restart.