Spin Up New Infrastructure in ARM - Front End

Added ~476270 cleanup ~366505 critical labels

Reassigned to @northrup

Opened a Ticket with Azure (#116082514590681) as the are currently limiting us to 20 public IP addresses in the ARM space. This is to get additional public static IP addresses for usage by proxy nodes, workers, and the eventual database and redis nodes.

Update on Ticket:

Thank you for contacting Microsoft Support. My name is Jennifer Phuong. I understand that you would like to increase the static IP limit for subscription c802e1f4-573f-4049-8645-4f735e6411b3.

I have engaged our Subscription Management team who will review the request and assist you further. Thanks for your patience while the team reviews and reaches out.

Update on Ticket for resources:

As requested, we have increased the Static Public ip to 60 for the Microsoft Azure Sponsorship subscription id: c802e1f4-573f-4049-8645-4f735e6411b3 in EastUS2 region

Assignee removed

Pointless now.

Status changed to closed

reopened

removed ~476270 label

cc/ @jnijhof @northrup @omame

Hope they will be in the same internal network as prometheus instances

They'll be in their own Azure network segment. Inside of Azure, but not on the same same space as the Prometheus server. This shouldn't cause an issue though.

This morning we found out that ARM to classic traffic accounts for a big portion of our monthly azure bill (see https://gitlab.com/gitlab-com/infrastructure/issues/1063#note_23345878). This is mainly due to the fact that workers are in classic while storage is in ARM. Since azure considers the two as separate DCs we pay as it was external traffic.

To overcome this we decided to move the git workers to ARM so that they'll live on the same network. But we should also move the load balancers otherwise we would be just shifting a piece and external traffic will remain.

Now for the plan. If we move the load balancers first then we'll have a situation where traffic would go from ARM (lb) to classic (worker) to ARM again (storage). Not ideal.

My plan is then to make the git workers jump over the fence, so they'll be living next to the storage nodes, while keeping them in the haproxy pool in classic and then move the load balancers at a later stage. While I'm at it, I can see if terraform can manage and provision these instances for us.

Thoughts?

removed ~366505 label

Ok - note when we say "move" what we mean is "create new machines". I like the idea of moving the git workers to ARM first, I think though that we shouldn't drag this out in a multi-step dance. The work between spinning git workers and spinning the request amount of machines to offset the worker fleet in ARM is minimal, let's just bit the bullet and do it.

As the person who has made the majority of the ARM infrastructure, I've been able to keep a naming principle and scope guide. I'll document that over in https://gtlab.com/gitlab-com/infrastructure/ and would ask that we stick to it for consistency and to avoid confusion and collision.

Yes, that's what I meant by jumping the fence: new workers/lbs instances have to be created.

The two-step approach is more logical than practical. We can set up everything in ARM and then repoint services one at a time within a short period of time.

assigned to @omame

added performance label

@pcarranza how about using spot instances as workers?

mentioned in issue #1064 (closed)

The new web workers are online in ARM but they're failing to mount a few volumes.

The haproxy config is ready and we just need to solve this issue to use them.

cc/ @northrup

@sytses it fits in our idea of separating ephemeral hosts from persistent hosts: ephemeral can be spot instances that can go away at any time, managed by an orchestrator that makes sure that we have enough instances for our load.

We are in fact taking the first steps to move in that direction, you can read more about it in this issue

Sadly Azure does not offer spot instances that I know, which means that we can't use this right away, but we are working on making the fleet much more flexible.

marked the task 7 Web Server Nodes as completed

As of now the web workers are all in azure ARM and they're called worker-webXX.fe.gitlab.com.

mentioned in issue #1218 (moved)

Just leaving a note to remind us to lower connections back to what we had before.

marked the task 10 SSH / Git / Sockets Nodes as completed

marked the task 8 HA Proxy Nodes as completed

We are switching the traffic away from the old classic environment through the new ARM environment using DNS balancing.

cc/ @northrup @omame

The switch has been successfully completed.

Current status: all load balancers, web and git nodes are in ARM. All the traffic is hitting the ARM load balancers. Only the api nodes are left to move but they are being bootstrapped right now along with the new sidekiq nodes.

marked the task 3 API Nodes as completed

marked the task 3 SIdekiq Nodes as completed

The Sidekiq nodes are up and processing data off of the stack.

I've got an calendar event set to restore the database settings to their original value Saturday night when traffic is the lowest and nobody minds a restart.

All the new workers are only the services they need. So no more sidekiq on a web node or workhorse on a sidekiq node.

mentioned in issue #1228 (moved)

mentioned in issue #1232 (closed)

mentioned in issue #1234 (closed)

mentioned in issue #1290 (closed)

Resetting the postgres connections has been split into #1290 (closed) as it requires careful planning. I'm closing this one.

closed

Links to https://gitlab.com/gitlab-com/infrastructure/issues/1434

mentioned in issue #1810 (closed)

removed milestone

mentioned in issue production#123 (closed)

Spin Up New Infrastructure in ARM - Front End

Designs

Child items ...

Activity