Skip to content

Fix capacity calculation

Arran Walker requested to merge ajwalker/taskscaler-fixes into main

This fixes a problem where we'd hit the panic capacity potential below zero (which is designed to panic because it should be an impossible situation).

However, it turns out there were 3 routes to get there:

  • Acquisitions and removing of reservations were not atomic, occasionally leading to a miscalculation because they were not in sync
  • Out-of-band/unexpected instances would push the instance capacity past max instances
  • Reservations should chip away at both immediate availability first, and then the remainder chip away at potential capacity

In addition, we fix a case where unavailable capacity wasn't being reported correctly when an instance was being removed.

This situation is quite hard to write a test for, but was easy to detect in a real stress test. We can probably do the same with some mock designed for stress testing, but this can be done in a follow-up.

Merge request reports