Skip to content

Handle Bot status transitions to other states and update sets used for metrics, fix total_n, raise exceptions on errors

Marios Hadjimichael requested to merge marios/fix-botstatus-metrics into master

Currently we end up causing a KeyError whenever a bot makes an UpdateBotSession request with any status other than UNHEALTHY and OK (e.g. TERMINATING), because we try to access the __bots_by_status dictionary used for metrics with that status as a key, and that status is not something we track.

This MR fixes:

  • Allow all valid BotSession.status options instead of failing for the few we don't have metrics for; only collect metrics for some interesting statuses
  • Collect metrics about BOT_TERMINATING (in addition to OK and UNHEALTHY)
  • Return the actual total number of bots in the metrics instead of a subset of specific states (and the user can still inquire about specific states individually)
  • Raise exceptions when the users try to read some metrics that don't exist (instead of just returning 0).

Some logs of the main problem:

2019-11-20 21:21:21,123:[                        grpc._server][ERROR][ThreadPoolExecutor-0_150]: Exception calling application: <BotStatus.BOT_TERMINATING: 4>
Traceback (most recent call last):
  File "/buildgrid/grpc/_server.py", line 434, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/buildgrid/buildgrid/server/_authentication.py", line 104, in __authorize_wrapper
    return behavior(self, request, context)
  File "/buildgrid/buildgrid/server/bots/service.py", line 151, in UpdateBotSession
    if bot_id not in self.__bots_by_status[bot_status]:
KeyError: <BotStatus.BOT_TERMINATING: 4>

NOTE: Expired and closed bot sessions are still not handled in this MR and need to be addressed. Relevant issue: #228 (e.g. expired/closed bot sessions will remain accounted for in the lsat state we saw them).

Edited by Marios Hadjimichael

Merge request reports

Loading