Skip to content

Log cumulative per-request rusage ("command stats")

Igor requested to merge command-stats into master

Background

On gitlab.com we are running quite a large deployment of gitaly. In order to better balance repos across nodes, we need to understand the actual resource usage.

Rebalancing is currently mainly based on disk space. It would be good to be able to balance based on other resources, such as CPU utilization and I/O.

In order to do that, we will want to correlate resource usage of git processes with gitaly RPCs that spawned them.

There is a wonderful getrusage/wait4 syscall that provides many useful stats about processes. And in fact, gitaly already does log these stats for each executed command -- however, it only does so at debug log level.

Considering that gitaly logs are already one of our largest indices in the Elasticsearch logging cluster, we need to be quite careful with not increasing the log volume too much.

Design

The solution that I would like to propose is aggregating rusage stats on a per-RPC basis. This way, the rate of log lines remains the same, but the average size of log records will increase a bit.

I believe such an increase can be justified by the value we get out of these additional logs.

Implementation

I've tried to make the change in a way that requires changing as little code as possible.

I did make use of mutable context values, which may be frowned upon.

I'm happy to make adjustments to the implementation if needed.

Example output (from running locally via gdk):

2020-07-08_15:52:01.88192 praefect-gitaly-0     : time="2020-07-08T17:52:01+02:00" level=info msg="finished unary call with code OK" command.inblock=0 command.majflt=0 command.minflt=622 command.oublock=0 command.real_time_ms=8 command.system_time_ms=3 command.user_time_ms=1 correlation_id=aKxKWb5YdJ grpc.code=OK grpc.meta.auth_version=v2 grpc.meta.client_name=gitlab-web grpc.meta.deadline_type=regular grpc.method=FindDefaultBranchName grpc.request.deadline="2020-07-08T17:52:11+02:00" grpc.request.fullMethod=/gitaly.RefService/FindDefaultBranchName grpc.request.glProjectPath=group-1/p-100 grpc.request.glRepository=project-102 grpc.request.repoPath=@hashed/37/83/37834f2f25762f23e1f74a531cbe445db73d6765ebe60878a7dfbecd7d4af6e1.git grpc.request.repoStorage=praefect-internal-0 grpc.request.topLevelGroup=@hashed grpc.service=gitaly.RefService grpc.start_time="2020-07-08T17:52:01+02:00" grpc.time_ms=8.505 peer.address= pid=36671 span.kind=server system=grpc

Highlighting added values:

command.inblock=0 command.majflt=0 command.minflt=622 command.oublock=0 command.real_time_ms=8 command.system_time_ms=3 command.user_time_ms=1

Opportunities

Having this data may open up some new opportunities for us:

  • Rebalancing: We can optimize for more utilization targets than just disk space. This may also be important data if we were to evaluate alternative storage architectures.
  • Incident response: This data should make it much easier to track down the source of CPU and I/O saturation events.
  • Rate-limiting: This data could form the basis for user or repo based quotas, to protect against resource hogs.

refs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7387

Edited by Igor

Merge request reports