1. 02 Dec, 2021 1 commit
    • Adam Coldrick's avatar
      Refactor test gRPC client to construct a channel in the subprocess · bbe7a2ff
      Adam Coldrick authored
      The load tests work by forking when they need to send a gRPC request,
      using gipc to gevent-cooperatively wait for the forked process to
      finish. This works around the poor gevent support in gRPC.
      
      Currently the gRPC channel is created in the parent process and reused
      in the fork, which largely works but isn't entirely safe. Notably, the
      forked process will occasionally segfault, and under some conditions
      even the parent process itself can segfault. This is caused by forking
      after the gRPC core has been initialized.
      
      This commit moves the channel construction into the subprocess, which
      means we never start the gRPC core in the main locust process, so we
      never fork it unsafely.
      bbe7a2ff
  2. 19 Nov, 2021 2 commits
  3. 18 Nov, 2021 1 commit
  4. 17 Nov, 2021 6 commits
    • Adam Coldrick's avatar
      Add a Dockerfile for the Locust tests · ef78cfd2
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      ef78cfd2
    • Adam Coldrick's avatar
      Add tasks for BatchRead/UpdateBlobs and Bytestream Write · 479c642e
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      479c642e
    • Adam Coldrick's avatar
      Update the gRPC client to send requests in a subprocess · 5e4b547f
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      gRPC doesn't cooperatively block in the way needed to work smoothly
      with gevent. Whilst this is fine for fast requests up to a point, it
      renders Locust useless for testing requests that respond more slowly.
      
      gRPC has experimental gevent support which can be enabled, however
      this causes the client's performance to degrade to the point that it
      is useless for measuring performance as we scale up the rate of
      requests.
      
      Sending the gRPC request in a subprocess, using gipc to provide
      gevent-compatible pipe-based IPC and blocking, solves this problem.
      The cost is starting up a process for each gRPC request, which sets
      a hard limit on how many concurrent requests we can make. This is
      higher than we're interested in testing for now, so its not a huge
      problem.
      
      This approach also seems to reduce the stability of the load tests,
      but at least allows us to get some useful performance measurements
      for a full variety of CAS request types.
      5e4b547f
    • Adam Coldrick's avatar
      Reuse previously uploaded blobs in FindMissingBlobs test · 2506690b
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      This makes the FindMissingBlobs test tend towards just sending
      FindMissingBlobs requests, rather than being a test of extreme-scale
      ByteStream Write more than a FindMissingBlobs test.
      2506690b
    • Jürg Billeter's avatar
      cmd_execute.py: Simplify --output-file option of bgd execute command · c09ac377
      Jürg Billeter authored
      The `--output-file` option of `bgd execute command` is not intuitive to
      use as it requires a boolean executable argument in addition to the path
      of the output file. This seems to be unnecessarily complex as the
      `OutputFile` message returned by the server already includes an
      `is_executable` field, which we can use to set the correct file
      permissions.
      c09ac377
    • Jürg Billeter's avatar
      cmd_execute.py: Fix path in chmod call in run_command() · ce45429b
      Jürg Billeter authored
      If there are multiple output files, the path variable from the first
      output file loop may no longer be correct.
      ce45429b
  5. 12 Nov, 2021 1 commit
  6. 10 Nov, 2021 1 commit
  7. 09 Nov, 2021 1 commit
    • Adam Coldrick's avatar
      Pin aiohttp to <3.8 · 679f75d3
      Adam Coldrick authored
      Some of the bgd browser-backend tests are failing with aiohttp 3.8, so
      pin to an older version for now so that CI isn't blocked.
      679f75d3
  8. 26 Oct, 2021 6 commits
    • Adam Coldrick's avatar
      Add a Locust User which makes Write and FindMissingBlobs requests · 2697bba7
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      This initial task simulates uploading some blobs using the ByteStream API,
      and then making a FindMissingBlobs request for those blobs plus some extra
      blobs that will likely not be found.
      
      This lets us get some insight into the performance of ByteStream Write and
      FindMissingBlobs under load.
      2697bba7
    • Adam Coldrick's avatar
      Add a generic gRPC-based Locust User · 2f5bbc1d
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      This is intended to be used as a base class to handle setting up gRPC
      clients for use in user classes which actually contain test tasks.
      2f5bbc1d
    • Adam Coldrick's avatar
      Add a custom Locust client for basic gRPC requests · 517e7fda
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      517e7fda
    • Adam Coldrick's avatar
      Add requirements.txt for load-testing scripts · 3d1a6fad
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      3d1a6fad
    • Adam Coldrick's avatar
      Turn MonitoringBus.__streaming_worker into a subprocess · ed598dc8
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      This method is currently a coroutine which consumes from a Janus queue
      containing all the metrics that we send to the monitoring bus. The
      coroutine is run in the main event loop, and therefore competes for
      processor time with the other coroutines in BuildGrid.
      
      These other coroutines are the log writer, the state metrics generator
      (in BuildGrids with at least one Scheduler configured), and the coroutines
      inside the Janus queues used for metrics and logging.
      
      When BuildGrid is producing metrics rapidly, this contention becomes
      increasingly noticeable, and logs start to exhibit latency.
      
      When BuildGrid produces metrics fast enough, __streaming_worker and the
      coroutines inside the related Janus queue effectively monopolise the
      main event loop, leading to a situation where no logs are emitted by
      BuildGrid.
      
      In this situation, the size of the log message queue grows unbounded.
      Assuming the load producing the metrics doesn't go away fast, the
      queue used by __streaming_worker will **also** grow unbounded. Under
      high load, this leads to rapid growth in the memory footprint of
      BuildGrid, and is unrecoverable until the load stops.
      
      The throughput of the Janus queue is also not great, exacerbating this
      issue by slowing down the rate at which BuildGrid can pull metrics out
      of the queue.
      
      This commit updates __streaming_worker to be run in a subprocess, rather
      than as a coroutine. This avoids the event loop contention, and also
      sidesteps any GIL-related perfomance issues caused by having an
      extremely busy thread running at the same time as we already have
      too many threads handling gRPC requests.
      
      This fixes the problem of log messages getting backed up and never
      written when experiencing high load and also increases the overall
      throughput of metric messages, avoiding the backlog and subsequent
      memory footprint growth and metrics distortion that previously
      occurred at pretty reasonable levels of load.
      ed598dc8
    • Adam Coldrick's avatar
      Fix the module path in the MetricRecord compiled protos · 3b12166c
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      3b12166c
  9. 21 Oct, 2021 1 commit
  10. 19 Oct, 2021 2 commits
    • Rohit Kothur's avatar
      Change type of digest_size_bytes column to bigint · ff846a73
      Rohit Kothur authored and Rohit Kothur's avatar Rohit Kothur committed
      ff846a73
    • Adam Coldrick's avatar
      Only wait for further ByteStream Write requests if there might be some · 1f0960a4
      Adam Coldrick authored
      Currently our ByteStream Write implementation assumes that there will be
      at least two WriteRequests sent. This means that its impossible to
      handle small (below the gRPC message size limit) blobs being written
      using ByteStream.
      
      The current implementation tries to iterate over the second and later
      requests, which actually blocks waiting for more input from the client
      if there's only been one request so far.
      
      Obviously, if the client has already sent all the data, there won't be
      any more requests and the connection will stay blocked indefinitely.
      
      This commit checks whether `finish_write` is set for each request, and
      skips iterating over any potential later requests if so. It also cleans
      up a bit of unnecessary code duplication.
      1f0960a4
  11. 30 Sep, 2021 1 commit
    • Jürg Billeter's avatar
      requirements: Pin jsonschema<4.0.0 · 8c153e7f
      Jürg Billeter authored
      jsonschema 4.0.0 removes the legacy mechanism to specify types to
      validators. This currently breaks BuildGrid:
      
            File "buildgrid/_app/settings/parser.py", line 1652, in get_validator
              return BgdValidator(schema, types=types)
          TypeError: __init__() got an unexpected keyword argument 'types'
      8c153e7f
  12. 28 Sep, 2021 1 commit
  13. 21 Sep, 2021 1 commit
  14. 20 Sep, 2021 1 commit
  15. 16 Sep, 2021 1 commit
  16. 09 Sep, 2021 1 commit
  17. 06 Sep, 2021 1 commit
    • Jürg Billeter's avatar
      requirements: Pin alembic<1.7 · 19ee3c11
      Jürg Billeter authored
      With alembic 1.7.1 type-check fails with:
      
          buildgrid/server/persistence/sql/alembic/env.py:26: error: Module has no attribute "config"; maybe "configure"?
      19ee3c11
  18. 23 Aug, 2021 1 commit
  19. 20 Aug, 2021 1 commit
    • Adam Coldrick's avatar
      Don't use the logger to log log exceptions · 4d417c13
      Adam Coldrick authored and Adam Coldrick's avatar Adam Coldrick committed
      Currently the log writing coroutine catches Exceptions and logs them.
      However, it logs them using a Python logger, which leads to the log
      being added to the queue.
      
      This means that in cases where the Exception wasn't a transient error,
      or specific to a given log line, we add a line into the logging
      queue for every line we remove.
      
      The implication of this is that if our logging gets broken for some
      reason, the logging queue grows to an unbounded size because we're
      never reducing the number of items in it. Also, there's no visibility
      to this failure because the log exceptions never make it out of the
      queue.
      
      This commit switches to writing directly to stdout when displaying
      a logging error. This means that even if we aren't logging properly,
      we're still draining the queue and avoiding unbounded growth.
      4d417c13
  20. 19 Aug, 2021 1 commit
  21. 18 Aug, 2021 1 commit
  22. 16 Aug, 2021 1 commit
    • Adam Coldrick's avatar
      Use the correct name for the RequestMetadata header · 817dab5f
      Adam Coldrick authored
      The spec explicitly defines the RequestMetadata trailing header key as
      `build.bazel.remote.execution.v2.requestmetadata-bin`. However, BuildGrid
      currently looks for just the suffix, `requestmetadata-bin`.
      
      This breaks our RequestMetadata persistence for tools that correctly
      follow the REAPI spec on this.
      
      This commit updates the key used by BuildGrid to be the full name as
      defined in the spec, so that we correctly record the RequestMetadata
      set by compliant tools.
      817dab5f
  23. 10 Aug, 2021 4 commits
  24. 02 Aug, 2021 2 commits