Commit 90768d8f authored by Tristan Maat's avatar Tristan Maat

Add architecture skeleton for a results parser

parent aae1c3ff
Pipeline #66291968 passed with stages
in 54 minutes and 35 seconds
"""Parse a set of measurements."""
from typing import Dict, Tuple
from .result import Result
def parse(tests: Dict) -> Dict[Tuple[str, int, str, str], Result]:
"""Parse results from a bst_measure results file.
A bst_measure results file is structured roughly as follows:
tests ->
versions ->
measurements ->
time/mem usage/logs x repetition
We want to end up with a dictionary that looks like this:
{(, test.timestamp, version.tag, version.sha): Result}
  • I think the repository being used also needs to be specified here and also possibly the benchmarking sha as well. Given the local developer workflow it is possible for results to be generated against SHAs and references that may no longer exist on the main repo, both in the context of buildstream and benchmarking. This wouldn't perhaps fit in with the current structure however if a simple snippet was being used to generate a result, perhaps another snippet for the broader configuration?

    Edited by Lachlan
  • I also wonder whether we need to retain the number of iterations carried out as part of the result (sample set size).

  • I'm not sure we need to take special care of the local workflow. These are just datapoints, understanding what they mean is up to the user. If the tag/sha doesn't exist in a remote we consider "main", why do we care? We have a datapoint for it, and the user can view it in whatever front-end we have for this - if they use local data, they'll know and understand that the tag they're pointing at with the name of their local branch is their local result.

    If we really need to, we can adopt the pattern of git remotes, and specify which remote was used for the specific result (with a URL as an identifier). This is a little stronger than git's link (where the same remote can be recorded multiple times with different names), but it's sufficient for what we need. If we do go down that route the remote should be specified in the result that we're parsing here, so we'll need to refactor a little up the stack.

    The sample set size is an excellent candidate for a data point, but well, this should be considered as part of a broader look at what should be recorded. I don't think it's the time for that just yet - let's get a prototype working first!

Please register or sign in to reply
final = {}
for test in tests:
for result in parse_versions(test):
final[result[0:3]] = result[4]
return final
def parse_versions(test: Dict) -> Tuple[str, int, str, str, Result]:
"""Parse test results for all versions of a test."""
for version in test:
yield, test.timestamp, version.tag, version.sha, Result.from_json(version)
  • Repos here as well also benchmark version.

    Edited by Lachlan
  • There is further information which should perhaps be included if we start to scale the benchmarking infrastructure. Currently each test result file includes host information (system memory, processor info, kernel release). It would be nice to at least encapsulate this and possibly expand it to be able to identify which runner machine was used for any given test (to cover for different hardware issues).

  • Test timestamp is also something that isn't currently captured robustly on a "per test iteration" basis. The start time for running tests on a given benchmark file is captured (together with the end time), but again this would need to be extracted from information outside of a given test/test iteration. For logged tests, timestamps can be extracted - but nothing says that a test will be logged or that the times contained in logging will reflect the totality of the test.

    Edited by Lachlan
Please register or sign in to reply
# FIXME: Remove - for testing purposes only!
# Can be invoked as `python -m tools.log_parser.parse <filename>``
if __name__ == '__main__':
import sys
import json
from pprint import pprint
from pathlib import Path
with Path(sys.argv[1]).resolve().open() as results:
"""Result definitions and helper functions."""
import statistics
from typing import Dict
class Result():
"""A test result.
This reports averaged results and their standard deviations over a
number of test runs.
  • So to be clear, the number of tests results refer to the number of repeats that are carried out given one benchmarking operation on a given benchmark test and configuration.

    Edited by Lachlan
Please register or sign in to reply
def __init__(self,
reported_times: [float],
reported_memory_usages: [int]):
"""Create the result object.
reported_times: A list of reported times
reported_memory_usages: A list of reported memory usages
assert len(reported_times) == len(reported_memory_usages),\
  • Also some consideration of errors/failures need to be covered, its probable that assertion is most likely to be valid most of the time, but will not cover a complete failure in one test run when both the reported_time and the reported_memory_usage would be omitted. As such I don't think assert would be useful in this context.

    Edited by Lachlan
  • I think the class is behaving as expected in those cases. An empty result is a valid result. An external entity should assert whether it's misbehaving if it produces an empty result.

    On the other hand, a result where we don't have memory usages but do have reported times is invalid, and hence this should be asserted here.

    This assertion is not intended to prove correctness of the application, but simply to be a developer aid for quicker debugging. If we want to start asserting that test results are correct, we should do this in an appropriate test suite.

    Edited by Tristan Maat
Please register or sign in to reply
"Inconsistent number of repetitions."
# Rather than computing the times here, we keep our values
# around, and compute the averages on access. This makes
# working with this class a bit more flexible, and makes it
# easier to spot what the values mean.
self._real_times = reported_times
self._memory_usages = reported_memory_usages
def real_time(self):
"""Get the average real time (in seconds).
This time *includes* the time we spent running in the
background because the OS swapped us out for something else.
return statistics.mean(self._real_times)
def real_time_stdev(self):
"""Get the standard deviation of our time measurements."""
# We use pstdev here because we're interested in the stdev of
# the runs we performed (so we know how reliable our results
# are), not the potential spread in execution times in
# general.
return statistics.pstdev(self._real_times)
def memory_usage(self):
"""Get the average memory usage (in kb).
This is the maximum resident size (i.e., at the point at which
we used the most memory, how much memory we were using).
return statistics.mean(self._memory_usages)
def memory_usage_stdev(self):
"""Get the standard deviation of our time measurements."""
# See real_time_stdev
return statistics.pstdev(self._memory_usages)
# TODO: Be more specific about what the dict can contain.
def from_json(cls, json: Dict) -> 'Result':
"""Create a result object from json data."""
# times = [repetition.get("time") for repetition in json]
# mems = [repoetition.get("mem") for repetition in json]
# logs = [repetition.get("log") for repetition in json]
# return cls(times, mems, Tree(logs))
# Tree(logs) can also be done as an @property and in this
# class (maybe nicer), though I'd recommend caching the result
# once it's been computed
  • The buildstream reference and sha are also available, but its open for debate as to whether they form part of the result or simply a verification aspect of the result.

    Edited by Lachlan
  • I think we also need to cover the success/failure case - but absolute definition might be difficult. In the first case we should just consider whether the assertion(s)) (although I would like to replace that/those) are satisfied and use that with some provision for the error logs to be captured.

    Edited by Lachlan
  • I believe that we should assert successful execution of the test before we start extracting data. Doing this here removes the nice abstraction we've built - we define what a "result" is here, not all the possible things that could happen to a test run.

Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment