Improve dataset schema and use metadata

Problem to solve

Right now in our evaluation dataset, all input fields are stored under 1 key inputs. This makes it hard to obtain a specific subset to run evaluation.

The current workaround is to create dataset splits, but creating dataset splits requires too much manual work and is practically impossible with any decently large dataset.

Proposal

We can redesign the schema and take advantage of dataset metadata. LS dataset filtering is only supported on metadata.

For example, our RCA dataset has the following structure:

{
  "Inputs": {
    "Failure Message": ...,
    "Project Full Path": ...,
    "Trace": ...,
    "Web Url": ...,
    "Job Id": ...,
    "Project Id": ...
  }
}

The actual inputs to /troubleshoot endpoint only needs Job ID, so we should change the dataset schema to

{
  "Inputs": {
    "Job Id": ...,
  },
  "Failure Message": ...,
  "Project Full Path": ...,
  "Trace": ...,
  "Web Url": ...,
  "Project Id": ...
}

This will enable user to quickly filter a subset of failed jobs for a specific project, or of a particular failure message.

Further details

Links / references

Edited Jul 29, 2025 by Hongtao Yang