Improve dataset schema and use metadata
Problem to solve
Right now in our evaluation dataset, all input fields are stored under 1 key inputs
. This makes it hard to obtain a specific subset to run evaluation.
The current workaround is to create dataset splits, but creating dataset splits requires too much manual work and is practically impossible with any decently large dataset.
Proposal
We can redesign the schema and take advantage of dataset metadata. LS dataset filtering is only supported on metadata.
For example, our RCA dataset has the following structure:
{
"Inputs": {
"Failure Message": ...,
"Project Full Path": ...,
"Trace": ...,
"Web Url": ...,
"Job Id": ...,
"Project Id": ...
}
}
The actual inputs to /troubleshoot
endpoint only needs Job ID
, so we should change the dataset schema to
{
"Inputs": {
"Job Id": ...,
},
"Failure Message": ...,
"Project Full Path": ...,
"Trace": ...,
"Web Url": ...,
"Project Id": ...
}
This will enable user to quickly filter a subset of failed jobs for a specific project, or of a particular failure message.
Further details
Links / references
Edited by Hongtao Yang