Tracing for dataset reconstitution
Background
Core has a role to play in guiding users towards an effective way of transferring and reconstituting collections of data for agents.
Single file upload/downloads are handled by the core upload
and core download
tools.
The "most common" interface people are using these days to describe datasets for deep learning is a csv that look something like:
filepath1,label1,label2,...
filepath2,label1,label2,...
This case is handled by core upload --csv
and core collect <tag>
Problem
The more/most general (also most specific) case that I think core
can help with (and would also solve #228 (closed) #226 (closed) ) would be tooling both command line and within the agents to help in the case where your input csv looks like:
filepath1_1,filepath1_2,...,filepath1_N,label1,label2,...
filepath1_1,filepath1_2,...,filepath1_N,label1,label2,...
i.e., there are N columns specifying files to be uploaded, plus labels.
The dataset is assigned an id and each row is assigned an id. The first file is posted as the "main" file. Each other file is posted as a reply to that file (i.e. main<-1, main<-2, main<-3) or each file consecutively (i.e. main<-1<-2<-3)? I don't think this particularly matters in terms of retrievability.
Labels would be handled the same as they are in core upload --csv
currently.
Reconstituting a row would involve listening for the "key" dataset tag, grabbing the row ID from the incoming message, and then tracing for all of the replied data. We would not handle the case of only wanted some pieces of info from the row. It's all or nothing.
Similarly, reconstituting a dataset would involve doing the same thing for all of the rows.
Further identification of each field (for example, using column names or something) could benefit usability, but it leads to additional complexity I don't think necessarily is something we can generalize very well. Everything we build on the upload side would also need to be maintained in both the CLI, and the agent API.
The argument to not take this on at all
- I think that experiments/datasets that need bundled data should manually bundle data into a single file (like a .tar.gz). If these need to be broken up, there should be an agent splitting them and reposting them. This pushes structure out of our default libraries and makes agents more independent.
- We've already handled this for the single-file dataset case at the
core
level. If we try to address the N file case, I'd argue we're pushing into the "application" layer, which runs counter to what we're attempting to restrict ourselves to. - This opens us up to needing to more clearly specify what a valid dataset CSV is, (which I really have no desire to do.) E.g., do we need a header column? How should the header be handled? What do we do if no header is present? Essentially it's a lot of ambiguity for not much gain. The solution we land on will likely not cover all of the intended/desired use cases.
- If we develop the solution, we are responsible for maintaining it. Core is targeted at the communication layer, not the application layer, and thus I'm very reluctant to add too much structure and complexity at this layer. Users should build the tools they need.