Add "full dataset" as the third argument to all scoring functions
So all our scoring functions (metrics) currently use two arguments: predicated values and true values (they get a whole dataframe in and select things they need) of the test part of the split.
There is an issue with this in the multi-label setting. The problem is that if you are doing cross-validation (non-stratified) it can happen that some combination of folds do not have all labels present in the test data. So when we score, there is less labels than in some other case. For some metrics the number of all labels can influence the score (e.g., score is normalized based on the number of labels).
A solution to this could be that we add to all scoring functions another input, full dataset, which the scoring function would use just to obtain the list of all labels. This would require a change to standard scoring pipeline, but I do not think people are using custom scoring pipelines.