Addition of a multi-class multi-task performance evaluation metric

I would like to expand metrics.py to allow for evaluation of multi-class multi-task classification problems. I would mostly likely just edit the F1MacroMetric class. My current approach would be to evaluate the multiple tasks separately and then take the average F1 score as the final score. I can continue looking for a more elegant solution if people don't feel like this is the most appropriate approach.