This is silly, but probably worth small discussion.
All our Multiclass samples using following example:
https://github.com/dotnet/machinelearning/blob/fd30559d9f8070d7cb71ba937897a63646d5b0bc/src/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs#L500
which has:
// The probabilities of being "AA", "BB", "CC", and "DD".
public float[] Scores;
There is small problem with that. All our multiclass learners produce column Score, not Scores, so we just copy column from Score to Scores, or reassign them in pigsty.
Which doesn't look like good idea for me, more like workaround.
Question is following: do we prefer to have consistency and call score column everywhere as Score column, or we want to be more user friendly and for multiclass use Scores (basically introduce (DefaultColumnNames.MulticlassScore)
In we want consistency we better rename Scores in MulticlassClassificationExample to Score. If we want be more user friendly, we need to go through lot of code, and replace for all multiclass cases Score to Scores.
I would go with the user friendly solution. There may also be confusion in the multi class case between scores and probabilities. If I am not mistaken, in almost all cases in ML.NET, the multi-class scores are actually probabilities. It could be nice to be able to know if a model produces scores or probabilities.
I would like to hear feedback from @rogancarr and @TomFinley since I value their opinion as well
I like the user-friendly method of keeping it plural (because it's an array) and giving it a clean default. That said, if it's a probability, then we should call it Probabilities and not Scores.
So how we want to handle this?
If learner doesn't produce probability call it Scores, and if it's probabilities call them Probabilities?
In that case we would need to have two different elevator to handle it, and I don't think metric we produce right now, justify two evaluators.
Summoning @TomFinley
I honestly don't really feel like this is too esseential, and this may be a case of hypercorrection... there are two things to consider: column names, I feel like there is something to be said for keeping them consistent between the things producing that "type" of item. Let's consider Score. There is some value in having the things produced by the trainer estimators and transformers being consistently named Score. Now, is the pluralization "correct," yeah, kinda. We call them Features, not Feature, and suchlike. So I cannot argue that it is "correct." But is it the "useful" sort of correct, or the "useless" sort of correct? My tendency is to think the useless sort of correct.
To know that a trainer estimator will always produce something called Score... that's useful. That gives the user at least something to go on, especially since in our API as it stands we don't offer the chance to change those names. I think of the people that will be confused by having to detect whether we are producing something called Score or Scores or Probability or Probabilities... and I think of the people that would be confused by the fact that we always consistently produce something called Score, and mysteriously get confused by the lack of plurality. I kind of have more respect for the former than the latter. 馃槃
There is also the point that even though practically these things are often probabilities for multiclass, this is merely conventional. You could easily imagine a multiclass classifier that just produces something that doesn't sum to 1, and that should be totally fine. Whether it is fine or not is a different question, but I still want this thing named Score.
So I think we should not have work in this area. I think the way things have been for years is actually probably the best way. In either Civ 3 or 4, can't quite remember which, there was a dialog offering a choice, and one of the choices is, "the old ways are best." I think I'd vote for "the old ways are best?" 馃槃
Closing the issue per the comment above.
Most helpful comment
Summoning @TomFinley