Machinelearning: Question about predictor output: Score and PredictedLabel columns

Created on 19 Jun 2018 · 8Comments · Source: dotnet/machinelearning

Current two tutorials in the docs use different columns to get a predicted value out of the pipeline into an instance of the user-defined prediction type:

Regression taxi fare tutorial uses the Score column
Binary classification sentiment analysis tutorial uses the PredictedLabel column

How does one know which column to use to populate instances of the prediction type? Especially given that, in case of the (binary) classification solution, the Score column is also available (I guess, then it contains the probabilities of being in a certain class).

As for the trainer inputs, rules are more or less clear:

Use the Label column for labels (or specify another column name through the LabelColumn property)
Use the Features column for features (or specify another column name through the FeatureColumn property)

Can the setup of the predictor output be done in similar way:

Use the column with the same name across all the predictors for the predictor output. I guess that might require to extend regression IDataView with the PredictedLabel column that would be a copy of the Score column.
Be able to setup the name of the output column. (That seems the PredictedLabelColumnOriginalValueConverter can be used for that; or I'm wrong and that class is intended for use in tandem with the Dictionarizer?)

By the way, the mere explanation of the Score and PredictedLabel columns here would be appreciated as well. Then, at least, I'll update the docs to make story clearer.

documentation question up-for-grabs

Source

pkulikov

Most helpful comment

The purpose of the output columns in scored IDataView is according to the learning task. e.g. if task is

Regression

Label: Original regression value of the example.
Score: Predicted regression value.

Binary Classification

Label: Original Label of the example.
Score: Raw score from the learner (e.g. value before applying sigmoid function to get probability).
Probability: Probability of being in certain class
PredictedLabel: Predicted class.

Multi-class Classification

Label: Original Label of the example.
Score: Its an array whose length is equal to number of classes and contains probability for each class.
PredictedLabel: Predicted class.

Clustering

Label: Original cluster Id of the example.
Score: Its an array whose length is equal to number of clusters. It contains square distance from the cluster centeriod.
PredictedLabel: Predicted cluster Id.

zeahmed on 22 Jun 2018

👍14

All 8 comments

The purpose of the output columns in scored IDataView is according to the learning task. e.g. if task is

Regression

Label: Original regression value of the example.
Score: Predicted regression value.

Binary Classification

Label: Original Label of the example.
Score: Raw score from the learner (e.g. value before applying sigmoid function to get probability).
Probability: Probability of being in certain class
PredictedLabel: Predicted class.

Multi-class Classification

Label: Original Label of the example.
Score: Its an array whose length is equal to number of classes and contains probability for each class.
PredictedLabel: Predicted class.

Clustering

Label: Original cluster Id of the example.
Score: Its an array whose length is equal to number of clusters. It contains square distance from the cluster centeriod.
PredictedLabel: Predicted cluster Id.

zeahmed on 22 Jun 2018

👍14

@zeahmed thank you very much for the answer. I'll keep that information in mind while updating the docs.

Are there any plans to unify meaning of the columns across various learning tasks or introduce any other changes in this area of ML.NET. Or that part is more or less stable?

pkulikov on 22 Jun 2018

Thank you @zeahmed for the detailed answer.

arafattehsin on 14 Jul 2018

@zeahmed I've been trying to use _Probability_ column in _EnsembleBinaryClassifier_, but it fails with 'Column 'Probability' not found in the data view'. Is it a bug or by design?

Lanayx on 22 Jul 2018

DRI RESPONSE: I can't find this information in our docs, and it definitely should be there. Wherefore marking this as documentation and up-for-grabs.

@Lanayx I'll suggest you to create separate issue, and provide us additional information (code snippet at least).

Ivanidzo4ka on 19 Oct 2018

@JRAlexander @Ivanidzo4ka is that something that should be at https://docs.microsoft.com/en-us/dotnet/machine-learning/ ?

pkulikov on 19 Oct 2018

👍1

Is there a way to get class label names along with the probabilities that we get from Multi-class classification? ryGetScoreLabelNames is part of legacy code in 0.8

dilmanous on 29 Dec 2018

👍1

I believe this issue is already addressed as part of 1.0 API reference documentation. Now all trainers have a sub-section in their remarks called input/output columns, where the types and definition of the input/output columns are clearly explained. E.g.: https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.trainers.averagedperceptrontrainer?view=ml-dotnet#input-and-output-columns