Machinelearning: Question about predictor output: Score and PredictedLabel columns

Created on 19 Jun 2018  路  8Comments  路  Source: dotnet/machinelearning

Current two tutorials in the docs use different columns to get a predicted value out of the pipeline into an instance of the user-defined prediction type:

How does one know which column to use to populate instances of the prediction type? Especially given that, in case of the (binary) classification solution, the Score column is also available (I guess, then it contains the probabilities of being in a certain class).

As for the trainer inputs, rules are more or less clear:

  • Use the Label column for labels (or specify another column name through the LabelColumn property)
  • Use the Features column for features (or specify another column name through the FeatureColumn property)

Can the setup of the predictor output be done in similar way:

  • Use the column with the same name across all the predictors for the predictor output. I guess that might require to extend regression IDataView with the PredictedLabel column that would be a copy of the Score column.
  • Be able to setup the name of the output column. (That seems the PredictedLabelColumnOriginalValueConverter can be used for that; or I'm wrong and that class is intended for use in tandem with the Dictionarizer?)

By the way, the mere explanation of the Score and PredictedLabel columns here would be appreciated as well. Then, at least, I'll update the docs to make story clearer.

documentation question up-for-grabs

Most helpful comment

The purpose of the output columns in scored IDataView is according to the learning task. e.g. if task is

Regression

  • Label: Original regression value of the example.
  • Score: Predicted regression value.

Binary Classification

  • Label: Original Label of the example.
  • Score: Raw score from the learner (e.g. value before applying sigmoid function to get probability).
  • Probability: Probability of being in certain class
  • PredictedLabel: Predicted class.

Multi-class Classification

  • Label: Original Label of the example.
  • Score: Its an array whose length is equal to number of classes and contains probability for each class.
  • PredictedLabel: Predicted class.

Clustering

  • Label: Original cluster Id of the example.
  • Score: Its an array whose length is equal to number of clusters. It contains square distance from the cluster centeriod.
  • PredictedLabel: Predicted cluster Id.

All 8 comments

The purpose of the output columns in scored IDataView is according to the learning task. e.g. if task is

Regression

  • Label: Original regression value of the example.
  • Score: Predicted regression value.

Binary Classification

  • Label: Original Label of the example.
  • Score: Raw score from the learner (e.g. value before applying sigmoid function to get probability).
  • Probability: Probability of being in certain class
  • PredictedLabel: Predicted class.

Multi-class Classification

  • Label: Original Label of the example.
  • Score: Its an array whose length is equal to number of classes and contains probability for each class.
  • PredictedLabel: Predicted class.

Clustering

  • Label: Original cluster Id of the example.
  • Score: Its an array whose length is equal to number of clusters. It contains square distance from the cluster centeriod.
  • PredictedLabel: Predicted cluster Id.

@zeahmed thank you very much for the answer. I'll keep that information in mind while updating the docs.

Are there any plans to unify meaning of the columns across various learning tasks or introduce any other changes in this area of ML.NET. Or that part is more or less stable?

Thank you @zeahmed for the detailed answer.

@zeahmed I've been trying to use _Probability_ column in _EnsembleBinaryClassifier_, but it fails with 'Column 'Probability' not found in the data view'. Is it a bug or by design?

DRI RESPONSE: I can't find this information in our docs, and it definitely should be there. Wherefore marking this as documentation and up-for-grabs.

@Lanayx I'll suggest you to create separate issue, and provide us additional information (code snippet at least).

@JRAlexander @Ivanidzo4ka is that something that should be at https://docs.microsoft.com/en-us/dotnet/machine-learning/ ?

Is there a way to get class label names along with the probabilities that we get from Multi-class classification? ryGetScoreLabelNames is part of legacy code in 0.8

I believe this issue is already addressed as part of 1.0 API reference documentation. Now all trainers have a sub-section in their remarks called input/output columns, where the types and definition of the input/output columns are clearly explained. E.g.: https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.trainers.averagedperceptrontrainer?view=ml-dotnet#input-and-output-columns

Was this page helpful?
0 / 5 - 0 ratings

Related issues

OneCyrus picture OneCyrus  路  4Comments

dev8546 picture dev8546  路  3Comments

sfilipi picture sfilipi  路  4Comments

rogancarr picture rogancarr  路  3Comments

sethreidnz picture sethreidnz  路  3Comments