Machinelearning: StratificationColumn in CrossValidation and TrainTestSplit

Created on 13 Feb 2019  路  18Comments  路  Source: dotnet/machinelearning

CrossValidation and TrainTestSplit have a parameter called StratificationColumn that is used to preserve groupings of columns across splits (as discussed in #2487). This isn't actually stratification, so we should rename the column.

This is a forked sub-issue from #2487

Related to #1204

API

Most helpful comment

By itself not an acceptable name. If you somehow clarified the "group" column to mean something else. @justinormont 's suggestion of RankingGroup is not my favorite since we use this in other contexts other than ranking (albeit lower priority ones that haven't yet been migrated to the open source codebase).

Anyway, sklearn gets away with it there because it's very, very clear in context what "group" it's talking about since you're calling GroupShuffleSplit. If you were to just identify something divorced from that context and just call it a "group," then by itself is it clear what it's talking about? Not at all.

This is the problem, is that what type of "group" is considered relevant are vert context dependent. If you can make a case that "group" is used in other contexts to refer to this specifically, I could change my mind potentially. But as far as I see the case depends on a 5 character substring of a method from Python taken compeltely out of the context that made it clear what type of group you were talking about.

Maybe RowGroup column for what we now call a Group column, and SplitGroup or SplittingGroup column for what we call stratification. If we don't have to the stomach to rename "group" column at this time, which I could understand, maybe just call it SplitColumn. That suggests clearly enough to me that this has something to do with when a dataset is split, and I think we can easily explain it.

All 18 comments

Do we have any idea what should be new name?

@Ivanidzo4ka good question! In the above, I've made a suggestion for "IdColumn".

Sorry, I guess you mention it in other issue, don't see it here.
IdColumn feels blank and also doesn't reflects purpose of it.
maybe ConsistencyColumn or RetentionColumn

How about RowGroupPreservationColumn? GroupPreservationColumn? PreservationColumn?

RowSetPreservationColumn? Super explicit, and doesn't use the word "group".

Row Set Preservation Society. That would be good name for my second album.
GroupPreservationColumn sound best for me, but would be nice to ask other people around

If I heard something was renamed to IdColumn, I would assume it was the Name column.

Is there another industry term for this? We can't be the first.

Closest I see in scikit-learn is GroupShuffleSplit. Perhaps SplitGroup?

https://scikit-learn.org/stable/modules/cross_validation.html#group-shuffle-split
image

Another route is to rename the Group column to RankingGroup, which then frees up Stratification to move to Group (which seems to be the industry term).

Speaking of renaming. @Dmitry-A was saying earlier today that Name may be better called RowID

public TrainTestData TrainTestSplit(IDataView data, double testFraction = 0.1, string stratificationColumn = null, uint? seed = null)
public CrossValidationResult<CalibratedBinaryClassificationMetrics>[] CrossValidate( IDataView data, IEstimator<ITransformer> estimator, int numFolds = 5, string labelColumn = DefaultColumnNames.Label,string stratificationColumn = null, uint? seed = null)

@justinormont what Name are you talking about?

The column purpose of Name, which allows a user to identify the row of data. It's mainly used for debugging as it's printed to the .inst.txt file. It lets you match the input data row to the output score.

I'm unsure we have brought the concept to ML.NET.

Ah, that Name. Do we even expose it anywhere in Ml.Net? It's probably part of some commands, but I don't think we do anything with commands right now, since they all hidden

Let's keep this discussion on potential names for StratificationColumn. Any other naming issues, please open a separate issue. (Sorry to be strict, but I need to drive this to conclusion.)

So far we have

IdColumn: Too vague
Group: Group and relatives feels to rank-y to some folks, but is industry standard language.
RowGroupPreservationColumn
GroupPreservationColumn

RowSetPreservationColumn
ConsistencyColumn
RetentionColumn

@TomFinley @shauheen @glebuk @yaeldekel Any thoughts?

I renamed it to GroupPreservationColumn in : https://github.com/dotnet/machinelearning/pull/2537

By itself not an acceptable name. If you somehow clarified the "group" column to mean something else. @justinormont 's suggestion of RankingGroup is not my favorite since we use this in other contexts other than ranking (albeit lower priority ones that haven't yet been migrated to the open source codebase).

Anyway, sklearn gets away with it there because it's very, very clear in context what "group" it's talking about since you're calling GroupShuffleSplit. If you were to just identify something divorced from that context and just call it a "group," then by itself is it clear what it's talking about? Not at all.

This is the problem, is that what type of "group" is considered relevant are vert context dependent. If you can make a case that "group" is used in other contexts to refer to this specifically, I could change my mind potentially. But as far as I see the case depends on a 5 character substring of a method from Python taken compeltely out of the context that made it clear what type of group you were talking about.

Maybe RowGroup column for what we now call a Group column, and SplitGroup or SplittingGroup column for what we call stratification. If we don't have to the stomach to rename "group" column at this time, which I could understand, maybe just call it SplitColumn. That suggests clearly enough to me that this has something to do with when a dataset is split, and I think we can easily explain it.

I like @TomFinley naming suggestions:

  • Group => RowGroup
  • Stratification => SplitGroup (or SplittingGroup/SplitColumn)
Was this page helpful?
0 / 5 - 0 ratings

Related issues

ddobric picture ddobric  路  4Comments

rogancarr picture rogancarr  路  3Comments

sethreidnz picture sethreidnz  路  3Comments

bs6523 picture bs6523  路  4Comments

neven10 picture neven10  路  3Comments