Machinelearning: Creation of components through MLContext: advanced options and other feedback

Created on 1 Dec 2018 · 9Comments · Source: dotnet/machinelearning

In our estimators and other similar components often have advanced settings, because, sometimes people have unusual circumstances. At the same time, there is a 95% or 99% scenario for "simple" usage that most people will be happy with. For this reason we have often made a distinction between common and advanced settings, as we see here.

https://github.com/dotnet/machinelearning/blob/cb37c7e7f1e1b29b5608a2755db793c5435d10b1/src/Microsoft.ML.FastTree/TreeTrainersCatalog.cs#L29-L38

There are some possible things that excited feedback:

Echoing feedback seen in #1690, these things where we're making something should have the prefix Create, even in situations where this a catalog where we are always creating. Note: Create preferred to Make.
The worth of ASP.NET style configuration was questioned (seen above as Action<FastTreeRegressionTrainer.Arguments> advancedSettings), e.g., there may not be much purpose in having a delegate. The older style where it just takes the Arguments period was preferred.
Having this Arguments object as a nested class the component being created was viewed as positive, but this would be more idiomatically called Options -- Arguments was a holdover name from when these were exclusively for command line arguments, but for the API this is not a great name. So while keeping the general structure of how they are placed currently, they should probably be renamed to Options.
It is good to have the convenience for the simple arguments, however, if we have both simple and advanced settings, we should not mix them but have instead two distinct constructors/extension methods. (E.g., in the above, we would have two methods, one that took the advanced options.) To do otherwise is to invite confusion about which "wins" if we have the setting set in both.
- Note that phase setting "set in both," which suggests that these settings object should retain the "simpler" settings in them. This reinforces feedback elsewhere as seen here.
If the simple arguments are totally sufficient, then there is no need to expose this Arguments class in hte public API. (For practical reasons relating to command line and entry-point usage, we still need to always have these Arguments objects, but if they serve no purpose for the API the class can simply be made internal.)

/cc @KrzysztofCwalina, @terrajobst , on whose feedback this list is primarily based, and who can correct me and provide clarification in case I misspoke.

API

Source

TomFinley

👍1

Most helpful comment

2- For the types that have the Column or ColumnInfo structure, to pass arguments, and the members of the column have all the members of the Arguments, internalize the arguments classes.

I feel like I'd be more comfortable with this once I see what it looks like. We have had settings on the "options" class be settings that are applied to multiple column mappings in other contexts. E.g., we set hashbits to 20, then apply that to, say, 13 distinct columns, to create a total of 13*(2^20) features, and whatnot, for categorical features. It's not immediately clear to me that this pattern is worthless and ought to be abandoned in favor of a world where there are no global settings on the Options object in the case where that setting is also available on the ColumnInfo object. (Indeed I am having trouble imagining a situation where there are options in ColumnInfothat are not more globally settable inOptions` beyond the names of the columns being mapped.)

So maybe we can make a trial of that, see if we like it, since I'm not sure we will.
I view Options

I agree with renaming Arguments to Options when it is a public class used by the transformer/estimator constructor.

Yes. In both cases, I should say we should be consistent, since whether we are exposing this Options class to the user or not. Of course, we should probably prioritize renaming in those scenarios where we believe we must make public.

However, I don't know if we should rename ColumnInfo to Options, maybe @TomFinley has an opinion about it?

So, the options are options to the estimator itself. There is a single one of these always. Column infos are similar to options in the sense that they control one of the column mappings of the estimator (that is, in the cases where this is something like a one-to-one or many-to-one column mapping), but this makes them very different, does it not? Since they are plural?

I view these Options as things that are always singular, and I view ColumnInfo as things that are part of or complimentary to Options, but control each of the individual column maps. This suggests having separate names for them is appropriate. Is it not so?

I also for the reasons outlined above a little less gungho about removing Options in case the options are covered by CoolumnInfo, since Options at least to this point does not exist to complement ColumnInfo, but rather to specifically overlap it in ways that ColumnInfo can override if it so chooses. At least this has been the pattern in the past, we can develop new patterns if there is a good reason to do so.

(Wether ColumnInfo is the best name, that is a separate discussion, but it should be different from Options, since naming them the same thing would form the expectation that they behave similarly, and they don't.)

TomFinley on 30 Jan 2019

👍2

All 9 comments

One point that @glebuk brought up I wanted to make explicit, since I did not consider it before, is that by doing this, if we decide down the line to add more options to the "regular" options, we are free to do so. If we had the existing arrangement, then in order to maintain the API, if we made more regular default parameters. E.g., if we have this:

Create(int a, int b, int c, Action<Arguments> advancedSettings = null)

We could not have this, since this could break the signature.

Create(int a, int b, int c, int d, Action<Arguments> advancedSettings = null)

And the only way we see of not breaking the signature would be appending it to the end, which is very awkward and strange looking.

Create(int a, int b, int c, Action<Arguments> advancedSettings = null, int d = 1)

But if we had this:

Create(int a, int b, int c)
Create(Arguments advancedSettings)

Then we're free to do this:

Create(int a, int b, int c, int d = 1)
Create(Arguments advancedSettings)

TomFinley on 2 Jan 2019

👍1

marking as Done, and closing this isssue. All learners have been taken care of.

abgoswam on 29 Jan 2019

For the transform estimators (former transforms) the following needs to happen:

1- Internalize the ctors
2- For the types that have the Column or ColumnInfo structure, to pass arguments, and the members of the column have all the members of the Arguments, internalize the arguments classes.
Document every public member on the Column/ColumnInfo classes.
3- For the types that don't have the Column/ColumnInfo, follow the same pattern as the trainers: rename Arguments to Options and document the members.

In the process, see if there are more extensions to be exposed; especially for cases when more advanced arguments need to be supplied.

@TomFinley @abgoswam @artidoro see if the above summary about dealing with transformers make sense.

sfilipi on 30 Jan 2019

Reopening this issue to take care of the transform estimators changes listed above by @sfilipi

abgoswam on 30 Jan 2019

@sfilipi .

For #2 , For the types that have Column or ColumnInfo structure, should we rename them to Options ? That way we will have Options consistently for both learnerEstimators and transformEstimators.

The other items (i.e. 1 and 2 above) looks good.

abgoswam on 30 Jan 2019

I agree with renaming Arguments to Options when it is a public class used by the transformer/estimator constructor.

However, I don't know if we should rename ColumnInfo to Options, maybe @TomFinley has an opinion about it?

artidoro on 30 Jan 2019

2- For the types that have the Column or ColumnInfo structure, to pass arguments, and the members of the column have all the members of the Arguments, internalize the arguments classes.

So maybe we can make a trial of that, see if we like it, since I'm not sure we will.
I view Options

I agree with renaming Arguments to Options when it is a public class used by the transformer/estimator constructor.

However, I don't know if we should rename ColumnInfo to Options, maybe @TomFinley has an opinion about it?

TomFinley on 30 Jan 2019

👍2

To rephrase the above points made my @TomFinley , @sfilipi and @abgoswam we need to:

Internalize constructors of estimators and transformers
Keep ColumnInfo and move it to to the estimators #1760
Rename Arguments -> Options
Internalize Options only when they are not used by public constructor. For all other cases, retain Options as public #1758

Here is the list of the transforms - estimators that need to be covered:

| Transform | Category | Priority | Owner | PR |
| ----------- | ---------- | -------- | ------ | ---------------- |
| Key To Value Transform | Categorical | 0 | artidoro |#2340 |
| Value to Key Transform | Categorical | 0 | artidoro | #2340|
| Concat Transform | Schema manipulation | 0 | artidoro |#2363 |
| ~~Term Transform~~ | Categorical | 0 | artidoro | |
| Image Greyscale Transform | Image | 0 | Senja | |
| Image Loader Transform | Image | 0 | Senja | |
| Image Pixel Extractor Transform | Image | 0 | Senja | |
| Image Resizer Transform | Image | 0 | Senja | |
| Tensorflow Scoring Transform | Image | 0 | Gani | #2312 |
| Min-Max Normalizer | Normalizer | 0 | artidoro |#2363 |
| ScoreTensorFlowModel | TensorFlow | 0 | artidoro | #2367 |
| TensorFlow | TensorFlow | 0 | artidoro | #2367 |
| ONNX | ONNX | 0 | artidoro | #2367 |
| Min-Max Normalizer | Normalizer | 0 | artidoro |#2363 |
| ~~NA Handle Transform~~ | Missing values | 1 | artidoro | |
| NA Indicator Transform | Missing values | 1 | artidoro |#2363 |
| NA Replace Transform | Missing values | 1 | artidoro |#2363 |
| OneHotEncoding Transform | Categorical | 1 | artidoro | #2340|
| OneHotHashEncoding Transform | Categorical | 1 | artidoro |#2364 |
| Hash Transform | Categorical | 1 | artidoro |#2364 |
| Copy Columns Transform | Schema manipulation | 1 | artidoro |#2364 |
| Key To Vector Transform | Categorical | 1 | artidoro | #2364|
| Text Transform | Text processing | 1 | abgoswam |#2394 |
| Word Embeddings Transform | Text processing | 1 | abgoswam |#2393 |
| Character Tokenizer Transform | Text processing | 2 | abgoswam |#2393 |
| Ngram Hash Transform | Text processing | 2 |abgoswam |#2393 |
| Ngram Transform | Text processing | 2 |abgoswam | #2393|
| Stopwords Remover Transform | Text processing | 2 | abgoswam |#2393 |
| Text Normalizer Transform | Text processing | 2 |abgoswam | #2393|
| Word Bag Transform | Text processing | 2 | abgoswam |#2393 |
| Word Hash Bag Transform | Text processing | 2 |abgoswam |#2393 |
| Word Tokenizer Transform | Text processing | 2 | abgoswam | #2393 |
| Latent Dirichlet Allocation Transform | Text processing | 2 | abgoswam |#2393 |
| Convert Transform | Column mapper | 2 | artidoro | #2365|
| Drop Slots Transform | Feature selection | 2 |artidoro |#2365 |
| LogMeanVar Normalizer | Normalizer | 2 |artidoro |#2363 |
| MeanVar Normalizer | Normalizer | 2 |artidoro |#2363 |
| Tree Ensemble Featurization Transform | Projection | 2 |artidoro |#2365 (not ITransformer) |
| ~~NA Filter~~ | Row manipulation | 2 |artidoro |#2365 |
| Count Feature Selection Transform | Feature selection | 2 | artidoro | #2365|
| Mutual Information Feature Selection Transform | Feature selection | 2 |artidoro |#2365 |
| Feature Contribution Calculation Transform | Feature selection | 2 | artidoro |#2365 |
| Permutation Feature Importance Transform | Feature selection | 2 | artidoro | #2365|
| NA Drop Transform | Missing values | 2 | artidoro |#2365 |
| Binning Normalizer | Normalizer | 2 | artidoro | #2363 |
| Global Contrast Normalization Transform | Normalizer | 2 | artidoro | #2366 |
| Principal Component Analysis Transform | Projection | 2 | artidoro | #2366 |
| Random Fourier Features Transform | Projection | 2 | artidoro | #2366 |
| Bootstrap Sample Transform | Row manipulation | 2 |artidoro | #2366 (not ITransformer )|
| Shuffle Transform | Row manipulation | 2 | artidoro | #2366 (not ITransformer )|
| Custom Stopwords Remover Transform | Text processing | 2 | artidoro | #2366 |
| Learner Feature Selection Transform | Feature selection | 3 | artidoro | #2366 (not ITransformer )|
| Lp-Norm Normalizer | Normalizer | 3 | artidoro | #2366 |
| Supervised Binning Normalizer | Normalizer | 3 | artidoro |#2363 |
| Whitening Transform | Normalizer | 3 | artidoro | #2366 |
| Choose Columns Transform | Schema manipulation | 3 | artidoro | #2367 |
| Drop Columns Transform | Schema manipulation | 3 | artidoro | #2367 |
| Generate Number Transform | Schema manipulation | 3 |artidoro | (not ITransformer ) #2367|
| Keep Columns Transform | Schema manipulation | 3 |artidoro | #2367|
| Key To Binary Vector Transform | Categorical | 3 | artidoro | #2367 |
| Term Lookup Transform | Categorical | 3 | artidoro| #2367|
| Label Indicator Transform | Column mapper | 3 |artidoro |(not ITransformer ) |
| Group Transform | Relational operation | 3 |artidoro |(not ITransformer ) !Public #2367 |
| Un-group Transform | Relational operation | 3 | artidoro |(not ITransformer ) !Public #2367 |
| Range Filter | Row manipulation | 3 |artidoro | (not ITransformer ) #2367 |
| Skip Filter | Row manipulation | 3 |artidoro |(not ITransformer ) #2367 |
| Skip and Take Filter | Row manipulation | 3 |artidoro |(not ITransformer ) #2367 |
| Take Filter | Row manipulation | 3 |artidoro | (not ITransformer ) #2367|

artidoro on 30 Jan 2019

👍1

Forgive me my intrusion into this wonderful conversation, but I have question from user, which I'm don't know how to answer.
So we have this issue https://github.com/dotnet/machinelearning/issues/2165 and for now user can just call trival transformer constructors and avoid estimators and Fit method.

number 1 in Senja list states