Machinelearning: There are two transforms with the Friendly Name "Term Transform"

Created on 23 May 2018  路  11Comments  路  Source: dotnet/machinelearning

In the list of entry points there are two identical transforms, one with the name field: "Transforms.TextToKeyConverter" and the other with the name field: "Transforms.Dictionarizer".
Their Friendly Name filed is the same: "Term Transform".

This will be confusing for systems interfacing with ml.net through the entry points; ml.net should not present the same entry point choice more than once.

bug good first issue up-for-grabs

All 11 comments

What is the proposed final name? Are we happy with either "Dictionarizer" or "TextToKey"? The first seems a bit funky (at least to my eye), but the second is not descriptive since obviously this can be applied to more than just text. So I prefer the first but only because being basically meaningless is preferable to being flat-out misleading and wrong.

Yeah, I was about to say, "TextToKeyConverter" is a bit wrong as it can take in about anything and convert to Key, but @TomFinley beat me to it.

I don't like either one, really. Do we want to expose the concept of keys? If so, "ToKey" is rather terse and hopefully descriptive. I'd say "AnythingToKey", but I don't know that every type, for instance the Picture type, can be converted to a Key type (and the future will bring more types).

To capture the primary use case, "LabelConverter" may be suitable. Scikit calls this a "LabelEncoder": http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

I maybe wrong but Tensorflow seems to not have a specialized transform for this, and uses the categorical featurizers:

categorical_column_with_hash_bucket(...): Represents sparse feature where ids are set by hashing.
categorical_column_with_identity(...): A _CategoricalColumn that returns identity values.
categorical_column_with_vocabulary_file(...): A _CategoricalColumn with a vocabulary file.
categorical_column_with_vocabulary_list(...): A _CategoricalColumn with in-memory vocabulary

mmm. ToKey... I actually like that name. ToKey. Super simple, fairly descriptive in five characters. Of course, hash is also a "to key." But it's still better than the original name "term."

We sort of have to expose them. They're basically the same things as factors in R, and as far as I can tell you simply can't get around the fact that enumerations into sets is a fairly central concept of ML. Of course whether we actually wind up using "key" in the name, I don't know.

LabelEncoder might be fine, indeed term's first name was "auto-label", but I worry somewhat about what will happen when someone uses the text metatransform, inspects the pipeline, and one of the first things there is them "label-encoding" their feature inputs. :)

@TomFinley, if it's ok I'd like to work on this. Was ToKey decided to be the name to change one of the Term Transform's to?

Hi @jwood803 I apologize, I just saw this now!!

I'm not sure it was decided. @justinormont likes it (or at least so I presume from the fact that he suggested it), I like it, but other people that might have opinions on this matter have not weighed in -- the ones I can think of are @Zruty0 , @GalOshri , @eerhardt , @KrzysztofCwalina ...

No worries, @TomFinley. 馃槃

I can go ahead and mess with it to get a PR out and we can go from there if that sounds like a plan.

Hi @jwood803 , mmmmaaaaybe? I would just hate for this to happen, then everyone shoots it down. Some history, we named it term four years ago, and the name stuck, not because anyone liked the name per se or thought it was so great, but because in those intervening four years every time someone suggested a new name for it, someone hated it even worse. (E.g., "dictionarizer..") My prior at this point is to assume everyone will hate any renaming to term.

Let's try this. I'm going to force discussion on the issue by naming my pigsty extension method in #870 to ToKey (while still the underlying classes have Term in their name). If it passes, then we'll consider that @justinormont's suggestion stands, and then the things named Term and whatnot can be renamed.

Sounds great, @TomFinley! I'll be on the lookout for the discussion. Thanks! 馃槃

Hi @jwood803 I think everyone agrees FYI (or at least, they let the PR go in), so let's possibly count on there being no particular disagreement on the point.

@jwood803, @TomFinley : Correct, I like ToKey for being terse and hopefully descriptive.

@TomFinley @justinormont Awesome! Thanks for the update. I'll start messing with this. Thanks!

Was this page helpful?
0 / 5 - 0 ratings