We should make sure the type names in the IDataView subsystem are the names we want to use forever.
See
If all of these types are column types, should they include column in the name? e.g. TextColumnType?
The idea was broached I know, but resolution never reached: should we rename? Or else perhaps make Schema a nested class of something appropriate, so as to give it definition?
We could rename the class to DataViewSchema, or ViewSchema.
We鈥檇 probably want to rename everything then. Row is a pretty generic name. So is ColumnType.
Microsoft.Data.DataView package are unique by adding a prefix to any name deemed too general.DataView, ex. DataViewSchema, DataViewRow, DataViewType.DV, ex. DVSchema, DVRow, DVType.Db prefix pattern in System.Data: DbDataReader, DbConnection, DbColumn, etc.v: DvSchema to follow the same pattern above.ColumnType class, using whatever name we decide for the base ColumnType class. For example, assuming we rename ColumnType => DataViewType, we would have:NumberDataViewTypeTextDataViewTypeBooleanDataViewTypeNumberType to match the type name in System. namespace.Int16 instead of I2. use Single instead of R4.UG property and corresponding struct RowId, use whatever we rename the Row type, followed by Id. Example, DataViewRowId.Schema.Metadata class to Schema.Annotations, and rename the property on Schema.Column from Metadata to Annotations.Builder classes to be nested under the types they build, following the same pattern as System.Collections.Immutable.We may also want to rename the static properties on NumberType:
NumberType.I2
NumberType.U4
NumberType.R4
should all match the .NET type names:
NumberType.Int16
NumberType.UInt32
NumberType.Single
Especially NumberType.UG should be renamed to NumberType.RowId or similar.
Also related is: https://github.com/dotnet/machinelearning/issues/1843
We should make sure the type names in the IDataView subsystem are the names we want to use forever.
See
If all of these types are column types, should they include column in the name? e.g. TextColumnType?
That's one possibility. Before we go that route though, I'd like to explore whether the name "column type" is meaningful -- it was back when the only things did was describe the types of columns in data views, but that hasn't been the case for a while.
The idea was broached I know, but resolution never reached: should we rename? Or else perhaps make Schema a nested class of something appropriate, so as to give it definition?
We could rename the class to DataViewSchema, or ViewSchema.
We鈥檇 probably want to rename everything then. Row is a pretty generic name. So is ColumnType.
Let's explore that. If we embraced this idea of DataViewSchema, then possibly DataViewType and DataViewRow may also be appropriate. The idea though of something named TextDataViewType though wearies me; that is an obnoxious name.
It is also potentially confusing, since we are using "data view" in a way that is distinct from the actual type: that is, pertaining to data-view-interacting objects, as opposed to data-views themselves. Since, these structures are used in contexts other than data views directly (though ultimately, one way or another, they are all feedable back to or from dataviews.)
Also if I look at other libraries it is not clear to me what we should do. Spark clearly gets by just fine with names that would certainly be considered by the .NET team's standards way too generic. That by itself is not really an argument, but the reality is we haven't really heard a complaint yet about ISchema, IRow or its successor Schema or Row. Just theoretical concerns that it might be a problem.
ADO.NET, if I look for similar structures, solved the problem by prefixing everything with Db. This is a violation of .NET naming guidelines, but if the .NET naming guidelines lead us to instead prefix everything with DataView, I might prefer actually something more like Dv. (This is somewhat reminiscent of old types, like DvInt, DvText, and so on.)
So my opinions are not very strong one way or the other. It might be one of these things where it becomes more clear that something is acceptable or unacceptable once a solid proposal is made. I just don't know yet for myself.
But of course as we say, on the subject of types names we've reached the point where it is, speak now or forever hold your peace.
For reference, a dump of the public surface of the DataView nupkg is here
https://gist.github.com/danmosemsft/98b0dd3d33a0283a6fe71e46ff22c929
I have updated the original comment with a proposal. Please take a look and provide any feedback (especially please weigh-in on the options for the prefix for (1)).
/cc @terrajobst @NiklasGustafsson @KrzysztofCwalina @ericstj @stephentoub
Should we meet to discuss this? Back and forth on GitHub might not be the best way to close on this. I'm happy to set aside some time. I think a one-hour meeting should get us close.
Here is a dump of the proposed public surface after the changes above.
https://gist.github.com/eerhardt/eff7e207673e1fde40a5696eb07ba273
I'll book some time to close on this, @terrajobst.
We had a meeting and approved the proposal with the following changes:
Builder classes, rename GetAnnotations and GetSchema to ToAnnotations and ToSchema to follow the immutable collections pattern (and StringBuilder).
Most helpful comment
We may also want to rename the static properties on
NumberType:should all match the .NET type names:
Especially
NumberType.UGshould be renamed toNumberType.RowIdor similar.