Machinelearning: API reference - XML documentation template for transforms

Created on 4 Apr 2019 · 11Comments · Source: dotnet/machinelearning

The XML documentation for the transforms should contain information about the schema: requirements about the type of the columns to work on, and information about the type of the columns produced.

@shmoradims

documentation

Source

sfilipi

All 11 comments

I'd suggest to document GetOutputSchema function and transformer's XML can reference GetOutputSchema's document.

wschin on 4 Apr 2019

Referencing #3127 as this feels same or related, maybe we could track this work under one issue? I think there are going to be a number of subsections that will need to be tracked. And should this go on the transforms? Or the extensions?

singlis on 4 Apr 2019

I'd suggest to document GetOutputSchema function and transformer's XML can reference GetOutputSchema's document.

It won't work, because GetOutputSchema is on the base class, and is calling GetOutputSchemaCore of every class, which is internal.
Also the types of the columns for the input columns need to be documented.

sfilipi on 5 Apr 2019

Proposal for the transforms XML template, mirroring the trainers template from issue: #3218,

1- XML on Transform extension method:

Summary - Create an <see cref the estimator>estimator that <Short transform description>

Parameters:
inputColumnName: Name of column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
The data type on this column should be <ref required data type.> | The data type on this column can be any type.
outputColumnName: Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
This column's data type will be <info from GetOutputSchema here>.
Example

2 - XML of the Estimator, Transformer

Summary - One line description of what the transform does.
Remarks - More details about the transform and its implementation.
Does it do a pass through the data? Yes/No
Input column type
Output column type.
Additional NuGet: "Link to NuGet" OR None of all that are included already in Microsoft.ML
SeeAlso cref the extension method for an example usage.

cc @natke @shmoradims @glebuk @singlis

sfilipi on 10 Apr 2019

Looks good @sfilipi. One small comment: for the summary of the extension method, would we want to say: "Create an xxx transform ..."?

natke on 10 Apr 2019

👍1

Shouldn't it say "Create an xxx estimator" instead of "transform"?
@sfilipi

artidoro on 10 Apr 2019

👍1

Shouldn't it say "Create an xxx estimator" instead of "transform"?
@sfilipi

@natke suggested that we reference the components by their functionality: trainers and transforms.
In that specific case we'll keep it to estimator - i updated the template above to reflect that.

in the Remarks of the actual estimator we're keeping transform, as a term.

sfilipi on 10 Apr 2019

Do we need the remarks? Can the summary just say: "Create an "?
Let's add a short description of the transform in the summary, similar to trainers.
Onnx -> ONNX

shmoradims on 10 Apr 2019

Transform priority is in this old issue. Use it as guide to prioritize more popular transforms. Also, reuse descriptions from here.

List of transforms to sign up for:

sfilipi on 11 Apr 2019

Other catalogs:

| MLContext.Transforms.Categorical | API Owner | Priority | Status |
| -------------------------------- | ----- | ----- | ----- |
| OneHotEncoding | Najeeb | 0 | Done #3388 |
| OneHotHashEncoding | Najeeb | 1 | Done #3388 |

| MLContext.Transforms.Conversion | API Owner | Priority | Status |
| -------------------------------- | ----- | ------ |----- |
| Hash | Senja | | Done |
| ConvertType | Senja | | Done |
| MapKeyToValue | Senja | | Done |
| MapKeyToVector | Senja | | Done |
| MapValueToKey | Senja | | Done |
| MapKeyToBinaryVector | Senja | |Done |

| MLContext.Transforms.FeatureSelection | API Owner | Priority | Status |
| -------------------------------- | ----- |--- |--- |
| SelectFeaturesBasedOnMutualInformation |Senja need a better example to show MI computation. something like this | | Done|
| SelectFeaturesBasedOnCount | Senja | | Done|

| MLContext.Transforms.Text| API Owner | Priority | Status |
| -------------------------------- | ----- | ---- | ---- |
| FeaturizeText | Senja | | Done #3438 |
| TokenizeCharacters | Artidoro | 2 | Done #3418|
| NormalizeText | Artidoro | 2 |Done#3418|
| ExtractWordEmbeddings | Artidoro | | Done #3418|
| TokenizeWords | Artidoro | 2 |Done #3418|
| ProduceNgrams |Artidoro | 2 |Done #3418|
| RemoveDefaultStopWords | Ivan | | Done #3413|
| RemoveStopWords | Ivan | | Done #3413|
| ProduceWordBags | Ivan | | Done #3440|
| ProduceHashedWordBags | Ivan | | Done #3440|
| ProduceHashedNgrams | Ivan | | Done #3419|
| LatentDirichletAllocation | Senja | | Done #3442 |

For the Data catalog, all API's documentations needs to be augmented with suggestions for when would one use this API.

| MLContext.Data | API Owner | Priority | Status |
| -------------------------------- | ----- | ----- |----- |
| LoadFromEnumerable | Najeeb | | Done #3417 |
| CreateEnumerable | Najeeb | The second overload of this API is a P4 scenario. the use case for that API would be: users has a model which has slot names preserved for the features, and when they load the models, they also get the schema out of the loaded model and pass that schema, together with the TRow type they want to load the data to this API. This API will then populate the Annotations (former metadata) for the feature column. | Done #3417 |
| BootstrapSample | Najeeb | | Done (previously by Rogan) |
| Cache | Najeeb | | Done (previously by Rogan) |
| FilterRowsByColumn | Najeeb | | Done (previously by Rogan) |
| FilterRowsByKeyColumnFraction | Najeeb | | Done (previously by Rogan) |
| FilterRowsByMissingValues | Wei-Sheng | |Done (previously by Rogan) |
| ShuffleRows | Wei-Sheng | | Done (previously by Rogan)|
| SkipRows | Wei-Sheng #3415 | | Done (previously by Rogan) |
| TakeRows | Wei-Sheng #3415| | Done (previously by Rogan) |

| Other | API Owner | Priority | Status |
| -------------------------------- | ----- | ---- | ---- |
| Permutation Feature Importance | Shahab | | Doen by @codemzs |
| MLContext.Model (root)| Shahab | | #3451 |

sfilipi on 11 Apr 2019

Finished 4/21 9:29pm. Great team work.

shmoradims on 22 Apr 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings