The XML documentation for the transforms should contain information about the schema: requirements about the type of the columns to work on, and information about the type of the columns produced.
@shmoradims
I'd suggest to document GetOutputSchema function and transformer's XML can reference GetOutputSchema's document.
Referencing #3127 as this feels same or related, maybe we could track this work under one issue? I think there are going to be a number of subsections that will need to be tracked. And should this go on the transforms? Or the extensions?
I'd suggest to document
GetOutputSchemafunction and transformer's XML can referenceGetOutputSchema's document.
It won't work, because GetOutputSchema is on the base class, and is calling GetOutputSchemaCore of every class, which is internal.
Also the types of the columns for the input columns need to be documented.
Proposal for the transforms XML template, mirroring the trainers template from issue: #3218,
1- XML on Transform extension method:
<see cref the estimator>estimator that <Short transform description> Parameters:
inputColumnName: Name of column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
The data type on this column should be <ref required data type.> | The data type on this column can be any type.
outputColumnName: Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
This column's data type will be <info from GetOutputSchema here>.
Example
2 - XML of the Estimator, Transformer
cc @natke @shmoradims @glebuk @singlis
Looks good @sfilipi. One small comment: for the summary of the extension method, would we want to say: "Create an xxx transform ..."?
Shouldn't it say "Create an xxx estimator" instead of "transform"?
@sfilipi
Shouldn't it say "Create an xxx estimator" instead of "transform"?
@sfilipi
@natke suggested that we reference the components by their functionality: trainers and transforms.
In that specific case we'll keep it to estimator - i updated the template above to reflect that.
in the Remarks of the actual estimator we're keeping transform, as a term.
Transform priority is in this old issue. Use it as guide to prioritize more popular transforms. Also, reuse descriptions from here.
List of transforms to sign up for:
MLContext.Transforms (root) | API Owner | Priority | Status
-- | -- | -- | --
CopyColumns | Senja | 聽 | Done PrToV1: #3348聽|
Concatenate |聽Artidoro| 0聽 | Done 聽|
DropColumns | Artidoro聽 | 聽3 |Done |聽
SelectColumns | Artidoro聽 | 3聽 |Done| 聽
Normalize.MinMax | Scott聽 | 聽 | Done #3432 聽|
Normalize.MeanVariance | Scott聽 | 聽 | Done聽聽#3432|
Normalize.LogMeanVariance | Scott聽 | 聽 | Done 聽#3432|
Normalize.Binning | Scott聽 | 聽 | Done #3432聽聽|
Normalize.SupervisedBinning | Scott聽 | 聽 | Done #3432聽聽|
CustomMapping | Artidoro 聽 | 聽 | Done聽|
IndicateMissingValues | Ivan聽 | 聽 | Done #3386 聽|
ReplaceMissingValues | Ivan聽 | 聽 | Done #3386聽|
ConvertToGrayscale | Ivan 聽 | 聽 | In PR #3376聽|
LoadImages | Ivan聽 | 聽 | In PR #3376聽|
ExtractPixels | Ivan聽 | 聽 | In PR #3376聽|
ResizeImages | Ivan聽 | 聽 | In PR #3376聽|
ConvertToImage | Ivan聽 | 聽 | In PR #3376聽|
IidChangePointEstimator | Wei-Sheng 聽 | 聽 | Done #3444聽|
IidSpikeEstimator | Wei-Sheng | 聽 | Done #3444聽|
SsaChangePointEstimator | Wei-Sheng | 聽 | Done #3444聽|
SsaSpikeEstimator | Wei-Sheng聽 | 聽 | Done #3444聽|
ApplyOnnxModel | Gani聽 | 聽 | Done 聽#3387 |
DnnFeaturizeImage | Senja聽 | 聽 | 聽|
NormalizeGlobalContrast | Artidoro聽 | 聽 | 聽Done |
NormalizeLpNorm | Artidoro聽 | 聽 | 聽Done |
ApproximatedKernelMap | Yael聽 | 聽 | Done #3377 |
CalculateFeatureContribution | Yael聽 | 聽 | Done #3384 聽|
Other catalogs:
| MLContext.Transforms.Categorical | API Owner | Priority | Status |
| -------------------------------- | ----- | ----- | ----- |
| OneHotEncoding | Najeeb | 0 | Done #3388 |
| OneHotHashEncoding | Najeeb | 1 | Done #3388 |
| MLContext.Transforms.Conversion | API Owner | Priority | Status |
| -------------------------------- | ----- | ------ |----- |
| Hash | Senja | | Done |
| ConvertType | Senja | | Done |
| MapKeyToValue | Senja | | Done |
| MapKeyToVector | Senja | | Done |
| MapValueToKey | Senja | | Done |
| MapKeyToBinaryVector | Senja | |Done |
| MLContext.Transforms.FeatureSelection | API Owner | Priority | Status |
| -------------------------------- | ----- |--- |--- |
| SelectFeaturesBasedOnMutualInformation |Senja need a better example to show MI computation. something like this | | Done|
| SelectFeaturesBasedOnCount | Senja | | Done|
| MLContext.Transforms.Text| API Owner | Priority | Status |
| -------------------------------- | ----- | ---- | ---- |
| FeaturizeText | Senja | | Done #3438 |
| TokenizeCharacters | Artidoro | 2 | Done #3418|
| NormalizeText | Artidoro | 2 |Done#3418|
| ExtractWordEmbeddings | Artidoro | | Done #3418|
| TokenizeWords | Artidoro | 2 |Done #3418|
| ProduceNgrams |Artidoro | 2 |Done #3418|
| RemoveDefaultStopWords | Ivan | | Done #3413|
| RemoveStopWords | Ivan | | Done #3413|
| ProduceWordBags | Ivan | | Done #3440|
| ProduceHashedWordBags | Ivan | | Done #3440|
| ProduceHashedNgrams | Ivan | | Done #3419|
| LatentDirichletAllocation | Senja | | Done #3442 |
For the Data catalog, all API's documentations needs to be augmented with suggestions for when would one use this API.
| MLContext.Data | API Owner | Priority | Status |
| -------------------------------- | ----- | ----- |----- |
| LoadFromEnumerable | Najeeb | | Done #3417 |
| CreateEnumerable | Najeeb | The second overload of this API is a P4 scenario. the use case for that API would be: users has a model which has slot names preserved for the features, and when they load the models, they also get the schema out of the loaded model and pass that schema, together with the TRow type they want to load the data to this API. This API will then populate the Annotations (former metadata) for the feature column. | Done #3417 |
| BootstrapSample | Najeeb | | Done (previously by Rogan) |
| Cache | Najeeb | | Done (previously by Rogan) |
| FilterRowsByColumn | Najeeb | | Done (previously by Rogan) |
| FilterRowsByKeyColumnFraction | Najeeb | | Done (previously by Rogan) |
| FilterRowsByMissingValues | Wei-Sheng | |Done (previously by Rogan) |
| ShuffleRows | Wei-Sheng | | Done (previously by Rogan)|
| SkipRows | Wei-Sheng #3415 | | Done (previously by Rogan) |
| TakeRows | Wei-Sheng #3415| | Done (previously by Rogan) |
| Other | API Owner | Priority | Status |
| -------------------------------- | ----- | ---- | ---- |
| Permutation Feature Importance | Shahab | | Doen by @codemzs |
| MLContext.Model (root)| Shahab | | #3451 |
Finished 4/21 9:29pm. Great team work.