Machinelearning: API reference - Samples for Transforms

Created on 10 Oct 2018  路  6Comments  路  Source: dotnet/machinelearning

We need to add samples on how to use the new transformer, and estimators than reference those samples from the XML documentation so that in docs.microsoft.com users can copy/paste the sample and have a head-starts.

Mot of the tests that got added as part of the transformer work are a good start for creating a sample.

MLContext Catalogs

Catalog | Total APIs | Samples Owner | Samples Status / ETA |
-- | -- | -- | -- |
MLContext.Transforms (root) | 19 | Senja | Remaining: 4 overrides for the normalizer multicolumn examples |
MLContext.Transforms.Categorical | 2 | ZeeshanA | Done v1 |
MLContext.Transforms.Conversion | 6 | Senja | DoneV1 |
MLContext.Transforms.FeatureSelection | 4 | ZeeshanA | Done v1 |
MLContext.Transforms.TimeSeries | 4 | Senja | Done V1 |
MLContext.Transforms.Text | 29 | ZeeshanA | Done V1 |
MLContext.Data | 10 | Senja | DoneV1 |
MLContext.Model (root) | 4 | ZeeshanS聽 | DoneV1 聽 |

P0+P1 Public API (extension methods) per Catalog

| MLContext.Transforms (root) | Num Overloads | Documentation | Sample | API Owner |
| -------------------------------- | ------------- | ------------ | ----- | ----- |
CopyColumns| 2 | Yes | 2 Can remove dependency on DatasetUtils. | Zeeshan|
Concatenate| 1 | Yes, needs improvement.| 1 - Can remove dependency on DatasetUtils.| Zeeshan |
DropColumns| 1 | Yes| 1 Can remove dependency on DatasetUtils.|Zeeshan |
SelectColumns|2 | Yes, needs improvement. | 2 - Can remove dependency on DatasetUtils.|Zeeshan |
Normalize| 1 | Done. | 1 #3244 |Ivan|
CustomMapping | 1 | Yes, needs improvement. | Done-v1 #3275| Artidoro |
IndicateMissingValues | 2| | Done-v1 #3275 | Artidoro |
ReplaceMissingValues | 2 | | Done-v1 #3275 | Artidoro |
ConvertToGrayscale | 1 | Yes, needs fixes. Example not displaying.| 1 #3165 | Abhishek |
LoadImages | 1 | Yes, needs fixes. Example not displaying. | 1 #3165 | Abhishek |
ExtractPixels | 2 | Yes, needs fixes. Example not displaying. | 1 #3165 | Abhishek |
ResizeImages | 2 | Yes. Example not displaying. | 1 #3165 | Abhishek |
ConvertToImage | 2 | Yes. | 1 #3165 | Abhishek |
IidChangePointEstimator | 1 | | 1- Done | Senja|
IidSpikeEstimator | 1 | | 1 - Done| Senja |
SsaChangePointEstimator | 1 | | 1 - Done | Senja |
SsaSpikeEstimator | 1 | | 1 - Done | Senja|
ApplyOnnxModel | 3| DoneV1 | #3349 | Gani |
DnnFeaturizeImage | 1 | Yes, needs improvement. | 1 - Done | Senja |
NormalizeGlobalContrast| 1 | Done | 0 #3232 | Ivan|
NormalizeLpNorm| 1 | Done. | 0 #3232| Ivan |
ApproximatedKernelMap| 1 | Done | 0 #3232 | Ivan |
mlContext.Transforms. CalculateFeatureContribution | 1 | Yes, needs improvement | Rogan

| MLContext.Transforms.Categorical | Num Overloads | Documentation | Sample | API Owner |
| -------------------------------- | ------------- | ------------ | ----- | ----- |
| OneHotEncoding | 2 | | 2 #3179 | Abhishek |
| OneHotHashEncoding | 2 | | 2 #3179 | Abhishek |
| | | | | |

| MLContext.Transforms.Conversion | Num Overloads | Documentation | Sample | API Owner |
| -------------------------------- | ------------- | ------------ | ----- | ----- |
| Hash | 2 | can't find the API | Done | Senja |
| ConvertType | 2 | Yes, needs improvement. | Done | Senja |
| MapKeyToValue | 2 | Yes, needs improvement. | Done | Senja |
| MapKeyToVector | 2 | Yes, needs improvement. | Done | Senja |
| MapValueToKey | 2 | Yes. | Done | Senja |
| MapKeyToBinaryVector | 2 | Yes, needs improvement. | Done | Senja |

| MLContext.Transforms.FeatureSelection | Num Overloads | Documentation | Sample | API Owner |
| -------------------------------- | ------------- | ------------ | ----- | ----- |
| SelectFeaturesBasedOnMutualInformation | 2 | need a better example to show MI computation. something like this | 2 #3184 | Abhishek |
| SelectFeaturesBasedOnCount | 2 | | 2 #3184 | Abhishek |
| | | | | |

| MLContext.Transforms.Text| Num Overloads | Documentation | Sample | API Owner |
| -------------------------------- | ------------- | ------------ | ----- | ----- |
| FeaturizeText | 2 | | #3120 | Zeeshan |
| TokenizeCharacters | 1 | | #3123 |Zeeshan |
| NormalizeText | 1 | | #3133| Zeeshan |
| ExtractWordEmbeddings | 1 | | #3142 | Zeeshan |
| TokenizeWords | 1 | | #3156 | Zeeshan |
| ProduceNgrams | 3 | | #3177 | Zeeshan |
| RemoveDefaultStopWords | 2 | | #3156 | Zeeshan|
| RemoveStopWords | 2 | | #3156 |Zeeshan |
| ProduceWordBags | 3 | | #3183 | Zeeshan |
| ProduceHashedWordBags | 3 | | #3183 | Zeeshan|
| ProduceHashedNgrams | 3 | | #3177 | Zeeshan |
| LatentDirichletAllocation | 2 | |#3191 | Zeeshan |

For the Data catalog, all API's documentations needs to be augmented with suggestions for when would one use this API.

| MLContext.Data | Num Overloads | Documentation | Sample | API Owner |
| -------------------------------- | ------------- | ------------ | ----- | ----- |
| LoadFromEnumerable | 1 | Done.| 1 - Done. | Senja |
| CreateEnumerable | 2 | Done. The second overload of this API is a P4 scenario. the use case for that API would be: users has a model which has slot names preserved for the features, and when they load the models, they also get the schema out of the loaded model and pass that schema, together with the TRow type they want to load the data to this API. This API will then populate the Annotations (former metadata) for the feature column. | 1 | Senja |
| BootstrapSample | 1 | Done. | 1 - Done. | Senja |
| Cache | 1 | Done. | 1 - Done. | Senja |
| FilterRowsByColumn | 1 | Done.| 1 - Done. | Senja |
| FilterRowsByKeyColumnFraction | 1 | Done. | 1 - Done. | Senja |
| FilterRowsByMissingValues | 1 | Done. | 1 - Done. | Senja |
| ShuffleRows | 1 | Done. | 1 - Done. | Senja |
| SkipRows | 1 | Done. | 1 - Done. | Senja |
| TakeRows | 1 | Done. | 1 - Done.| Senja |

| Other | Num Overloads | Documentation | Sample | API Owner |
| -------------------------------- | ------------- | ------------ | ----- | ----- |
| Permutation Feature Importance | 4 | Yes, but needs work | Yes, but needs work | Rogan |

documentation

Most helpful comment

I moved the trainers to a separate issue: #2522

All 6 comments

List of Trainers:

| BinaryClassification.Trainers | Category | Priority | Owner | Completed PR |
| ------- | -------- | -------- | ------ | ----------- |
| StochasticDualCoordinateAscent | Linear | 0 | Shahab | |
| StochasticGradientDescent | Linear | 0 | Shahab | |
| AveragedPerceptron | Linear | 0 | Shahab | |
| LogisticRegression | Linear | 0 | Shahab | |
| SymbolicStochasticGradientDescent | Linear | 0 | Shahab | |
| FastTree | Tree | 0 |Shahab | |
| FastForest | Tree | 0 | Shahab | |
| LightGbm | Tree | 0 | Shahab | |
| FieldAwareFactorizationMachine | FFM | 0 | Shahab | |
| GeneralizedAdditiveModels | GAM | 1 |Shahab | |
| LinearSupportVectorMachines | Linear | 2 | | |

| Trainer | Category | Priority | Owner | Completed PR |
| ------- | -------- | -------- | ------ | ----------- |
| SDCAMC: Fast Linear Multi-class Classification (SA-SDCA) | Linear | 0 | | |
| SDCAR: Fast Linear Regression (SA-SDCA) | Linear | 0 | | |
| OVA: One-vs-All | Meta | 0 | | |
| FastTreeRegression: FastTree (Boosted Trees) Regression | Tree | 0 | | |
| KMeansPlusPlus: KMeans++ Clustering | Clustering | 0 | | |
| LightGBMMulticlass: LightGBM Multi-class Classifier | Tree | 0 | | |
| LightGBMRegression: LightGBM Regressor | Tree | 0 | | |
| MultiClassLogisticRegression: Multi-class Logistic Regression | Linear | 0 | | |
| OLSLinearRegression: Ordinary Least Squares (Regression) | Linear | 0 | | |
| FastForestRegression: Fast Forest Regression | Tree | 0 | | |
| RegressionGamTrainer: Generalized Additive Model for Regression | GAM | 1 | | |
| OnlineGradientDescent: Stochastic Gradient Descent (Regression) | Linear | 1 | | |
| PoissonRegression: Poisson Regression | Linear | 1 | | |
| PKPD: Pairwise coupling (PKPD) | Meta | 1 | | |
| pcaAnomaly: PCA Anomaly Detector | Projection | 1 | | |
| FastTreeTweedieRegression: FastTree (Boosted Trees) Tweedie Regression | Tree | 1 | | |
| PriorPredictor: Prior Predictor | Baseline | 2 | artidoro |#2510 |
| RandomPredictor: Random Predictor | Baseline | 2 | artidoro |#2510 |
| MultiClassNaiveBayes: Multiclass Naive Bayes | Bayes | 2 | | |
| BinarySGD: Hogwild SGD (binary) | Linear | 2 | | |
| FastTreeRanking: FastTree (Boosted Trees) Ranking | Tree | 2 | | |
| LightGBMRanking: LightGBM Ranking | Tree | 2 | | |

Would the logistic regression one be done with PR #2256? I wonder if others may be done, too. I can try to go through and see if they have XML doc examples.

Hi @jwood803, yes you took care of LogisticRegression with #2256. Thanks!
This workitem is to complete everything: XML doc over the extensions, estimators etc.

If you have bandwidth, and are looking for something to do, you can contribute to the samples, and replicate the work you did on Logistic Regression for #2256 for the other binary trainers:

LightGBM,
FastTree,
AveragedPerceptron,
SDCA,
LinearSVM
SymSGD

basically every extension on the BinaryClassificationCatalog.BinaryClassificationTrainers.

this BinaryClassificationCatalog.BinaryClassificationTrainers catalog,

cc @shmoradims @rogancarr FYI.

If you claim those, feel free to update the table with your username.

I moved the trainers to a separate issue: #2522

I talked with @shmoradims and @eerhardt and I will work as a sub-part of this issue an example implementation of IDataView.

Verified that everything is documented, but the normalizer multicolumn APIs. Tracking that as a separate issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

daholste picture daholste  路  4Comments

maxt3r picture maxt3r  路  3Comments

JakeRadMSFT picture JakeRadMSFT  路  3Comments

dev8546 picture dev8546  路  3Comments

OneCyrus picture OneCyrus  路  4Comments