Machinelearning: QUESTION: The "pipeline" is immutable; sometimes a chain of Estimators, sometimes a single Estimator. Easy to understand?

Created on 21 Nov 2018  路  10Comments  路  Source: dotnet/machinelearning

This is just an observation of a possible risk. I'm not saying that my hypothetical proposal below is better, since it is probably less flexible. I just would like to get feedback from the community about our current approach to double-check we're on the right path.

The fact that a "pipeline" sometimes is a "chain of estimators", but sometimes "it can be" a single estimator could be confusing for developers. For instance:

In this case "dataProcessPipeline" is a single Estimator of type TextFeaturizingEstimator:

var dataProcessPipeline = mlContext.Transforms.Categorical.MapValueToKey("Area", "Label") ;

In this other case below, "dataProcessPipeline" is a chain of estimators of type EstimatorChain<TTrans>, as soon as you call the first Append():

var dataProcessPipeline = mlContext.Transforms.Categorical.MapValueToKey("Area", "Label")
                .Append(mlContext.Transforms.Text.FeaturizeText("Title", "TitleFeaturized"))
                .Append(mlContext.Transforms.Text.FeaturizeText("Description", "DescriptionFeaturized"))
                .Append(mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"));

Also, the fact that an EstimatorChain pipeline is created from its first element might also be confusing?

Simplifying and in comparison, when you create a List or collection in C# you first create the List (the box) then add things/elements into it. You usually don't create a collection of items from the first item but you create the "box" first, then add items. But that is for mutable collections. Not the same! :)

In the case of our current EstimatorChain it is using a more advanced pattern based on fluent API and immutable objects. Since each estimator and estimator-chain is immutable, when you append another estimator in reality you are creating a new estimator-chain and returning that new pipeline (estimator-chain).

QUESTION: Is this pattern clear or confusing for you?

A different approach based on a typical mutable collection could be something like the following (This is NOT how ML.NET currently works and might require different types):

//DataView with dataset
IDataView trainingDataView = textLoader.Read(TrainDataPath);

// Create an "empty" EstimatorChain, the "box", which would be mutable, as it'll be growing with items:
var dataProcessPipeline = MLContext.CreateEstimatorChain();

// Add Estimators to the same chain/pipeline
dataProcessPipeline.Append(mlContext.Transforms.CopyColumns("FareAmount", "Label");
dataProcessPipeline.Append(mlContext.Transforms.Categorical.OneHotEncoding("VendorId", "VendorIdEncoded"));
dataProcessPipeline.Append(mlContext.Transforms.Normalize(inputName: "TripTime", mode: NormalizerMode.MeanVariance));
dataProcessPipeline.Append(mlContext.Transforms.Concatenate("Features", "VendorIdEncoded", "TripTime"));

//... Peek data into the DataView, etc. if you want

//Optional - Clone the pipeline with data transformations in case you want to reuse the dataProcessPipeline for parallel executions of additional trainers
var trainingPipeline = dataProcessPipeline.Clone();

//Add trainer to the training pipeline
var sdcaTrainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent(label: "Label", features: "Features");
trainingPipeline.Append(sdcaTrainer);

//Train the model fitting to the dataSet
var trainedModel = trainingPipeline.Fit(trainingDataView);

In this last code, when you execute dataProcessPipeline.Append(estimator) it is really appending an estimator into that current pipeline.

In comparison and shown below, with our current API, when adding a trainer, you have to "catch" the returned pipeline, as the estimator/trainer was added only to the returned new pipeline, not to the pipeline owning the method Append() you run.

//Add trainer to the training pipeline
var sdcaTrainer = mlContext.Regression.Trainers.StochasticDualCoordinateAscent(label: "Label", features: "Features");
var trainingPipeline = dataProcessPipeline.Append(sdcaTrainer);

//Train the model fitting to the dataSet
var trainedModel = trainingPipeline.Fit(trainingDataView);

That's also why you don't need to clone the pipeline if you want to "fork" it, since every time you call Append() you are creating a new pipeline, so you could "fork" whenever you call .Append().

As summary, in our current API, .Append() is not appending anything into the current pipeline, but creating and returning a new pipeline (EstimatorChain) with that new estimator appended.

Our current approach is probably more flexible based on immutable EstimatorChains but I'd like to double check if our current approach is clear for anyone learning the API.

What are your thoughts about it?
Can you provide your feedback? 馃憤
Thanks,

question

Most helpful comment

I will not comment on the user adoption / likability of the API, I would like to point out more of an architectural concern: we expect 'estimators' to be passed to some (potentially lazy) methods that will fit them (maybe repeatedly) at their convenience.

Having estimator pipelines (that are estimators) mutable looks dangerous in this case: it will immediately compromise thread-safety, and introduce a potential for confusion.

I also pinged folks on Gitter to provide more user feedback on this.

All 10 comments

I will not comment on the user adoption / likability of the API, I would like to point out more of an architectural concern: we expect 'estimators' to be passed to some (potentially lazy) methods that will fit them (maybe repeatedly) at their convenience.

Having estimator pipelines (that are estimators) mutable looks dangerous in this case: it will immediately compromise thread-safety, and introduce a potential for confusion.

I also pinged folks on Gitter to provide more user feedback on this.

I'm +1 on such a pattern. Developers likely to engage with ML.NET are likely to be familiar with the benefits of immutability. Even if they're not, the pattern is not confusing. My only question would be whether Append is the best name. Might Concat lead to better intuitions?

I do not think it is confusing at all. Even when you look at the List example, now most likely you will find people using collection initialisers, like:

List<Cat> cats = new List<Cat>
{
    new Cat(){ Name = "Sylvester", Age=8 },
    new Cat(){ Name = "Whiskers", Age=2 },
    new Cat(){ Name = "Sasha", Age=14 }
};

which is very similar to the append sample above. Less flexible though. And, as mentioned before, immutability brings plenty of benefits, like thread-safety.

I also very much like the immutable API (I'm mostly doing Clojure so this is very natural). Forgetting to "catch" the return value of .Append can certainly be source of bugs. It seems that tagging the method as [Pure] would at least tell visual studio to give a wiggly line if you don't assign it a value. But then: The method isn't really pure, is it?

Personally i like pattern. I agree with @lobrien that probably the name misleading.

@CESARDELATORRE , seems like everybody is in favor of immutable pipelines. If you're happy with the results, could you please close the issue?

Right, sounds great that folks like our current approach, but we just had four comments. It would be good to keep it some more time to gather more feedback? :)

That's fine @CESARDELATORRE , but as @Zruty0 already pointed out we simply cannot have IEstimators be mutable. If we lived in a world where SchemaShape GetOutputSchema(SchemaShape inputSchema) could potentially return different results at different times depending on the state of the estimator, it would be practically impossible to compose chains of estimators. This composability of estimators is a core aspect of this software, and it absolutely relies on immutability to work at all. You'll note also that IEstimator's return a transformer type -- mutability would wreak havoc with that as well.

Note that this reasoning would hold even in the hypothetical world where, say, you'd received one thousand people noisily agreeing with you that having the basis of our data pipelines be objects with constantly mutating state was somehow a less confusing situation. Good software architecture must be a more deliberative process than merely holding loose straw polls.

I agree on that but the API usage experience can usually be improved that's why we also ask for feedback.
In any case, we all agree that our current approach is good and folks providing feedback in this thread also like it, so let's close the issue. If there's related feedback in the future we can correlate the issues. 馃憤

Creating a generic method like this:
void TrainModel(MLContext mlContext, string trainDataPath, string modelPath)

Getting the Column property for my TextReader was trivial to enumerate over T with Reflection.

In addition, getting my Label with Reflection was also obtainable.

The problem came after:
var dataProcessPipeline = mlContext.Transforms.CopyColumns(label.Name, "Label");

In the examples where the OneHotEncoding and/or Normalize get chained works well, but in my case iterating through and appending to my dataProcessPipeline doesn't work with this API.

I guess I may be one of the few devs making 100% dynamic methods compared to dedicated methods for a given model, but wanted to chime in.

Was this page helpful?
0 / 5 - 0 ratings