Machinelearning: SdcaMaximumEntropy trainer goes into an infinite loop if it takes already transformed data view as an input

Created on 10 Mar 2020 · 6Comments · Source: dotnet/machinelearning

System information

OS version: Windows 10 Pro x64
.NET Version: .NET Core 3.0
ML.NET: 1.5.0-preview

Issue

What I did

create data-preparation pipeline
create trainer SdcaMaximumEntropy
execute pipeline, e.g. to debug transformed data view
add trainer to the pipeline and execute pipeline again, with the trainer included

What happened

If I execute pipeline once, e.g. load from enumerables into data view and then execute entire transformation chain that includes transformations and trainer, everything works fine.

If I execute pipeline twice, first time - separately, then - as a part of entire transformation chain, it consumes 3GB of RAM memory out of 16GB available, then training hangs indefinitely and never ends.
Fixed this temporarily by changing this MaximumNumberOfIterations option, but not sure if it's a good idea...

What I expect

I expect training to stop eventually, no matter how many times I execute pipeline.
Check the comment on the last line in the core below.

Source code

Source code is taken from this issue https://github.com/dotnet/machinelearning/issues/4903

```C#

public IEstimator GetPipeline(IEnumerable columns)
{
var pipeline = Context
.Transforms
.Conversion
.MapValueToKey(new[] { new InputOutputColumnPair("Label", "Strategy") })
.Append(Context.Transforms.Concatenate("Combination", columns.ToArray())) // merge "dynamic" colums into single property
.Append(Context.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") })) // normalize merged columns into Features
.Append(Context.Transforms.SelectColumns(new string[] { "Label", "Features" })); // remove everything from data view, except transformed columns

return pipeline;
}

public IEstimator GetEstimator()
{
var options = new SdcaMaximumEntropyMulticlassTrainer.Options
{
// MaximumNumberOfIterations = 100 // uncomment this to fix the issue
};

var estimator = Context
.MulticlassClassification
.Trainers
.SdcaMaximumEntropy(options)
.Append(Context.Transforms.Conversion.MapKeyToValue(new[]
{
new InputOutputColumnPair("Prediction", "PredictedLabel") // set trainer to use Prediction property as output
}));

return estimator;
}

public void TrainModel(IEnumerable columns, IEnumerable items)
{
var estimator = GetEstimator();
var pipeline = GetPipeline(columns);
var inputs = Context.Data.LoadFromEnumerable(items); // create view

// If I stop execution here, everything is ok

var model = pipeline.Append(estimator).Fit(inputs); // works fine for the data view loaded from enumerables

// Data preparation pipeline is a part of a transformation chain, so I don't need next 2 lines, but I don't understand why it's causing the issue

var pipelineModel = pipeline.Fit(inputs);
var pipelineView = pipelineModel.Transform(inputs); // execute pipeline before the training
var model = pipeline.Append(estimator).Fit(pipelineView); // use transformed pipelineView instead of initial inputs and ... go into infinite loop ... why?
}
```

P3 question

Source

artemiusgreat

All 6 comments

@artemiusgreat I couldn't reproduce this error with a simple enumerable data with only a label and a numeric feature vector. I used a simple pipeline with only MapValueToKey and CopyColumns, and the same estimator as you. Maybe a few lines of the enumerable you are passing to TrainModel will help me reproduce.

That said, one thing I see potentially wrong with your code is that in the last line you have pipeline.Append(estimator).Fit(pipelineView). pipelineView already has the operations in pipeline applied, which means that you have dropped all columns except "Label" and "Features". Now, when you do pipeline.Append(estimator), this chain expects a column named "Strategy", all the columns that will be combined into a column named "Combination", and so on.

Admittedly, this should throw a schema mismatch error at the first step, saying column "Strategy" not found. Not sure why this is not the case. If you can give me a sample of the enumerable, I can debug this.

najeeb-kazmi on 11 Mar 2020

@najeeb-kazmi yes, you're right regarding throwing an exception, but the line selecting only columns Label and Features is irrelevant to the issue and can be commented for now.

Created a demo project that doesn't reproduce significant resources consumption, but demonstrates how drastically execution time can increase by simply separating data preparation pipeline from trainer. At least, this is the only difference I can see between 2 stop-watches.
https://github.com/artemiusgreat/MaxEntropyLoopDemo
Method CreateModel in this file
https://github.com/artemiusgreat/MaxEntropyLoopDemo/blob/master/ModelBuilder.cs
It includes training data set that consists of 3 records in Input.csv file
The first column in the provided data set is a Strategy (Label), it will look like 08, 07, 06

artemiusgreat on 11 Mar 2020

@artemiusgreat I tried to make the comparison as apples-to-apples as possible. The difference is primarily due to .AppendCacheCheckpoint being at the end of dataPipeline in the slow pipeline.
Caching is helpful before an operation that does multiple passes over the data, like an SdcaMaximumEntropyTrainer. Before discussing the implications of caching, a few preliminaries:

As a baseline, running your code without any changes, I got these times:
RunTime SLOW 00:01:53.55 RunTime FAST 00:00:06.58
I'm not sure why you have "Strategy" in the concat transform in the slow pipeline (Line 93). This is not present in the fast pipeline, so I removed it. This is the label, so it shouldn't be in the model features anyway.
csharp .Append(mlContext.Transforms.Concatenate("Combination", Selection.Concat(new[] { "Strategy" }).ToArray()))
In the slow pipeline, you are fitting the data pipeline, then fitting the data pipeline + the trainer again. The fast pipeline, on the other hand, only fits the data pipeline + trainer once. In ML.NET, IDataView is lazily evaluated, so nothing happens until the output of an operation needs to be consumed. The slow pipeline applies the transformations to the data twice when .Fit is called, first to produce transformedView consumed by .Fit, then the same transformations to transformedView before being passed to the trainer. (This is responsible only for a very small part of the difference, but is relevant to what I talk about next.)

With this in mind, let's talk about caching. The slow pipeline has .AppendCacheCheckpoint at the end. Having this at the end is meaningless in context of producing transformedView in Line 38. When .Fit is called on transformedView in Line 39, the part of the operation that produces transformedView to be consumed by .Fit on dataPipeline.Append(trainer) enjoys no benefits of caching. Caching only comes into play when you fit dataPipeline.Append(trainer) in Line 39, as this appended pipeline now includes caching.

So, to make the comparison apples-to-apples, I removed .AppendCacheCheckpoint from the fast pipeline (Line 73). I got the following times:
RunTime SLOW 00:01:49.31 RunTime FAST 00:01:38.20
The difference is due to noise, and the fact that the slow pipeline still fits and applies the dataPipeline twice. If I also change Line 39 to
csharp var slowModel = trainer.Fit(transformedView);
I get

RunTime SLOW 00:01:34.11 RunTime FAST 00:01:34.96

najeeb-kazmi on 12 Mar 2020

👍1

The thing is, in the real project, method creating pipeline is the same for all trainers, and only SdcaMaximumEntropy trainer goes wild.
I agree that previous example had some confusing lines.
I created a simplified version of previous example and used copy-paste to make sure I'm using the same code to compare apples-to-apples.
It still has an issue, once I move data preparation pipeline to a separate method GetPipeline, it ignores caching.
https://github.com/artemiusgreat/MaxEntropyLoopDemo/blob/master/ModelBuilder.cs#L38
May I ask to pull this repo one more time and check if you can run both methods WITH cache?

Slow code
```C#
public static void CreateSlowModel(IDataView baseView)
{
var dataPipeline = GetPipeline(); // if I move data pipeline creation to a separate method, it becomes slow, replace this line with the content of the method GetPipeline and slow model will become fast
var trainer = GetEstimator();

dataPipeline.Append(trainer).Fit(baseView);
}

public static IEstimator GetPipeline()
{
return mlContext
.Transforms
.Conversion
.MapValueToKey("Label", "Strategy")
.Append(mlContext.Transforms.Concatenate("Combination", Selection.ToArray()))
.Append(mlContext.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") }))
.AppendCacheCheckpoint(mlContext);
}

**Results**

RunTime FAST 00:00:07.52
RunTime SLOW 00:02:36.56


**Now merge two methods into one**

```C#
public static void CreateSlowModel(IDataView baseView)
{
  var dataPipeline = mlContext
    .Transforms
    .Conversion
    .MapValueToKey("Label", "Strategy")
    .Append(mlContext.Transforms.Concatenate("Combination", Selection.ToArray()))
    .Append(mlContext.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") }))
    .AppendCacheCheckpoint(mlContext);
  var trainer = GetEstimator();

  dataPipeline.Append(trainer).Fit(baseView);
}

Results

RunTime FAST 00:00:07.34
RunTime SLOW 00:00:07.63

artemiusgreat on 12 Mar 2020

@artemiusgreat This is happening because GetPipeline() returns an IEstimator<ITransformer>, so when you call .Append on dataPipeline in the slow pipeline, it goes to this method:
https://github.com/dotnet/machinelearning/blob/290da8222d4ebcf5c9c4fa134d27151d4ee69364/src/Microsoft.ML.Data/DataLoadSave/EstimatorExtensions.cs#L46-L49
Here, start is dataPipeline, which is an EstimatorChain<ITransformer>, and estimator is trainer, which is also an EstimatorChain<ITransformer>. This method then calls .Append twice on an empty EstimatorChain<ITransformer>, using this method:
https://github.com/dotnet/machinelearning/blob/290da8222d4ebcf5c9c4fa134d27151d4ee69364/src/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs#L87-L88
This returns a new EstimatorChain<ITransformer> whose non-public property _estimators is an IEstimator<ITransformer>[] of length 2, with two elements of type EstimatorChain<ITransformer>, the first one being dataPipeline and the second one being trainer. At the same time, the corresponding non-public property _needCacheAfter is a bool[] of length 2 with both elements being false. This is where you are losing the caching.

On the other hand, when you create the dataPipeline in the slow pipeline by copying the code of GetPipeline(), it does not get cast to IEstimator<ITransformer> but remains an EstimatorChain<ITransformer>. So, when you call .Append on this, it directly goes to the second method I linked above:
https://github.com/dotnet/machinelearning/blob/290da8222d4ebcf5c9c4fa134d27151d4ee69364/src/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs#L87-L88
Here, trainer, which is an EstimatorChain<ITransformer>, gets appended to the _estimators property of dataPipeline. What you get then is an EstimatorChain<ITransformer> whose _estimators is an IEstimator<ITransformer>[] of length 4, with the first three being the three estimators in dataPipeline, and the last being the trainer, which is an EstimatorChain<ITransformer>. The corresponding _needCacheAfter is a bool[] of length 4 with the third element (corresponding to the normalizer, i.e. the step right before the trainer) being true and the rest false. This is why you get caching when you do it like this.

So, coming to your problem, you can do one of the following:

Append another cache checkpoint in the slow pipeline (Line 41):
csharp dataPipeline.AppendCacheCheckpoint(mlContext).Append(trainer2).Fit(baseView);
This will give you an EstimatorChain of length 2, containing two EstimatorChain objects, same as the first situation above, with the exception that _needCacheAfter will have the first element corresponding to the dataPipeline set to true.
Change the return type of GetPipeline() to EstimatorChain<NormalizingTransformer> (Line 44)
csharp public static EstimatorChain<NormalizingTransformer> GetPipeline()
or cast dataPipeline to EstimatorChain<NormalizingTransformer> in Line 41
csharp (dataPipeline as EstimatorChain<NormalizingTransformer>).Append(trainer2).Fit(baseView);
You'll have to and add a using Microsoft.ML.Transforms statement. This will give you the same behavior as in the second situation I described above, i.e. EstimatorChain of length 4, with _needCacheAfter having third element (corresponding to the normalizer) set to true.

najeeb-kazmi on 13 Mar 2020

👍1

Thank you for the investigation of this case.
Suggested change does fix the issue, in Demo and in the real project.
So, it wasn't an infinite loop, just excess consumption of resources and slow execution without cache.
With cache it works pretty fast.

Final code

Added cache in the method that combines data pipeline with a trainer.

C# public void GetPredictor(IEnumerable<string> columns, IDataView inputs) { var estimator = GetEstimator(); var pipeline = GetPipeline(columns); var estimatorModel = pipeline.AppendCacheCheckpoint(Context).Append(estimator).Fit(inputs); }

artemiusgreat on 13 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings