Machinelearning: AutoML Regression Experiment fails after 67iterations

Created on 2 Mar 2020  路  16Comments  路  Source: dotnet/machinelearning

Hi,

When running a Regression Experiment, AutoML sistematically fails after 67 iterations, raising the Exception "All instances skipped due to missing features". By looking at other issues, I got the idea that the SmacSweeper could be the cause. This is also suggested by the stack strace:

in Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl.MakeBoundariesAndCheckLabels(Int64& missingInstances, Int64& totalInstances)
   in Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Double[][] binUpperBounds, Single maxLabel, Boolean dummy, Boolean noFlocks, PredictionKind kind, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   in Microsoft.ML.Trainers.FastTree.DataConverter.Create(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean diskTranspose, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   in Microsoft.ML.Trainers.FastTree.ExamplesToFastTreeBins.FindBinsAndReturnDataset(RoleMappedData data, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeaturIndices, Boolean categoricalSplit)
   in Microsoft.ML.Trainers.FastTree.FastTreeTrainerBase`3.ConvertData(RoleMappedData trainData)
   in Microsoft.ML.Trainers.FastTree.FastForestRegressionTrainer.TrainModelCore(TrainContext context)
   in Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   in Microsoft.ML.AutoML.SmacSweeper.FitModel(IEnumerable`1 previousRuns)
   in Microsoft.ML.AutoML.SmacSweeper.ProposeSweeps(Int32 maxSweeps, IEnumerable`1 previousRuns)
   in Microsoft.ML.AutoML.PipelineSuggester.SampleHyperparameters(MLContext context, SuggestedTrainer trainer, IEnumerable`1 history, Boolean isMaximizingMetric)
   in Microsoft.ML.AutoML.PipelineSuggester.GetNextInferredPipeline(MLContext context, IEnumerable`1 history, DatasetColumnInfo[] columns, TaskKind task, Boolean isMaximizingMetric, CacheBeforeTrainer cacheBeforeTrainer, IEnumerable`1 trainerWhitelist)
   in Microsoft.ML.AutoML.Experiment`2.Execute()
   in Microsoft.ML.AutoML.ExperimentBase`2.Execute(ColumnInformation columnInfo, DatasetColumnInfo[] columns, IEstimator`1 preFeaturizer, IProgress`1 progressHandler, IRunner`1 runner)
   in Microsoft.ML.AutoML.ExperimentBase`2.Execute(IDataView trainData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)

However, compared to the other issues, I'm running a console application, I'm loading data from database with no missing values. and I hopefully have the right NuGet dependencies:

  • Microsoft.ML.AutoML and Microsoft.ML.Recommender: 0.16.0
  • Microsoft.ML and all the other ML packages: 1.4.0

I understand that the problem might be caused by some of the third-party libraries ML depends on, but isn't at least possible to ignore the exception thrown by a single trainer without compromising the whole regression experiment? I would like to be able to access the BestRun object and choose the best out of the first 67 experiments without having to look back at the CacheDirectory.

If necessary, I can generate a csv with all the data used for training.

Thanks

AutoML.NET P2 bug

All 16 comments

Hi @francescomazzurco , please send along a .csv example with which we can reproduce this issue.

Hi @mstfbl, I'm now creating a small working example along with the .csv, but I am encountering difficulties in reproducing the issue. I'll dig into it and give you updates by the end of the day

Ok, I found the problem. I could reproduce the exception only on one of our computers, so I finally realised that the issue is related to culture, even when data is loaded from memory and there is no parsing. In the project I attached, data is parsed and loaded using invariant culture. Then, a non-english culture is set just before running the experiment.
C# var mlContext = new MLContext(); List<Model> models = ReadCsv(@"data\data.csv"); var dataView = BuildDataView(mlContext, models); var experimentSettings = new RegressionExperimentSettings { MaxExperimentTimeInSeconds = 600, CacheDirectory = new DirectoryInfo(@".\cache"), }; var experiment = mlContext.Auto().CreateRegressionExperiment(experimentSettings); // Data has already been parsed using invariant culture CultureInfo.DefaultThreadCurrentCulture = CultureInfo.CreateSpecificCulture("it-IT"); var bestRun = experiment.Execute(dataView).BestRun;

The exception is thrown after the 67th iteration.
TestML.zip

Now I've seen other issues related to culture, not sure if they are reporting the same issue but in such case feel free to close this issue. Thanks

@francescomazzurco: This should be fixed in the next release (v1.5.0-preview2). There was a fix added in January to use culture invariant when sweeping parameter values -- https://github.com/dotnet/machinelearning/pull/4635/.

You can test against the nightly NuGet feed by adding https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json as a NuGet source in Visual Studio. Feed details: https://dev.azure.com/dnceng/public/_packaging?_a=connect&feed=MachineLearning.

Hi @justinormont, thanks for your reply.
I tested against the nightly build, no exception is thrown anymore, however the regression experiment hangs forever and does not complete the 68th training. Nothing happens even after MaxExperimentTimeInSeconds (I expected the experiment to abort after such time).
Interestingly, this behaviour only occurs when setting a non-english culture, so it seems that culture still has effects on the SmacSweeper.

I published the working example here: https://github.com/francescomazzurco/TestML

@LittleLittleCloud: Do you have time to investigate?

I will take a look

Hi I am Diego S. , from Italy.
I have the same issue ...
CreateBinaryClassificationExperiment is good
CreateRegressionExperiment fail..
only if I set
CultureInfo.DefaultThreadCurrentCulture = CultureInfo.CreateSpecificCulture("en-EN");
it works.
The data is good, not nulls.
So I think it a bug.
I get the data from a database.
package ML.AutoML 0.16

Quick update: I just tested against v.0.17.1 and the bug is still there. Same behavior: the 68th iteration hang forever and never completes.

@francescomazzurco: I believe this fixed now. It will be available in the next release. Or you can run against the nightly build, as outlined above.

@justinormont I just tested against v.0.17.3-29420-1 from October 20th, but the bug is still there. I see there are newer builds, but I am not able to install them as NuGet can not find package _MlNetMklDepsCode_

@francescomazzurco: You'll need a nightly build or release after 2020-10-30 as the fix went in then.

@harishsk: Any guess why the nightly won't install for @francescomazzurco?

@justinormont @francescomazzurco

As part of moving into arcade, we've published some nugets that have a bug, where it requires the MlNetMklDepsCode nuget to work. This is a bug, and we're working on fixing it. Those nugets should be ignored for the time being.

Also, there had been some problems with publishing nugets from master (which are the ones required by @francescomazzurco ), and so I believe there hasn't been any nuget published correctly from master since October 20th. So I don't think there's any public nuget including the change made on October 30, Justin is referring to. This problem was on Azure DevOps side, and should be fixed now. So I'll run a manual build to publish nugets from master branch, and hopefully it will work. I'll update this thread with info about that. Thanks.

There are some problems with our nuget publishing pipeline. Working on that now, I'll update this thread once the nuget is published.

The nugets has just been published to the public feed.
@francescomazzurco , please, try version 0.17.3-29530-4 from the feed, it should work now.
Thanks.

I was able to successfully install the most recent build from today ( 0.17.3-29602-5 ) which indeed solves the bug. Feel free to close the issue. Thanks for the support

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxt3r picture maxt3r  路  3Comments

daholste picture daholste  路  3Comments

sethreidnz picture sethreidnz  路  3Comments

bs6523 picture bs6523  路  4Comments

ddobric picture ddobric  路  4Comments