Machinelearning: Image Classification Infinite Training Loop

Created on 30 Jul 2020  路  12Comments  路  Source: dotnet/machinelearning

System information

  • Windows 10 Enterprise
  • .NET 4.7.03190
  • Visual Studio Professional 2019 16.6.2

Issue

  • What did you do?
    I am attempting to run training for image classification for the first time with about 5000 images in 5 sub-folders for tagging as required. My pc does not have a real GPU if that matters, the training uses my CPU cores at 100% during the "bottleneck" phase.

  • What happened?
    The training runs through a loop that repeats over and over. The loop repeated over 12 times in the longest run over 3 hours before I cancelled it. I will attach the output log. I tried making a new solution entirely and got the same behavior. I tried reducing the training set to 2000 images in 2 sub-folders but that had the same behavior.

  • What did you expect?
    I expected the training to complete after one loop through the images, since this is what the documentation seems to say.

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.
image classifier log

P1 bug

All 12 comments

To add to Jesse's explanation of the issue, the ML .Net documentation here says that the bottleneck phase should only happen once, not 12+ times.

Hi @jessewinkler,

It's hard to say where the issue is coming from with the logs alone. Would you be able to provide your code?

@jessewinkler was this model trained using Model Builder?

There is no code. Yes, it's just using model builder on an empty C# console project - click image classification, select data folder, click train. Thanks

With Model Builder, when the number of images is below a certain level, cross-validation is used to help build a more robust model. When using the API, it is correct that the bottleneck phase is a one-time calculation whose values can be cached for later training runs. However, with cross-validation because it uses different samples of the data, the bottleneck phase is performed multiple times (once per subsample).

@justinormont feel free to correct me or provide additional feedback.

@luisquintanilla: AutoML does automatically use cross validation when the number of examples is low. This is 10-fold CV, and the user is reporting seeing 12 runs. The CV may contribute to it, though by itself doesn't account for the loop.

Interesting, looking at the log file, the total count in the bottleneck phases varies a bit around the 4400 - 4600 mark, so that would imply they were different and not the exact same each time. I will set it up to run overnight and see if that tells us anything new.

@luisquintanilla Could you provide documentation on how much data is needed to avoid cross validation? The dataset Jesse is using has 5 classes with 900-1,000 images each.

@rebecca-burwei : 15k rows of data (aka images):
https://github.com/dotnet/machinelearning/blob/6bae29fc342bf192a36a69484d62db8d6266f8df/src/Microsoft.ML.AutoML/API/ExperimentBase.cs#L112-L115

Cross validation returns more useful metrics than train/validate, as it averages across 10-folds, in turn giving less noisy metrics. This is important for highly skewed datasets, smaller datasets (this case), and when the metrics are closer to 0.0/1.0, as noise otherwise can dominate and drown out the signal.

Upsampling to hit the 15k
Do note you'll leak information between the cross validation folds if you replicate the images to hit the 15k limit. As Model Builder doesn't expose a SamplingKeyColumn, duplicate images will end up in both the training split and the validation split causing your metrics to be artificially high. You can try it, though your metrics will no longer be valuable as it then measures only how well it memorizes the images instead the model's ability to generalize.

Custom dataset splits
In the future, Model Builder _may_ support custom splits of your dataset, where you provide the Training, Validation, and Test datasets. At which point, the training will use those splits instead of using cross validation for small datasets.

You can use the AutoML API directly and provide your own custom dataset split.

Simply follow the AutoML Multi-class Classification Sample -- https://github.com/dotnet/machinelearning-samples/tree/master/samples/csharp/getting-started/MulticlassClassification_AutoML

Setup your Train/Validate/Test dataset as:

Label,filepath
Cat,folder/file1.jpg
Dog,folder/file2.jpg
Cat,folder/file3.jpg
Cat,folder/file4.jpg

I reran overnight on the larger set of 5000, and it did complete after exactly 10 iterations. If there was a bug here, it didn't reproduce.

Thanks @justinormont ! Appreciate the thoroughness of your explanation. :)

If I were to upsample using image augmentation (i.e., add images to my dataset by performing random rotations, translations, skews, etc), would I run into the same leakage problem? In other words, does Model Builder also perform image augmentation?

@rebecca-burwei : Quite welcome.

The remote Azure training of Model Builder uses image augmentations. The local training/refitting of the image model does not, though the original model was likely pre-trained with augmentations.

If using Model Builder, you would run into the same leakage by manually upsampling using image augmentation; you could still do it though it would cause the validation metrics to be artificially high and theoretically could interfere with the early stopping. You can score a separate hold-out test dataset after the training process to give you the correct metrics.

You can call the AutoML API directly as it allows for pre-split datasets (training, validation, test). To avoid the leakage, you can create your own training dataset splits, upsample the training dataset images with augmentations, then call AutoML API directly using these splits.

Was this page helpful?
0 / 5 - 0 ratings