Machinelearning: Models build with SDCA trainers and seeded ML context are getting different values for accuracy

Created on 13 Oct 2020  路  4Comments  路  Source: dotnet/machinelearning

System information

  • Windows 10:
  • .NET 5 RC1 & RC2:

Issue

  • I build a model using a MLContext with seed parameter set and I use SDCAMaximumEntropy or SDCANonCalibrated
  • The accuracy fluctuates with every build
  • I expect the accuracy to be the same. If I'm using other trainers like LighGbm, the accuracy is consistent, the same with every build.

Source code / logs

You can find the notebook here: https://github.com/dcostea/SmartFireAlarm/blob/master/SmartFireAlarm/Jupyter/sample.ipynb
I have extracted here the code:

#r "nuget:Microsoft.ML,1.5.2"
#r "nuget:Microsoft.ML.LightGBM,1.5.2"
using Microsoft.ML;
using Microsoft.ML.Trainers.LightGbm;
using Microsoft.ML.Data;

MLContext mlContext = new MLContext(seed: 123);

const string TRAIN_DATASET_PATH = "./sensors_data_train.csv";
IDataView trainingData = mlContext.Data.LoadFromTextFile<ModelInput>(
    path: TRAIN_DATASET_PATH,
    hasHeader: true,
    separatorChar: ',');

const string TEST_DATASET_PATH = "./sensors_data_test.csv";
IDataView testingData = mlContext.Data.LoadFromTextFile<ModelInput>(
    path: TEST_DATASET_PATH,
    hasHeader: true,
    separatorChar: ',');

var featureColumns = new string[] { "Temperature", "Luminosity", "Infrared", "Distance" };

var trainingPipeline = mlContext.Transforms.Conversion.MapValueToKey("Label")
    .Append(mlContext.Transforms.Concatenate("Features", featureColumns))
    .Append(mlContext.MulticlassClassification.Trainers.SDCAMaximumEntropy("Label", "Features"))
    .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

var model = trainingPipeline.Fit(trainingData);

var predictions = model.Transform(testingData);
var metrics = mlContext.MulticlassClassification.Evaluate(predictions, "Label", "Score", "PredictedLabel");
Awaiting User Input bug classification perf

All 4 comments

SDCA still has a certain degree of non-determinism even after setting the seed due to things like multi-threading. You can improve it by setting NumberOfThreads to 1 in the Options

Hi @dcostea ,

Has the answer above on setting NumberOfThreads in SdcaMaximumEntropyMulticlassTrainer.Options solved your issue? If so, please feel free to close this issue. If not, please confirm whether or not a different issue is now occurring, or the same error is being outputted. Thanks!

Hi @dcostea ,

Has the answer above on setting NumberOfThreads in SdcaMaximumEntropyMulticlassTrainer.Options solved your issue? If so, please feel free to close this issue. If not, please confirm whether or not a different issue is now occurring, or the same error is being outputted. Thanks!

I was looking forward to verifying the solution. I had to deliver a talk this evening and I just got free. Let me try a few minutes.

I can see good improvement, but as anticipated, it still has a little of non-determinism.
By default, using multi-threads the accuracy used to fluctuate up to 4-5 percent.

.Append(mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(new SdcaMaximumEntropyMulticlassTrainer.Options { NumberOfThreads = 1 }))

These are the measurements obtained with the above code:
MicroAcc 94.66 95.33 94.66 94.66 95.33 94.66 94.66 94.66 95.33
MacroAcc 95.06 96.06 95.06 95.06 96.06 95.06 95.06 95.06 96.06

As you can see, MicroAcc fluctuates less de 1 percent and MacroAcc fluctuates 1 percent.

I will close the issue. Thank you for the improvement tip!

cc @harishsk @mstfbl

Was this page helpful?
0 / 5 - 0 ratings

Related issues

rogancarr picture rogancarr  路  3Comments

daholste picture daholste  路  3Comments

daholste picture daholste  路  4Comments

OneCyrus picture OneCyrus  路  4Comments

rebecca-burwei picture rebecca-burwei  路  3Comments