Machinelearning: Models build with SDCA trainers and seeded ML context are getting different values for accuracy

Created on 13 Oct 2020 · 4Comments · Source: dotnet/machinelearning

System information

Windows 10:
.NET 5 RC1 & RC2:

Issue

I build a model using a MLContext with seed parameter set and I use SDCAMaximumEntropy or SDCANonCalibrated
The accuracy fluctuates with every build
I expect the accuracy to be the same. If I'm using other trainers like LighGbm, the accuracy is consistent, the same with every build.

Source code / logs

You can find the notebook here: https://github.com/dcostea/SmartFireAlarm/blob/master/SmartFireAlarm/Jupyter/sample.ipynb
I have extracted here the code:

#r "nuget:Microsoft.ML,1.5.2"
#r "nuget:Microsoft.ML.LightGBM,1.5.2"
using Microsoft.ML;
using Microsoft.ML.Trainers.LightGbm;
using Microsoft.ML.Data;

MLContext mlContext = new MLContext(seed: 123);

const string TRAIN_DATASET_PATH = "./sensors_data_train.csv";
IDataView trainingData = mlContext.Data.LoadFromTextFile<ModelInput>(
    path: TRAIN_DATASET_PATH,
    hasHeader: true,
    separatorChar: ',');

const string TEST_DATASET_PATH = "./sensors_data_test.csv";
IDataView testingData = mlContext.Data.LoadFromTextFile<ModelInput>(
    path: TEST_DATASET_PATH,
    hasHeader: true,
    separatorChar: ',');

var featureColumns = new string[] { "Temperature", "Luminosity", "Infrared", "Distance" };

var trainingPipeline = mlContext.Transforms.Conversion.MapValueToKey("Label")
    .Append(mlContext.Transforms.Concatenate("Features", featureColumns))
    .Append(mlContext.MulticlassClassification.Trainers.SDCAMaximumEntropy("Label", "Features"))
    .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

var model = trainingPipeline.Fit(trainingData);

var predictions = model.Transform(testingData);
var metrics = mlContext.MulticlassClassification.Evaluate(predictions, "Label", "Score", "PredictedLabel");

Awaiting User Input bug classification perf

Source

dcostea

👍1

All 4 comments

SDCA still has a certain degree of non-determinism even after setting the seed due to things like multi-threading. You can improve it by setting NumberOfThreads to 1 in the Options

harishsk on 19 Oct 2020

👍1

Hi @dcostea ,

Has the answer above on setting NumberOfThreads in SdcaMaximumEntropyMulticlassTrainer.Options solved your issue? If so, please feel free to close this issue. If not, please confirm whether or not a different issue is now occurring, or the same error is being outputted. Thanks!

mstfbl on 19 Oct 2020

👍1

Hi @dcostea ,

Has the answer above on setting NumberOfThreads in SdcaMaximumEntropyMulticlassTrainer.Options solved your issue? If so, please feel free to close this issue. If not, please confirm whether or not a different issue is now occurring, or the same error is being outputted. Thanks!

I was looking forward to verifying the solution. I had to deliver a talk this evening and I just got free. Let me try a few minutes.

dcostea on 19 Oct 2020

I can see good improvement, but as anticipated, it still has a little of non-determinism.
By default, using multi-threads the accuracy used to fluctuate up to 4-5 percent.

.Append(mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(new SdcaMaximumEntropyMulticlassTrainer.Options { NumberOfThreads = 1 }))

These are the measurements obtained with the above code:
MicroAcc 94.66 95.33 94.66 94.66 95.33 94.66 94.66 94.66 95.33
MacroAcc 95.06 96.06 95.06 95.06 96.06 95.06 95.06 95.06 96.06

As you can see, MicroAcc fluctuates less de 1 percent and MacroAcc fluctuates 1 percent.

I will close the issue. Thank you for the improvement tip!

cc @harishsk @mstfbl

dcostea on 19 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings