Machinelearning: Multidimensional Vectors causing AutoML to throw null reference exception

Created on 22 Oct 2020  路  4Comments  路  Source: dotnet/machinelearning

System information

  • OS version/distro: Win10
  • .NET Version (eg., dotnet --info): 3.1.7

Issue

  • What did you do?
    Tried to use a 2 dimensional float array as a vector
  • What happened?
    Got a null reference exception when trying to run an experiment
  • What did you expect?
    Th experiment to run

The documentation seems to indicate my setup is correct: https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.vectortypeattribute.-ctor?view=ml-dotnet#Microsoft_ML_Data_VectorTypeAttribute__ctor_System_Int32___

To fix my multidimension vector problem, I've tried:

  • Adding/Removing the float[,] initializers in InputData
  • Specifying the exact size with [VectorType(3,60)]
  • Leaving the [VectorType] attribute off altogether and using autoschema to set it.
  • Leaving the [VectorType] attribute off altogether and not using autoschema to let ML.net figure it out on its own
  • Adding just [VectorType()], although the docs say that is for single dimension arrays.

Source code / logs

Here is a minimal reproduction of the issue:

class Program
{
    static void Main(string[] args)
    {
        var mlContext = new MLContext();

        // create schema for multidimensional vector
        var autoSchema = SchemaDefinition.Create(typeof(InputData));
        var col = autoSchema[1];
        col.ColumnType = new VectorDataViewType(NumberDataViewType.Single, 3, 60);

        // fabricate some data
        var trainingData = new List<InputData>();
        var inputData = new InputData();
        inputData.MultiDimensional = new float[20,20];
        for (int i = 0; i < inputData.MultiDimensional.GetUpperBound(0); i++)
        {
            for (int j = 0; j < inputData.MultiDimensional.GetUpperBound(1); j++)
            {
                inputData.MultiDimensional[i,j] = 5; // doesn't matter
            }
        }
        trainingData.Add(inputData);

        // setup a data view
        IDataView trainingDataView = mlContext.Data.LoadFromEnumerable<InputData>(trainingData, autoSchema);

        // preview it (goes BOOM)
        var preview = trainingDataView.Preview();

        // run the experiment
        var settings = new BinaryExperimentSettings();
        settings.MaxExperimentTimeInSeconds = 60;
        ExperimentResult<BinaryClassificationMetrics> experimentResult = mlContext.Auto()
            .CreateBinaryClassificationExperiment(settings)
            .Execute(trainingDataView);
    }
}

public class InputData
{
    public bool Label { get; set; }
    public float[,] MultiDimensional { get; set; }
}

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

loadsave question

Most helpful comment

Hi @bsambrone,

Thank you for letting us know about this issue. I've successfully replicated the error, and I'm investigating now. I'll get back to you once I figure out the source of the error. The issue stems from the way the data in IDataView trainingDataView is saved and/or loaded for .Preview(), so it's not an AutoML Experiment bug. Thanks.

All 4 comments

Hi @bsambrone,

Thank you for letting us know about this issue. I've successfully replicated the error, and I'm investigating now. I'll get back to you once I figure out the source of the error. The issue stems from the way the data in IDataView trainingDataView is saved and/or loaded for .Preview(), so it's not an AutoML Experiment bug. Thanks.

Hi @mstfbl , are you aware of any workarounds I can use for multidimension vectors that I can use for training with AutoML until the fix is ready?

Maybe related, but if I flatten out my multidimensional vectors into one giant single dimension vector, how does that impact training? If minimal other than time or memory usage, I can go down that road as well if the underlying algorithms don't care how many dimensions the vector is and all they ultimately expect is a big 'ol matrix of numbers (is that a true statement?).

Hi @bsambrone ,

After further investigation, I can see that the way in which InputData is setup is incorrect. ML.NET does not allow multidimensional arrays.

From the documentation you've linked:

Notice that this attribute is expected to be added to one dimensional arrays, and it shouldn't be added to multidimensional arrays...

In addition, please take a look at this comment that explains how to set up the schema without multidimensional arrays. You need to use a flat array to represent the prediction values. Take a look at this example, I have edited your sample code:

public static void Main(string[] args)
{
    var mlContext = new MLContext();

    // fabricate some data
    var trainingData = new List<InputData>();
    for(int i = 0; i < 10; i++)
    {
        var inputData = new InputData();
        System.Random random = new System.Random();
        inputData.Predictions = new float[400];
        inputData.Label = i % 2 == 0 ? true : false;
        for (int j = 0; j < inputData.Predictions.Length; j++)
        {
            inputData.Predictions[j] = (float)random.NextDouble();
        }
        trainingData.Add(inputData);
    }

    // setup a data view
    IDataView trainingDataView = mlContext.Data.LoadFromEnumerable<InputData>(trainingData);

    // preview it (goes BOOM)
    var preview = trainingDataView.Preview();

    // run the experiment
    var settings = new BinaryExperimentSettings();
    settings.MaxExperimentTimeInSeconds = 60;
    ExperimentResult<BinaryClassificationMetrics> experimentResult = mlContext.Auto()
        .CreateBinaryClassificationExperiment(settings)
        .Execute(trainingDataView);
}

public class InputData
{
    public bool Label { get; set; }

    [VectorType(20, 20)]
    public float[] Predictions { get; set; }
}

I'm closing this issue as it has been resolved, please feel free to ask any additional questions. Thanks.

Hi @mstfbl , thanks for the update!

I tried running the modified code you provided, and I am getting an exception on the experimentation run:
Exception has occurred: CLR/System.InvalidOperationException
An unhandled exception of type 'System.InvalidOperationException' occurred in Microsoft.ML.AutoML.dll: 'Training failed with the exception: System.ArgumentNullException: Value cannot be null. (Parameter 'items')

Is there something I need to change for it to run? I see that you are setting the labels as well as the data, so it's unclear to me what is null. I also gather that you set the length of Predictions to be 400, as the VectoryType is 20,20 (so 20 x 20 = 400) - let me know if I inferred that incorrectly.

Also, to make sure I understand what is happening with the single vector vs multidimensional - is it fair to say at the end of the day ML.Net (or machine learning in general) operates on matrices of data and doesn't care how many "columns" it's broken up into? Is that why I see non-AutoML examples taking every single column (except for the label) and mashing that into a single features column? I'm new to ML so I wanted to double check that I got this concept right.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

neven10 picture neven10  路  3Comments

daholste picture daholste  路  3Comments

rogancarr picture rogancarr  路  3Comments

samueleresca picture samueleresca  路  3Comments

sethreidnz picture sethreidnz  路  3Comments