Machinelearning: Schema mismatch for feature column 'Features': expected Vector<Single>, got VarVector<Single> '

Created on 27 Mar 2020  路  3Comments  路  Source: dotnet/machinelearning

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): 3.1

Issue

  • What did you do? : Train a model from a json File
  • What happened? Schema mismatch for feature column 'Features': expected Vector, got VarVector '

Source code / logs

Here is the code to train the model :
```C#
private static ITransformer trainWithJson(MLContext mlContext)
{
using (StreamReader r = new StreamReader("datasetCrewCleaned.json"))
{
string json = r.ReadToEnd();
List items = JsonConvert.DeserializeObject>(json);
List itemsTrain = new List();
for (int i = 0; i < items.Count; i++)
{
if (i > 24000)
{
break;
}
itemsTrain.Add(items[i]);
}
var data = mlContext.Data.LoadFromEnumerable(itemsTrain);
var pipeline = mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName: nameof(FilmModel.BoxOffice))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "NameEncoded", inputColumnName: nameof(FilmModel.Name)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "DurationEncoded", inputColumnName: nameof(FilmModel.Duration)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "ClassificationEncoded", inputColumnName: nameof(FilmModel.Classification)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "GenreEncoded", inputColumnName: nameof(FilmModel.Genre)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "StudioEncoded", inputColumnName: nameof(FilmModel.Studio)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "BudgetEncoded", inputColumnName: nameof(FilmModel.Budget)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "ReleaseDateEncoded", inputColumnName: nameof(FilmModel.ReleaseDate)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "CrewEncoded", inputColumnName: nameof(FilmModel.Crew)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: "ActorsEncoded", inputColumnName: nameof(FilmModel.Actors)))
.Append(mlContext.Transforms.Concatenate("Features", "NameEncoded", "ClassificationEncoded", "DurationEncoded", "GenreEncoded", "StudioEncoded", "BudgetEncoded", "ReleaseDateEncoded", "CrewEncoded", "ActorsEncoded"))
.Append(mlContext.Regression.Trainers.LbfgsPoissonRegression());
var model = pipeline.Fit(data);
return model;
}
}

And here is the class I'm using to deSerialize the Json File:

```C#
public class FilmModel
    {
        public FilmModel(float boxOffice, string[] crew, string[] actors, string releaseDate = null, string  name = null, string classification = null, string duration = null, string genre = null, string studio =  null, string budget = null)
          {
            Name = name;
            Classification = classification;
            Duration = duration;
            Genre = genre;
            Studio = studio;
            Budget = budget;
            BoxOffice = boxOffice;
            ReleaseDate = releaseDate;
            Crew = crew;
            Actors = actors;
         }
         public string Name { get; set; }
         public string Classification { get; set; }
         public string Duration { get; set; }
         public string Genre { get; set; }
         public string Studio { get; set; }
         public string Budget { get; set; }
         public float BoxOffice { get; set; }
         public string ReleaseDate { get; set; }
         public string[] Actors { get; set; }
         public string[] Crew { get; set; }
    }

The [] Actors and Crew do not have the same length for each row. For instance a movie could have 4 crew members and another 8. I think that's the problem but I do not know how to fix this.

Most helpful comment

Hi @lionelquirynen ,

Here is a sample of how you can get this to work. Essentially, when you apply OneHotEncoding to a vector, it does what it's supposed to. However, it creates a separate encoded vector for each of the elements in the input vector. This results in a vector of variable length because it's dependent on the number of elements in the original input vector. Each of the nested vectors however is of fixed size. Therefore, what you need to do (with a custom transform) is map the encoded values from each of the nested vectors into a single vector. In addition, because you won't know ahead of time what the dimensions of the nested vectors are, use SchemaDefinition to set the combined vector size to a vector with the size equal to that of the nested vectors. If you run this program you'll see that it trains the model with no error. Keep in mind that this is done only for one column. In your case, I'd wrap the Action in a method that takes in the name or dimensions of the column I want to combine the nested vectors for and return the Action so you're able to more generically use this for each of the columns where you generate vectors of variable size.

class Program
    {
        static void Main(string[] args)
        {
            var originalData = new DataInput[]
            {
                new DataInput {Actors = new string [] {"A","B","C"}},
                new DataInput {Actors = new string [] {"A"}},
                new DataInput {Actors = new string [] {"C"}},
                new DataInput {Actors = new string [] {"A","D"}}
            };

            MLContext ctx = new MLContext();

            IDataView dv = ctx.Data.LoadFromEnumerable(originalData);

            var dataPrep = ctx.Transforms.Categorical.OneHotEncoding("EncodedActors", "Actors");

            // Apply One-Hot Encoding
            // This generates a Vector with dimensions (n,4). The first dimension is unknown
            // since it is dependent on the number of values in the Actor column.
            // The second dimension pertains to the number of unique values in the Actor column
            // A,B,C,D give it a length of 4. Therefore the result of the transformation is a column
            // called EncodedActors which is a vector that contains N vectors of size 4.
            // One thing to keep in mind though is that the vector is 1-D.
            IDataView transformedData = dataPrep.Fit(dv).Transform(dv);

            // Get size of nested vectors
            var encodedVectorType = transformedData.Schema["EncodedActors"].Type as VectorDataViewType;
            var encodedVectorDimensions = encodedVectorType.Dimensions[1];

            // Custom Transform
            Action<TransformedInput,TransformedOutput> customTransform = (rowIn,rowOut) =>
            {
                // Contains values for all nested vectors
                float[] unifiedEncoding = new float[encodedVectorDimensions];

                //Get indices for one-hot encoding
                var indices =
                    rowIn
                    .EncodedActors
                    .Select((x, i) => new { x, i }) //Iterate over each nested vector and provide the item and index
                    .Where(x => x.x == 1) // Filter indices where there is 1
                    .Select(x => x.i); // Return the indices

                foreach(var idx in indices)
                {
                    var mappedIdx = idx % encodedVectorDimensions; // Convert from 1-D to 2-D
                    unifiedEncoding[mappedIdx] = 1;
                }

                // Assign the unified values to FinalEncoding column
                rowOut.FinalEncoding = unifiedEncoding;
                rowOut.Label = 3.0F; // Hard coded because for now all that's needed is the Final Encoding
            };

           // Set the type of FinalEncoding column to a Vector of size equal to that of nested vectors.
            var outputSchemaDefinition = SchemaDefinition.Create(typeof(TransformedOutput));
            outputSchemaDefinition["FinalEncoding"].ColumnType = new VectorDataViewType(NumberDataViewType.Single,encodedVectorDimensions);

            var trainingPipeline = ctx.Transforms.CustomMapping(customTransform, null, outputSchemaDefinition: outputSchemaDefinition)
                .Append(ctx.Regression.Trainers.LbfgsPoissonRegression(featureColumnName:"FinalEncoding"));

            var trainResultDv = trainingPipeline.Fit(transformedData).Transform(transformedData);

            var preview = trainResultDv.Preview();
        }
    }

class DataInput
    {
        [VectorType]
        public string[] Actors { get; set; }
    }

class TransformedInput
    { 
        [VectorType]
        public float[] EncodedActors { get; set; }
    }

class TransformedOutput
    {
        public float[] FinalEncoding { get; set; }
        public float Label { get; set; }
    }

All 3 comments

Hi @lionelquirynen ,

Here is a sample of how you can get this to work. Essentially, when you apply OneHotEncoding to a vector, it does what it's supposed to. However, it creates a separate encoded vector for each of the elements in the input vector. This results in a vector of variable length because it's dependent on the number of elements in the original input vector. Each of the nested vectors however is of fixed size. Therefore, what you need to do (with a custom transform) is map the encoded values from each of the nested vectors into a single vector. In addition, because you won't know ahead of time what the dimensions of the nested vectors are, use SchemaDefinition to set the combined vector size to a vector with the size equal to that of the nested vectors. If you run this program you'll see that it trains the model with no error. Keep in mind that this is done only for one column. In your case, I'd wrap the Action in a method that takes in the name or dimensions of the column I want to combine the nested vectors for and return the Action so you're able to more generically use this for each of the columns where you generate vectors of variable size.

class Program
    {
        static void Main(string[] args)
        {
            var originalData = new DataInput[]
            {
                new DataInput {Actors = new string [] {"A","B","C"}},
                new DataInput {Actors = new string [] {"A"}},
                new DataInput {Actors = new string [] {"C"}},
                new DataInput {Actors = new string [] {"A","D"}}
            };

            MLContext ctx = new MLContext();

            IDataView dv = ctx.Data.LoadFromEnumerable(originalData);

            var dataPrep = ctx.Transforms.Categorical.OneHotEncoding("EncodedActors", "Actors");

            // Apply One-Hot Encoding
            // This generates a Vector with dimensions (n,4). The first dimension is unknown
            // since it is dependent on the number of values in the Actor column.
            // The second dimension pertains to the number of unique values in the Actor column
            // A,B,C,D give it a length of 4. Therefore the result of the transformation is a column
            // called EncodedActors which is a vector that contains N vectors of size 4.
            // One thing to keep in mind though is that the vector is 1-D.
            IDataView transformedData = dataPrep.Fit(dv).Transform(dv);

            // Get size of nested vectors
            var encodedVectorType = transformedData.Schema["EncodedActors"].Type as VectorDataViewType;
            var encodedVectorDimensions = encodedVectorType.Dimensions[1];

            // Custom Transform
            Action<TransformedInput,TransformedOutput> customTransform = (rowIn,rowOut) =>
            {
                // Contains values for all nested vectors
                float[] unifiedEncoding = new float[encodedVectorDimensions];

                //Get indices for one-hot encoding
                var indices =
                    rowIn
                    .EncodedActors
                    .Select((x, i) => new { x, i }) //Iterate over each nested vector and provide the item and index
                    .Where(x => x.x == 1) // Filter indices where there is 1
                    .Select(x => x.i); // Return the indices

                foreach(var idx in indices)
                {
                    var mappedIdx = idx % encodedVectorDimensions; // Convert from 1-D to 2-D
                    unifiedEncoding[mappedIdx] = 1;
                }

                // Assign the unified values to FinalEncoding column
                rowOut.FinalEncoding = unifiedEncoding;
                rowOut.Label = 3.0F; // Hard coded because for now all that's needed is the Final Encoding
            };

           // Set the type of FinalEncoding column to a Vector of size equal to that of nested vectors.
            var outputSchemaDefinition = SchemaDefinition.Create(typeof(TransformedOutput));
            outputSchemaDefinition["FinalEncoding"].ColumnType = new VectorDataViewType(NumberDataViewType.Single,encodedVectorDimensions);

            var trainingPipeline = ctx.Transforms.CustomMapping(customTransform, null, outputSchemaDefinition: outputSchemaDefinition)
                .Append(ctx.Regression.Trainers.LbfgsPoissonRegression(featureColumnName:"FinalEncoding"));

            var trainResultDv = trainingPipeline.Fit(transformedData).Transform(transformedData);

            var preview = trainResultDv.Preview();
        }
    }

class DataInput
    {
        [VectorType]
        public string[] Actors { get; set; }
    }

class TransformedInput
    { 
        [VectorType]
        public float[] EncodedActors { get; set; }
    }

class TransformedOutput
    {
        public float[] FinalEncoding { get; set; }
        public float Label { get; set; }
    }

Thanks a lot for the explanation and the comment ! Appreciate a lot !

@luisquintanilla Thanks for the explanation.
@lionelquirynen I hope that answers your question.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

neven10 picture neven10  路  3Comments

ddobric picture ddobric  路  4Comments

daholste picture daholste  路  4Comments

darren-zdc picture darren-zdc  路  3Comments

rogancarr picture rogancarr  路  3Comments