Machinelearning: VectorType attribute with dynamic dimension

Created on 16 May 2018 · 20Comments · Source: dotnet/machinelearning

The VectorType attribute can be added to an array-valued field when the length of the array is known at compile time, e.g.,

        public class Data
        {
            [ColumnName("Features")]
            [VectorType(2)]
            public float[] Features;

            [ColumnName("Label")]
            public bool Label;
        }

However, what if the length of the array is only known at run time?

For example, given IEnumerable<Data> data where the length of the Features array is given by int numFeatures, I need to be able to pass numFeatures to CollectionDataSource.Create(data) somehow, and remove the static 2 argument to the VectorType attribute on the Features field.

question

Source

mjmckp

👍2

Most helpful comment

@Anaschouihdi

Create a schema definition and pass it as the 2nd parameter in the LoadFromEnumerable method:

var schemaDef = SchemaDefinition.Create(typeof(Data));
schemaDef["Features"].ColumnType = new VectorDataViewType(NumberDataViewType.Single, 5);
var trainingDataView = mlContext.Data.LoadFromEnumerable(dataArray, schemaDef);

drake7707 on 16 May 2019

👍4

All 20 comments

Can you please elaborate your IEnumerable<Data> example bit more? Usually, feature vector size is known and fixed for all the examples in the dataset. I would like to know what is your scenario in more detail?

zeahmed on 16 May 2018

👎1

Features may be extracted from a data source in a way that is specified at runtime. A simple example is the last N values from a time series, where N is configurable at runtime.

mjmckp on 16 May 2018

Thanks for the explanation. Yes, this can done through a'Transform in the pipeline (may be call it WindowTransform) which can turn last N values of column(s) into a feature vector. This transform is not currently available in ML.Net.

zeahmed on 16 May 2018

Thanks, however that particular transform was just an example.

In general, the transforms done on the source data to extract features may be parameterised in countless ways. ML.Net should not be attempting to implement every conceivable transform, and should instead allow pre-processed arrays of features as an input...

mjmckp on 16 May 2018

Thanks @mjmckp for the suggestion. ML.NET does support pre-processed arrays of feature as input. Your concern is only regarding setting VectorType dimension at runtime. I am adding @TomFinley and @Ivanidzo4ka if they have more info in this regard.

zeahmed on 16 May 2018

That's right, thanks a lot @zeahmed

mjmckp on 16 May 2018

Any update on this?

mjmckp on 24 May 2018

Support for this would also be very useful for our scenario. We are trying to integrate ML.NET into our framework, which generates training data with feature vectors as float[] arrays. The number of features is not known a priori and can vary across runs.

Given a set of float[] feature vectors and labels, we would like to be able to instantiate a LearningPipeline and train a model. However as various stages of the pipeline rely on the VectorTypeAttribute to determine the input schema, we are currently unable to do this. Internally the ML.NET framework supports passing explicit schema definitions such as:

ComponentCreation.CreateDataView<TRow>(this IHostEnvironment env, IList<TRow> data, SchemaDefinition schemaDefinition)
ComponentCreation.CreatePredictionEngine<TSrc, TDst>(this IHostEnvironment env, Stream modelStream, bool ignoreMissingColumns, SchemaDefinition inputSchemaDefinition, SchemaDefinition outputSchemaDefinition)

Could this be surfaced in the pipeline APIs to support variable feature vector dimensions?

chitsaw on 2 Jun 2018

Sorry for delay.
We definitely can let you setup vector size during runtime.
So far I can see two ways to do that.
First is let you pass dictionary of field/property (property not yet supported, but we working on this) name and dimension for it.

            var vectorSizes = new Dictionary<string, int[]>();
            vectorSizes.Add("Features", new int[1] { 2 });
            pipeline.Add(CollectionDataSource.Create(data, vectorSizes));

something like this.
Another option is to inspect first element in your collection and infer vector sizes from it.

pipeline.Add(CollectionDataSource.Create(data, inferVectorSizesFromCollection:true));

downside of second approach is fact what you need to start two enumerators, and in some cases like SQL data extraction it can be quite costly.

Does this sound reasonable for you?

Ivanidzo4ka on 29 Jun 2018

👍1

Either way sounds fine to me, thanks a lot

mjmckp on 29 Jun 2018

Thanks for looking into this. While either would work, the first approach is more explicit and may provide users more flexibility (e.g. specifying a vector size that is smaller than the underlying collection).

chitsaw on 29 Jun 2018

Hello Guys,

I am a new to ML.net and I don't really get your solution. I am using the 0.11.0 version and I try to keep the following architecture:

class Data
{
public string ID{ get; set; }

[VectorType(5)] //I do not know the if the data will contain 5 or more features
public float[] Features { get; set; }

}

InputData row = new InputData { AssetID = Data[0, i + 1].ToString(), Features = features };

var context = new MLContext();
var DataView = context.Data.LoadFromEnumerable(dataArray);
string featuresColumnName = "Features";
var pipeline=context.Transforms.Concatenate(featuresColumnName,"Features") .Append(context.Clustering.Trainers.KMeans(featuresColumnName, clustersCount: NumberClusters));

var model = pipeline.Fit(DataView);

Could you help me ?

Anaschouihdi on 9 May 2019

@Anaschouihdi

Create a schema definition and pass it as the 2nd parameter in the LoadFromEnumerable method:

var schemaDef = SchemaDefinition.Create(typeof(Data));
schemaDef["Features"].ColumnType = new VectorDataViewType(NumberDataViewType.Single, 5);
var trainingDataView = mlContext.Data.LoadFromEnumerable(dataArray, schemaDef);

drake7707 on 16 May 2019

👍4

That's right, thanks a lot @drake7707

ehsanasgarian on 16 May 2019

One thing I forgot to mention is that you'll also need to pass the same schema definition as an additional parameter inputSchemaDefinition in the prediction engine:

var predEngine = mlContext.Model.CreatePredictionEngine<IrisData, IrisPrediction>(trainedModel, inputSchemaDefinition: schemaDef);

drake7707 on 16 May 2019

👍1

@drake7707 Thank you very much for your answer. It works perfectly fine !

Anaschouihdi on 30 May 2019

closing this since this can be achieved by passing input schema and overriding column property with the dimensions at runtime.

codemzs on 30 Jun 2019

👀1

Hi, can someone please post and example passing input schema and overriding column property with the dimensions at runtime.

I tried, the following in vb.net. but get an error "ballhist is a class type and cannot be used as an espression.

var schemaDef = SchemaDefinition.Create(typeof(BallHist));
schemaDef["Features"].ColumnType = new VectorDataViewType(NumberDataViewType.Single, 5);
var trainingDataView = mlContext.Data.LoadFromEnumerable(dataArray, schemaDef);

my class is as follows
Public Class BallHist
Public Sequence As Single

    <LoadColumn(1)>
    <ColumnName("Day")>
    Public Day As Single

    <LoadColumn(2)>
    <ColumnName("Month")>
    Public Month As Single

    <LoadColumn(3)>
    <ColumnName("Year")>
    Public Year As Single

    <LoadColumn(4)>
    <VectorType(9)> ' want this to be dynamic at runtime.
    <ColumnName("PreviousBalls")>
    Public PreviousBalls As Single()


    <LoadColumn(5)>
    <ColumnName("BallNo")>
    Public BallNo As Single


End Class

Thank you in advance.

ScubaAddict1 on 11 Jul 2019

For any reference the code in vb.net is

Dim featureDimension As Integer = Data(0).PreviousBalls.Count - 1
Dim schemaDef = SchemaDefinition.Create(GetType(BallHist))
schemaDef("PreviousBalls").ColumnType = New VectorDataViewType(NumberDataViewType.Single, featureDimension)
trainData = mlContext.Data.LoadFromEnumerable((GetTrainDataBallHist(records, NumRecordsForTrain)), schemaDef)

ScubaAddict1 on 16 Jul 2019

Hello Guys,

But how do this when the size of the vector changes for each row?
For instance if I have a dataset with 10 000 movies, each movies contains two arrays of string, one for the actors and the other for the crew members. For each movie the number of actors and crew members are not the same... How to handle this?

Best regards,

Lionel Quirynen