Machinelearning: Concatenating a range of columns in the data class into the "Features" column will lead to exception thrown

Created on 5 Jul 2018 · 11Comments · Source: dotnet/machinelearning

Hello!

I often have CSV files with more than 50 float columns, so it's not feasible to specify each of them individually. I've failed to load them in one shot using a range/sweep specifier. To test things out in smaller scale, I used the Iris example because it ends with 4 float columns.

Here's the data class, I only added 2 lines at the end:

    public class IrisData
    {
        [Column("0")]
        public float Label;

        [Column("1")]
        public float SepalLength;

        [Column("2")]
        public float SepalWidth;

        [Column("3")]
        public float PetalLength;

        [Column("4")]
        public float PetalWidth;

        [Column("1-*", name: "Features")] // New
        public float[] Features; // New
    }

Here's the simplified pipeline, I only commented out the normal way with ColumnConcatenator:

            var pipeline = new LearningPipeline();
            pipeline.Add(new TextLoader(DataPath).CreateFrom<IrisData>(useHeader: true, separator: '\t'));
            //pipeline.Add(new ColumnConcatenator("Features",
            //                                    "SepalLength",
            //                                    "SepalWidth",
            //                                    "PetalLength",
            //                                    "PetalWidth"));
            pipeline.Add(new KMeansPlusPlusClusterer() { K = 3 });
            var model = pipeline.Train<IrisData, ClusterPrediction>();

So it worked when I load each column individually and then concatenate them in the pipeline, like the sample code says. But it always throws an exception when I use my above code:

System.Reflection.TargetInvocationException: 'Exception has been thrown by the target of an invocation.'
Inner Exception:
InvalidOperationException: Column 'Features' is a vector of variable size, which is not supported for normalizers

Please help! Thank you!

=============================================================

System information

OS version/distro: Windows 10
.NET Version (eg., dotnet --info): .Net Framework 4.7.1

Issue

What did you do?: trying to load a CSV's multiple float columns by specifying a range in the data class's declaration, for example: "1-4"
What happened?: I got an exception on the Features' size.
What did you expect?: that concatenating columns by specifying a range would work the same as adding a ColumnConcatenator to the pipeline.

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

Source

hellothere33

All 11 comments

Hi,
Just a hunch here: I think that the issue might be that you are trying to reuse the Features column. The first parameter in a ColumnConcatenator is the destination column. Maybe try changing your "Features" column on your IrisData model to be called MiscFeatures or something along those lines (update the ColumnName attribute and the field name), and update the ColumnConcatenator to take in "MiscFeatures" as well.

Now I'm not certain how that will work since you'll be concatenating 2 different types (most types are float, the last one is a vector) and hopefully #535 will get an answer to further explain how we could go about concatenating these together.

dan-drews on 14 Jul 2018

👍1

@hellothere33 , thanks for your question.

The problem you are having is due to the fact that we have different kinds vector-valued columns: fixed-size and variable-size. For fixed-size vectors, all examples are expected to have the same, specified number of elements in the corresponding column, whereas for variable-size vectors the number of elements in the field may differ from example to example.

When we build the data pipeline, we inspect the class (IrisData) and infer the data type of each column. The type of Features is float[], so we assume it's a variable-size vector of floats.

On the other hand, when you concatenate 4 features using ColumnConcatenator, the resulting column will be a fixed-size vector of floats (size 4).

Most prediction pipelines do not accept variable-size vectors, and hence the error message you are seeing.

Thankfully, in your particular scenario you can give a hint to the pipeline with the vector size:

        public class IrisData
        {
            [Column("0")]
            public float Label;

            [Column("1-*", name: "Features")] // Name and location
            [VectorType(4)] // Vector size
            public float[] Features; 
        }

If the vector size is only known at runtime, the VectorType attribute approach will not work. In this case you can still give us a hint of the column type, by utilizing the SchemaDefinition class.

We definitely should have this covered in the documentation somewhere. If it isn't, then this issue should be about the missing documentation.

Zruty0 on 18 Jul 2018

❤1

Thank you @dan-drews and @Zruty0 for your suggestions!

Using VectorType(4) indeed does work. Since my actual model has varying +80 columns (which is why I used Iris only as a simpler proxy example), I'm going to try the SchemaDefinition as suggested.

Would you like me to close this issue, or to change its title to be about missing documentation? Let me know.

hellothere33 on 18 Jul 2018

I have discovered that the API reference doesn't have a documentation for VectorTypeAttribute, which looks like an issue with our xmldoc tooling.

@sfilipi , would you mind taking over this issue?

Separately, I think we would benefit from a longer doc describing exactly the mechanism behind schema comprehension that we currently do: how attributes are handled, why SchemaDefinition is sometimes needed, what are the limitations. I created a separate issue for this #554 .

Zruty0 on 19 Jul 2018

🎉1

Opened an issue in the docs infrastructure to track:
https://github.com/dotnet/docs/issues/6542

sfilipi on 19 Jul 2018

👍1

I couldn't find examples of SchemaDefinition though in the samples repo, and there are no tests in the main repo using is either, or maybe I didn't find it. Could you point me to the right direction?
Thank you!

hellothere33 on 23 Jul 2018

I am in the process of writing a doc on 'schema comprehension', and it'll include examples of SchemaDefinition use. I'll get back to you in 0-1 days.

Zruty0 on 23 Jul 2018

😄1

@hellothere33 , I have written the schema comprehension doc, which contains an example of handling runtime-known column sizes.

Unfortunately, in the process I discovered that SchemaDefinition-aware constructors didn't make it into the LearningPipeline helper API, so you won't be able to easily tweak your code to make it work, like I hoped.

As we are now contemplating replacing LearningPipeline with something more versatile, I don'ty think we should be adding new capabilities to it.

Would you be able to have a compile time 'upper bound' on the number of features you have? You mentioned 80+, so maybe 100 would suffice? This way you can still take advantage of [VectorType] compile-time attribute.

Zruty0 on 25 Jul 2018

Thanks for the doc @Zruty0!

Yes I've been using an hard-coded upper bound array length with in combination with [VectorType]. It works. So even if it's not the cleanest, at least I'm not blocked.