Hello!
I often have CSV files with more than 50 float columns, so it's not feasible to specify each of them individually. I've failed to load them in one shot using a range/sweep specifier. To test things out in smaller scale, I used the Iris example because it ends with 4 float columns.
Here's the data class, I only added 2 lines at the end:
public class IrisData
{
[Column("0")]
public float Label;
[Column("1")]
public float SepalLength;
[Column("2")]
public float SepalWidth;
[Column("3")]
public float PetalLength;
[Column("4")]
public float PetalWidth;
[Column("1-*", name: "Features")] // New
public float[] Features; // New
}
Here's the simplified pipeline, I only commented out the normal way with ColumnConcatenator:
var pipeline = new LearningPipeline();
pipeline.Add(new TextLoader(DataPath).CreateFrom<IrisData>(useHeader: true, separator: '\t'));
//pipeline.Add(new ColumnConcatenator("Features",
// "SepalLength",
// "SepalWidth",
// "PetalLength",
// "PetalWidth"));
pipeline.Add(new KMeansPlusPlusClusterer() { K = 3 });
var model = pipeline.Train<IrisData, ClusterPrediction>();
So it worked when I load each column individually and then concatenate them in the pipeline, like the sample code says. But it always throws an exception when I use my above code:
System.Reflection.TargetInvocationException: 'Exception has been thrown by the target of an invocation.'
Inner Exception:
InvalidOperationException: Column 'Features' is a vector of variable size, which is not supported for normalizers
Please help! Thank you!
=============================================================
Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.
Hi,
Just a hunch here: I think that the issue might be that you are trying to reuse the Features column. The first parameter in a ColumnConcatenator is the destination column. Maybe try changing your "Features" column on your IrisData model to be called MiscFeatures or something along those lines (update the ColumnName attribute and the field name), and update the ColumnConcatenator to take in "MiscFeatures" as well.
Now I'm not certain how that will work since you'll be concatenating 2 different types (most types are float, the last one is a vector) and hopefully #535 will get an answer to further explain how we could go about concatenating these together.
@hellothere33 , thanks for your question.
The problem you are having is due to the fact that we have different kinds vector-valued columns: fixed-size and variable-size. For fixed-size vectors, all examples are expected to have the same, specified number of elements in the corresponding column, whereas for variable-size vectors the number of elements in the field may differ from example to example.
When we build the data pipeline, we inspect the class (IrisData) and infer the data type of each column. The type of Features is float[], so we assume it's a variable-size vector of floats.
On the other hand, when you concatenate 4 features using ColumnConcatenator, the resulting column will be a fixed-size vector of floats (size 4).
Most prediction pipelines do not accept variable-size vectors, and hence the error message you are seeing.
Thankfully, in your particular scenario you can give a hint to the pipeline with the vector size:
public class IrisData
{
[Column("0")]
public float Label;
[Column("1-*", name: "Features")] // Name and location
[VectorType(4)] // Vector size
public float[] Features;
}
If the vector size is only known at runtime, the VectorType attribute approach will not work. In this case you can still give us a hint of the column type, by utilizing the SchemaDefinition class.
We definitely should have this covered in the documentation somewhere. If it isn't, then this issue should be about the missing documentation.
Thank you @dan-drews and @Zruty0 for your suggestions!
Using VectorType(4) indeed does work. Since my actual model has varying +80 columns (which is why I used Iris only as a simpler proxy example), I'm going to try the SchemaDefinition as suggested.
Would you like me to close this issue, or to change its title to be about missing documentation? Let me know.
I have discovered that the API reference doesn't have a documentation for VectorTypeAttribute, which looks like an issue with our xmldoc tooling.
@sfilipi , would you mind taking over this issue?
Separately, I think we would benefit from a longer doc describing exactly the mechanism behind schema comprehension that we currently do: how attributes are handled, why SchemaDefinition is sometimes needed, what are the limitations. I created a separate issue for this #554 .
Opened an issue in the docs infrastructure to track:
https://github.com/dotnet/docs/issues/6542
I couldn't find examples of SchemaDefinition though in the samples repo, and there are no tests in the main repo using is either, or maybe I didn't find it. Could you point me to the right direction?
Thank you!
I am in the process of writing a doc on 'schema comprehension', and it'll include examples of SchemaDefinition use. I'll get back to you in 0-1 days.
@hellothere33 , I have written the schema comprehension doc, which contains an example of handling runtime-known column sizes.
Unfortunately, in the process I discovered that SchemaDefinition-aware constructors didn't make it into the LearningPipeline helper API, so you won't be able to easily tweak your code to make it work, like I hoped.
As we are now contemplating replacing LearningPipeline with something more versatile, I don'ty think we should be adding new capabilities to it.
Would you be able to have a compile time 'upper bound' on the number of features you have? You mentioned 80+, so maybe 100 would suffice? This way you can still take advantage of [VectorType] compile-time attribute.
Thanks for the doc @Zruty0!
Yes I've been using an hard-coded upper bound array length with in combination with [VectorType]. It works. So even if it's not the cleanest, at least I'm not blocked.
I suppose we should close the issue then. Feel free to reopen if you feel otherwise.
Yes, thanks everyone!