Machinelearning: Make TextLoader infer column information from a CSV file header

Created on 27 Oct 2020 · 12Comments · Source: dotnet/machinelearning

System information

OS version/distro: Windows 10
.NET Version (eg., dotnet --info): 3.1

Issue

I couldn't find the channel to submit a request so I apologise if this isnt the right place.

I have a web portal that customers upload csv files to train and predict using multiclass-classification algo. However, ML.NET requires a concrete class for the Input Model with properties hard coded to the columns of the csv.

Could you enhance the LoadFromText method so that it can accept an argument for the label header name and then just read the features from the headers of the rest of the csv file without having to create a class? This way nothing is hard coded and it can work with any csv file that my customers upload.

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

P3 question

Source

iluveu28

All 12 comments

Hi, @iluveu28 .

The problem with the feature you're requesting, is that the TextLoader not only needs to know how many columns there are, or what are the labels of each column, but it also needs to know what type should each column be and/or if multiple columns should be treated together as a vector. Furthermore, depending on your scenario, you'd anyway need to change the pipeline that comes after loading the data, to handle any difference on your user's input.

Still there are 2 ways in which you can try to accomplish what you want, although there's no plan to actually include the feature you've requested directly onto TextLoader.

Use TextLoader.Options and TextLoader.Columns. You've mentioned that "ML.NET requires a concrete class for the Input Model with properties hard coded to the columns of the csv", but ML.NET actually provides multiple ways of loading data. Besides defining a ModelInput class, TextLoader also works by taking a TextLoader.Options object, which in turn can specify the column information needed to load the CSV file. This API can be found here, and it looks like this:

https://github.com/dotnet/machinelearning/blob/25f8c5bea58dd852b27aa1dec1e931e6e5509d01/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/Regression/OrdinaryLeastSquaresAdvanced.cs#L31-L41

So, inside the TextLoader.Options object, you can also add an array of TextLoader.Column objects; in this example, the first column is type float named "Label", and then "Features" is a float vector made up of columns 1-6. In your case, I'd suggest creating your own logic to read your users CSV file and determining the TextLoader.Columns objects to use.

Use AutoML.NET InferColumns. AutoML.NET is a tool created on top of ML.NET that automatize the process of creating models. It has one method called InferColumns() (docs found here), to which you pass your dataset path, and it infers the column information of your dataset. This method returns a ColumnInferenceResults object, which in turn contains a TextLoader.Options object, which contains the column names and types inferred from the dataset. This object can be passed to the method I described on point number 1). So I think this is the option you're asking for, but take into account that inferring column information may not always work correctly as it's not a trivial problem, and so I'd recommend you to actually write your own logic tailored to your use case to generate the TextLoader.Options object. Anyway, here's how using InferColumns() looks like:

https://github.com/dotnet/machinelearning/blob/68ea9690ef6582a0decc974055c1f6982227aaf5/test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs#L28-L30

Please, let us know if this answer your question. Thanks.

antoniovs1029 on 29 Oct 2020

👍1

First of all, thanks a lot @antoniovs1029 for your swift reply.

This might work for training.

However in prediction web api, it requires a TData so how can I add it into the pool without specifying the input model class?

services.AddPredictionEnginePool()
.FromFile(modelName: modelName, filePath: $"MLModels/modelName.zip", watchForChanges: true);

Thereafter, in the controller or service later, I do need to handle the DI where the TData is required as well

private readonly PredictionEnginePool _predEnginePool;

Lastly, I'm doing below to get the test data for prediction so how should I do it again without the input model?

var testDataList = _mlContext.Data.CreateEnumerable(dataView, false).ToList();

iluveu28 on 30 Oct 2020

So, PredictionEngines need to have a defined input and output class since they were created precisely to map C# input objects to output objects. Because of this, I'd think PredictionEngines isn't the tool you want to use for your use case, as you can't know in advance the schema of the data your users will input.

Although we usually recommend PredictionEnginePool for web apps, you can actually use other ML.NET API's on web apps. In particular, you could use the typical ML.NET scenario where your have input and output IDataViews, and simply use the TextLoader.Options I talked about earlier to load your data, and then use, for example, .GetColumn<float>("Score") on your output dataview to get your predictions. So it would be something like this:

```C#
var traindataPath = "train.csv";
var testdataPath = "test.csv";
var mlContext = new MLContext();

// Insted of defining the TextLoader options manually
// you migh use the AutoML.NET InferColumns() I mentioned earlier
var loader = mlContext.Data.CreateTextLoader(new[]
{
... your columns here
},
hasHeader: true,
separatorChar: ',');

var training_data = loader.Load(traindataPath);
var test_data = loader.Load(testdataPath);

var learningPipeline = ... your ML.NET pipeline here

var model = learningPipeline.Fit(training_data);
var output = model.transform(test_data);

// "scores" will contain a float array with the score values
// given to each row of the test_data IDataView
var scores = output.GetColumn("Score");
```

No need to use CreateEnumerable or PredictionEngine if you already have your CSV files and your TextLoader.Options object.

Note that if your test_data only has 1 row (i.e. you only want to have one prediction at a time), then in that case PredictionEngine would perform faster than the code above.

antoniovs1029 on 30 Oct 2020

I'd also like to mention the DataFrame project (link to tutorial). It lets you load your data into memory without specifying a schema, by inferring column information similar to what AutoML does (as mentioned earlier), and manipulate your data similar to how Panda's DataFrames work on Python.

I'm not sure if you'd benefit from using DataFrames (perhaps using AutoML.NET's infer columns I mentioned earlier would be more what you're looking for), since it loads the data into memory, and its main benefit, which is manual data exploration, wouldn't be useful to you.

But I wanted to mention it in case someone else stumbles upon this issue and might find DataFrames useful.

antoniovs1029 on 30 Oct 2020

@antoniovs1029

Would it impact performance or cause any thread safety issues when not using the prediction engine pool in API?

I tried the InferColumns and it's working fine for training. However, prediction is giving below error when I call Transform method :-

"Cannot map column (name: Bucket, type: Key) in data to the user-defined type, System.String. (Parameter 'column')"

Snippets of my prediction codes :-

            var testData = loader.Load(testDataPath);

            ITransformer mlModel = _mlContext.Model.Load(modelPath, out var modelInputSchema);

            var output = mlModel.Transform(testData);

            var predictions = output.GetColumn<string>(labelColumnName);

Note that predict and train are in separate APIs so I plan to save the inferred columns results into DB as a json string so that the predict API can retrieve it later and then reconstruct the column definitions but for the purpose of this POC, I hard coded the columns using an array. During train, the label is inferred correctly as DataKind.String but in predict, when I debug & inspect the outputSchema of the output variable in the code above, somehow there's another Uint32 field for the KeyValues of the label so I think this might be the issue here. Please advise how I should obtain the string field prediction results/label.

Also, the test data csv file is posted to the API as a Stream. The text loader expects a file path as an argument which means I'd have to write the stream to the disc first which incurs IO costs. When using the prediction engine pool, I convert the stream into a List then pass into the engine which is more efficient. Is there a way I can do this with text loader or something else that works for my use case?

iluveu28 on 2 Nov 2020

A bigger code snippet below to give you a better idea of what I'm doing in predict. The reason I do inference in training but not in predict is because in predict, the label column is empty so it is always inferred as a Single rather than a string. Another question I have is that, there may be additional irrelevant columns in the test data csv added by the users for remark purposes so how should I tell the loader to ignore those and only worry about the columns I specified?

            string[] inputColNames = { "RegisterName", "RegisterValue", "ValidFlag", "ValidFlagValue" };
            string labelColumnName = "Bucket";

            TextLoader.Column[] columns = new TextLoader.Column[inputColNames.Length + 1];
            int colCount = 0;
            foreach (string colName in inputColNames)
            {
                if (colName == "ValidFlagValue")
                {
                    columns[colCount] = new TextLoader.Column(colName, DataKind.Single, colCount);
                } else
                {
                    columns[colCount] = new TextLoader.Column(colName, DataKind.String, colCount);
                }

                colCount = colCount + 1;
            }

            columns[colCount] = new TextLoader.Column(labelColumnName, DataKind.String, colCount);

            var options = new TextLoader.Options
            {
                Separators = new[] { ',' },
                HasHeader = true,
                Columns = columns,
                AllowQuoting = true,
                AllowSparse = true
            };

            var loader = _mlContext.Data.CreateTextLoader(columns,
                    hasHeader: true,
                    separatorChar: ',');

            var testData = loader.Load(testDataPath);

            ITransformer mlModel = _mlContext.Model.Load(modelPath, out var modelInputSchema);

            var output = mlModel.Transform(testData);

            var predictions = output.GetColumn<string>(labelColumnName);

iluveu28 on 2 Nov 2020

Would it impact performance or cause any thread safety issues when not using the prediction engine pool in API?

As I mentioned before, if you only run prediction on one sample at a time, Prediction Engine is faster than the transform - get column snippet I posted. But if you run predictions on multiple samples at once, Prediction Engine might even be slower. On the other hand, Prediction Engine itself isn't thread safe, in that you shouldn't share the same PredictionEngine object accross different threads. That's why PredictionEnginePool exists, to provide multiple PredictionEngine objects (where all share the same underlying model) whenever a thread requests it. But since PredictionEngines aren't used on the transform-getcolumn snippet, this isn't a problem.

I tried the InferColumns and it's working fine for training. However, prediction is giving below error when I call Transform method :-
"Cannot map column (name: Bucket, type: Key) in data to the user-defined type, System.String. (Parameter 'column')"
[...]
During train, the label is inferred correctly as DataKind.String but in predict, when I debug & inspect the outputSchema of the output variable in the code above, somehow there's another Uint32 field for the KeyValues of the label so I think this might be the issue here

Can you provide a full stack trace? I think the error isn't thrown while calling .Transform() but actually when calling .GetColumn<string>(labelColumn).

What your model is probably doing is that it's taking labelColumn as a string, but then applies a ValueToKeyTransformer (or maybe a OneHot mapper) to map it into a "Key" type, so then when you try to retrieve the column using GetColumn<string>(labelColumn) it fails, because the type of the column isn't string but Key. I believe that changing this to GetColumn<uint>(labelColumn) would fix it, if that's the problem. This is expected, by the way, so it's not a bug.

Also, I wanted to note that I suspect you're doing a mistake. The labelColumn contains the labels that were read from your input data file. Usually models of ML.NET output their predictions in a new column called "PredictedLabel". You can debug your scenario, and find in the outputSchema if such a column exists. It would depend on your model, but I would expect the output to be on "PredictedLabel" and not in your labelColumn. Also the PredictedLabel column might be of Key type or String type, depending on your model, if it's a key type, I'd recommend adding a KeyToValueTransformer at the end of your model to actually get the string prediction.

Also, the test data csv file is posted to the API as a Stream. The text loader expects a file path as an argument which means I'd have to write the stream to the disc first which incurs IO costs. When using the prediction engine pool, I convert the stream into a List then pass into the engine which is more efficient. Is there a way I can do this with text loader or something else that works for my use case?

So TextLoader doesn't have an API that accepts Stream objects directly. But if you create a TextLoader using any of the CreateTextLoader() APIs (link), then you are able to call textLoader.Load(IMultiStreamSource source) (link) which accepts a IMultiStreamSource as input

I believe that if you implement your own class implementing IMultiStreamSource, wrapping your stream, you would be able to pass that custom class to the TextLoader.

Unfortunately, there's no documentation on how to do it, and I've personally never done it. But I think it should work. You can take our MultiFileSource code (link) as an example on how to implement IMultiStreamSource. MultiFileSource is what TextLoader typically uses to read files from disk. As you can see on the implementation, the Open() method creates a new Stream object, using the string path that was previously provided by the user.

https://github.com/dotnet/machinelearning/blob/1ea53362a692e551737341787d7b49541c28dfb8/src/Microsoft.ML.Data/DataLoadSave/MultiFileSource.cs#L76-L89

In your case, I think you'd only need to return a copy of your stream on the implementation of Open(). It's important that you return a copy of the stream, since typically training a model requires to do several passes on the input data (ML.NET doesn't store the dataset in memory, it streams from it multiple times). Also, in your case, I'd think you'd like to ignore the index parameter of Open().

there may be additional irrelevant columns in the test data csv added by the users for remark purposes so how should I tell the loader to ignore those and only worry about the columns I specified?

The TextLoader will only load the columns that were specified on the TextLoader.Options object I mentioned the other day. So as long as you describe which columns are relevant to you through that object, you don't need to worry about anything else. So if you use the infer columns API, then it returns a TextLoader.Options object with the column information, and then you'd need to write your own logic to remove any column that you don't want before passing the TextLoader.Options into the TextLoader.

antoniovs1029 on 6 Nov 2020

@antoniovs1029

Thanks a lot for the clear explanations on the prediction engine. I do training in bulk so the transform is best.

As for predict, changing it to Uint32 fixed it. However, the returned values look like a list of indexes/keys.

            var output = mlModel.Transform(testData);

So I inspected the outputSchema and I do see both the string and keys of the PredictedLabel returned from the transformed output variable above but I'm still struggling to obtain the actual prediction string results. It'd be great if the GetColumn would just return me the string output instead. I don't really understand what you meant by "I'd recommend adding a KeyToValueTransformer at the end of your model to actually get the string prediction.". Could you link me to some sample codes?

iluveu28 on 7 Nov 2020

So I inspected the outputSchema and I do see both the string and keys of the PredictedLabel returned from the transformed output variable above but I'm still struggling to obtain the actual prediction string results. It'd be great if the GetColumn would just return me the string output instead.

Can you share a screen shot of your output DataView schema showing the PredictedLabel string and key columns you mention? And the code you're using for .GetColumns().

Typically an ML.NET scorer will output the PredictedLabel as a key, so users need to add a KeyToValueTransformer after the scorer so that it can map keys back to label "strings". Adding the KTVT creates a new PredictedLabel column in the schema, which is now a "string" column. But since you say that you can see on your outputSchema that there are both string and key columns for PredictedLabel, then I'd think that you already have a KeyToValueTransformer on your model.

To add a KeyToValueTransformer to your model you could have used the .MayKeyToValue() API:
https://github.com/dotnet/machinelearning/blob/4a30bf5e92fba4bb3c2a4e37c304927a9009ffa7/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Conversion/KeyToValueToKey.cs#L83-L93

NOTE: KTVT actually doesn't create a string column, but actually a VBuffer<ReadOnlyMemory<char>> where it stores the text of the column. So I made a mistake before when I said you could use .GetColumn<string>()... as you can see on the snippet I've pointed to, you'd actually need to use .GetColumn<VBuffer<ReadOnlyMemory<char>>>().

antoniovs1029 on 9 Nov 2020

@antoniovs1029

I screenshot the watch below, see indexes 14-15 :-

So I changed the GetColumn return type to ReadOnlyMemory now it yields the predictions. Now I just need to parse thru the list and convert it to json string before returning to the consumer app. I'm actually building a generic AI platform like DataRobot but much simpler. Code snippets that work :-

            var testData = loader.Load(testDataPath);

            ITransformer mlModel = _mlContext.Model.Load(modelPath, out var modelInputSchema);

            var output = mlModel.Transform(testData);

            var predictions = output.GetColumn<UInt32>(labelColumnName);

            var predictedLabel = output.GetColumn<ReadOnlyMemory<char>>("PredictedLabel");

I actually work for Intel in Malaysia as a Software Architect. On behalf of Intel, I'd like to extend my deepest gratitude for your guidance in getting this to work, you're a life saver! Hit me up if you ever come to visit Malaysia and I will buy you a nice meal! :)

iluveu28 on 10 Nov 2020

❤1

Many thanks for the invite 😄 I will, if I ever go to Malaysia.

I'll close this issue now, since the original and follow up questions are now resolved.

antoniovs1029 on 10 Nov 2020

@antoniovs1029

Need help again.

I have another project where the training data has 27 columns. Somehow the InferColumns ignored all the numerical columns.

        var columnInference = _mlContext.Auto().InferColumns(trainingFilePath, labelColumnName);
        var textLoader = _mlContext.Data.CreateTextLoader(columnInference.TextLoaderOptions);

        var trainingDataView = textLoader.Load(trainingFilePath);

When I inspect the columnInference variable, I see only the string columns. Why?

Also, when I inspect the trainingDataView variable, I see an additional column created called Features which is a vector type.

Although the training was successful and saved. The prediction is not working because it's complaining the Features column is missing.

Predict-error