Machinelearning: Working with streaming and larger than memory data sets

Created on 9 May 2018  路  11Comments  路  Source: dotnet/machinelearning

Where does ml.net stand with regard to streaming support and large data sets? Is it design to only work with static data sets that can fit into main memory all at once?

Are capabilities similar to https://github.com/BlueMountainCapital/Deedle.BigDemo/blob/master/README.md planned?

When evaluating an approach on how to incorporate ML into application architecture, it is important for decision makers to understand what is going to be possible and what are the limits with regard to working with large and or unbounded data sets. It would be good to have a clear message about that in the readme and roadmap.

I'm not sure if a successful ml library is possible without a very efficient and flexible (streaming, big data) data frame implementation like the one discussed in https://github.com/dotnet/corefx/issues/26845
It would be very interesting to understand ml.net team's take on that.

question

Most helpful comment

@voltcode thanks for the question. As we are still a bit early in the release we don't have all the technical documentation published yet. Let me give you a quick answer. The answer is emphatic "yes!" the ML.NET framework was built from the ground up with efficient streaming as its core design principle. The solution is the implementation of ML.NET's core data structure - IDataView . _IDataView_ -provides a schematised, immutable and cursorable view into the data. This allows algorithms to just stream data as needed, similar to the way IEnumerable< T > is used, but significantly more efficiently.
However, it does not mean that every possible ML pipeline can scale infinitely with number of training examples.
Some components (learners or transforms) may choose to load the entire data in memory to simplify implementation. for example, FastTreeBinaryClassifier, needs to do that.
In addition, some components, such as various neural network learners need to process data in batches, IDataView provides that capability as well, as simply a different usage pattern.
We will update the documents in the near future to reflect this.
Thanks for the suggestion!

Gleb

All 11 comments

@voltcode thanks for the question. As we are still a bit early in the release we don't have all the technical documentation published yet. Let me give you a quick answer. The answer is emphatic "yes!" the ML.NET framework was built from the ground up with efficient streaming as its core design principle. The solution is the implementation of ML.NET's core data structure - IDataView . _IDataView_ -provides a schematised, immutable and cursorable view into the data. This allows algorithms to just stream data as needed, similar to the way IEnumerable< T > is used, but significantly more efficiently.
However, it does not mean that every possible ML pipeline can scale infinitely with number of training examples.
Some components (learners or transforms) may choose to load the entire data in memory to simplify implementation. for example, FastTreeBinaryClassifier, needs to do that.
In addition, some components, such as various neural network learners need to process data in batches, IDataView provides that capability as well, as simply a different usage pattern.
We will update the documents in the near future to reflect this.
Thanks for the suggestion!

Gleb

@glebuk Is there a way (or at least a envisioned abstraction) to easily join existing IDataViews? I understand that one could create a CompoundDataView manually, but it would be very convenient to have such abstraction already available at the core.

@glebuk from streaming perspective, efficient handling of large streams would require a backpressure mechanism to be available, is it on the roadmap too? I don't see it being a part of the IDataView (not saying that it should be, but you mentioned that it was designed with streaming in mind). How would streaming be handled in the case of a ml.net pipeline distributed in a cluster? Is distribution going to be possible at all with ml.net?

@voltcode : To add to this, on the topic of scalability, I use ML.Net's underlying framework to run TB scale datasets on a single node. These are text (NLP) datasets read from disk. Our roadmap lists distributed training, which is reference to going beyond a single node.

@voltcode By join do you mean vertically (append rows) or horizontally (like inner join or zip join)? Could you explain what do you mean by compoundDataView?
IDV usually starts with a 'rectangularized' dataset. Getting data to this rectangular form is usually expected to be done outside of this framework with tools such as Pendleton, Spark, or SQL.
Since the IDataView is schematizable, you can have a single data view carry a variety of independent features. Various pieces of the ML pipeline may choose which columns to operate on. for example, if you have a product catalog, you can have a single idv with prices, text, reviews, and images. You can have different transforms modify different columns. Then the final step would be to combine all the resulting features into a single vector column and pass it all to a learner - you have just created a model that reasons on a product based on text, image and numeric features.
Multiple independent IDVs are used for more complex scenarios, such as recommender systems where you have independed datasets such as users and items.

@glebuk By join I mean inner join, and similar. By CompoundDataView I meant rolling out an implementation that composes internally out of two or more IDataView, that on the outside are presented as single view.

Since this library allows building pipelines, it is natural (perhaps I wish for too much, but I see a great opportunity for dotnet here, in terms of catching up with Java ecosystem :) ) that some of the building blocks that would be available, allow for combining more than one input.

In terms of rectangularization, it should be fine, as long as it allows for virtualization of the underlying data view (ie not loading entire set in memory) and for lack of boundary (streams). As you already mentioned, that indeed was incorporated into the design and will be available in some form. I'm looking forward to examining working examples of those features, with EF or something else providing data directly into the framework in a lazy manner.

We do have a HashJoinTransform for that purpose. Of course, a general join functionality for two large datasets s a problem better left for RDBS like SQL Server.
If you are interested in using HashJoinTransform, beware that it is not available in our simplified LearningPipeline API. In order to do joins, you will have to leave the world of simple LearningPipeline and entering the world of data flow DAGs. You will need to use an Experiment class and do somethig like here
As far as creating streaming dataview, the framework provides several classes that might make it easier, such as StreamingDataView < TRow > Note that those are Runtime namespace, meaning that the API is not final.

Also, with regards to "backpressure." The IDataView operates on the pull model. The data is lazy and is only materialized if used by the final consumer. Imagine that you would take a loader, string a bunch of transformers. At that point no data processing occurs. Then, as you add a learner, you call "train" As training is executed, the last item in the pipeline, trainer would request data. As data is requested it would be produced in a chain fashion by the loader and processed by all the items in the pipeline.

@voltcode, were all your questions answered? If so, it might be time to close the issue.

@markusweimer yes they were. It would be great to have an example or two that demonstrate more advanced uses of IDataView, but I suppose the information in this thread is a good enough starting point.

Thanks for the confirmation. Closing the issue for now.

Was this page helpful?
0 / 5 - 0 ratings