Machinelearning: Support for loading data from SQL server

Created on 10 May 2018  路  11Comments  路  Source: dotnet/machinelearning

Somewhat related to #96 but more specific, it would be nice to have a loader to stream data out of a SQL Server table or view and into this framework for training.

enhancement

Most helpful comment

I did not say EF ;-) - I heard some of the authors of this package worked on EF before, but please please don't bind it to EF, or sql server or any other specific tech. Or course you want to get data into the learning system without going over to csv. But the data is usually already flattened anyway, so no need for a complex OR mapping tool - especially since the target data structure is Immutable tensor

All 11 comments

It seems SQL server is on the road map https://github.com/dotnet/machinelearning/blob/master/ROADMAP.md

Additional Data source support
Data from SQL Databases, such as SQL Server
Data located on the cloud

you want to bond the ML lib to a specific server? really?

@forki IDataView might work with arbitrary types, looking at https://github.com/dotnet/machinelearning/issues/96#issuecomment-387794016. I'm not sure yet how these cursors work, but naturally having a provider for every library and source type is untenable. There might be support for ingesting data from sources like IEnumerable<T>, which would support any source quite easily. At least coming?

I'm not sure if this is the same for everyone, but for me it's a bit puzzling to see there are specific loaders for some sources, like files with CSV data, which gives the impression the idea is to bond data sources (as in providers and format of the data) quite closely to the library. Then there is discussion about these specific loaders and in-memory loaders. This suggests the idea indeed is to couple source loading and format handling together and push to a specifc pipeline. For instance, it looks to me the CSV loader input/output doesn't work in all locales and it's not easy to work around that. I also might find more performant and robust ways to load the data in any case.

@forki I am still having trouble understanding how the internals of this library works, but I don't believe it is controversial to suggest that there are more efficient/performant ways of training on data from a database than plugging in EF Core, transforming to IEnumerable (if the feature lands), then transforming again into an IDataView - or worse, exporting GBs/TBs of data to CSV just to load them back into this library.

If this fits better as an independent provider plugged in via NuGet, so be it. It would be nice to have at least one example of a SQL loader that OSS devs could use as a template to develop their own providers for SQLite, Mongo, etc. I would not expect this library to bundle every provider under the sun or carry a lot of potentially unused dependencies (e.g. EF Core).

I did not say EF ;-) - I heard some of the authors of this package worked on EF before, but please please don't bind it to EF, or sql server or any other specific tech. Or course you want to get data into the learning system without going over to csv. But the data is usually already flattened anyway, so no need for a complex OR mapping tool - especially since the target data structure is Immutable tensor

Just bumping this.

Since SQL data source is such a common ask, we will implement a version of the SQL-backed data view (and a SQL-backed IDataReader). From what I understand, we'll probably use the primitives in System.Data for this, and, therefore, it'll be available as a separate NuGet (so that ML.NET core doesn't depend on System.Data.

Then again, nothing prevents anyone from implementing IDataView with any backing storage: NoSQL databases, magnetic tape readers or anything :)

Referencing https://github.com/dotnet/machinelearning/issues/1130 as is abouth the same issue. @Zruty0, just a note that there are inconsistencies between what various ADO.NET providers support from System.Data. asynchrony etc. and you might want to consider some more advanced things like streaming (not always properly supported). The linked issue has a link to one implementation that solves most of the problems if you want to see concrete code (other libraries have the same).

@singlis I think you looked into loading data from Sql Server, what was result of your investigation?

Hello, sorry for the late reply.

We currently integrate with IEnumerable and there is no binding to any specific database technology.
https://github.com/dotnet/machinelearning/blob/617f0f6bc3b0bcbf8e0af86594a706406c274de0/src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs#L77

There is work in progress sample in the machine-learning samples. Note the sample does use EF:
https://github.com/dotnet/machinelearning-samples/pull/295

I have not done any investigation yet into how this works with large datasets. This is something that we want to get an idea for what EF does -- but the end result will be fine tuning EF because at the end of the day ML.NET is working from an IEnumerable. How the data is pulled from the database and cached is a behavior for Entity Framework.

I hope this helps -- since SQL support is on the road map, I am closing this issue. If there is more to discuss, please re-open.

Hopefully this isn't in bad taste, but I have a written a small post on getting data in Azure SQL and then using that to create a model. Just in case an example without EF is wanted.

Basically, what @singlis mentioned, that the LoadFromEnumerable is the magic that is needed for SQL data.

Was this page helpful?
0 / 5 - 0 ratings