Machinelearning: Please Expose ParquetDataLoader add implement ParquetWriter

Created on 26 Feb 2020 · 2Comments · Source: dotnet/machinelearning

Apache Parquet is a popular data format in the industry. It is used in ScikitLearn, Spark and a many other ML and big-data related software. Currently ParquetReader is an internal class that is not exposed to end users.
Feature Request:

Please provide ParquetReader to ML.NET users
Please implement ParquetWriter to ML.NET users.
Update Parquet.NET used by the framework to the latest as it offers bug fixes and perf improvements.

This would significantly simplify integration with various other ML and ETL processes.
This would also provide a good industry-standard data interop between ML.NET and other ML and data tools, as an alternative to CSVs.

P2 enhancement

Source

GKrivosheev-rms

❤5

Most helpful comment

Hi @GKrivosheev-rms , thanks for the suggestion :).

yaeldekel on 27 Feb 2020

👍3

All 2 comments

Hi @GKrivosheev-rms , thanks for the suggestion :).

yaeldekel on 27 Feb 2020

👍3

If I might add a suggestion tied to this. Parquet supports many kinds of compression. The big ones are snappy (often the default in python packages) and gzip. But others are gaining traction as well for space and/or size issues. Some others that are supported by pyarrow are brotli, lz4, and zstd.

(forgive the link to my own blog)