Machinelearning: Please Expose ParquetDataLoader add implement ParquetWriter

Created on 26 Feb 2020  路  2Comments  路  Source: dotnet/machinelearning

Apache Parquet is a popular data format in the industry. It is used in ScikitLearn, Spark and a many other ML and big-data related software. Currently ParquetReader is an internal class that is not exposed to end users.
Feature Request:

  • Please provide ParquetReader to ML.NET users
  • Please implement ParquetWriter to ML.NET users.
  • Update Parquet.NET used by the framework to the latest as it offers bug fixes and perf improvements.

This would significantly simplify integration with various other ML and ETL processes.
This would also provide a good industry-standard data interop between ML.NET and other ML and data tools, as an alternative to CSVs.

P2 enhancement

Most helpful comment

Hi @GKrivosheev-rms , thanks for the suggestion :).

All 2 comments

Hi @GKrivosheev-rms , thanks for the suggestion :).

If I might add a suggestion tied to this. Parquet supports many kinds of compression. The big ones are snappy (often the default in python packages) and gzip. But others are gaining traction as well for space and/or size issues. Some others that are supported by pyarrow are brotli, lz4, and zstd.

(forgive the link to my own blog)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

maxt3r picture maxt3r  路  3Comments

OneCyrus picture OneCyrus  路  4Comments

daholste picture daholste  路  3Comments

lionelquirynen picture lionelquirynen  路  3Comments

sethreidnz picture sethreidnz  路  3Comments