Machinelearning: Please Expose ParquetDataLoader add implement ParquetWriter

Created on 26 Feb 2020  路  2Comments  路  Source: dotnet/machinelearning

Apache Parquet is a popular data format in the industry. It is used in ScikitLearn, Spark and a many other ML and big-data related software. Currently ParquetReader is an internal class that is not exposed to end users.
Feature Request:

  • Please provide ParquetReader to ML.NET users
  • Please implement ParquetWriter to ML.NET users.
  • Update Parquet.NET used by the framework to the latest as it offers bug fixes and perf improvements.

This would significantly simplify integration with various other ML and ETL processes.
This would also provide a good industry-standard data interop between ML.NET and other ML and data tools, as an alternative to CSVs.

P2 enhancement

Most helpful comment

Hi @GKrivosheev-rms , thanks for the suggestion :).

All 2 comments

Hi @GKrivosheev-rms , thanks for the suggestion :).

If I might add a suggestion tied to this. Parquet supports many kinds of compression. The big ones are snappy (often the default in python packages) and gzip. But others are gaining traction as well for space and/or size issues. Some others that are supported by pyarrow are brotli, lz4, and zstd.

(forgive the link to my own blog)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bs6523 picture bs6523  路  4Comments

samueleresca picture samueleresca  路  3Comments

rebecca-burwei picture rebecca-burwei  路  3Comments

JakeRadMSFT picture JakeRadMSFT  路  3Comments

rogancarr picture rogancarr  路  3Comments