Visidata: Parquet file type support?

Created on 2 Feb 2018 · 8Comments · Source: saulpw/visidata

Hi Saul,
would you consider adding a parquet support in visidata? From your experience - is it hard to add a new file format?

Source

mficek

👍2

Most helpful comment

As of 4c913f61cc05359305180cf277d18978a7201829, you can use vd -f pandas data.foo and it will call pandas.read_foo(). This should work for Parquet files too, as long as they have a .parquet extension. Let me know how well this works for you.

This will be in v1.4, released within the next week.

saulpw on 14 Sep 2018

❤3

All 8 comments

It's not hard support to a new format, if there is a Python3 module that supports it, and the data can be loaded entirely into memory. See How to create a loader for VisiData. It would be great to have support for small Parquet datasets like this.

But I believe Parquet datasets can get rather large. I am planning on making VisiData support datasets that can't be loaded entirely into local memory (like SQL databases, terabyte .tsv files, Parquet, etc). What's your use case for Parquet, and how large are the datasets that you would be connecting to?

saulpw on 2 Feb 2018

I'm working with files of approx. 12Gb in memory, but very often I save some intermediate results or even results of my workflow in parq format simply because it's so fast.
I'm used to load parquet files with pandas, or pyarrow.parquet package. As far as I know, it doesn't support iterative reading, just whole files (columns selectively) at once. It uses threads (number of threads is a parameter) and from my experience it's blazingly fast: Gigabites in order of seconds, hundreds of MBs almost instantly.
I can try to check the option of creating a parquet loader for VisiData, if you agree that loading the data into memory at once (with pyarrow) and then converting them into rows is the way to go...

mficek on 3 Feb 2018

Hm, I think integrating with pandas is a much faster way. Let me think about this for a bit.

saulpw on 5 Feb 2018

Hi Saul,

Thanks for your amazing work with visidata! Personally I wanted support with dta files. Pandas currently supports it. When you say that integrating with pandas is a much faster way, this would mean that this fix would add all the file formats supported by Pandas?

cuchoi on 4 Mar 2018

Hi @cuchoi, yes, creating a Pandas loader should support all the file formats that Pandas supports (although that may be a bit more work to integrate with Pandas overall). I've been reticent to take on the additional dependencies of Pandas, but it might happen in v1.2 (v1.1 is being released in the next couple of days).

saulpw on 4 Mar 2018

This will be in v1.4, released within the next week.

saulpw on 14 Sep 2018

❤3

Working quite well but for some files we can get error depending on how it was serialized

  File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 253, in read
    use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: categories: list<element: struct<@type: string, level: int64, localId: string, localName: string, name: string>>