Hi Saul,
would you consider adding a parquet support in visidata? From your experience - is it hard to add a new file format?
It's not hard support to a new format, if there is a Python3 module that supports it, and the data can be loaded entirely into memory. See How to create a loader for VisiData. It would be great to have support for small Parquet datasets like this.
But I believe Parquet datasets can get rather large. I am planning on making VisiData support datasets that can't be loaded entirely into local memory (like SQL databases, terabyte .tsv files, Parquet, etc). What's your use case for Parquet, and how large are the datasets that you would be connecting to?
I'm working with files of approx. 12Gb in memory, but very often I save some intermediate results or even results of my workflow in parq format simply because it's so fast.
I'm used to load parquet files with pandas, or pyarrow.parquet package. As far as I know, it doesn't support iterative reading, just whole files (columns selectively) at once. It uses threads (number of threads is a parameter) and from my experience it's blazingly fast: Gigabites in order of seconds, hundreds of MBs almost instantly.
I can try to check the option of creating a parquet loader for VisiData, if you agree that loading the data into memory at once (with pyarrow) and then converting them into rows is the way to go...
Hm, I think integrating with pandas is a much faster way. Let me think about this for a bit.
Hi Saul,
Thanks for your amazing work with visidata! Personally I wanted support with dta files. Pandas currently supports it. When you say that integrating with pandas is a much faster way, this would mean that this fix would add all the file formats supported by Pandas?
Hi @cuchoi, yes, creating a Pandas loader should support all the file formats that Pandas supports (although that may be a bit more work to integrate with Pandas overall). I've been reticent to take on the additional dependencies of Pandas, but it might happen in v1.2 (v1.1 is being released in the next couple of days).
As of 4c913f61cc05359305180cf277d18978a7201829, you can use vd -f pandas data.foo and it will call pandas.read_foo(). This should work for Parquet files too, as long as they have a .parquet extension. Let me know how well this works for you.
This will be in v1.4, released within the next week.
Working quite well but for some files we can get error depending on how it was serialized
File "/usr/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 253, in read
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1136, in pyarrow._parquet.ParquetReader.read_all
File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported: categories: list<element: struct<@type: string, level: int64, localId: string, localName: string, name: string>>
@mycaule Does it work with a simple Python script? If so, it it should work in VisiData; if not, then that's what has to be fixed for it to work.
Most helpful comment
As of 4c913f61cc05359305180cf277d18978a7201829, you can use
vd -f pandas data.fooand it will callpandas.read_foo(). This should work for Parquet files too, as long as they have a.parquetextension. Let me know how well this works for you.This will be in v1.4, released within the next week.