Visidata: [wishlist] Autodetect file extensions only supported by pandas?

Created on 5 Feb 2020 · 14Comments · Source: saulpw/visidata

Currently, the method for interacting with files differs, depending on the format, e.g.:

vd file.tsv
vd -f pandas file.feather

One can easily alias the later to something like vp, but it still requires a little mental effort each time visidata is called to choose the right command.

Is there any reason that the main visidata ui cannot attempt to detect the file format, at least for some common formats that are only supported by pandas (e.g. parquet and feather), and use the pandas loader for these formats?

You could still leave the -f pandas as an option, if there are likely to be ambiguous file extensions where the choice should be left up to the user.

pandas

Source

khughitt

👀1

Most helpful comment

Which version of the pandas API are we going to use? (From my limited experience of doing QA for VisiData, the pandas API is relatively volatile. This is understandable (it is an ongoing-maturing library), but it means we are going to have to make a decision with which version to support.)

I was reading this issue and wanted to chime in on this point: Pandas 1.0.0 was just released recently and as part of that finally adopted a versioning policy which should make the API much less volatile.

tsibley on 14 Feb 2020

❤1 👍1

All 14 comments

Interesting idea @khughitt! It does seem like a convenience win to make pandas the default loader when it's the only option. Another possibility is declaring your own defaults in ~/.visidatarc:

vd.filetype('feather', PandasSheet)
vd.filetype('parquet', PandasSheet)

That's perhaps a middle of the road option between default native support and needing the -f option.

Since ~/.visidatarc settings trump the defaults, you'd also be able to override ambiguous file types like JSON or CSV if you find yourself always reaching for pandas instead.

ajkerrigan on 6 Feb 2020

❤2

@ajkerrigan Great suggestion!

khughitt on 7 Feb 2020

@ajkerrigan Do you know if this will should in the current stable version of visidata? I tried it out, but it doesn't look like "filetype" is defined for visidata.vdtui.VisiData in this version.

I tried installing the latest version from the development branch the other day, but couldn't get visidata and vsh to work together, so I reverted to stable.

khughitt on 7 Feb 2020

The stable branch indeed has a different api!

Could you please try:

open_feather = open_pandas

and if that does not work:

def open_feather(p):
     return PandasSheet(p.name, source=p, filetype='feather')

The pandas code is in visidata/loaders/_pandas.py, and the examples I am looking at are for stata/ .dta, which we did enable autodetection for.

Would you mind sharing what issues were you having with vsh and visidata working together?

anjakefala on 7 Feb 2020

❤1

@anjakefala Thanks - the first approach did the trick! Tested with both feather and parquet and works fine.

Do you happen to know if there is any appropriate way to query the version of the API from the config? I didn't see anything like vd.__version__.

I could just use try...except blocks, but figured it's better to be explicit if possible.

I'll try and reproduce the issues I was having the other day with the dev branch and report it if I'm able to. The only reason I didn't before, tbh, was that I wasn't 100% sure I was installing things correctly :sweat_smile:

khughitt on 7 Feb 2020

@anjakefala Ha! Sure enough, the issues with the dev branch were all on my end.. I forgot that I had a system-wide stable installation of visidata, and it was conflicts with my pip --user-installed dev version..

I was thrown off by the NameError: name 'vdtui' is not defined error message and didn't even think to check for multiple installations.. Also makes me realize my $PATH order needs to be fixed..

khughitt on 7 Feb 2020

Oy! @khughitt. Python environments are not fun. =(

By the way, I would stick with stable, if your primary use-case is pandas! There are a few develop-pandas issues on the board. Unless, you want to help out by taking a look. =D

Do you happen to know if there is any appropriate way to query the version of the API from the config? I didn't see anything like vd.__version__.

I do happen to know!

from visidata import __version__

An example is in visidata/motd.py ^^.

anjakefala on 7 Feb 2020

❤2

Good call @anjakefala !

@khughitt if you're looking for a way to decide which API to use at runtime, it _might_ be easier to check for the features you need rather than keeping track of versions. I'm not sure, but worth a shot:

if callable(getattr(vd, 'filetype', None)):
    vd.filetype('feather', PandasSheet)
else:
    # stable/v1.5 logic

On the topic of pandas, I wonder if we'd be able to tag team some of those issues @khughitt and/or @anjakefala ? I'm an infant when it comes to pandas, but slowly getting familiar with visidata's code. If you're up for taking a crack at some pandas issues as a team, we might be able to get somewhere together.

ajkerrigan on 7 Feb 2020

👍1

@ajkerrigan Sure thing! I probably can't dedicate too much time at the moment, and I still need to spend some time getting more familiar with visidata's API, first, but I'll try and start flagging issues as I come across them / submitting PR's when I'm able to.

Currently I use vd for a mix of files: tsv, feather, and parquet. I used to just use gzipped-tsv files for everything for consistency / support, but especially recently as I'm re-using files more in pipelines and/or parsing them in realtime for shiny applications, I've been shifting more to feather/parquet..

Some motivation: https://github.com/khughitt/benchmark-compression

khughitt on 7 Feb 2020

👍1

@ajkerrigan, that is really clever! Saul was considering reverting back to the open_ framework, and this would ensure that that .visidatarc continues to work if we do revert.

Thanks so much for sharing your workflow @khughitt. =)

I would definitely like to tag team on approaching them. I am also uncomfortable with pandas.

I am going to outline the collection of thoughts I have on the matter. No one is under any commitment or expectation to act on this. =) It might be a helpful guide for understanding the issues.

As far as I understand, the current major issue with the branch is that when someone got selection working for the pandas loader, the change that necessitated that broke editing in pandas. I think the pandas loader will probably need both to be a usable loader.

This was the change in question: https://github.com/saulpw/visidata/blob/develop/visidata/loaders/_pandas.py#L71

The major questions will be:

What is the essence of the source for a pandas loader, and how we best cohere that with VisiData?
Which VisiData reference documentation will we need to exist for others to be able to work on a VisiData loader async?
If we do have to choose between supporting selection or editing, which do we choose? If selection gets dropped what does that mean for commands that require selection (e.g. fill-nulls). This will be a big question for Saul.
Which version of the pandas API are we going to use? (From my limited experience of doing QA for VisiData, the pandas API is relatively volatile. This is understandable (it is an ongoing-maturing library), but it means we are going to have to make a decision with which version to support.)

anjakefala on 8 Feb 2020

👍1

Which version of the pandas API are we going to use? (From my limited experience of doing QA for VisiData, the pandas API is relatively volatile. This is understandable (it is an ongoing-maturing library), but it means we are going to have to make a decision with which version to support.)

tsibley on 14 Feb 2020

❤1 👍1

Thanks, @tsibley! I retract my comment. I had no idea they had shipped 1.0. =D Good for them!

anjakefala on 16 Feb 2020

What if we add this to the pandas loader?

for ft in 'feather gbq orc parquet pickle sas stata'.split():
    globals()['open_'+ft] = lambda p,ft=ft: PandasSheet(p.name, source=p, filetype=ft)

Are there commonly used extensions for the above filetypes? Would this suffice to satisfy this issue?

saulpw on 29 Jun 2020

👍1

That seems like a helpful change. Maybe see if open_<ft> already exists first though, so it's more of a fallback mechanism that will play nice if/when standalone loaders come in for those types?

ajkerrigan on 29 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[www] Create favicon for visidata.org

anjakefala · 35Comments

[Windows] Issue with viewing HelpSheet (manpage) on Windows

bob-u · 17Comments

[develop] concatenating selected source columns, from the Columns Sheet, does not work

aborruso · 12Comments

Add option to guess at column types

khughitt · 12Comments

[wishlist] Add ability to expand rows with nested data like ExpandedColumn ("unfurl")

frosencrantz · 11Comments