Currently, the method for interacting with files differs, depending on the format, e.g.:
vd file.tsv
vd -f pandas file.feather
One can easily alias the later to something like vp, but it still requires a little mental effort each time visidata is called to choose the right command.
Is there any reason that the main visidata ui cannot attempt to detect the file format, at least for some common formats that are only supported by pandas (e.g. parquet and feather), and use the pandas loader for these formats?
You could still leave the -f pandas as an option, if there are likely to be ambiguous file extensions where the choice should be left up to the user.
Interesting idea @khughitt! It does seem like a convenience win to make pandas the default loader when it's the only option. Another possibility is declaring your own defaults in ~/.visidatarc:
vd.filetype('feather', PandasSheet)
vd.filetype('parquet', PandasSheet)
That's perhaps a middle of the road option between default native support and needing the -f option.
Since ~/.visidatarc settings trump the defaults, you'd also be able to override ambiguous file types like JSON or CSV if you find yourself always reaching for pandas instead.
@ajkerrigan Great suggestion!
@ajkerrigan Do you know if this will should in the current stable version of visidata? I tried it out, but it doesn't look like "filetype" is defined for visidata.vdtui.VisiData in this version.
I tried installing the latest version from the development branch the other day, but couldn't get visidata and vsh to work together, so I reverted to stable.
The stable branch indeed has a different api!
Could you please try:
open_feather = open_pandas
and if that does not work:
def open_feather(p):
return PandasSheet(p.name, source=p, filetype='feather')
The pandas code is in visidata/loaders/_pandas.py, and the examples I am looking at are for stata/ .dta, which we did enable autodetection for.
Would you mind sharing what issues were you having with vsh and visidata working together?
@anjakefala Thanks - the first approach did the trick! Tested with both feather and parquet and works fine.
Do you happen to know if there is any appropriate way to query the version of the API from the config? I didn't see anything like vd.__version__.
I could just use try...except blocks, but figured it's better to be explicit if possible.
I'll try and reproduce the issues I was having the other day with the dev branch and report it if I'm able to. The only reason I didn't before, tbh, was that I wasn't 100% sure I was installing things correctly :sweat_smile:
@anjakefala Ha! Sure enough, the issues with the dev branch were all on my end.. I forgot that I had a system-wide stable installation of visidata, and it was conflicts with my pip --user-installed dev version..
I was thrown off by the NameError: name 'vdtui' is not defined error message and didn't even think to check for multiple installations.. Also makes me realize my $PATH order needs to be fixed..
Oy! @khughitt. Python environments are not fun. =(
By the way, I would stick with stable, if your primary use-case is pandas! There are a few develop-pandas issues on the board. Unless, you want to help out by taking a look. =D
Do you happen to know if there is any appropriate way to query the version of the API from the config? I didn't see anything like vd.__version__.
I do happen to know!
from visidata import __version__
An example is in visidata/motd.py ^^.
Good call @anjakefala !
@khughitt if you're looking for a way to decide which API to use at runtime, it _might_ be easier to check for the features you need rather than keeping track of versions. I'm not sure, but worth a shot:
if callable(getattr(vd, 'filetype', None)):
vd.filetype('feather', PandasSheet)
else:
# stable/v1.5 logic
On the topic of pandas, I wonder if we'd be able to tag team some of those issues @khughitt and/or @anjakefala ? I'm an infant when it comes to pandas, but slowly getting familiar with visidata's code. If you're up for taking a crack at some pandas issues as a team, we might be able to get somewhere together.
@ajkerrigan Sure thing! I probably can't dedicate too much time at the moment, and I still need to spend some time getting more familiar with visidata's API, first, but I'll try and start flagging issues as I come across them / submitting PR's when I'm able to.
Currently I use vd for a mix of files: tsv, feather, and parquet. I used to just use gzipped-tsv files for everything for consistency / support, but especially recently as I'm re-using files more in pipelines and/or parsing them in realtime for shiny applications, I've been shifting more to feather/parquet..
Some motivation: https://github.com/khughitt/benchmark-compression
@ajkerrigan, that is really clever! Saul was considering reverting back to the open_ framework, and this would ensure that that .visidatarc continues to work if we do revert.
Thanks so much for sharing your workflow @khughitt. =)
I would definitely like to tag team on approaching them. I am also uncomfortable with pandas.
I am going to outline the collection of thoughts I have on the matter. No one is under any commitment or expectation to act on this. =) It might be a helpful guide for understanding the issues.
As far as I understand, the current major issue with the branch is that when someone got selection working for the pandas loader, the change that necessitated that broke editing in pandas. I think the pandas loader will probably need both to be a usable loader.
This was the change in question: https://github.com/saulpw/visidata/blob/develop/visidata/loaders/_pandas.py#L71
The major questions will be:
Which version of the pandas API are we going to use? (From my limited experience of doing QA for VisiData, the pandas API is relatively volatile. This is understandable (it is an ongoing-maturing library), but it means we are going to have to make a decision with which version to support.)
I was reading this issue and wanted to chime in on this point: Pandas 1.0.0 was just released recently and as part of that finally adopted a versioning policy which should make the API much less volatile.
Thanks, @tsibley! I retract my comment. I had no idea they had shipped 1.0. =D Good for them!
What if we add this to the pandas loader?
for ft in 'feather gbq orc parquet pickle sas stata'.split():
globals()['open_'+ft] = lambda p,ft=ft: PandasSheet(p.name, source=p, filetype=ft)
Are there commonly used extensions for the above filetypes? Would this suffice to satisfy this issue?
That seems like a helpful change. Maybe see if open_<ft> already exists first though, so it's more of a fallback mechanism that will play nice if/when standalone loaders come in for those types?
Most helpful comment
I was reading this issue and wanted to chime in on this point: Pandas 1.0.0 was just released recently and as part of that finally adopted a versioning policy which should make the API much less volatile.