When running pandas in AWS, The following works perfectly fine:
pd.read_csv("s3://mybucket/data.csv")
But running the following, does not:
pd.read_csv("hdfs:///tmp/data.csv")
It would be a good user experience to allow for the hdfs:// schema too similar to how http, ftp, s3, and file are valid schemas right now.
Based on the limited bit I know, dealing with authentication and all that can be a rabbit hole.
If you want to put together a prototype based around http://hdfs3.readthedocs.io/en/latest/, I think we'd add it.
How should this be implemented ? Should there also be a read_hdfs like the read_s3 ?
i believe we can either use hdfs (similar to s3fs and/or pyarrow for this); would be similar to the way we do s3 atm
So, here is a quick comparison:
These seem to be the active (have their latest release in 2017) options.
As pandas already has a pyarrow engine for parquet, it looks like having pyarrow with the native libhdfs would be the most universal option.
This would be a great feature to have in Pandas. Is it still being worked on?
@sergei3000 you are welcome to submit a PR; pandas is all volunteer effort
Hi @jreback, I want to work on this PR.
The changes proposed in the previous PR is not relevant anymore since pandas is using fsspec.
Do you have any suggestions on how to start?
i am not sure what is the appropriate library to read from hdfs
is now a days - so need to figure that out. note that pyarrow i believe does support this - so that's an option as well
otherwise this would be similar to how we implement other readers eg like gcs
tests can be on a similar manner to here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_gcs.py
This is how I managed to read from hdfs:
import os
import pandas as pd
import pydoop.hdfs as hd
os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"
with hd.open("/share/bla/bla/bla/filename.csv") as f:
df = pd.read_csv(f)
This is how I managed to read from hdfs:
import os import pandas as pd import pydoop.hdfs as hd os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf" with hd.open("/share/bla/bla/bla/filename.csv") as f: df = pd.read_csv(f)
pandas is using fsspec for reading s3 / gcs files, fsspec also supports reading for hdfs with pyarrow.
Here you can find some code samples of using pyarrow written by Wes McKinney.
Hey @jreback
I just checked reading / writing hdfs files (csv and parquet) with pandas, and as I guessed it works fine. its probably works since the project started to use fsspec for reading s3 / gsc files.
So i will continue working on adding tests.
Note that pyarrow's HDFS interface will be deprecated sometime. I guess the "legacy" interface will be around a while, but fsspec will need to have its shim rewritten to the newer filesystem that pyarrow makes, when it's stable. Hopefully, this shouldn't affect users.
Hi @jreback
Should the test_hdfs be able run on a separate docker (aimiliar to dask docker) or in the pandas docker (which will require additional installations)?
we don't run any containers as part of the ci
this is just mocked which i think is fine
if we really want to have full testing then would need to stop a new azure job for this (not against but a bit overkill)
that said if u want to go yeah
Note that dask does test its read_csv from HDFS: https://github.com/dask/dask/blob/master/dask/bytes/tests/test_hdfs.py#L131