Pandas: Feature to read csv from hdfs:// URL

Created on 9 Nov 2017 · 20Comments · Source: pandas-dev/pandas

When running pandas in AWS, The following works perfectly fine:

pd.read_csv("s3://mybucket/data.csv")

But running the following, does not:

pd.read_csv("hdfs:///tmp/data.csv")

It would be a good user experience to allow for the hdfs:// schema too similar to how http, ftp, s3, and file are valid schemas right now.

API Design Enhancement IO CSV IO Network

Source

AbdealiJK

👍1

All 20 comments

Based on the limited bit I know, dealing with authentication and all that can be a rabbit hole.

If you want to put together a prototype based around http://hdfs3.readthedocs.io/en/latest/, I think we'd add it.

TomAugspurger on 9 Nov 2017

How should this be implemented ? Should there also be a read_hdfs like the read_s3 ?

AbdealiJK on 9 Nov 2017

See https://github.com/pandas-dev/pandas/blob/dc4b0708f36b971f71890bfdf830d9a5dc019c7b/pandas/io/s3.py and https://github.com/pandas-dev/pandas/blob/2f9d4fbc7f289a48ed8b29f573675cd2e21b2c89/pandas/io/common.py#L94 and search for uses of _is_s3_url.

TomAugspurger on 9 Nov 2017

i believe we can either use hdfs (similar to s3fs and/or pyarrow for this); would be similar to the way we do s3 atm

jreback on 9 Nov 2017

https://hdfs3.readthedocs.io/en/latest/

jreback on 9 Nov 2017

https://arrow.apache.org/docs/python/filesystems.html

jreback on 9 Nov 2017

So, here is a quick comparison:

hdfs: Package for connecting to WebFS and HttpFS which are REST protocols to access HDFS data
hdfs3: Wrapper on the library libhdfs3 which needs to be installed independently
pyarrow: Supports both engines the native libhdfs and separately installed libhdfs3
cyhdfs: Cython wrapper for native libhdfs

These seem to be the active (have their latest release in 2017) options.

As pandas already has a pyarrow engine for parquet, it looks like having pyarrow with the native libhdfs would be the most universal option.

AbdealiJK on 11 Nov 2017

This would be a great feature to have in Pandas. Is it still being worked on?

sergei3000 on 21 Jan 2020

@sergei3000 you are welcome to submit a PR; pandas is all volunteer effort

jreback on 21 Jan 2020

Hi @jreback, I want to work on this PR.
The changes proposed in the previous PR is not relevant anymore since pandas is using fsspec.
Do you have any suggestions on how to start?

DavidKatz-il on 24 Sep 2020

👍1

i am not sure what is the appropriate library to read from hdfs
is now a days - so need to figure that out. note that pyarrow i believe does support this - so that's an option as well

otherwise this would be similar to how we implement other readers eg like gcs

jreback on 25 Sep 2020

tests can be on a similar manner to here: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/io/test_gcs.py

jreback on 25 Sep 2020

👍1

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

sergei3000 on 26 Sep 2020

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

pandas is using fsspec for reading s3 / gcs files, fsspec also supports reading for hdfs with pyarrow.

DavidKatz-il on 26 Sep 2020

Here you can find some code samples of using pyarrow written by Wes McKinney.

https://wesmckinney.com/blog/python-hdfs-interfaces/

sergei3000 on 27 Sep 2020

Hey @jreback
I just checked reading / writing hdfs files (csv and parquet) with pandas, and as I guessed it works fine. its probably works since the project started to use fsspec for reading s3 / gsc files.
So i will continue working on adding tests.

DavidKatz-il on 1 Oct 2020

Note that pyarrow's HDFS interface will be deprecated sometime. I guess the "legacy" interface will be around a while, but fsspec will need to have its shim rewritten to the newer filesystem that pyarrow makes, when it's stable. Hopefully, this shouldn't affect users.

martindurant on 1 Oct 2020

👍1

Hi @jreback
Should the test_hdfs be able run on a separate docker (aimiliar to dask docker) or in the pandas docker (which will require additional installations)?

DavidKatz-il on 13 Oct 2020

we don't run any containers as part of the ci

this is just mocked which i think is fine

if we really want to have full testing then would need to stop a new azure job for this (not against but a bit overkill)

that said if u want to go yeah

jreback on 13 Oct 2020

Note that dask does test its read_csv from HDFS: https://github.com/dask/dask/blob/master/dask/bytes/tests/test_hdfs.py#L131

martindurant on 13 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings