Pandas: Feature to read csv from hdfs:// URL

Created on 9 Nov 2017  路  20Comments  路  Source: pandas-dev/pandas

When running pandas in AWS, The following works perfectly fine:

pd.read_csv("s3://mybucket/data.csv")

But running the following, does not:

pd.read_csv("hdfs:///tmp/data.csv")

It would be a good user experience to allow for the hdfs:// schema too similar to how http, ftp, s3, and file are valid schemas right now.

API Design Enhancement IO CSV IO Network

All 20 comments

Based on the limited bit I know, dealing with authentication and all that can be a rabbit hole.

If you want to put together a prototype based around http://hdfs3.readthedocs.io/en/latest/, I think we'd add it.

How should this be implemented ? Should there also be a read_hdfs like the read_s3 ?

i believe we can either use hdfs (similar to s3fs and/or pyarrow for this); would be similar to the way we do s3 atm

So, here is a quick comparison:

  • hdfs: Package for connecting to WebFS and HttpFS which are REST protocols to access HDFS data
  • hdfs3: Wrapper on the library libhdfs3 which needs to be installed independently
  • pyarrow: Supports both engines the native libhdfs and separately installed libhdfs3
  • cyhdfs: Cython wrapper for native libhdfs

These seem to be the active (have their latest release in 2017) options.

As pandas already has a pyarrow engine for parquet, it looks like having pyarrow with the native libhdfs would be the most universal option.

This would be a great feature to have in Pandas. Is it still being worked on?

@sergei3000 you are welcome to submit a PR; pandas is all volunteer effort

Hi @jreback, I want to work on this PR.
The changes proposed in the previous PR is not relevant anymore since pandas is using fsspec.
Do you have any suggestions on how to start?

i am not sure what is the appropriate library to read from hdfs
is now a days - so need to figure that out. note that pyarrow i believe does support this - so that's an option as well

otherwise this would be similar to how we implement other readers eg like gcs

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

This is how I managed to read from hdfs:

import os

import pandas as pd
import pydoop.hdfs as hd

os.environ['HADOOP_CONF_DIR'] = "/usr/hdp/2.6.4.0-91/hadoop/conf"

with hd.open("/share/bla/bla/bla/filename.csv") as f:
    df =  pd.read_csv(f)

pandas is using fsspec for reading s3 / gcs files, fsspec also supports reading for hdfs with pyarrow.

Here you can find some code samples of using pyarrow written by Wes McKinney.

https://wesmckinney.com/blog/python-hdfs-interfaces/

Hey @jreback
I just checked reading / writing hdfs files (csv and parquet) with pandas, and as I guessed it works fine. its probably works since the project started to use fsspec for reading s3 / gsc files.
So i will continue working on adding tests.

Note that pyarrow's HDFS interface will be deprecated sometime. I guess the "legacy" interface will be around a while, but fsspec will need to have its shim rewritten to the newer filesystem that pyarrow makes, when it's stable. Hopefully, this shouldn't affect users.

Hi @jreback
Should the test_hdfs be able run on a separate docker (aimiliar to dask docker) or in the pandas docker (which will require additional installations)?

we don't run any containers as part of the ci

this is just mocked which i think is fine

if we really want to have full testing then would need to stop a new azure job for this (not against but a bit overkill)

that said if u want to go yeah

Note that dask does test its read_csv from HDFS: https://github.com/dask/dask/blob/master/dask/bytes/tests/test_hdfs.py#L131

Was this page helpful?
0 / 5 - 0 ratings