When trying to integrate Pandas CSV reading from s3 for local development (in docker) with containers from LocalStack or Minio we need to be able to define a custom host as well as a port.
PR #12198 introduces the AWS_S3_HOST environment variable, I propose adding the AWS_S3_PORT one. Something like:
s3_host = os.environ.get('AWS_S3_HOST','s3.amazonaws.com')
s3_port = os.environ.get('AWS_S3_PORT')
try:
conn = boto.connect_s3(host=s3_host, port=s3_port)
except boto.exception.NoAuthHandlerFound:
conn = boto.connect_s3(host=s3_host,anon=True, port=s3_port)
This would allow to define something like this in the docker-compose.yml and use Minio for serving the csv files from a local s3 for development and AWS for production:
environment:
- AWS_ACCESS_KEY_ID=supersecret
- AWS_SECRET_ACCESS_KEY=supersecret
- AWS_S3_HOST=s3local
- AWS_S3_PORT=9000
- S3_USE_SIGV4=True
This is only applicable for pandas 0.18.X and 0.19.X since 0.20.X uses s3f. I would be willing to submit a PR for this.
we don't offer backports for any version before the last major one (0.20)
For the record, I ended up using a workaround with s3fs along with the change introduced in https://github.com/dask/s3fs/pull/69:
import pandas as pd
from s3fs.core import S3FileSystem
client_kwargs = {'endpoint_url': 'http://s3:9000'}
s3 = S3FileSystem(anon=False, client_kwargs=client_kwargs)
df = pd.read_csv(s3.open('s3://bucket/file.csv.gz', mode='rb'))
Most helpful comment
For the record, I ended up using a workaround with s3fs along with the change introduced in https://github.com/dask/s3fs/pull/69: