Kedro: Cannot read csv in chunks with pandas

Created on 4 Nov 2020  路  5Comments  路  Source: quantumblacklabs/kedro

Description

Cannot read csv in chunks with kedro data catalog.

df = pd.read_csv(csv, chunksize=1000)
df.get_chunk()

Context

How has this bug affected you? What were you trying to accomplish?

Steps to Reproduce

train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

df = catalog.load("train_dataset")
df.get_chunk()

ValueError: I/O operation on closed file.
df

Expected Result

I should be able to loop over the reader.

Actual Result

ValueError: I/O operation on closed file.

-- If you received an error, place it here.

ValueError: I/O operation on closed file.

```yaml
train_dataset:
  type: pandas.CSVDataSet
  filepath: 'mycsv.csv'
  load_args:
    chunksize: 50000

-- Separate them if you have more than one.
```

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V):
    kedro: 0.16.6
  • Python version used (python -V):
    3.7.5
  • Operating system and version:
    Ubuntu
Bug Report

All 5 comments

Its been awhile since I have used chunksize. If I remember correct it returns a generator.

chunks = catalog.load("train_dataset")

for chunk in chunks:
   # chunk is a DataFrame do what you need with it
   process(chunk)

@WaylonWalker Thanks for jumping in, I have read your blog about Kedro befoe it helps me understand some concepts better.

When I iterate it it throws error that saying file is closed already.

I was able to replicate. I setup a pipeline with a csv and a catalog entry just as you did. I run into the same error if I try to kedro run or catalog.load it. I am not able to replicate the issue just loading with pandas, even if I use fsspec like the pandas.CSVDataSet does. Someone with a deeper understanding of the internals may need to take a look

I posted my replica of the issue here https://github.com/WaylonWalker/kedro_chunked.

I have read your blog about Kedro befoe it helps me understand some concepts better.

That is awesome!!! and potentially motivating to keep making more content.

@WaylonWalker I did the same thing for checking if it is the problem of fsspec -> seems not too.
catalog.load() will first call fsspec, then it also calls the transformer, I suspect transformer tries to read that generator and closed it.

But I haven't dig dive into transformer before yet, it would be great if someone has more knowledge jump in.

I'm facing the same issue, anyone has updates on this problem?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

yetudada picture yetudada  路  3Comments

WaylonWalker picture WaylonWalker  路  3Comments

yetudada picture yetudada  路  3Comments

bensdm picture bensdm  路  4Comments

f-istvan picture f-istvan  路  3Comments