Pandas: read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks

Created on 28 Feb 2019  Â·  17Comments  Â·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

Download this file upload.txt

# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse

# I attached the file in the github issue
filename = "upload.txt"
# this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"

with open(filename) as fd:

    print("READ CHUNK BY CHUNK")

    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            dtype={column: 'UInt64' },
            usecols=[column],
            chunksize=1
    )
    for chunk in (res):
        # print("chunk %d" % i)
        print(chunk)



    fd.seek(0) # rewind

    print("READ THE WHOLE FILE AT ONCE ")
    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            usecols=[column],
            dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
    )
    print(res)





If I read in chunks, read_csv succeeds, if I try to read the column at once, I get

Traceback (most recent call last):
  File "test2.py", line 34, in <module>
    dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
    return cls._from_sequence(scalars, dtype, copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
    return integer_array(scalars, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
    values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
    values.dtype))
TypeError: object cannot be converted to an IntegerDtype


Expected Output

I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).

Output of pd.show_versions()


I am using v0.23.4 with a patch from master to fix some other bug.
[paste the output of pd.show_versions() here below this line]
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.0
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8

pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Bug ExtensionArray IO CSV

Most helpful comment

Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.

I'm running into the same problem as OP when I read 1 of the sheets of a .xlsl file (pandas 0.24.2).
There are NaN values, but from pandas 0.24 that should work when doing .astype(pd.Int16Dtype()) right?

This gave the same problem as OP:

df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

However, ugly, but this seemed to have worked for me:

df_sheet.age = df_sheet.age.astype('float')  # first convert to float before int
df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

All 17 comments

Have you been able to narrow down the cause? Possibly start reading the first n rows, and then bisect from there, to see what line causes the failure?

That s part of the difficulty, depending on the chunk size the exception is raised or not. With a size of one, it succeeds. Bigger, the read fails, and i don t get why

I suspect a specific value in the CSV is causing that. I'd recommend trying
with different values of nrows to see what that value is.

On Thu, Feb 28, 2019 at 7:53 AM Matthieu Coudron notifications@github.com
wrote:

That s part of the difficulty, depending on the chunk size the exception
is raised or not. With a size of one, it succeeds. Bigger, the read fails,
and i don t get why

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25472#issuecomment-468280305,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIuInlFKIjkT6d73ZH9iwZOOalBOvks5vR99wgaJpZM4bWRmS
.

Also, if you are able to share a file that can reproduce the issue, that would be great.

Sorry I definitely had uploaded it but I may have messed up somewhere, it ended up not being visible, anyway I've put the file in the first post (upload.txt but it's a csv really). I think it's a bug because readling line by line, no value appears as a problem. the .csv file is generated so there should be no error in the values either.

When I tried to use your code to read the file, most of the values in the column showed up as missing which might be the reason it's not reading as 'UInt64'. Reading it as default format and/or string works.

I actually updated to pandas 0.24.1 because it supported empty rows via UInt64 (else why would it work when readling line by line). 'UInt64' also works for other columns with empty values, there are just some columns for which it doesn't and I can't fathom why.

Have you had a chance to debug this @teto?

I am not sure what else I can do, I've provided the data file and a standalone example.
If it reads several items, it fails, if just one at a time it works. Seems like a bug to me and pandas is too complex to be able to just dive and fix the bug for casual user like me.

Gotcha, hopefully someone has time to take a look, but you may be the expert here as this is fairly new.

cc @kprestel who implemented EA support for read_csv.

I'll be able to take a look at this tonight hopefully.

Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.

I'm running into the same problem as OP when I read 1 of the sheets of a .xlsl file (pandas 0.24.2).
There are NaN values, but from pandas 0.24 that should work when doing .astype(pd.Int16Dtype()) right?

This gave the same problem as OP:

df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

However, ugly, but this seemed to have worked for me:

df_sheet.age = df_sheet.age.astype('float')  # first convert to float before int
df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

I just ran into this - it looks much more general than a read_csv problem to me.

>>> pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype
    new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
    applied = getattr(b, f)(**kwargs)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe
    return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence
    return integer_array(scalars, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array
    values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
  File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array
    raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
TypeError: object cannot be converted to an IntegerDtype

I would expect that this should just work? As @NumesSanguis says above, converting via float does work, e.g.

>>> pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype())
0    1
1    2
2    3
dtype: Int64

This is using

>>> pd.__version__
'1.0.3'

@TomAugspurger - do you think a new issue needs to be opened for this?

I thought we already had an issue for that (possible search for "strictness
of _from_sequence") but I may be wrong.

On Tue, May 19, 2020 at 6:15 AM Luke Stanbra notifications@github.com
wrote:

I just ran into this - it looks much more general than a read_csv problem
to me.

pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype())
Traceback (most recent call last):
File "", line 1, in
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe
return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array
raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
TypeError: object cannot be converted to an IntegerDtype

I would expect that this should just work? As @NumesSanguis
https://github.com/NumesSanguis says above, converting via float does
work, e.g.

pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype())
0 1
1 2
2 3
dtype: Int64

This is using

pd.__version__
'1.0.3'

@TomAugspurger https://github.com/TomAugspurger - do you think a new
issue needs to be opened for this?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25472#issuecomment-630752912,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKAOISOMPTHZT64PP2CIM3RSJS5NANCNFSM4G2ZDGJA
.

OK - that's good to know. Gets a bit too into the internals for me to follow, but was interesting to see how you all talk about this kind of stuff. If anyone else stumbles across this the relevant issues are 33254, 32586 and 33607

@NumesSanguis

Any ideas for a workaround if the integer (18 places) is too big for float64?

@dekiesel
Sorry I don't know

Was this page helpful?
0 / 5 - 0 ratings