Download this file upload.txt
# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse
# I attached the file in the github issue
filename = "upload.txt"
# this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"
with open(filename) as fd:
print("READ CHUNK BY CHUNK")
res = pd.read_csv(
fd,
comment='#',
sep='|',
dtype={column: 'UInt64' },
usecols=[column],
chunksize=1
)
for chunk in (res):
# print("chunk %d" % i)
print(chunk)
fd.seek(0) # rewind
print("READ THE WHOLE FILE AT ONCE ")
res = pd.read_csv(
fd,
comment='#',
sep='|',
usecols=[column],
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
)
print(res)
If I read in chunks, read_csv succeeds, if I try to read the column at once, I get
Traceback (most recent call last):
File "test2.py", line 34, in <module>
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
return cls._from_sequence(scalars, dtype, copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
values.dtype))
TypeError: object cannot be converted to an IntegerDtype
I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).
pd.show_versions()
I am using v0.23.4 with a patch from master to fix some other bug.
[paste the output of pd.show_versions() here below this line]
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.19.0
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
Have you been able to narrow down the cause? Possibly start reading the first n rows, and then bisect from there, to see what line causes the failure?
That s part of the difficulty, depending on the chunk size the exception is raised or not. With a size of one, it succeeds. Bigger, the read fails, and i don t get why
I suspect a specific value in the CSV is causing that. I'd recommend trying
with different values of nrows to see what that value is.
On Thu, Feb 28, 2019 at 7:53 AM Matthieu Coudron notifications@github.com
wrote:
That s part of the difficulty, depending on the chunk size the exception
is raised or not. With a size of one, it succeeds. Bigger, the read fails,
and i don t get why—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25472#issuecomment-468280305,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIuInlFKIjkT6d73ZH9iwZOOalBOvks5vR99wgaJpZM4bWRmS
.
Also, if you are able to share a file that can reproduce the issue, that would be great.
Sorry I definitely had uploaded it but I may have messed up somewhere, it ended up not being visible, anyway I've put the file in the first post (upload.txt but it's a csv really). I think it's a bug because readling line by line, no value appears as a problem. the .csv file is generated so there should be no error in the values either.
When I tried to use your code to read the file, most of the values in the column showed up as missing which might be the reason it's not reading as 'UInt64'. Reading it as default format and/or string works.
I actually updated to pandas 0.24.1 because it supported empty rows via UInt64 (else why would it work when readling line by line). 'UInt64' also works for other columns with empty values, there are just some columns for which it doesn't and I can't fathom why.
Have you had a chance to debug this @teto?
I am not sure what else I can do, I've provided the data file and a standalone example.
If it reads several items, it fails, if just one at a time it works. Seems like a bug to me and pandas is too complex to be able to just dive and fix the bug for casual user like me.
Gotcha, hopefully someone has time to take a look, but you may be the expert here as this is fairly new.
cc @kprestel who implemented EA support for read_csv.
I'll be able to take a look at this tonight hopefully.
Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.
I'm running into the same problem as OP when I read 1 of the sheets of a .xlsl file (pandas 0.24.2).
There are NaN values, but from pandas 0.24 that should work when doing .astype(pd.Int16Dtype()) right?
This gave the same problem as OP:
df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())
However, ugly, but this seemed to have worked for me:
df_sheet.age = df_sheet.age.astype('float') # first convert to float before int
df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())
I just ran into this - it looks much more general than a read_csv problem to me.
>>> pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe
return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array
raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
TypeError: object cannot be converted to an IntegerDtype
I would expect that this should just work? As @NumesSanguis says above, converting via float does work, e.g.
>>> pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype())
0 1
1 2
2 3
dtype: Int64
This is using
>>> pd.__version__
'1.0.3'
@TomAugspurger - do you think a new issue needs to be opened for this?
I thought we already had an issue for that (possible search for "strictness
of _from_sequence") but I may be wrong.
On Tue, May 19, 2020 at 6:15 AM Luke Stanbra notifications@github.com
wrote:
I just ran into this - it looks much more general than a read_csv problem
to me.pd.Series(["1", "2", "3"]).astype(pd.Int64Dtype())
Traceback (most recent call last):
File "", line 1, in
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/generic.py", line 5698, in astype
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 582, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 442, in apply
applied = getattr(b, f)(**kwargs)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 625, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 821, in astype_nansafe
return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 354, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 135, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/Users/lukestanbra/dev/venv/lib/python3.6/site-packages/pandas/core/arrays/integer.py", line 218, in coerce_to_array
raise TypeError(f"{values.dtype} cannot be converted to an IntegerDtype")
TypeError: object cannot be converted to an IntegerDtypeI would expect that this should just work? As @NumesSanguis
https://github.com/NumesSanguis says above, converting via float does
work, e.g.pd.Series(["1", "2", "3"]).astype(float).astype(pd.Int64Dtype())
0 1
1 2
2 3
dtype: Int64This is using
pd.__version__
'1.0.3'@TomAugspurger https://github.com/TomAugspurger - do you think a new
issue needs to be opened for this?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/25472#issuecomment-630752912,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAKAOISOMPTHZT64PP2CIM3RSJS5NANCNFSM4G2ZDGJA
.
@NumesSanguis
Any ideas for a workaround if the integer (18 places) is too big for float64?
@dekiesel
Sorry I don't know
Most helpful comment
Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.
I'm running into the same problem as OP when I read 1 of the sheets of a .xlsl file (
pandas 0.24.2).There are NaN values, but from pandas 0.24 that should work when doing
.astype(pd.Int16Dtype())right?This gave the same problem as OP:
However, ugly, but this seemed to have worked for me: