It's possible to write out a CSV file using the default settings (i.e., line_terminator='\n'
in DataFrame.to_csv
) that can't be read back in using the default settings (i.e., lineterminator=None
).
This problem arises when there's a text column that doesn't get quotes around it (e.g., if it has no commas) but has a carriage return. read_csv
by default thinks that either '\n' or '\r' can be line terminators, so extra rows appear. See example below.
In [12]: pd.DataFrame({"text": ["foo\rbar"]}).to_csv("foo.csv", index=False)
In [13]: pd.read_csv("foo.csv")
Out[13]:
text
0 foo
1 bar
It might be best to have the default line terminator for read_csv
be '\n'. That might reduce usability (e.g., it's nice to be able to load CSVs from UNIX or Windows using the defaults), so maybe it'd be good just to add something to the documentation. Or, maybe a warning could be printed out or exception raised if a CSV file appears to use multiple types of line endings, which shouldn't happen.
(IIRC this can also cause the default parsing engine to make python segfault in some cases.)
this seems like a very particular special case to me. Normally having \r
INSIDE of a string is very odd to say the least.
Yeah, it's definitely a rare issue, but I thought it worth documenting in case somebody else runs into it or thinks of a good solution that doesn't impair usability for more typical cases.
ok, you can add a small note in io.rst / csv section that show the issues w.r.t. '\r' (try to be as general as possible though).
Wouldn't it be wise for to_csv
to emit quotes whenever the data contains either \r or \n? By default Python 3's open
will convert \r to \n ("universal newline"), so omitting quotes around \r seems a bit dangerous.
I agree, this points to a bug in to_csv. to_csv should quote special characters that occur within fields if to conform with the CSV standard. From RFC4180 documenting the CSV standard (https://tools.ietf.org/html/rfc4180):
6. Fields containing line breaks (CRLF), double quotes, and commas*
should be enclosed in double-quotes. For example:*
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
I ran into this bug today--using python3, to_csv did not put quotes around string with \r in it.
I ran into this issue today as well. While I was able to resolve the issue by explicitly setting the lineterminator
in read_csv()
to \n
, that's a bit cumbersome.
I agree with others that this is almost certainly a bug in to_csv()
. Pulling a bit more from the CSV standard @ameasure shared (emphasis added):
The ABNF grammar appears as follows:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
If I'm interpreting it correctly, carriage returns should definitely be escaped.
I ran into this bug today as well. We were analyzing online reviews, and many reviews span multiple lines but do not use commas. I support the amendment of to_csv
.
well a pull request from the community would be a way to resolve this.
the csv
library defaults do quote the carriage returns with quoting=0
(or csv.QUOTE_MINIMAL
The issue (as pointed above) is because the to_csv
method sets the line_terminator
to \n
by default [1]. Instead can it not be set as None
in both the to_csv
method and the underlying CSVFormatter
object [2] as defaults and let the csv
module set the line_terminator
value?
A final change would be to not initialize the writer with line_terminator=None
. A possible filter when creating the writer_kwargs
dict? [3]
I'm not sure how this would work in Unix environments where the default line terminator is \n
(as I don't have access to one to test)
Most helpful comment
Yeah, it's definitely a rare issue, but I thought it worth documenting in case somebody else runs into it or thinks of a good solution that doesn't impair usability for more typical cases.