Pandas: Saving CSV with backslashed-escaping is not idempotent.

Created on 31 Aug 2016 · 18Comments · Source: pandas-dev/pandas

@pdbaines and I noticed this bug.

I want Pandas to write a CSV file so that all field data is backslash escaped if the character has a special interpretation (e.g. quotes or backslashes themselves). If a quote is backslashed, it is treated as field data, rather than a special character. This is not the behavior that I am seeing.

Consider the following data frame:

df = pd.DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\"""], "zoo": ["1"]})
df.to_csv("out.csv", index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

When written to a file, it looks something like this:

"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.\","1"

The quotes are properly escaped in Please "help" me, but oddly, the end-quote of the field is backslashed, but the start-quote of the field is not back-slashed.

If I read the data frame in again using exactly the same parameters,

df2 = pd.read_csv("out.csv", quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

I get a data frame with both fields concatenated into the first field and the second field is NaN.

$ print(df2)
                                                text  zoo
0  Hello! Please "help" me. I cannot quote a csv....  NaN

If I instead, do the following:

df3 = pd.DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\\""""], "zoo": ["1"]})
df3.to_csv("outB.csv", index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)
df4 = pd.read_csv("outB.csv", quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

I instead get a file with an odd-number of unescaped quote characters:

"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.\\"","1"

and some unescaped quote characters are treated as data.

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0

IO CSV

Source

deads

Most helpful comment

@gfyoung
Please explain why it is desired behavior to not be able to save arbitrary data to a Pandas DataFrame cell and read it back in the same.

deads on 2 Sep 2016

👍2

All 18 comments

I think this is a dupe of #12922 and likely fixed here: https://github.com/pydata/pandas/commit/d814f433940031029f1ddf0d9abdecdf4ad31dac

@gfyoung

jreback on 31 Aug 2016

>>> import csv
>>> from pandas import DataFrame
>>> df = DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\"""],
                    "zoo": ["1"]})
>>> df
                                              text zoo
0  Hello! Please "help" me. I cannot quote a csv.\   1

Why did you put an backslash character there at the end? Let's remove it:

>>> df = DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv."""],
                    "zoo": ["1"]})
>>> df
                                              text zoo
0  Hello! Please "help" me. I cannot quote a csv.   1
>>> print(df.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8",
                    escapechar='\\', doublequote=False)))
"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.","1"

You get that backslash at the end because you put it there.

@jreback : This is not a bug and can be closed.

gfyoung on 1 Sep 2016

👎1

@gfyoung: I put a backslash there to show that backslash escaping does not work for all possible inputs. It seems very reasonable for an implementation to simply backslash the backslash as is the case with C, Java, Python shell, JSON, and string literals as well as the output of Python str.__repr__.

A lack of idempotency could be a security concern as it could affect the availability and integrity of an application.

deads on 2 Sep 2016

If a quote is backslashed, it is treated as field data, rather than a special character.

The // is interpreted as field data. It is not escaping that last quotation mark. My point is that you are misinterpreting your output as bugged when in fact it is behaving as is.

gfyoung on 2 Sep 2016

@gfyoung
Please explain why it is desired behavior to not be able to save arbitrary data to a Pandas DataFrame cell and read it back in the same.

deads on 2 Sep 2016

👍2

@deads I am not convinced that your example _should_ be lossless. csv is a pretty lossy format, esp with all of the options you have selected. Can you do this example with the python csv reader losslessly?

jreback on 2 Sep 2016

This behavior is present in the csv module https://gist.github.com/wesm/7763d396ae25c9fd5b27588da27015e4 . From first principles seems like the offending backslash should be escaped. If I manually edit the file to be

"a"
"Hello! Please \"help\" me. I cannot quote a csv.\\"

then read_csv returns the original input

I fiddled with R and it doesn't seem to do much better

> df <- data.frame(a=c("Hello! Please \"help\" me. I cannot quote a csv.\\"))> write.table(df, sep=',', qmethod='e', row.names=F)
"a"
"Hello! Please \"help\" me. I cannot quote a csv.\"

Another example of CSV not being a high fidelity data interchange tool =|

Also, I do not think it is fair to say

You get that backslash at the end because you put it there.
@jreback : This is not a bug and can be closed.

to someone reporting behavior that looks like a bug. This would seem buggy to me if I ran across it in production (presuming this came out of some kind of real world use). "Just change your input" is easy to say until the data in question is machine-generated (and may contain backslashes).

wesm on 5 Sep 2016

@wesm : Your comment is a little presumptive because I did not realize that that was his point with the extraneous backslash. Before jumping to conclusions as you did about me "brushing off" this problem as a cop-out, I would suggest that you read the original issue.

With regards to your point about the bug, I am not surprised that this issue persists with csv because the entire to_csv method is essentially a giant wrapper around Python's csv module. I suspect then that the issue in pandas is really a manifestation of a bug in the csv module that should be pursued with the Python developers first.

gfyoung on 6 Sep 2016

I seem to have problems with quoting and escaping, too. Has anything happened since 2k16?

black-snow on 10 Dec 2017

@black-snow : I don't believe anything has changed with this issue unfortunately. Given that the escaping and writing is handled by Python csv at the very end, if we were to work around this idempotent issue, I think we would have to do some hacky data adjustment before writing.

Have a look at the examples and see if they still persist today. Then also post your example code, and we can have a look.

gfyoung on 11 Dec 2017

Thanks for the quick reply @gfyoung ! I've already fixed my issue. Apparently pandas needs to be told that quotes inside a quoted field are escaped with a backslash.

black-snow on 11 Dec 2017

Awesome! Mind sharing your code-sample for reference?

gfyoung on 11 Dec 2017

Sure.

There's no real csv standard but I'm used to certain defaults i. e. delimiter is ,, quote char is " and escape char is \ (e. g. from PHP). Pandas doesn't seem to use the backslash as the escape character by default so I had to add it.

X = pd.read_csv(args.file, header=None, index_col=False, escapechar='\\').as_matrix()

Thought this issue would be related but apparently it's not.

black-snow on 12 Dec 2017

@black-snow : You'll need to provide us the file (if possible). Can't run that code-sample as is. :smile:

gfyoung on 12 Dec 2017

I cannot, sadly, it's business internal stuff. But without the escapechar pandas should already fail on this (not tested):

"first column","second column"
"first columne with \"escaped\" quotes","second column"

black-snow on 12 Dec 2017

But without the escapechar pandas should already fail on this (not tested)

Confirmation would be nice.

gfyoung on 12 Dec 2017

I'm also having this problem. Setting escapechar to backsplash doesn't fix it.

pd.DataFrame({'bar': ['test test \\', 'test'], 'foo': ['aa', 'bb']}).to_csv('~/test.csv', quoting=csv.QUOTE_NONNUMERIC, doublequote=False, escapechar="\\")

This results in the following code in the file:

"","bar","foo" 0,"test test \","aa" 1,"test","bb"
Now, if I'm trying to read the same file using backslash as escape char I get erroneous result:
pd.read_csv('~/test.csv', escapechar='\\').iloc[0]

bar = test test ",aa"
foo = NaN

I believe setting "" as escape character should result in "" being escaped by ""