Data.table: fread should un-escape escaped quotes in fields

Created on 9 Apr 2015  路  7Comments  路  Source: Rdatatable/data.table

According to the docs:

A quoted field must start with quote and end with a quote that is also immediately followed by sep or \n. Thus, unescaped quotes may be present in a quoted field (...,2,"Joe, "Bloggs"",3.14,...) as well as escaped quotes (...,2,"Joe \",Bloggs\"",3.14,...).

The following csv should be supported by fread.

library(data.table)
dt <- data.table(a = 1:2, b = c('f(c("a","b"))','sum(1,2)'))

dt
#    a             b
#1: 1 f(c("a","b"))
#2: 2      sum(1,2)

write.table(dt,"tbl1.csv",sep=",",na="",col.names=TRUE,row.names=FALSE,qmethod="escape")
write.table(dt,"tbl2.csv",sep=",",na="",col.names=TRUE,row.names=FALSE,qmethod="double")

system("cat tbl1.csv")
# "a","b"
#1,"f(c(\"a\",\"b\"))"
#2,"sum(1,2)"
system("cat tbl2.csv")
# "a","b"
#1,"f(c(""a"",""b""))"
#2,"sum(1,2)"

# output NA
fread("tbl1.csv",sep=",")
# Error in fread("tbl1.csv", sep = ",") : 
#   Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types (   first): 2,"sum(1,2)"
# In addition: Warning message:
#   In fread("tbl1.csv", sep = ",") :
#   Starting data input on line 2 and discarded previous non-empty line: "a","b"

# incorrect output
fread("tbl2.csv",sep=",")
# a                 b
#1: 1 f(c(""a"",""b""))
#2: 2          sum(1,2)

# incorrect output
read.table("tbl1.csv",sep=",",header=TRUE)
#   a                 b
#1 1 f(c(\\a\\,\\b\\))
#2 2          sum(1,2)

# correct output
read.table("tbl2.csv",sep=",",header=TRUE)
# a               b
#1 1 f(c("a","b"))
#2 2      sum(1,2)

Findings:
as of now, only writing using qmethod="double" and read.table correctly supports write-read such kind of data.

Latest dev data.table, my locale, etc:

> sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.5
# 
# loaded via a namespace (and not attached):
#  [1] bitops_1.0-6   chron_2.3-45   devtools_1.7.0 evaluate_0.5.5 formatR_1.0    httr_0.6.1     knitr_1.8      RCurl_1.95-4.5 stringr_0.6.2  tools_3.1.3   
bug fread

All 7 comments

PS: Remember to also test and credit #1299.

MOVED from #1299
I do as well have problems with the way fread currently handles quotes and tried to find a solution (don't look at my fork - it's sh*t).
Here are my thoughts on it:

There are two more or less separate issues:

Quoting

Some strings contain control characters (wrt a csv file) like \n, \l or sep (eg. ,).
To allow unambiguous parsing of files containing such characters such strings must be quoted, i.e. surrounded by a quote character(typically ").
Quote charters, however, may be part of a sting as well: He said: "What'up?" or {"id": 34, "character":"\""}
There are two major strategies to escape them:

  • Double: all quote characters inside the string to be quoted get repeated once. e.g. the above would become "He said: ""What'up?""" and "{""id"": 34, ""character"":""\""""}"'
  • Escape character: all quotes inside the string to be quoted get escaped by a preceding escape character (usually \). With this method the escape character must be escaped as well if it was part of the string before (usually now using the Double method). e.g. the above strings would become "He said: \"What'up?\"" and "{\"id\": 34, \"character\":\"\\\"\"}"

The former is easier to implement and uses less extra storage. The latter is the standard for write.table. However write.table fails to implement the second part and does not escape \ if it was part of the string. This leads to problems e.g. in the following case (in the 7th line, the quoted region never ends):

> a = "\\write.table \"killer\"\\"; a
[1] "\\write.table \"killer\"\\"
> cat(a)
\write.table "killer"\
> write.table(a)
"x"
"1" "\write.table \"killer\"\"
> write.table(a, qmethod = "double")
"x"
"1" "\write.table ""killer""\"

Note that quoting always comes with a quote escape method. If the quoted string did not contain any quoting characters, using non escaping quoting would be equivalent to using the double escaping method. If It did also not contain the escape character, escape character escaping quoting would be equivalent as well. Often it is possible to infer the quote escape method used.

I would want fread to correctly parse csv files if the user specifies the used quote method on a per file basis. Where qmethod could be either something like:

  • "notQuoted" allowing to read files that contain fields with arbitrarily many " but no control characters (e.g. this real world example)
  • "quotedDoubleEscaped"
  • "quotedEscapeCharEscaped"
    It would be cool if one could specify the quote and the escape character.
    Further I would want fread to deduce the quote method if possible.

Unquoting

i.e. the removal of the leading and trailing quote charactars from the string and the unescaping of the string

fread should do this only if all fields of one column are exactly one quoted region (necessarily but not sufficiently begin and end with a quote) or if sep2 was implemented if the same holds for all sub-fields of a column.
There should be options to specify unquoting behavior on a per column basis.
At least fread should provide easy ways to do unquoting afterwards.

Summary

Reading files with escaped quotes (as specified by the user) is a must. fread should try to deduce the quote escaping method used.
It would be cool if fread would allow for unquoting and even cooler if it would try to guess what should be unquoted.

For ease of writing/reading these words sound much more confident than I am about them. I don't want to nag but to help.
best,
Jan

PS: some example of currently improperly handled data:

> a = '"gene names" "protein names"\nABL,JTK7 "Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1"'; cat(a)
"gene names" "protein names"
ABL,JTK7 "Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1"
> fread(a)
   gene names                                                                        protein names
1:   ABL,JTK7 Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1
> fread(a,quote = "")
Empty data.table (0 rows) of 3 cols: ABL,JTK7 "Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1"
Warning message:
In fread(a, quote = "") :
  Starting data input on line 2 and discarded previous non-empty line: "gene names" "protein names"

I think my problem should be added to this, as it relates to allowing escapes and a feature request in the Master task for fread bugs. I'm using the 1.10.4-3 of data.table.

When exporting using the default fwrite settings, quotes inside of quotes are doubled to escape. Here is a line of the csv, showing that in the name 'Oller, Thomas "Rico"', the quotes are doubled to escape it and also balanced correctly :

AD04,19992000,"Oller, Thomas ""Rico""",A,Oller,Thomas,Oller,,,Assemblymember,Assemblymember,REP,Y,LEG_ESI,2006-12-15 12:43:44,N

However, when using fread, these are not handled well. The result is double quotes:

Oller, Thomas ""Rico"" 

Its only when I use the read.csv that this is read correctly as:

Oller, Thomas "Rico"

Note that this problem persists when forcing quotes, with quote=TRUE in fwrite. I also get strange triple slashes when setting fwrite to quote=TRUE, qmethod='escape'. In fact, I haven't found a way to simply import and export using the fread/fwrite combination! I only can import it back using data.table(read.csv(...)). Since i haven't found a simple solution using data.table to a simple task, this feels like a big bug!

Escapes are defined this standard, RFC 4180:
https://tools.ietf.org/html/rfc4180#page-6

Here is a one line example of the issue:
fread('AD04,19992000,"Oller, Thomas ""Rico""",A,Oller,Thomas,Oller,,,Assemblymember,Assemblymember,REP,Y,LEG_ESI,2006-12-15 12:43:44,N\n')

The example shows that fread does not handle escapes the way the standard defines them, as the quotes around "Rico" are not escaped:

     V1       V2                     V3 V4    V5     V6    V7 V8 V9            V10            V11 V12 V13     V14
1: AD04 19992000 Oller, Thomas ""Rico""  A Oller Thomas Oller NA NA Assemblymember Assemblymember REP   Y LEG_ESI
                   V15 V16
1: 2006-12-15 12:43:44   N

If fread does not comply with RFC 4180 for .csv, then why not escalate that to a 'Bug'?

Just got bit by this... a bit unfortunate that fwrite default become hard to invert...

Escaping " using "" (qmethod="double") is the only option for tables exported in BigQuery.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andschar picture andschar  路  3Comments

DavidArenburg picture DavidArenburg  路  3Comments

arunsrinivasan picture arunsrinivasan  路  3Comments

alex46015 picture alex46015  路  3Comments

lux5 picture lux5  路  3Comments