Hello,
I might have ran into an issue ...
I think readtable / writetable doesn't escape backslashes as expected.
\\ should be rendered as \ like for a single string in the REPL ?julia> using DataFrames
julia> dat = DataFrame()
0x0 DataFrames.DataFrame
julia> dat[:LatexLabel] = ["{\$ \\lambda_1 \$}","{\$ \lambda_2 \$}"]
2-element Array{ASCIIString,1}:
"{\$ \\lambda_1 \$}"
"{\$ lambda_2 \$}"
julia> println(dat)
2x1 DataFrames.DataFrame
| Row | LatexLabel |
|-----|-------------------|
| 1 | "{\$ \\lambda_1 \$}" |
| 2 | "{\$ lambda_2 \$}" |
julia> println("{\$ \\lambda_1 \$}")
{$ \lambda_1 $}
julia> println("{\$ \lambda_2 \$}")
{$ lambda_2 $}
Note that the single backslash in front of lambda_2 is properly escaped and disappear while the double is not when printing. The single backslash in front of \$ does not disappear when printing.
dat the double backslash \\ is not understood as an escaped single backslash \.writetable("test.csv",dat)
Gives the following test.csvfile :
"LatexLabel"
"{$ \\lambda_1 $}"
"{$ lambda_2 $}"
This is clearly different from the writedlm in Base :
julia> ar = convert(Array,dat)
2x1 Array{ASCIIString,2}:
"{\$ \\lambda_1 \$}"
"{\$ lambda_2 \$}"
julia> writedlm("test2.csv",ar)
Gives the following test2.csvfile :
{$ \lambda_1 $}
{$ lambda_2 $}
Don't you think that writetable should behave like Base writedlm ?
Thanks,
Lionel
PS : I'm trying to write a dataframe in a .csv file from Julia, that will be reused for generating a bar chart in Latex. So I want to include my labels in a Latex string in my first column.
Please avoid reporting an issue twice, or a least provide links to the other thread (on julia-users). As you found out, write seems to exhibit the same behavior as writetable, which likely indicates the issue is in Base Julia, not in DataFrames. Regarding the printing issue, I suspect it might be related to the previous one.
If you don't get any reply on julia-users in a few days, file an issue in Julia, and link to this one. Thanks!
Sorry, I did search for other issues before posting here and I might have missed something. Would you mind pointing to the issue you're speaking about so I could refer to it here ?
If I follow you, writetable is based on Base.write, which exhibits a different behavior than writedlm ?
@lionpeloux Sorry, I was referring to an e-mail you sent this morning to julia-users, but apparently you've deleted it since then (which sounds weird, since anyway people have received it and may reply to it). Could you explain how it is related to, or different from, the current issue?
@nalimilan Sure. The matter I'm pointing to is different from the post you're referring to.
What I'm saying is that the DataFrame readtable() and writetable() doesn't escape characters properly like the Base.readdlm() and Base.writedlm() syntax do. This is problematic when you want to use $ and \ characters in your string (in my case to feed Latex with a .csv for plotting).
For instance, when you take this string {\$ \\lambda \$} it prints like this in the REPL :
julia> println("{\$ \\lambda \$}")
{$ \lambda $}
Which is what you expect. If you put this same string into an array and write this array to a .csv file with writedlm() you'll get a text file with {$ \lambda $} in it and it's exactly what you want. And it is coherent with the println() behavior.
But if you use DataFrames.writetable() you'll have a .csv file with {$ \\lambda $}. Thus, you can't get ride of the double \\.
Thanks. But I'm referring to another post by you, the one from this morning in the thread "Re: Adding backslashes to a string fails". In that message, you mentioned that write doesn't behave properly either. Is that actually the case, or did you discover something new since then? That's highly relevant to debug the present issue.
No, that's why I deleted my msg ... I realized I misunderstood the problem. Here are some new tests :
write()s1 = "{\$ \\lambda \$}"
s2 = "{\$ \lambda \$}"
println(s1)
f1 = open("test1.txt","w")
write(f1,s1)
close(f1)
println(s2)
f2 = open("test2.txt","w")
write(f2,s2)
close(f2)
You get exactly the same strings in the REPL and in the txt files. So write() seems ok to me. The results are (respectively for s1 and s2) :
s1 : {$ \lambda $}
s2 : {$ lambda $}
And when I read back the strings in the txt files, I get my initial inputs where \ and $ are properly escaped with a backslash :
f1 = open("test1.txt","r")
s1 = readall(f1)
close(f1)
f2 = open("test2.txt","r")
s2 = readall(f2)
close(f2)
So write() do the job properly.
writedlm()readdlm() and writedlm behave exactly the same as read() and write() :
a = ["{\$ \\lambda \$}","{\$ \lambda \$}"]
f = open("test.txt","w")
writedlm(f,a,',')
close(f)
f = open("test.txt","r")
a = readdlm(f,',')
close(f)
Will results in a text file with the following data, (which is properly escaped back when read) :
{$ \lambda $}
{$ lambda $}
writetable()The matter is in the DataFrames package as you can see with this example :
using DataFrames
df = DataFrame()
df[:S] = ["{\$ \\lambda \$}","{\$ \lambda \$}"]
writetable("test.txt",df)
Which gives you the following content in the txt file :
"S"
"{$ \\lambda $}"
"{$ lambda $}"
The single backslashes are properly escaped, but the double \\ is not processed : it should be understood as a single escaped backslash and written \ in the txt file.
When you read back in the file the $ are properly escaped and the double backslash remains \\ :
julia> using DataFrames
julia> df = readtable("test.txt")
2x1 DataFrames.DataFrame
| Row | S |
|-----|-----------------|
| 1 | "{\$ \\lambda \$}" |
| 2 | "{\$ lambda \$}" |
I hope this is now more clear !
I'm having the same issue.
I can't have only one backslash when writing a file with writeable(). Either I get none, either two.
In fact, it seems that for not pair numbers of backslash I have useful output:
"\(a" => "(a"
"\\\(a" => "\\(a"
"\\\\\(a" => "\\\\(a"
Where outputs has 1 less backslash as input.
Whereas for pair numbers, no backslash is removed:
"\\(a" => "\\(a"
"\\\\(a" => "\\\\(a"
"\\\\\\(a" => "\\\\\\(a"
Whereas expected behaviour (at least println()'s behaviour is:
"\(a" => "(a"
"\\(a" => "\(a"
"\\\(a" => "\\(a"
"\\\\(a" => "\\(a"
"\\\\\(a" => "\\(a"
"\\\\\\(a" => "\\\(a"
Could you please provide have a look at this?
With "(a", the '\' is just quoting the (, and since there is no need for a ( to be quoted, it disappears. The string is just 2 bytes, the ( and the a.
With "\(a", you have 3 bytes, the backslash \, the (, and the a.
What you are seeing when you have an uneven number of backslashes is simply the unnecessary one before the ( being eaten.
Note: some modern languages, such as Swift, give an error if you use a \ to quote a character that doesn't require quoting, to prevent this sort of confusion.
@ScottPJones I do not expect "\(a" to work, but I did expect "\\(a" to work, which doesn't.
I expect the behaviour of escaping character to be consistent across tools I use in Julia.
So what I would expect is this:
"\(a" => "(a"
"\\(a" => "\(a"
"\\\(a" => "\(a"
"\\\\(a" => "\\(a"
"\\\\\(a" => "\\(a"
"\\\\\\(a" => "\\\(a"
Which I currently do not get with writeable().
The strings with an uneven number of \ are not really properly quoted.
If you want a \ in the output, you must have two of them in the string literal.
It just happens that there is no quote needed for (, so it is not quoted on output or display.
I corrected the "\\\(a" => "\(a"
So, yes, it should work as you say, but it is not the case with writeable()
And I just realised that github is also escaping the backslash correctly, so I also correct part that did not have the code quotes.
The example above for writetable seems correct to me: if there is a single backslash, it will get displayed as \\ in a string literal.
@ScottPJones The issue is to have the content on the file (so when we access with other software than Julia), that is written with a single backslash.
If I write "\$" in Julia string, the $ sign will be escaped and in the text file it will be written "$".
If we read back the file with "$", the backslash will be added back in order to have a valid Julia string.
=> the $ symbol is correctly escaped.
Now, if we want to do the same with the backslash symbol, it does not work. The backslash symbol is not escaped (never).
So the problem is that the backslash symbol itself is never escaped, whereas the others symbols are. Either noting is escaped, either everything (including the backslash). Having all but the backslash that is escaped is an issue.
When you read back a file with "$", there is no backslash "added". It is simply displayed with a backslash when output as a quoted string.
All the above examples have the same flaw. The string (such as "\lambda" or "\(" do NOT contain any backslashes at all, because in both cases they incorrectly are quoting a character that does not need to be quoted. No backslash is going out to the file, so when it is read back in, and you don't see any backslashes, that is totally correct behavior.
@ScottPJones I base the read back case on @lionpeloux 's example.
Can you see that currently it is not possible to write things like "\(" into the text file with writeable(), whatever you write in Julia?
Whereas these are things that works correctly with write().
You can try to use the CSV.jl package instead.
Thanks very much @nalimilan , this works as expected with this CSV.jl package.
It gives the exact expected result.
@ScottPJones The needed functionality is to write a single backslash into the text file in order to reuse the text file for post processing with other tools (for instance LaTeX).
The CSV.jl package provides this functionality but writeable() doesn't.
OK. There are two totally distinct issues here:
1) The correct syntax for representing a string literal in Julia that has a single backslash character.
2) The format used for quoting characters in the output file.
As far as 1), Julia's string literal format is very similar to, but not identical to C's string literal format (which means that it's rather dated). (differences include \a meaning \x07, $ needing to be quoted since it is used for interpoation)
Also, there are important differences between what is accepted in a Julia string literal, and what is produced when a string value is escaped.
If you want to produce a single backslash character, using a Julia string literal, you must quote it with another backslash character, i.e. the string "\\" is only one byte long, it only has a single backslash. (In v0.6, you can also use the raw string macro, i.e. raw"foo\bar", but there seems to be a problem, it's not completely "raw", in that a \ before a " is treated as an escape character for the ")
That is why I said that all the examples were invalid, because "\l" or "\(" or "\f" are NOT 2 character strings - they are the same as: "l", "(", and "\x0c" (the last one being a form-feed control character. None of the strings contain the backslash character "\x5c".
As far as 2) C, C++, C#, JavaScript, Java, JSON, Python, Swift and Julia's string literal formats are all very similar, but they are not interchangeable.
readtable and writetable use Julia's string literal escape and unescape code, and should not be used for any sort of interchange of data except with other Julia programs (or you write Julia compatible string escape / unescape functions in the other language). Backslashes must always be quoted (with another backslash) in Julia's string format.
If wish to interchange data with other programs, you must be clear on both file format and the character set encoding (which frequently will be ISO-8859-1, Windows 1252, or UTF-8)
CSV.jl uses the format from RFC-4180 (see https://www.ietf.org/rfc/rfc4180.txt), which does not have any escape characters, and embedded double-quote characters are represented by two double-quotes. writedlm/readdlm basically also use the RFC-4180 format (but allow you to set the delimiter)
JSON.jl uses the JSON standard for escaping characters, similar (but not identical) to Julia.
CSV.jl actually allows for setting custom delim, quotechar, and escapechar through keyword arguments, with defaults being ,, ", and \, respectively. Docs are here
@quinnj Does CSV.jl really by default use an \ instead of doubling the "?
That would be incompatible with CSV readers/writers that expect RFC-4180.
I'd checked writecsv, and it does double the quote character, as expected.
In my experience, \ is more common as the escape character, but that's also why it's configurable.
@ScottPJones
If you want to produce a single backslash character, using a Julia string literal, you must quote it with another backslash character, i.e. the string
"\\"is only one byte long, it only has a single backslash.
Yes, but if you ask writetable to write this in a text file, inside the file you get two characters, thus two bytes are written, not one.
|Julia|println()|writetable()|CSV.jl|write()|
|------|----------|--------------|--------|--------|
|\\|\|\\|\|\|
|\$|$|$|$|$|
So what I do not get, is why is \$ => $ but \\ not \, in the text file produced by writetable() ?
writetable outputs strings using print_escaped (or escape_string, depending on version).
$ has to be quoted in Julia string literals, because otherwise it indicates string interpolation, however, print_escaped does not quote $ (although print_unescaped accepts either a \ followed by a $ or a $ by itself, returning a single $ in both cases)
I'm not sure why the $ is not also escaped on output via print_escaped, that means that if you copy/paste a string written by print_escaped that has a $ in it, into the REPL or some Julia code, you'll get an error.
That's why the backslashes get doubled on output by writetable, because they need to be able to be read back by print_unescape.
Documentation: escape_string does "general escaping of traditional C and Unicode escape sequences." It is distinct from Julia's string literal escaping.
So it round-trips just fine, but it probably won't play nicely with other CSV readers. If you want better compatibility with other CSV readers, then use CSV.jl. Given that changes here will break that round-tripping with previously saved files, I'm not sure this worth an upheaval.
So, as I understand, you see writetable() only as a mean to save the table from Julia and read back by readtable() only.
I personally saw it as a mean to write down the data in a text file form, and by such expected to be able to chose what I print in the file (for instance printing a single \).
I fact, I don't see the point of having all the options in these two functions if the purpose is only to save and read back (separators, headers, etc).
It would have look much more coherent to me to have a very raw function (that we cannot parametrise, which maximise the chances to be able to read back the file with the same function), and redirect toward CSV.jl for CSV like files.
Besides the handling of $ for interpolation in Julia's string literals, escape_string matches Julia's string literal format, but it really is incompatible with "traditional C" escape sequences.
C, C++, Java, JavaScript, and Python all support the following \b \f \n \r \t.
C, C++ and Python additionally support \a and \v (as does Julia).
Only Julia seems to support \e (for escape), so I think if they really wanted to do "traditional C and Unicode escapes", and be more compatible, only the first five would be output escaped that way (the others should be escaped as hex sequences, like the other control characters).
I really think this issue should be closed, because 1) writetable is doing exactly what it was written to do, i.e. output strings using Julia's C-like literal format, and read them back in correctly via readtable after the round-trip. 2) if other more portable formats such as CSV are needed, then writedlm, writecsv, and CSV.jl work just fine.
@ScottPJones
1) writetable is doing exactly what it was written to do, i.e. output strings using Julia's C-like literal format, and read them back in correctly via readtable after the round-trip.
2) if other more portable formats such as CSV are needed, then writedlm, writecsv, and CSV.jl work just fine.
So maybe it would be nice to adapt the documentation to reflect these two points in order for people to know which tools they should use to fulfil their needs?
So, as I understand, you see writetable() only as a mean to save the table from Julia and read back by readtable() only.
Yes, and even then, you need to make sure you've got the right character encoding.
I personally saw it as a mean to write down the data in a text file form, and by such expected to be able to chose what I print in the file (for instance printing a single ).
No, because then it would not be able to make a round-trip.
I fact, I don't see the point of having all the options in these two functions if the purpose is only to save and read back (separators, headers, etc).
I think this was somebody's attempt to handle CSV files (and variants, hence all the options), however, I don't think it's really very useful at this point, since there are much better options available with the work @quinnj (and others) have done with CSV.jl and JSON.jl.
Note: of all the options for transporting data between disparate systems and languages, the only one mentioned here that has a clear standard is JSON, which is what I would strongly recommend.
(JSON even specifies that the file must be Unicode, not some random encoding like you frequently get with CSV files)
So maybe it would be nice to adapt the documentation to reflect these two points in order for people to know which tools they should use to fulfil their needs?
Yes, I do believe the documentation should be updated, hopefully somebody with an understanding of the issues (and time!) will be able to step up to the plate.
No, because then it would not be able to make a round-trip.
This just depends on how you code it. Can't write() and read() do this round trip?
This just depends on how you code it. Can't write() and read() do this round trip?
write and read will just output a sequence of bytes, but you are trying to write out something with some structure, such as rows and columns, and different types of values (i.e. strings that can contain any legal character).
You can't write something out in an ambiguous format, and then expect to read it back in and get the same data.
If you wrote out a backslash without quoting it (by outputting two backslashes), then when you read it back in, you would not be able to tell if the backslash meant it was a backslash, or that it was the leading character of an escape sequence.
@JonWel The plan is to deprecate writetable in favor of CSV.jl, that's why there's not a lot of interest in improving it. See https://discourse.julialang.org/t/announcement-dataframes-0-9-0-planned-for-february/266.
@ScottPJones
You can't write something out in an ambiguous format
If I write \n in a text editor, I have two bytes 01011100 and 01101110 in UTF8. This is different from the char \n who is written as the single byte 00001010.
So I don't see where you see ambiguity, it's just the usage of UTF8.
Why should we write \( with 3 bytes in a text file: 01011100 01011100 00101000 where it is clearly defined with two bytes 01011100 00101000? And more particularly, why should we read correctly the two bytes, but write back 3 bytes afterwards, which when read again are stored in memory as two bytes by Julia? (you can check that "\\(".data gives 2 bytes, not 3)
@JonWel You are still missing the point about using some escaping/quotation scheme in a file.
If you use a character, such as a tab or comma, to separate out certain fields in a file, then you have two options: 1) you simply disallow being able to use those characters ever in data (which would probably not be acceptable) or 2) you use some mechanism for quoting those characters.
In the "standard" CSV (RFC-4180), any field that has the delimiter, for example, a comma, must use double-quotes around the string. That means that if you want to represent a double-quote in a value, you also have a problem, so to handle that, two double-quote characters are output, and when read back in, the doubling is removed.
The same is true when using an escape character, such as \. If you did not quote the escape character itself in some way, you would not be able to use it within a field.
If you just output a single \, then you'd have to be careful that the following character in the string is not a character that is part of an escape sequence, and that the input code would actually leave the single \ alone. That just moves the problem a bit further along, because then you'd still have certain sequences that could not be allowed in a field.
So: If you write out "\\" * "f" to a file (i.e. 0x5c 0x66) in a file, what would you expect to get back?
That's why it's critical to quote the escape character.
@ScottPJones
If you write out
"\\"*"f"to a file (i.e. 0x5c 0x66) in a file, what would you expect to get back?
If I have a string stored in memory as 0x5c 0x66, I expect the CSV to contain \f, and have 0x5c 0x66 back in memory when reloading the file.
If I have a char 0x0c, I expect to have a strange character that github can't display written in my CSV file, and 0x0cback in memory .
Instead of this, I get \\f and \f in the file. If I try to read the 0x0c with readtable(), I get a NA value.
I think the difference is that you seem to have the idea to write code in the text file in a more or less similar manner as the one we type when writing in the Julia side, whereas I consider we are writing in UTF8, thus we can directly write the same UTF8 char as in memory, of course, we also need to have the formatting, ie that we place comas and line return in the right places (and we can also play a bit with double quotes).
And as I understand the RFC-4180 you speaks about, it should be possible to write the 0x0a char into the CSV file if it is enclosed in double quote.
So in RFC-4180, having the tree bytes 0x22 0x0a 0x22 is a valid field, but instead of using this, you write 0x22 0x5c 0x6e 0x22.
So I agree that if I want the string a"b, the text file will have to contain "a""b" as field because the double quote has a special meaning in the formatting.
But the backslash do not have this special meaning, so it is not a necessity to play with backslash escaping.
So, to me, this file would complain with RFC-4180:
col1,col2,col3
line1,1,2
\(\lambda\),3,4
位,5,6
\$,7,8
\n,9,10
"
",11,12
\f,13,14
,15,16
And it's plain possible to read and write it with CSV.jl:
"col1","col2","col3"
"line1",1,2
"\(\lambda\)",3,4
"位",5,6
"\$",7,8
"\n",9,10
"
",11,12
"\f",13,14
"",15,16
(note that 0x0c is present before 15 but can't be displayed by GitHub, and as readtable can't read it, it puts NA instead)
Where we see that CSV.jl added double quote on all strings (why not), but all the chars are conserved as initial => I have a good round trip conservation (only difference is that all string fields receive quotes.
If I do the same with readtable and writetable, I get this back:
"col1","col2","col3"
"line1",1,2
"\\(\\lambda\\)",3,4
"位",5,6
"\\$",7,8
"\\n",9,10
"\n",11,12
"\\f",13,14
NA,15,16
Which has quite a lot of differences.
If I read the output of CSV.jl and write it back with readtable and writetable, I get:
"col1","col2","col3"
"line1",1,2
"\\(\\lambda\\)",3,4
"位",5,6
"\\$",7,8
"\\n",9,10
"\n",11,12
"\\f",13,14
"\f",15,16
I find the results from CSV.jl more consistent.
I think the difference is that you seem to have the idea to write code in the text file in a more or less similar manner as the one we type when writing in the Julia side, whereas I consider we are writing in UTF8
This has absolutely nothing to do with the character set encoding. It has to do with the difference between the representation of a string (or sequence of bytes) as a literal or in a file, and the string itself. writetable and readtable happen to use a particular way of representing that string in a file, based on the Julia string literal escaping/unescaping rules (with the exception of the $ character). RFC-4180 uses a much more primitive method (i.e. put anything that has any special characters within quotes, and if you need to embed a quote, double it).
So in RFC-4180, having the tree bytes 0x22 0x0a 0x22 is a valid field, but instead of using this, you write 0x22 0x5c 0x6e 0x22.
Using a backslash as an escape is not at all part of RFC-4180. Some things like MySQL don't follow RFC-4180, and do use the backslash to escape certain characters. There are very many variants of CSV or TSV files, with many incompatibilities.
I'd just avoid CSV like the plague, but if you must receive data already in the CSV format, then definitely use CSV.jl, I think @quinnj has handled quite a few of the variants out there
(the problem then being knowing just what the correct settings are for a particular file you need to read)
So, to me, this file would complain with RFC-4180:
No, that first example seems to be correct, according to RFC-4180.
I'll look at that NA you got using readtable (but later, it's almost 2am here!)
In my experience, \ is more common as the escape character, but that's also why it's configurable.
@quinnj My experience has been very much otherwise, and I'd always heard that the by far greatest producer of CSV files in the world has been Microsoft products, such as Excel.
I've verified that all of the CSV and TSV output formats Common,Windows,MS-DOS, Tab outputs from Excel do not do any escaping with backslash (quoting follows RFC-4180)
Closing since the behavior in this package is unlikely to change and the preferred solution is CSV.jl.
Most helpful comment
Documentation:
escape_stringdoes "general escaping of traditional C and Unicode escape sequences." It is distinct from Julia's string literal escaping.So it round-trips just fine, but it probably won't play nicely with other CSV readers. If you want better compatibility with other CSV readers, then use CSV.jl. Given that changes here will break that round-tripping with previously saved files, I'm not sure this worth an upheaval.