Jq: Support for CSV-formatted strings

Created on 16 Apr 2018  Â·  35Comments  Â·  Source: stedolan/jq

Examples which I should expect to work:

$ echo '"1,2,3"' | jq -R "@csv"
jq: error (at <stdin>:1): string ("\"1,2,3\"") cannot be csv-formatted, only array
$ echo '"1,2,3"' | jq -r "@csv"
jq: error (at <stdin>:1): string ("1,2,3") cannot be csv-formatted, only array
$ echo '"1,2,3"' | jq -r "@tsv"
jq: error (at <stdin>:1): string ("1,2,3") cannot be tsv-formatted, only array

This works fine:

$ echo '[1,2,3]' | jq -r "@csv"
1,2,3
$ echo '[1,2,3]' | jq -r "@tsv"
1   2   3

However, the idea is to convert CSV from the string, not from the array.

My version:

$ jq --version
jq-1.5
feature request jq function library support

Most helpful comment

@nicowilliams but if fast is what we're going for then why not just code the CSV parser in C?

Take a look to this FSM implementation writen in pure ANSI C:

I designed the code to be portable and to be easy to put before the lexer, as a pipe, or I think so.

  • The program can be compiled to be used standalone, but removing the main function
    the exported function csv2json can be used as a filter between two FILE*.
  • The output consist on line delimited JSON arrays.
  • The error reporting in the function csv2json is exhaustive and free of any IO context (do not assume stderr is available). This is for easy integration.
  • All the tests defined in the "test.csv" passed.
  • In my old laptop a 3 megabyte CSV file is parsed in 0.101 seconds.

JJOR

All 35 comments

  1. Please read the jq documentation for @csv, the input of which must be a flat array.
  2. Consider this variant of your first try:

    echo '"1,2,3"' | jq -r 'split(",") | @csv'
    "1","2","3"

  3. For your reference, usage questions are best asked at stackoverflow.com with the jq tag: https://stackoverflow.com/questions/tagged/jq

Does it mean the feature won't be implemented?

Does it mean the feature won't be implemented?

I'm not sure what "feature" you mean, but as illustrated, it's trivially easy to convert a JSON string with embedded commas into valid CSV, and if you want the numbers in the JSON string to be recognized as such, you could modify the filter to: split(",") | map(tonumber? // .) | @csv.

In any case, to the best of my knowledge, there is no outstanding "enhancement request" under consideration for further changes to either @csv or @tsv.

This for me doesn't work as expected as split removes quoted commas:

$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@tsv'
1   "2a 2b" 3
$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@csv'
"1","""2a","2b""","3"
$ echo '1,"2a,2b",3' | jq -Rr '.|split(",")|@text'
["1","\"2a","2b\"","3"]

So I know it's just more than just splitting the commas. I've checked already scan()/capture() as potential use, but the regex would be to complex.

I'm aware this workaround works:

$ echo '[1,"2a,2b",3]' | jq -r "@tsv"
1   2a,2b   3

but then the input is not a proper CSV input as expected, as brackets need to be added, so input requires additional external processing. So I believe processing for string could work similar way by supporting strings but without the [ and ] brackets, like:

$ echo '"1,\"2a,2b\",3"' | jq -r "@tsv"
jq: error (at <stdin>:1): string ("1,\"2a,2b\...) cannot be tsv-formatted, only array

If I understand filters correctly (such as @csv, @tsv) that are expecting input in this specific format, but the limitation is that 'the input must be an array'. It seems other inputs supports plain input, but these specific one does not support string format, so feature request is about supporting strings as described, but I may misunderstand what input means.

I understand jq is not a proper converter, but I think it would be a good feature to support input CSV in a string format as well. Related GH-645.

Another use case: Converting CSV to TSV. A lot of CLI tools can do this simple conversion, so why not to make jq better.

  1. Currently, the best way to get jq to read a CSV file is to use EITHER a CSV-to-TSV or CSV-to-json program that can pipe into jq, or perhaps to use a wrapper such as cq. For further details, see the jq FAQ: Can jq process CSV or TSV files?

  2. There has been some discussion about adding pre-filters, but given the backlog, such additions may be a long time in coming to fruition, especially given the predisposition of the jq maintainers to avoid new comand-line options and bloat. That is, jq seeks to dovetail with other tools if available, which is the case with CSV, especially since jq can handle TSV input natively.

Just having a use case for this, going to note the results of my brief research:

  • jq alone cannot really be used, many Stack Overflow examples use split(",") which doesn't really work if data contains escaped commas, see e.g. https://github.com/stedolan/jq/issues/1650#issuecomment-381942184.
  • cq as suggested in the FAQ is relatively hard to get to, it's part of jqt which seems to be a small experimental project that doesn't even have binary releases on GitHub. The author states "jqt is developed under the Fedora Linux distribution, and a lot of portability issues are expected at this stage of development.".
  • Googling "csv to json command line" yields a couple of npm packages and quick'n'dirty scripts around the web. Unlike with jq itself, one has to try several options, evaluate them, etc.

So I don't think that CSV scenario has, currently, a similarly good solution as e.g. YAML (which is very-well covered by yq).

I'm unclear as to what you're asking for.

Is it that you want to be able to read CSV/TSV into arrays, or CSV/TSV and pass it through?

I do think it'd be nice if jq supported more input formats that raw and JSON. Ideally I'd like jq to support CSV/TSV as inputs (with and without value quoting), and maybe even YAML and XML. But if this isn't what you're asking for then never mind.

@nicowilliams Yes, process CSV with jq, or as you say, generally more input formats.

OK, thanks. We would definitely not use @ syntax for that -- that syntax has to be exclusively about output because otherwise we'd have confusing behavior, I think.

We'd have fromcsv and fromtsv builtins. In fact, you can code these in jq right now:

$ jq -cn '[range(5)]|@csv|split(",")'
["0","1","2","3","4"]

but that doesn't take into account quoting of string values. To handle that would require something more like a reduction-based parser:

    reduce (explode[]|[.]|implode) as $char ({result:[""],inquoted:false,nextisbackslashquoted:false};
        (.result|length|. -1) as $idx |
        if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
        elif .nextisbackslashquoted then
            .result[$idx] += $char
        elif .inquoted then
            .result[$idx] += $char
        elif $char == "," then
            .result += [""]
        elif .result[$idx]|length == 0 and $char == "\"" then
            .inquoted = true
        else
            .result[$idx] += $char
        end) | .result;

That needs a bit more work, but it does work:

$ jq -Rcn '
def fromcsv:
    reduce (explode[]|[.]|implode) as $char ({result:[""],inquoted:false,nextisbackslashquoted:false};
        (.result|length|. -1) as $idx |
        if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
        elif .nextisbackslashquoted then
            .result[$idx] += $char
        elif .inquoted then
            .result[$idx] += $char
        elif $char == "," then
            .result += [""]
        elif .result[$idx]|length == 0 and $char == "\"" then
            .inquoted = true
        else
            .result[$idx] += $char
        end) | .result; "\"a,b\",c,d e f,\"g,h i,j\""|fromcsv'
["a,b","c","d e f","g,h i,j"]
$ 

That needs a bit more work, but it does work

Yes, it works well, except perhaps for the handling of embedded carriage-returns. It would at least be desirable for fromcsv to be able to act as the inverse of @csv.

Consider therefore:

$ jq -rn '["a\nb","c"] | @csv'
"a
b","c"

In my view, the above is correct, but that is largely immaterial, because whether it's correct or not, we see that there's an invertibility issue:

$ jq -rn '["a\nb","c"] | @csv' | jq -Rcf fromcsv.jq
["a"]
["b\"","c"]

@pkoppstein Yes, that's one of the things I meant by "it needs a bit more work.

@nicowilliams - Yes, I guessed as much, but there is so much confusion/disagreement/inconsistency about this particular point, I thought it would be worthwhile illustrating how well-behaved @csv actually is.

@kenorb, @pkoppstein Here you go:

$ jq -cn '
def fromcsv:
    def initialize: {result:[""],ready:false,inquoted:false,nextisbackslashquoted:false};
    foreach (if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (initialize;
        (.result|length|. -1) as $idx |
        if .ready then initialize else . end |
        if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
        elif .nextisbackslashquoted then
            .result[$idx] += $char
        elif .inquoted then
            .result[$idx] += $char
        elif $char == "," then
            .result += [""]
        elif $char == "\n" then
            .ready = true
        elif .result[$idx]|length == 0 and $char == "\"" then
            .inquoted = true
        else
            .result[$idx] += $char
        end;
        if .ready then .result else empty end); "\"a,\nb\",c,d e f,\"g,h i,j\"\n\"a,z\"\n"|fromcsv'
["a,\nb","c","d e f","g,h i,j"]
["a,z"]
$ 

The foreach adds a newline to the input if it doesn't already have it, then iterates over every character in the input. The update expression does all the parsing. The extract expression extracts complete outputs.

You can check that fromcsv consumes the outputs of @csv. Mind you, I've still not thought it entirely through, so there may well be a bug in there.

This pattern can be extended to parse things like XML too if you like (but, wow, that might be a lot of work!).

Here's troublesome.csv, for which the invocation:

jq -rR 'include fromcsv; fromcsv|@csv'

should (I think we're agreed) be idempotent:

"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab"," ","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\\nbar"

With the most recent version, the "split" is lost entirely.

@pkoppstein It works fine if you add -s to the jq command-line! Can you figure out why? :)

Anyways, that reminds me we probably want a fromcsv/1 as well:

def fromcsv_initialize_helper:
    {result:[""],ready:false,inquoted:false,nextisbackslashquoted:false};
def fromcsv_update_helper($char):
    (.result|length|. -1) as $idx |
    if .ready then fromcsv_initialize_helper else . end |
    if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
        .result[$idx] += $char
    elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
        .inquoted = false # should remember that next $char must be comma (or no next $char cause EOL)
    elif .nextisbackslashquoted then
        .result[$idx] += $char
    elif .inquoted then
        .result[$idx] += $char
    elif $char == "," then
        .result += [""]
    elif $char == "\n" then
        .ready = true
    elif .result[$idx]|length == 0 and $char == "\"" then
        .inquoted = true
    else
        .result[$idx] += $char
    end;
def fromcsv(in):
    foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        fromcsv_initialize_helper;
        fromcsv_update_helper($char);
        if .ready then .result else empty end);
def fromcsv:
    foreach (if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        fromcsv_initialize_helper;
        fromcsv_update_helper($char);
        if .ready then .result else empty end);

And here's how to use it:

$ jq -Rr -L /tmp 'include "fromcsv"; fromcsv(.//empty,inputs) | @csv' <<EOF
> "C1","C2","C3","C4","C5"
> "1"," ","5,9,13,17"
> "esctab","\t","10","14","18"
> "tab"," ","11","15","19"
> "4","8","12","16","20"
> "split
> line"
> "foo\\nbar"
> EOF
"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab"," ","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\nbar"
$ 
$ cat > /tmp/in <<EOF
"C1","C2","C3","C4","C5"
"1"," ","5,9,13,17"
"esctab","\t","10","14","18"
"tab"," ","11","15","19"
"4","8","12","16","20"
"split
line"
"foo\\nbar"
EOF
$ diff -u /tmp/in <(jq -Rr -L /tmp 'include "fromcsv"; fromcsv(.,inputs) | @csv' < /tmp/in)
$ echo $?
0
$ 

Did you catch that? I passed .//empty,inputs to fromcsv/1. That allows me not to worry about whether -n was used on the command-line. I.e., if I forgot to add -n this works because the first input, the one read by jq itself, gets prepended to the stream from inputs, and if I did remember to add -n then the first input is null and then the //empty causes it to be ignored.

Please beat up on this a bit more, then we can include this in the next release.

I prefer (./"")[] to explode[]|[.]|implode but I've not profiled the code.

@nicowilliams wrote:

It works fine if you add -s to the jq command-line! Can you figure out why? :)

Yes, but if one uses -s in conjunction with this second version, it will fail with ordinary CSV, such as:

C1,C2,C3,C4,C5
1,"""5,9,13,17"""
esctab,\t,10,14,18
tab,    ,11,15,19
4,8,12,16,20

Anyway, yes, having a stream-oriented version is the way to go! Fantastic! Alas, version 3 also fails with "ordinary CSV" as above....

@fadado wrote:

I prefer (./"")[] to explode[]|[.]|implode but I've not profiled the code.

Alas, using / is a little faster but requires way more memory:

$ /usr/bin/time -lp jq 'tostring|explode[]|[.]|implode| empty' jeopardy.json
user        31.83
sys          0.56
 984571904  maximum resident set size
    288772  page reclaims
     15149  involuntary context switches


$ /usr/bin/time -lp jq 'tostring | (./"")[] | empty' jeopardy.json
user        25.06
sys          2.97
4708450304  maximum resident set size
   1578353  page reclaims
     14215  involuntary context switches

We do need an explodes that streams... Anyways, as to @pkoppstein's "ordinary CSV" (quotes in original)... well, it ain't very ordinary. I specifically thought of and decided to ignore improper quoting, mostly because a) it was more work than I wanted to do, b) I can't quite figure out the correct way to parse something like this "a"b",c, or worse: a"b,c or similar. One option is to preserve quotes in the middle without treating them specially, so that both "a"b",c and a"b,c get parsed as ["a\"b","c"]. This requires adding some state and state update logic:

def fromcsv(in):
    def initialize:
        {result:[""],
         ready:false,
         inquoted:false,
         nextisbackslashquoted:false,
         firstwasquote:false,
         lastwasquote:false};
    def update($char):
        (.result|length|. -1) as $idx
      | if .ready then initialize else . end
      | if .inquoted and (.nextisbackslashquoted|not) and $char != "\"" then
            .result[$idx] += $char
          | .lastwasquote = false
        elif .inquoted and (.nextisbackslashquoted|not) and $char == "\"" then
            .inquoted = false            # should remember that next $char must be comma (or no next $char cause EOL)
          | .lastwasquote = true
        elif (.inquoted|not) and (.nextisbackslashquoted|not) and $idx > 0 and
             .lastwasquote and $char != "," and $char != "\n" then
            .result[$idx] += $char
          | .inquoted = .firstwasquote   # should remember that next $char must be comma (or no next $char cause EOL)
          | .lastwasquote = false
        elif .nextisbackslashquoted then
            .result[$idx] += $char
          | .lastwasquote = false
        elif .inquoted then
            .result[$idx] += $char
          | .lastwasquote = false
        elif $char == "," then
            .result += [""]
          | .inquoted = false
          | .lastwasquote = false
          | .firstwasquote = false
        elif $char == "\n" then
            .ready = true
        elif (.result[$idx]|length) == 0 and $char == "\"" then
            .firstwasquote = true
          | .inquoted = true
        else
            .result[$idx] += $char
        end;
    foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        initialize;
        update($char);
        if .ready then .result else empty end);
def fromcsv: fromcsv(.);

@nicowilliams - Yes, I also thought of the name explodes :-)

By "ordinary CSV" I mainly meant:

(a) there are no embedded raw newlines;
(b) fields need not always be quoted;
(c) values with any interior double-quotation marks must be quoted and the embedded quotes must be doubled.

I did not mean that fromcsv should handle pseudo-CSV, i.e. CSV that does not conform to any of the major "standards".

In particular, I agree that fromcsv should not be expected to tolerate improper quoting, such as the examples you give. (By contrast, the CSV value """5,9,13,17""" encodes the string "5,9,13,17".)

@pkoppstein Oh? There is a standard for CSV? Do you have a link to a specification?

RFC 4180 is not a standard, FYI. There's a few W3C documents on the subject, such as https://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/ and https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/ -- I'm not sure as to the status of those two.

Here's a blog post about this mess: https://chriswarrick.com/blog/2017/04/07/csv-is-not-a-standard/

Anyways, feel free to improve my fromcsv/1. I'm not going to spend any more time on it.

@nicowilliams - I was careful to put "standard" in quotation marks. (Maybe I should have written """standard""" :-)

FWIW, it's been my view that CSV is such a mess, and jq's existing support for TSV and @csv is so good (especially in conjunction with third-party tools), that providing a full CSV parser should be viewed as a very low-priority goal, certainly not worthy of someone with your talents!

@pkoppstein Ah yes, I missed that.

Oh, so in CSV escaping is only of double quotes, and only by doubling them?

Aight, one more refinement. Honestly, this code does not look pretty to me. There has got to be a better way to write this...

def fromcsv(in; $allow_backslash_escaping):
    def initialize:
        {result:[""],
         ready:false,
         inquoted:false,
         nextisbackslashquoted:false,
         firstwasquote:false,
         lastwasquote:false};
    def update($char):
        (.result|length|. -1) as $idx
      | if .ready then initialize else . end
      | if $char != "\"" and $char != "," and
           $char != "\n" and $char != "\\" then # " close double quotes for vim
            # Not a special char, just add it now
            .result[$idx] += $char
          | .lastwasquote = false
        elif (.nextisbackslashquoted|not) and $allow_backslash_escaping and
             $char == "\\" then # " close double quotes for vim
            # Backslash escaping next $char
            .nextisbackslashquoted = true
          | .lastwasquote = true
        elif .lastwasquote and $char == "\"" then
            # double-quote-quoted double-quote
            .result[$idx] += $char
          | .lastwasquote = false
        elif .inquoted and (.lastwasquote|not) and $char == "\"" then
            # double-quote either ending a value or escaping the next
            # $char that should be a double-quote; next char will tell
            # us which it is
            .lastwasquote = true
        elif .inquoted and .lastwasquote and $char == "," then
            # comma-ending a field
            .result += [""]
          | .inquoted = false
          | .lastwasquote = false
          | .firstwasquote = false
        elif .inquoted and .lastwasquote and $char == "\n" then
            # newline-ending a field
            .ready = true
        elif .inquoted or .nextisbackslashquoted then
            .result[$idx] += $char
          | .lastwasquote = false
          | .nextisbackslashquoted = false
        elif $char == "," then
            .result += [""]
          | .inquoted = false
          | .lastwasquote = false
          | .firstwasquote = false
        elif $char == "\n" then
            .ready = true
        elif (.result[$idx]|length) == 0 and $char == "\"" then
            .firstwasquote = true
          | .lastwasquote = false
          | .inquoted = true
        else
            .result[$idx] += $char
        end;
    foreach (in | if endswith("\n") then . else . + "\n" end | explode[]|[.]|implode) as $char (
        initialize;
        update($char);
        if .ready then .result else empty end);
def fromcsv(stream): fromcsv(stream; false);
def fromcsv: fromcsv(.);

[@nicowilliams - This is not a request or suggestion that you spend more time on this!]

The latest version works splendidly except when non-quoted fields are encountered. E.g. for the "ordinary" CSV presented above, the result has additional empty or null fields at the beginning of the follow-on records:

"C1","C2","C3","C4","C5"
"",,,,"1","""5,9,13,17"""
"",,,,,"esctab","\t","10","14","18"
"",,,,,,,,,"tab","  ","11","15","19"
"",,,,,,,,,,,,,"4","8","12","16","20"

Honestly, this code does not look pretty to me. There has got to be a better way to write this...

Yes: mimic the EBNF in https://tools.ietf.org/html/rfc4180:

# From: https://tools.ietf.org/html/rfc4180
#
# file = [header CRLF] record *(CRLF record) [CRLF]
# header = name *(COMMA name)
# record = field *(COMMA field)
# name = field
# field = (escaped / non-escaped)
# escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
# non-escaped = *TEXTDATA
# COMMA = %x2C
# CR = %x0D
# DQUOTE =  %x22
# LF = %x0A
# CRLF = CR LF
# TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

def fromcsv: # . is a record
    def csv($str; $len):
        # combinators (primitive: `|`, `,`, `recurse`: sequence, alternate and Kleene*)
        def plus(scanner): scanner | recurse(scanner); # Kleene+
        def optional(scanner): first(scanner , .);
        # scanners: position -> position
        def char(test): select(. < $len and ($str[.:.+1] | test)) | .+1; 
        def many($alphabet): select(. < $len) | last(plus(char(inside($alphabet)))) // empty;
        def CR: char(. == "\r");
        def LF: char(. == "\n");
        def COMMA: char(. == ",");
        def DQUOTE: char(. == "\"");
        def DQUOTE2: DQUOTE | DQUOTE;
        def TEXTDATA: char(.=="\t" or .>=" " and .!="," and .!="\"" and .!="\u007F");
        def SPACE: optional(many(" \t"));
        def non_escaped: recurse(TEXTDATA);
        def escaped: DQUOTE | recurse(first(TEXTDATA , COMMA , CR , LF , DQUOTE2)) | DQUOTE;
        # Parse fields and records
        def field: . as $i | first((escaped|[true,$i,.]) , (non_escaped|[false,$i,.]));
        def record:
            def r:
                field as [$e,$i,$j]
                | if $e then $str[$i+1:$j-1] else $str[$i:$j] end
                , ($j | SPACE | COMMA | SPACE | r);
            0|r
        ;
        # Collect record fields
        [ record ]
    ;
    rtrimstr("\r\n") as $str | csv($str; $str|length)
;

def fromcsv(stream): # stream of records
    stream | fromcsv
;

Considerations:

  • This is an exemple of _jq_ capabilities implementing backtrack parsers!
  • Input is assumed to be compliant with the RFC record production.
  • As a bonus I accept optional space around commas.
  • Space at begin/end of records is not accepted.
  • The management of records with embeded \n or \r inside quoted fields is fromcsv caller responsability.
  • I ignored any efficiency consideration.

Happy hacking!

@fadado Nice! Reminds me of how one does parsing in Icon. In Icon there's a "scanning context" that consists of a string and start and end offsets into the string, and parsing functions basically modify the current scanning context as they go, on backtracking undoing their side-effect of modifying the current scanning context. The current scanning context is akin to a dynamic global in a LISP, so that you can push new scanning contexts.

EDIT: What @fadado is doing is to setup things so that $str and $len and . define the current scanning context, with . as the current offset into $str. The parse finishes when . == $len. Backtracking works because if you end up outputting an earlier value of . then you've rewound so an alternative can be used.

@fadado's code isn't quite right though, as I'm sure some might have noticed. I think making it perfect will require changing . to be a bit more complex than just an offset into $str...

@fadado Here's an improved version of yours:

# From: https://tools.ietf.org/html/rfc4180
#
# file = [header CRLF] record *(CRLF record) [CRLF]
# header = name *(COMMA name)
# record = field *(COMMA field)
# name = field
# field = (escaped / non-escaped)
# escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
# non-escaped = *TEXTDATA
# COMMA = %x2C
# CR = %x0D
# DQUOTE =  %x22
# LF = %x0A
# CRLF = CR LF
# TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

def fromcsv(stream): # . is a record
    # We keep track of the string and its length via the $str and $len
    # argument bindings of the csv/2 local function below.
    #
    # We keep "mutable" state as the input to the csv/2 local function
    # and all its local functions.
    #
    # This is very similar to how string parsing is done in Icon, which
    # has a "string scanning context" that consists of a string and an
    # offset into that string.
    #
    # Backtracking is accomplished by using `empty` and ultimately
    # causing a different `.` to be output (or none at all).
    #
    def init($str): # Setup / update parse state:
        {o:0,             # offset of current/next char
         u:"",            # accumulator of current field's content
         r:[],            # accumulator of current record's fields
         start:-1,        # start of current field (including dquote)
         end:-1,          # end   of current field
         wantmore:false,  # true if we need more input for split dquoted-string
         s:$str,          # copy of $str used for wantmore processing
        };

    # Setup string scanning context and parse it:
    def csv($str; $len):
        def fieldstart: .start=.o;
        def fieldend: .end=.o;
        def update_escaped: (.u = .u + $str[.ddquote:.o - 1]) | (.ddquote = .o);
        def finish_escaped:  .u = .u + $str[.ddquote:.end];
        def extract_non_escaped: (.u=$str[.start:.end]);
        def wantmore: .wantmore = true;
        # Combinators (primitive: `|`, `,`, `recurse`: sequence, alternate and Kleene*)
        def plus(scanner): scanner | recurse(scanner); # Kleene+
        def optional(scanner): first(scanner , .);
        # Scanners: current offset -> new offset
        def char(test): select(.o < $len and ($str[.o:.o+1] | test)) | (.o+=1);
        def many(test): select(.o < $len) | last(plus(char(test))) // empty;
        # Terminals
        def CR: char(. == "\r");
        def LF: char(. == "\n");
        def COMMA: char(. == ",");
        def DQUOTE: char(. == "\"");
        def DQUOTE2:
            DQUOTE | DQUOTE
          | update_escaped;
        def TEXTDATA: many((.=="\t" or .>=" ") and . != "," and . != "\"" and . != "\u007F");
        # Rules
        def escaped:      # ANBF: escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
            fieldstart
          | DQUOTE
          | (.ddquote=.o)
          | last(plus(first(TEXTDATA , COMMA , CR , LF , DQUOTE2)))
          | fieldend
          | ((DQUOTE | finish_escaped) // wantmore);
        def non_escaped:  # ABNF: non-escaped = *TEXTDATA
            fieldstart
          | TEXTDATA
          | fieldend
          | extract_non_escaped;
        # Parse fields and records
        def field:        # ABNF: field = (escaped / non-escaped)
            first(escaped, non_escaped);
        #def record: field | last(plus(COMMA | field));
        def record:       # ABNF: record = field *(COMMA field)
            # This one is complicated because we need to handle
            # .wantmore == true.  It can probably be simplified though.
            def r:
                if .wantmore then .wantmore = false else .  end
              | field
              | if .wantmore then .
                else
                    . as $i
                  | .r += [.u]
                  | . as $i
                  | ((.start=-1 | .end=-1 | COMMA | r) // $i)
                end;
            r;
        # Collect record fields
        record ;
    foreach stream as $str (
        init("");
        if .wantmore then       # Append new intput to prev and restart parse
            .s += $str
          | .o = .start
        else init($str) end     # Reset state for new parse
      | csv(.s; .s|length);     # Parse
        select(.wantmore|not).r # Output complete parses
    );

def fromcsv: fromcsv(.);
$ jq -Rrcn -L . 'include "fromcsv"; fromcsv("\"a\n","\"\"b\"\"\",c","\"a\n\"\"b\"\"\",c","a,b c,d")'
["a\n\"b\"","c"]
["a\n\"b\"","c"]
["a","b c","d"]
$ jq -Rrcn -L . 'include "fromcsv"; fromcsv("\"a\n","\"\"b\"\"\",c","\"a\n\"\"b\"\"\",c","a,b c,d")|@csv'
"a
""b""","c"
"a
""b""","c"
"a","b c","d"
$ 

@fadado - Thank you so much for your inspiration.

Over at https://gist.github.com/pkoppstein/bbbbdf7489c8c515680beb1c75fa59f2
I've posted a parser that is similar to @fadado's but with an emphasis on efficiency (see "Performance" below), some error reporting (record number and context), and the "proper" handling of various edge cases. The parser is inspired by published PEG grammars, and can largely be viewed as a PEG parser.

The main filters are fromcsv and fromcsfile: the former handles raw NEWLINEs within quoted fields on a per-string basis. If a CSV file has quoted fields with embedded raw NEWLINES, then for it to be processed properly, it must either be "slurped" or fed to fromcsvfile.

Some other noteworthy points:

  1. The parser uses jq's support for regular expressions, for speed and simplicity.

  2. Various complexities of CSV are handled, including:

    • CRLF and CR are allowed as end-of-record markers;
    • empty fields can be represented by adjacent double quotes,
      e.g. a,"","""",d => ["a","","\","d"]
    • lines such as a,,,b and ,,, are handled
    • a, ,b is interpreted as a," ",b
    • quoted fields can contain raw newlines
  3. spaces around quoted fields are ignored.

Examples:

Invocation: jq -nRc 'include "fromcsvfile.jq"; fromcsvfile' <<< TEXT

  1. Input need not be rectangular

$'a,b\nc'
=>
["a","b"]
["c"]

  1. Null fields can be specified in several ways

$'a,"",b,,c\n,,'

=>

["a","","b","","c"]
["","",""]

  1. Empty lines are ignored

$'a,b\n\nc,d'

=>

["a","b"]
["c","d"]

  1. Edge cases

a
=>
["a"]

$'a\n'
=>
["a"]

Performance:

There is a sample CSV file at http://support.spatialkey.com/spatialkey-sample-csv-data/
called Sacramentorealestatetransactions.csv
It uses '\r' as the line delimiter so I ran tr '\r' '\n' to convert it for testing purposes here.

IN=Sacramentorealestatetransactions.csv

/usr/bin/time -lp jq -nRc 'include "fromcsvfile"; fromcsvfile' $IN > OUT 
user         0.77
sys          0.00
   2473984  maximum resident set size
/usr/bin/time -lp jq -Rc 'include "fromcsvfile"; fromcsv'  $IN > OUT
user         0.73
sys          0.00
   2322432  maximum resident set size
# @nicowilliams' parser as posted earlier in this thread
/usr/bin/time -lp jq -nRc 'include "nw-fromcsv"; fromcsv(inputs)' $IN > OUT
user         1.33
sys          0.00
   2379776  maximum resident set size

Yes, @fadado's approach consumes a fair bit of stack. My original approach is very efficient, but also very ugly. I'll look at your gist tomorrow @pkoppstein.

@nicowilliams - Yes, I’d appreciate it if you could review the gist, especially with respect to (in decreasing order of importance)

  1. correctness (i.e., does it achieve what we want?)
  2. efficiency of the handling of newlines within fields (i.e., avoiding the need to slurp globally, while retaining the PEG orientation of the string-parser)
  3. elegance (i.e. minimizing departure from the aforementioned PEG orientation)
  4. ease of understanding that the implementation is “correct”, or simplicity of a proof of “correctness”.

Ultimately, I think that, after removing all references to PEG in particular, the above criteria in the order shown should be considered in deciding whatever goes into jq. For me, in the present context, adherence to any particular formal grammar or formalism is less important than practicality. With respect to 1 and 2 above, I have quite a large number of “specification-by-example” test cases that I am working on with a view to being more automated.

Thanks.

@pkoppstein Regarding the use of regular expressions, that's probably efficient as to jq code, but if fast is what we're going for then why not just code the CSV parser in C?

The fastest jq code not using regexps is still going to be a foreach-based state machine. How to make that pretty is still of interest to me. Eventually we should collect all of these techniques (@fadado's, yours, and the state machine approach) into a nice wiki page on how to parse with jq.

Regarding state machines, it might be nice to do something like:

    def STATE_A_P: ...; # true if we're in state A
    def STATE_A_NEXT: ...; # consume current char and change state
    ...
    foreach ... as $in (init;
        if STATE_A_P then STATE_A_NEXT
        elif STATE_B_P then STATE_B_NEXT
        ...;
        EXTRACT_F)

I find the @fadado's approach extremely pleasing, naturally, though yeah, it's not the most efficient.

Oh, and it'd be nice to have something of a switch syntax. With varargs (which I have in a branch) we could build a trivial jq-coded switch that has condition expressions on even arguments and actions on odd arguments. Nicer still would be syntax that allows one to mix destructuring with predicates so that if a predicate is true then the bindings established in it are available to the actions.

@nicowilliams - Please note that two relatively minor problems with the gist have been attended to, so that now it "passes" all the tests defined in the "test.csv" file at
https://gist.github.com/pkoppstein/a5364d07dc52d99163d6390df6375fd1 test.csv

By design, most of the lines in test.csv explore edge cases. The results that I would argue are what we want are at https://gist.github.com/pkoppstein/e962076c25e75af201e958bdff92fdbf test.csv.json

Not exactly by design, but not surprisingly, the most recent version (relative to this note) of your EBNF parser "fails" most of these tests in the sense of producing output at variance with test.csv.json. In some cases, though, the "failure" is indubitable.

@nicowilliams but if fast is what we're going for then why not just code the CSV parser in C?

Take a look to this FSM implementation writen in pure ANSI C:

I designed the code to be portable and to be easy to put before the lexer, as a pipe, or I think so.

  • The program can be compiled to be used standalone, but removing the main function
    the exported function csv2json can be used as a filter between two FILE*.
  • The output consist on line delimited JSON arrays.
  • The error reporting in the function csv2json is exhaustive and free of any IO context (do not assume stderr is available). This is for easy integration.
  • All the tests defined in the "test.csv" passed.
  • In my old laptop a 3 megabyte CSV file is parsed in 0.101 seconds.

JJOR

Was this page helpful?
0 / 5 - 0 ratings

Related issues

LoganBarnett picture LoganBarnett  Â·  3Comments

mcandre picture mcandre  Â·  3Comments

benjamin-bin-shen picture benjamin-bin-shen  Â·  3Comments

tischwa picture tischwa  Â·  4Comments

rclod picture rclod  Â·  4Comments