Jq: Feature request: NUL-delimited output

Created on 6 Nov 2016 · 21Comments · Source: stedolan/jq

Right now, the standard-practice way to read an array from jq into a shell-script is to use raw output and parse on newlines.

However, JSON strings can contain literal newlines; this makes such parsing error-prone.

NUL-delimited output, allowing IFS= read -r -d '' string to read exactly one C string unambiguously, would resolve this.

feature request fixed in master

Source

charles-dyfis-net

👍7

Most helpful comment

So, yeah, a -0 would actually be nice.

nicowilliams on 23 Jan 2017

👍6

All 21 comments

@charles-dyfis-net is it not simpler in this case to keep newline escaping, instead of using raw output? This allows to keep a single item per line, which is easier to loop over in a shell script:

input.json

[
  "LF\nLF",
  "TAB\tTAB",
  "FF\fFF"
]

Filter

.[]

Command Line

$ jq '.[]' input.json

Output

"LF\nLF"
"TAB\tTAB"
"FF\fFF"

Otherwise, you can actually add a character of your choice at the end of each line, directly from your jq filter:

Filter + NUL

.[]
| ( . + "\u0000")

Command Line + NUL

$ jq '.[] | ( . + "\u0000")' input.json

Output + NUL

"LF\nLF\u0000"
"TAB\tTAB\u0000"
"FF\fFF\u0000"

Command Line + NUL as Raw (View as Hex)

$ jq -r '.[] | ( . + "\u0000")' input.json | xxd

Output + NUL as Raw (Viewed as Hex)

0000000: 4c46 0a4c 4600 0a54 4142 0954 4142 000a  LF.LF..TAB.TAB..
0000010: 4646 0c46 4600 0a                        FF.FF..

eric-brechemier on 9 Nov 2016

🎉1

Thank you -- I actually have a few StackOverflow answers I'm going to want to amend in light of the patterns suggested in this ticket.

That said, this still would be a desirable feature to have.

Newline escaping requires the consumer's code to perform unescaping -- while printf '%b' is POSIX-defined, it's hardly common idiom, and without extensions such as bash's printf -v, command substitutions used to invoke it are themselves side-effecting, strippping trailing newlines. Moreover, lack of such unescaping is only visible/obvious in the error case, whereas reading a NUL-delimited stream as a line-delimited stream or the inverse is an easily-detected corner case. Moreover, whereas common tools (xargs -0, sort -z, etc) can deal with NUL-delimited streams, very few correctly grok "newline-delimited-text, but with the specific correct set of escape sequences".

The patterns given here are helpful: though \x00\x0a is a bit harder to process on the consumer side than just \x00 (for purposes of xargs -0 &c), it's certainly better than where we were without them.

charles-dyfis-net on 9 Nov 2016

@charles-dyfis-net

If you use -j instead of -r then it won't output the newline (u00a0) characters.

thedward on 11 Nov 2016

👍1

JSON (at least RFC 7159 JSON) does not permit unescaped ASCII control
characters (U+0000 ~ U+001F), which contains the newline/linefeed
character. jq neither accepts nor outputs JSON strings containing newlines.

I'm not sure how you've come across this as an issue. Can you show me a
use case for this?

On Fri, Nov 11, 2016 at 10:17 AM Thedward Blevins [email protected]
wrote:

@charles-dyfis-net https://github.com/charles-dyfis-net

If you use -j instead of -r then it won't output the newline (u00a0)
characters.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/stedolan/jq/issues/1271#issuecomment-259979384, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQ4V_lnpBbkMDRAfZsyxTcTRwM0e776ks5q9IclgaJpZM4Kqnc8
.

wtlangford on 11 Nov 2016

@wtlangford, gladly.

Consider the following contrived example:

#!/usr/bin/env bash
input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r item; do
  printf 'Shell script interpreted item as: %q\n' "$item"
  printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -r '.[] | .value' <<<"$input_json")

...where the intended output is (something equivalent to -- not all ksh-derivative shells implement printf %q in exactly the same way):

Shell script interpreted item as: $'I am\na multiline\nvalue\twith a tab'
...as a literal: <<<I am
a multiline
value   with a tab>>>
Shell script interpreted item as: I\ am\ a\ second\ value
...as a literal: <<<I am a second value>>>

Instead, as given above, the actual output is:

Shell script interpreted item as: I\ am
...as a literal: <<<I am>>>
Shell script interpreted item as: a\ multiline
...as a literal: <<<a multiline>>>
Shell script interpreted item as: $'value\twith a tab'
...as a literal: <<<value       with a tab>>>
Shell script interpreted item as: I\ am\ a\ second\ value
...as a literal: <<<I am a second value>>>

Now, to fix this, we can use NUL delimiters. That would modify our expression to be something like the following:

#!/usr/bin/env bash
input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r -d '' item; do
  printf 'Shell script interpreted item as: %q\n' "$item"
  printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -j '.[] | .value | (. + "\u0000")' <<<"$input_json")

...and it does in fact work exactly as desired. The only problem is that it requires the user to use some idioms that aren't completely obvious unless they read this ticket. :)

charles-dyfis-net on 12 Nov 2016

👍1

Ah. I see, you're using the raw output mode. It does, as you've found,
output unescaped newline characters, as it outputs the value of the json
strings and not the strings themselves. :)

I see your use case now. I'm not strictly averse to adding a new flag, but
at the same time, we try not to add new flags to the binary. I'd
definitely like to see some form of this added to the wiki, though.

On Fri, Nov 11, 2016, 22:58 Charles Duffy [email protected] wrote:

@wtlangford https://github.com/wtlangford, gladly.

Consider the following contrived example:

input_json='[{"value": "I amna multilinenvaluetwith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r item; do
printf 'Shell script interpreted item as: %qn' "$item"
printf '...as a literal: <<<%s>>>n' "$item"
done < <(jq -r '.[] | .value' <<<"$input_json")

...where the intended output is:

Shell script interpreted item as: $'I amn a multilinenvaluetwith a tab'
...as a literal: << a multiline
value with a tab>>>
Shell script interpreted item as: 'I am a second value'
...as a literal: <<>>

Instead, as given above, the actual output is:

Shell script interpreted item as: I am
...as a literal: <<>>
Shell script interpreted item as: a multiline
...as a literal: <<>>
Shell script interpreted item as: $'valuetwith a tab'
...as a literal: <<>>
Shell script interpreted item as: I am a second value
...as a literal: <<>>

Now, to fix this, we can use NUL delimiters. That would modify our
expression to be something like the following:

input_json='[{"value": "I amna multilinenvaluetwith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r -d '' item; do
printf 'Shell script interpreted item as: %qn' "$item"
printf '...as a literal: <<<%s>>>n' "$item"
done < <(jq -j '.[] | .value | (. + "u0000")' <<<"$input_json")

...and it does in fact work exactly as desired. The only problem is that
it requires the user to use some idioms that aren't completely obvious
unless they read this jq ticket. :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/stedolan/jq/issues/1271#issuecomment-260099928, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQ4VwESUHerOjuEUQpxjswLC1LMwSqfks5q9TlkgaJpZM4Kqnc8
.

wtlangford on 12 Nov 2016

@charles-dyfis-net you could also keep the list of values encoded as JSON, then use jq again within the loop to decode each JSON value into a raw string:

#!/bin/sh { jq '.[] | .value' << INPUT_JSON [ {"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"} ] INPUT_JSON } | { while read -r jsonString do printf 'JSON Value: <<<%s>>>\n' "$jsonString" printf 'Text Value: <<<%s>>>\n' "$( jq -r -n "$jsonString")" done }

JSON Value: <<<"I am\na multiline\nvalue\twith a tab">>> Text Value: <<<I am a multiline value with a tab>>> JSON Value: <<<"I am a second value">>> Text Value: <<<I am a second value>>>

The conversion from JSON to text is done in jq -r -n "$jsonString": the JSON string is provided as a filter, using -n flag, which prints itself as a raw string, using -r flag.

eric-brechemier on 14 Nov 2016

@eric-brechemier, noted, though that's considerably less efficient than a single jq run.

I think I'm entirely happy with @wtlangford's suggestion of treating this as a doc enhancement rather than a software enhancement -- now it's just a question of whether and when I have the time to assign this to myself and generate a wiki edit incorporating the many suggestions given here. :)

charles-dyfis-net on 14 Nov 2016

@wtlangford without adding a new flag, you could repurpose the -j flag to accept an optional argument:

-j # join with empty character --join-output='\u0000' # join with NUL

eric-brechemier on 16 Nov 2016

👍3 🎉2

It seems to me that the matter of enhancing jq to support "joining with NUL" is of rather low priority, and certainly much lower than several other issues (notably the release of jq 1.6).

In any case, I suspect that most users who actually have the need to join with NUL can simply use the idiom:

jq -c ..... | tr '\n' '\0'

That is, I suspect that most such users are working in an environment that has tr.

If using tr is not an option, then chances are that using the -c option in some other way, perhaps in conjunction with jq's support for @tsv and/or "u0000", will suffice to solve the problem at hand.

Rather than expending the very limited resources available on supporting NUL-as-delimiter, I believe it would be far better to enhance support for the application/json-seq MIME type. Specifically, it should be easy to use jq to accept a JSON stream as input but produce json-seq as output (and vice versa), but currently the --seq option does not provide the flexibility to make this convenient.

(Note: To convert a stream of JSON texts to json-seq, one could use the form: jq -n --seq --slurpfile in <(STREAM) '$in[]' )

pkoppstein on 17 Nov 2016

@pkoppstein, tr does not address the use case given in the sample code above, wherein there is a need to distinguish between literal newlines and delimiter newlines, and conflating the two (as by converting _all_ newlines to delimiters) will cause the very ambiguity this feature (by selecting a delimiter not allowed in JSON strings even in escaped form) is intended to address.

charles-dyfis-net on 17 Nov 2016

@charles-dyfis-net - My point is that one can use jq -c (without the -r option) to insert the NULs, and then later on in the processing convert to "raw output" if that is really needed.

pkoppstein on 17 Nov 2016

@pkoppstein, ...so what you have then is essentially the same proposal offered by @eric-brechemier of using multiple passes, with the same performance overhead -- which is to say, the need to invoke a separate instance of jq for each item of output to be processed to convert into ultimate raw form.

charles-dyfis-net on 17 Nov 2016

@charles-dyfis-net - My comments were mainly directed to the question of whether joining with NUL is really needed, not to the example which you yourself described as contrived.

For non-contrived problems, I suspect your concerns about efficiency are probably misplaced. Consider, for example, pipelines of the form:

while read -r line ; do MUNGE << "$line" | jq WHATEVER ; done < <(jq -c HEAVYLIFTING)

In realistic scenarios, the additional cost associated with the inner invocations of jq will almost certainly be relatively small, perhaps even to the point of insignificance if reasonable care is taken with the details.

The real issue here is probably https://github.com/stedolan/jq/issues/147

pkoppstein on 17 Nov 2016

Rather than expending the very limited resources available on supporting NUL-as-delimiter, I believe it would be far better to enhance support for the application/json-seq MIME type. Specifically, it should be easy to use jq to accept a JSON stream as input but produce json-seq as output (and vice versa), but currently the --seq option does not provide the flexibility to make this convenient.

@pkoppstein are you referring to this?

eric-brechemier on 22 Nov 2016

@eric-brechemier - That does seem to be related.

pkoppstein on 23 Nov 2016

So, yeah, a -0 would actually be nice.

nicowilliams on 23 Jan 2017

👍6

So, yeah, a -0 would actually be nice.

Yes please... pretty, pretty please!

pvdb on 12 Feb 2018

I was thinking of working on this (it looks pretty simple), which option do people want?

-0 / --nul-output

-j/--join-output 'u0000'

Personally I think I would prefer the first one.

pabs3 on 14 Oct 2019

I ended up implementing the first option, but I'll be happy to change the PR to the other option if people prefer that.

pabs3 on 14 Oct 2019

👍1

Thanks! I suggested the second option to address the reluctance to introduce a new flag.
But using -0 directly would make the usage simpler.

eric-brechemier on 24 Oct 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

simple example

kelchy · 4Comments

Error: Cannot find module 'commander'

mcandre · 3Comments

Can't work out how to sort_by

kaihendry · 4Comments

Strip quotes from strings

tbelaire · 4Comments

Comments

lhunath · 3Comments