Right now, the standard-practice way to read an array from jq into a shell-script is to use raw output and parse on newlines.
However, JSON strings can contain literal newlines; this makes such parsing error-prone.
NUL-delimited output, allowing IFS= read -r -d '' string to read exactly one C string unambiguously, would resolve this.
@charles-dyfis-net is it not simpler in this case to keep newline escaping, instead of using raw output? This allows to keep a single item per line, which is easier to loop over in a shell script:
[
"LF\nLF",
"TAB\tTAB",
"FF\fFF"
]
.[]
$ jq '.[]' input.json
"LF\nLF"
"TAB\tTAB"
"FF\fFF"
Otherwise, you can actually add a character of your choice at the end of each line, directly from your jq filter:
.[]
| ( . + "\u0000")
$ jq '.[] | ( . + "\u0000")' input.json
"LF\nLF\u0000"
"TAB\tTAB\u0000"
"FF\fFF\u0000"
$ jq -r '.[] | ( . + "\u0000")' input.json | xxd
0000000: 4c46 0a4c 4600 0a54 4142 0954 4142 000a LF.LF..TAB.TAB..
0000010: 4646 0c46 4600 0a FF.FF..
Thank you -- I actually have a few StackOverflow answers I'm going to want to amend in light of the patterns suggested in this ticket.
That said, this still would be a desirable feature to have.
Newline escaping requires the consumer's code to perform unescaping -- while printf '%b' is POSIX-defined, it's hardly common idiom, and without extensions such as bash's printf -v, command substitutions used to invoke it are themselves side-effecting, strippping trailing newlines. Moreover, lack of such unescaping is only visible/obvious in the error case, whereas reading a NUL-delimited stream as a line-delimited stream or the inverse is an easily-detected corner case. Moreover, whereas common tools (xargs -0, sort -z, etc) can deal with NUL-delimited streams, very few correctly grok "newline-delimited-text, but with the specific correct set of escape sequences".
The patterns given here are helpful: though \x00\x0a is a bit harder to process on the consumer side than just \x00 (for purposes of xargs -0 &c), it's certainly better than where we were without them.
@charles-dyfis-net
If you use -j instead of -r then it won't output the newline (u00a0) characters.
JSON (at least RFC 7159 JSON) does not permit unescaped ASCII control
characters (U+0000 ~ U+001F), which contains the newline/linefeed
character. jq neither accepts nor outputs JSON strings containing newlines.
I'm not sure how you've come across this as an issue. Can you show me a
use case for this?
On Fri, Nov 11, 2016 at 10:17 AM Thedward Blevins [email protected]
wrote:
@charles-dyfis-net https://github.com/charles-dyfis-net
If you use -j instead of -r then it won't output the newline (u00a0)
characters.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/stedolan/jq/issues/1271#issuecomment-259979384, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQ4V_lnpBbkMDRAfZsyxTcTRwM0e776ks5q9IclgaJpZM4Kqnc8
.
@wtlangford, gladly.
Consider the following contrived example:
#!/usr/bin/env bash
input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r item; do
printf 'Shell script interpreted item as: %q\n' "$item"
printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -r '.[] | .value' <<<"$input_json")
...where the intended output is (something equivalent to -- not all ksh-derivative shells implement printf %q in exactly the same way):
Shell script interpreted item as: $'I am\na multiline\nvalue\twith a tab'
...as a literal: <<<I am
a multiline
value with a tab>>>
Shell script interpreted item as: I\ am\ a\ second\ value
...as a literal: <<<I am a second value>>>
Instead, as given above, the actual output is:
Shell script interpreted item as: I\ am
...as a literal: <<<I am>>>
Shell script interpreted item as: a\ multiline
...as a literal: <<<a multiline>>>
Shell script interpreted item as: $'value\twith a tab'
...as a literal: <<<value with a tab>>>
Shell script interpreted item as: I\ am\ a\ second\ value
...as a literal: <<<I am a second value>>>
Now, to fix this, we can use NUL delimiters. That would modify our expression to be something like the following:
#!/usr/bin/env bash
input_json='[{"value": "I am\na multiline\nvalue\twith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r -d '' item; do
printf 'Shell script interpreted item as: %q\n' "$item"
printf '...as a literal: <<<%s>>>\n' "$item"
done < <(jq -j '.[] | .value | (. + "\u0000")' <<<"$input_json")
...and it does in fact work exactly as desired. The only problem is that it requires the user to use some idioms that aren't completely obvious unless they read this ticket. :)
Ah. I see, you're using the raw output mode. It does, as you've found,
output unescaped newline characters, as it outputs the value of the json
strings and not the strings themselves. :)
I see your use case now. I'm not strictly averse to adding a new flag, but
at the same time, we try not to add new flags to the binary. I'd
definitely like to see some form of this added to the wiki, though.
On Fri, Nov 11, 2016, 22:58 Charles Duffy [email protected] wrote:
@wtlangford https://github.com/wtlangford, gladly.
Consider the following contrived example:
input_json='[{"value": "I amna multilinenvaluetwith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r item; do
printf 'Shell script interpreted item as: %qn' "$item"
printf '...as a literal: <<<%s>>>n' "$item"
done < <(jq -r '.[] | .value' <<<"$input_json")...where the intended output is:
Shell script interpreted item as: $'I amn a multilinenvaluetwith a tab'
...as a literal: << a multiline
value with a tab>>>
Shell script interpreted item as: 'I am a second value'
...as a literal: <<>>Instead, as given above, the actual output is:
Shell script interpreted item as: I am
...as a literal: <<>>
Shell script interpreted item as: a multiline
...as a literal: <<>>
Shell script interpreted item as: $'valuetwith a tab'
...as a literal: <<>>
Shell script interpreted item as: I am a second value
...as a literal: <<>>
Now, to fix this, we can use NUL delimiters. That would modify our
expression to be something like the following:input_json='[{"value": "I amna multilinenvaluetwith a tab"}, {"value": "I am a second value"}]'
while IFS= read -r -d '' item; do
printf 'Shell script interpreted item as: %qn' "$item"
printf '...as a literal: <<<%s>>>n' "$item"
done < <(jq -j '.[] | .value | (. + "u0000")' <<<"$input_json")...and it does in fact work exactly as desired. The only problem is that
it requires the user to use some idioms that aren't completely obvious
unless they read this jq ticket. :)—
You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub
https://github.com/stedolan/jq/issues/1271#issuecomment-260099928, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADQ4VwESUHerOjuEUQpxjswLC1LMwSqfks5q9TlkgaJpZM4Kqnc8
.
@charles-dyfis-net you could also keep the list of values encoded as JSON, then use jq again within the loop to decode each JSON value into a raw string:
#!/bin/sh
{
jq '.[] | .value' << INPUT_JSON
[
{"value": "I am\na multiline\nvalue\twith a tab"},
{"value": "I am a second value"}
]
INPUT_JSON
} | {
while read -r jsonString
do
printf 'JSON Value: <<<%s>>>\n' "$jsonString"
printf 'Text Value: <<<%s>>>\n' "$( jq -r -n "$jsonString")"
done
}
JSON Value: <<<"I am\na multiline\nvalue\twith a tab">>>
Text Value: <<<I am
a multiline
value with a tab>>>
JSON Value: <<<"I am a second value">>>
Text Value: <<<I am a second value>>>
The conversion from JSON to text is done in jq -r -n "$jsonString": the JSON string is provided as a filter, using -n flag, which prints itself as a raw string, using -r flag.
@eric-brechemier, noted, though that's considerably less efficient than a single jq run.
I think I'm entirely happy with @wtlangford's suggestion of treating this as a doc enhancement rather than a software enhancement -- now it's just a question of whether and when I have the time to assign this to myself and generate a wiki edit incorporating the many suggestions given here. :)
@wtlangford without adding a new flag, you could repurpose the -j flag to accept an optional argument:
-j # join with empty character
--join-output='\u0000' # join with NUL
It seems to me that the matter of enhancing jq to support "joining with NUL" is of rather low priority, and certainly much lower than several other issues (notably the release of jq 1.6).
In any case, I suspect that most users who actually have the need to join with NUL can simply use the idiom:
jq -c ..... | tr '\n' '\0'
That is, I suspect that most such users are working in an environment that has tr.
If using tr is not an option, then chances are that using the -c option in some other way, perhaps in conjunction with jq's support for @tsv and/or "u0000", will suffice to solve the problem at hand.
Rather than expending the very limited resources available on supporting NUL-as-delimiter, I believe it would be far better to enhance support for the application/json-seq MIME type. Specifically, it should be easy to use jq to accept a JSON stream as input but produce json-seq as output (and vice versa), but currently the --seq option does not provide the flexibility to make this convenient.
(Note: To convert a stream of JSON texts to json-seq, one could use the form: jq -n --seq --slurpfile in <(STREAM) '$in[]' )
@pkoppstein, tr does not address the use case given in the sample code above, wherein there is a need to distinguish between literal newlines and delimiter newlines, and conflating the two (as by converting _all_ newlines to delimiters) will cause the very ambiguity this feature (by selecting a delimiter not allowed in JSON strings even in escaped form) is intended to address.
@charles-dyfis-net - My point is that one can use jq -c (without the -r option) to insert the NULs, and then later on in the processing convert to "raw output" if that is really needed.
@pkoppstein, ...so what you have then is essentially the same proposal offered by @eric-brechemier of using multiple passes, with the same performance overhead -- which is to say, the need to invoke a separate instance of jq for each item of output to be processed to convert into ultimate raw form.
@charles-dyfis-net - My comments were mainly directed to the question of whether joining with NUL is really needed, not to the example which you yourself described as contrived.
For non-contrived problems, I suspect your concerns about efficiency are probably misplaced. Consider, for example, pipelines of the form:
while read -r line ; do MUNGE << "$line" | jq WHATEVER ; done < <(jq -c HEAVYLIFTING)
In realistic scenarios, the additional cost associated with the inner invocations of jq will almost certainly be relatively small, perhaps even to the point of insignificance if reasonable care is taken with the details.
The real issue here is probably https://github.com/stedolan/jq/issues/147
Rather than expending the very limited resources available on supporting NUL-as-delimiter, I believe it would be far better to enhance support for the application/json-seq MIME type. Specifically, it should be easy to use jq to accept a JSON stream as input but produce json-seq as output (and vice versa), but currently the --seq option does not provide the flexibility to make this convenient.
@pkoppstein are you referring to this?
@eric-brechemier - That does seem to be related.
So, yeah, a -0 would actually be nice.
So, yeah, a
-0would actually be nice.
Yes please... pretty, pretty please!
I was thinking of working on this (it looks pretty simple), which option do people want?
Personally I think I would prefer the first one.
I ended up implementing the first option, but I'll be happy to change the PR to the other option if people prefer that.
Thanks! I suggested the second option to address the reluctance to introduce a new flag.
But using -0 directly would make the usage simpler.
Most helpful comment
So, yeah, a -0 would actually be nice.