It would be useful to have as builtins some cryptographic hash functions (like say the SHA1 and SHA2 family, MD5, etc.) (and even some HMAC functions).
The main use-cases for such a feature would be for example:
jq is used as a pre-processor for loading some JSON object streams in a document oriented database like CouchDB, MongoDB, etc.;group_by(.key | sha1 | [:2]);I have quickly hacked a proof-of-concept based on the latest release (1.5): https://github.com/cipriancraciun/jq/tree/patches/sha1 , see also the diff at the link bellow:
https://github.com/stedolan/jq/compare/a5b5cbefb83935ce95ec62b9cadc8ec73026d33a...cipriancraciun:patches/sha1
If this is deemed useful I could provide the implementation for all these based on OpenSSL or GnuTLS.
I like the idea. @nicowilliams- what say you?
Unfortunately, it would mean either compiling in openssl/etc as a hard dependency or making it an optional feature during compilation. Either works, I suppose.
Of course, we could decide we finally want executable library loading (dlopen, LoadLibrary), and implement that. But dlopen and friends makes me very sad.
I was looking to do this today. I have an array of objects that I wanted to fingerprint. something like:
$ echo '[{"x": {"a": "a"}}, {"x": {"b": 3}}, {"x": {"c": "c"}}]' | \
jq '.[] |= . + {xhashed: .x | tostring}'
[
{
"x": {
"a": "a"
},
"xhashed": "{\"a\":\"a\"}"
},
{
"x": {
"b": 3
},
"xhashed": "{\"b\":3}"
},
{
"x": {
"c": "c"
},
"xhashed": "{\"c\":\"c\"}"
}
]
where tostring could be sha1 or sha256
One thing to keep in mind is that you would need to canonicalize the object representation into some standard way before hashing. https://github.com/substack/json-stable-stringify comes to mind.
If we're worried about pulling in dependencies, we could just use a micro-lib from clibs or ccan such as https://github.com/jb55/sha256.c
I also noticed that when you do tostring on an object, even with the --sorted-keys (-S) option, it still produces a string with unsorted keys. Bug?
Since I've opened this topic I've created a new branch with new "extensions", like one can see in the following examples and tests:
https://github.com/cipriancraciun/jq/tree/patches/extensions/src/_extensions/_examples
https://github.com/cipriancraciun/jq/tree/patches/extensions/src/_extensions/_tests
The branch is at:
https://github.com/cipriancraciun/jq/tree/patches/extensions
Now to answer @jb55 question: my crypto functions (MD5 and SHA family) come in two variants as seen in:
https://github.com/cipriancraciun/jq/blob/patches/extensions/src/_extensions/jqe_crypto_builtins.h
https://github.com/cipriancraciun/jq/blob/patches/extensions/src/_extensions/jqe_crypto.c
The crypto_sha256 accepts any JSON value, calls jv_dump_string (in which in the end calls jv_dump_term) with JV_PRINT_ASCII | JV_PRINT_SORTED , and then hashes it, thus if the jv_dump_term is implemented correctly should "canonize" the input.
While the crypto_sha256_ll takes an extra argument which says if the value should be expected a string, or if it should be transformed into a string like the previous function does.
See the following example:
jq '{
value : .,
value_hash : . | crypto_sha256,
tostring : . | tostring,
tostring_hash : . | tostring | crypto_sha256_ll (false),
}' <<'EOS'
{"a" : 1, "b" : 2}
{"b" : 2, "a" : 1}
"a"
["a"]
EOS
jq allows a single name to be used for different defs with different arities. Why not just md5 and sha256?
Btw, the prefix . | can be dropped in each case.
Why not just
md5andsha256?
First I didn't want to "trash" the namespace, because perhaps some day these functions would have been introduced in jq itself.
And why the two different named functions (i.e. crypto_md5 and crypto_md5_ll)? Mainly because I followed this pattern in the other "extensions", where there was a big difference between the "usual" and "low-level" ones.
Btw, the prefix
. |can be dropped in each case.
I know, but I always like to put it just for "sake of completion", because (from a functional language point of view) it is equivalent to function(argument), else "it looks" to me that just saying field : function actually initializes the field with the value of the function or another constant.
Yes, we should do this. Questions:
what hash (and MAC) function implementations should we use?
One obvious answer is: use the code from various IETF RFCs. This won't be very optimized (at all, really), and may have timing side channels to worry about. (Speaking of timing side channels, we should have a constant-time string comparison function. In any case, we should not recommend jq for cryptographic security applications.)
should we have an option to use OpenSSL?
Probably, but we should also have an option to use statically-linked, in-tree implementations wherever possible, and not by including an entire copy of OpenSSL in-tree as we did for Oniguruma.
Also, we're going to need a base64 decoder. And, really, we need a binary "type" -- basically pretending that binary data is actually an array of small numbers (0..255, naturally). We don't want to be base64 coding all the time.
For base64 decoder see #47
This works for me, but for my use case I need base64decode + md5 (calculating fingerprints of public ssh keys).
This looks like a great idea, is it still planned?
Most helpful comment
Since I've opened this topic I've created a new branch with new "extensions", like one can see in the following examples and tests:
https://github.com/cipriancraciun/jq/tree/patches/extensions/src/_extensions/_examples
https://github.com/cipriancraciun/jq/tree/patches/extensions/src/_extensions/_tests
The branch is at:
https://github.com/cipriancraciun/jq/tree/patches/extensions
Now to answer @jb55 question: my crypto functions (MD5 and SHA family) come in two variants as seen in:
https://github.com/cipriancraciun/jq/blob/patches/extensions/src/_extensions/jqe_crypto_builtins.h
https://github.com/cipriancraciun/jq/blob/patches/extensions/src/_extensions/jqe_crypto.c
The
crypto_sha256accepts any JSON value, callsjv_dump_string(in which in the end callsjv_dump_term) withJV_PRINT_ASCII | JV_PRINT_SORTED, and then hashes it, thus if thejv_dump_termis implemented correctly should "canonize" the input.While the
crypto_sha256_lltakes an extra argument which says if the value should be expected a string, or if it should be transformed into a string like the previous function does.See the following example: