Jq: "shell out" or filter though bash command?

Created on 9 Jun 2013 · 41Comments · Source: stedolan/jq

Given this file

{
  "releases": [
    {
      "date": "1998-05-12"
    },
    {
      "date": "1997-05-12"
    },
    {
      "date": "1999-05-12"
    }
  ]
}

I would like to run the date values through the date command, example

$ date +%s -d 1998-05-12
894949200

so that the final result is

{
  "releases": [
    {
      "date": 894949200
    },
    {
      "date": 863413200
    },
    {
      "date": 926485200
    }
  ]
}

Is something like this possible?

feature request

Source

ghost

👍27 ❤2

Most helpful comment

Not currently possible, was definitely thinking of adding something along these lines.

stedolan on 9 Jun 2013

👍3

All 41 comments

Not currently possible, was definitely thinking of adding something along these lines.

stedolan on 9 Jun 2013

👍3

It'd be cool to be able to add functions written in C. Since builtins that are written in C with trivial function prototypes (return jv, take a fixed number [between 1 and 5] of arguments) it should be quite simple to to use dlopen()/dlsym() (or win32's LoadLibrary() equivalents). The hard part for object code plugins is the need for them to use the jv_* functions from libjq, which would require passing plugin functions a pointer to a table of jv_* functions.

Ideologically, is this ok for the jq language? Few builtins in jq have side-effects. Are side-effects required to be backtrackable? No, I think they're not (earlier I thought they were).

nicowilliams on 18 Jun 2013

I definitely want this to happen.

I think it might be possible to use gcc -Wl,--export-dynamic or similar to allow dlopened libraries to load symbols from the main jq executable - that way, plugins that used jv_foo wouldn't have to have a definition handy and wouldn't need a table of function pointers.

I envisage having an import foo statement in jq code which searched the user's jq path for either libjq-foo.so or foo.jq, and used either dl_open or jq_parse_library to get at the contents. Does that sound sane?

stedolan on 19 Jun 2013

Well... it's complicated. If there were two versions of libjq in the process then the plugins for one would get the jv_* from the wrong version of libjq and all hell breaks loose quickly. If there were a portable way to get a dl handle for the calling libjq then it could pass that to the plugins, but alas, that's not portable. SQLite3 handles this about as portably as can be done, roughly like this:

it generates a header file that defines a struct table of pointers to the library's exported objects
it also generates macros by the same names as the exported symbols which then expect to be invoked in a lexical context where there's a variable of a given name that points to the library's exported object pointers struct
it also generates code to setup that struct
plugins include the header for plugins, and their entry points take a pointer to the calling library's exported object pointer struct

I highly recommend this approach.

nicowilliams on 19 Jun 2013

But yes, modulo DLL hell prevention measures it's sane.

nicowilliams on 19 Jun 2013

Urrrrrrrgh. That really seems like exactly the thing the dynamic linker is supposed to do.

stedolan on 19 Jun 2013

I accept that dynamic linkers are generally broken and that we may have to do as you say to avoid DLL hell. I'm just pining for a sane linker.

stedolan on 19 Jun 2013

I may be missing something fundamental here. In what situation are there two incompatible versions of libjq in the same process where we're not _already_ screwed?

stedolan on 19 Jun 2013

ikr.

But the Unix RTLDs weren't that smart initially. Some of them are quite good now (Solaris' in particular), but the improvements haven't spread universally. Options like RTLD_GROUP and RTLD_FIRST and so on, really need to become universal. Also, I wish the GNU linker crowd would adopt Solaris' direct binding (-B direct)...

nicowilliams on 19 Jun 2013

I may be missing something fundamental here. In what situation are there two incompatible versions of libjq in the same process where we're not already screwed?

Hopefully never. But in real life this happens all the time in apps
that use, e.g., OpenSSL and which also use, e.g., libpam. Or if you
have multiple nss (name service switch) modules that use different
versions of OpenSSL, libldap, libsasl2, libgss, ..., when nscd is not
running.

I can definitely see libjq being used in all sorts of networking libraries.

nicowilliams on 19 Jun 2013

I've wanted to write an open source generic plugin system, one that could use advanced RTLDs or fallback on the SQLite3 scheme without the developer of the plugin interface (or plugin) having to know about the details.

nicowilliams on 19 Jun 2013

I am also envious of solaris' seemingly working dynamic linker. I comfort myself in the secure knowledge that no dynamic linker actually works properly and I'm just not familiar enough with solaris' to know the manner in which it breaks :)

I am scared by the thought of someone using libjq in a low-level network library. I'm reasonably happy with the jq language, but less so with the API. API/ABI breaks will likely be frequent over the next while.

The generic plugin system would be nice. It makes me sad that it would involve so much work.

stedolan on 19 Jun 2013

Oh, I forgot to mention that the SQLite3 struct thing includes an ABI version number first, so it's easy to fail safe.

Regarding ABI breaks and apps that use libjq: that's what the shared object versioning is for.

Re: source backwards-incompatible changes: those are easy to discover (the compiler errors out).

It'll all work out.

(I've used the Solaris RTLD extensively. It's really quite good. There's some really good docs on it and the link-editor, and then there's some great blog entries by the Solaris engineering linker aliens, as we call them.)

nicowilliams on 19 Jun 2013

I'm looking for the equivalent of an old-style ETL tool in the JSON world. All the regular ETL tools (Pentaho, Talend, Orange, Knime, etc.) are painful to use with JSON, and overkill for what I need. I just need to do some format translations on JSON values, like constructing date strings from separate fields, or converting "monetary shorthand" into numbers, or breaking up fully-qualified stock tickers into exchange and ticker. Simple stuff. Except that I don't want to extend jq or write a C library or do much heavy coding to accomplish it -- I don't consider myself a developer but I _do_ use scripting languages to prepare and analyze data.

Jq has the potential to become _the_ ETL tool for JSON, if it can get this feature right. In the best of all possible worlds, I would be able to write transformations in my favorite scripting language, point jq to my "library" of transformations, and then just cobble together jq command lines like this:

jq --transforms myscript.js 'def chgvalue(f): _change_it(f); map(chgvalue(.[0].field_that_must_be_changed))'

where the _change_it() function was found in the myscript.js file. Transformation functions should be able to take in a keypair list and return a keypair list that may have additional keypairs, in whatever data container paradigm the scripting language supports. Transforms written in C/C++ are faster, of course, and yes, you'd have to include a script-running engine (but since its JSON I figure you can probably do JavaScript already?), but I really think this elevates jq immensely. As JSON becomes more and more ubiquitous, transformation tools for non-developers are going to become important.

teknomath on 1 Oct 2013

What's "ETL"?

nicowilliams on 2 Oct 2013

Extract-Transform-Load -- its a class of software found commonly in enterprises. The best example is probably Informatica's PowerCenter platform, but there are open source alternatives (Knime, Orange) and freemium alternatives (Pentaho, Talend, Rapid-Miner). Also, scripting languages like R and Python get used heavily for ETL, but the value of these bigger platforms is that they provide a lot of enterprise-specific features that a simple script approach lacks, like high availability, failover, auditing, compliance verification, data provenance, and managed workflow.

The basic gist behind ETL is that you have data in source A and you want to get it into sink B but A and B have different formats and/or different expectations of what shape the data should be, so you need to extract it from A, transform the data, and load it into B. ETL, as a method, is required when the systems that produce A and consume B cannot be changed, for whatever reason -- its for when you have the "square peg and round hole" problem and you need to solve it by changing the peg, not the hole. Simple data reformatting is the low, low end of the ETL spectrum of features...the real serious stuff addresses problems when you have 1:n or n:m data reshaping issues, or pivoting or classification. Yeah, you can do all this in R or Python or C++...but an ETL platform is going to make your life a lot easier.

ETL becomes important for the JSON universe as soon as you want to send data from a modern web-based data source (that produces JSON) into a legacy application that knows nothing about JSON. Yes, you could update the legacy app to read JSON...but often that is a Hard Problem. It's easier to just transform the JSON data into whatever form the legacy system expects.

Another thing that comes up is that a lot of the more sophisticated transformations can actually be done better on the JSON side of the story, rather than inside the ETL tool -- so I might really want a "TEL" or "ELT" process. Moving the "T" part of the story outside of the data movement and trivial reformatting and reshaping tasks is an ongoing debate in the ETL world. [...actually, that's exactly what I'm up to: I'm using Elasticsearch to do some categorization and similarity testing and I need to get my data back into my legacy system -- it comes out of Elasticsearch as JSON and it needs to go back into my system as a CSV file].

teknomath on 2 Oct 2013

See the handles branch of my github clone of jq. This is coming.

nicowilliams on 27 Dec 2013

@svnpenn Thanks :)

nicowilliams on 27 Dec 2013

@teknomath I'm already using jq as an ETL, much as I've used XSLT in that fashion before (only jq makes me much happier than XSLT). You might want to try out the features in https://github.com/nicowilliams/jq/tree/handles . I'm working towards adding a proper library system, including dlopen()ing C extensions -- I think that will help make jq incredibly powerful.

nicowilliams on 27 Dec 2013

Hi, I just want to know the current status of this issue (enhancement)? I have little to say about C plugins, but since most standard Unix programs (sed, awk, etc.) are filters, and jq operates on filters, they should definitely coexist well.

Meanwhile, for tasks such as the one mentioned at the beginning of this issue, I'm using jq to extract relevant values, passing through relevant filters, and then using jq to assemble back, which is a huge pain:

jq -r '.releases[].date' | parallel -k date -d {} +%s | jq -R '{"date": .}' | jq -s '{"release": .}'

(And this gets even more complicated when other key/value pairs are present in the array and needs to be preserved.)

zmwangx on 8 Jul 2014

Off and on I end up doing work with big piles of unstructured data that I need to make sense of, and I traditionally have used giant shell pipelines to sort out the needles in the haystacks. Lately I have being doing more and more with jq, primarily because it is much less error prone due to its clean (but occasionally surprising) semantics. This feature request is one of the most frequent reasons I have to "drop out" of jq to process some data.

In the traditional unix-pipeline-awk world, you would would use awk's system() call with a string shell command constructed from the record you are processing (e.g.system("date -d @" $2)). Based on my (possibly idiosyncratic) usage of jq, I think this would fit reasonably well into the jq world:

jq ' .date = system("date -d @" + (.epoch | tostring)) '

Note that the json string argument will need to be converted to raw, and in the example we will probably want the output to be converted to a json string. But the result could have been json. I expect the user will want some control over how the input/output conversions are handled.

If date returned more than one result (although I don't think it will) this also seems to work fine.

So system() will work find for date, which processes a single date at a time, but what if you want to NFKC normalize some strings in a few million json records? In awk, you would solve this using a "coprocess" to which you can both read and write. The syntax is awkward, and derived from the Korn shells |& operator.

awk '
  BEGIN {
    uconv = "uconv -b 1 -f utf-8 -t utf-8 -x \"::nfkc;\"";
    PROCINFO[uconv, "PTY"] = 1;
  }
  {
    print $2 |& uconv;
    uconv |& getline normalized;
   ...
  }
  END {
    close(uconv);
  }
'

In these situations, deadlock is a possibility and that is why the example asks awk to use a pty for the coprocess. I left out the usual stdbuf incantations that try to force the coprocess to work unbuffered.

The low level reading and writing is quite flexible, but not so clean. If you constrain it to be a filter that takes one input and produced one output, It looks more like a jq filter and seems to fit in nicely, although as a filter it will not produce an interesting json output unless the coprocess does.

jrdriscoll on 25 Oct 2015

👍2

I have problems like that of the OP all the time where I need to get data out of one system, transform it, and then somehow join it back in with the original data. With the newish input it is _much_ easier to do this sort of thing with jq. Here is a reasonably clean way to do way the OP poster wants (admittedly, my tolerance "reasonable" and "clean" in these matters may not be representative):

#!/bin/bash

# need objects on single line for subsequent paste to work
# paste is effectively joining on implicit key = line_number

cat releases.json \
 | jq -c '.' >r-c.json 

# convert contained dates to pipe separated string in jq
# convert to space separated epochs in awk
# paste json objects, corresponding epochs onto single line
# read the epochs after each object with input
# note that the test for type not string also removes nulls

cat r-c.json \
  | jq -r ' 
    if (.|type)=="object" and (.releases|type)=="array"
      then [ .releases[].date? ] 
           | map(if (.|type)!="string" then empty else . end) 
           | reduce .[] as $d (""; . + $d + "|") 
      else "" end
  ' \
  | gawk -F\| ' { 
    for (i=1; i<NF; i++) {
      "date +%s -d " $(i) | getline d;
      printf d " "; 
    } 
    printf "\n";
  } ' \
  | paste -d\  r-c.json - \
  | jq ' 
    if (.|type)=="object" and (.releases|type)=="array"
      then .releases = ( .releases 
        | map(if (.|type)=="object" and (.date|type)=="string" then .date=input else . end)
      ) else . end
  '

jrdriscoll on 26 Oct 2015

👍2

Yes, I resort to these kinds of tricks too. A shell-out should probably be a high priority. I may even work on it this coming weekend, we'll see.

nicowilliams on 26 Oct 2015

Well, while I do think that a "shell-out" feature would help many users
of jq, my point was actually that "input" solved essentially ALL my
problems (although not in the most terse or elegant way). And it is
true you need to be a relatively sophisticated user to do so.

On Mon, 2015-10-26 at 08:05 -0700, Nico Williams wrote:

Yes, I resort to these kinds of tricks too. A shell-out should
probably be a high priority. I may even work on it this coming
weekend, we'll see.

—
Reply to this email directly or view it on GitHub.

jrdriscoll on 26 Oct 2015

Yes, input and inputs solved and/or help work around a number of problems. I'm quite happy about how input and inputs turned out.

nicowilliams on 26 Oct 2015

@nicowilliams I should mention I do not think this feature is required anymore in this specific case. I have not had time to check, but I believe the new JQ date commands can be used to fix my issue here. However others might still be interested in generic JQ system command.

ghost on 26 Oct 2015

@svnpenn Right, for datetime-related tasks a shell-out is not needed. You'll note that we added the sorts of things we needed that were relatively easy to add :)

A shell-out wouldn't be so hard to code, but first we needed to work out a privilege management model that would work for that and I/O in general.

I see two shell-out forms: CMD | popen and CMD | popen(inputs_for_cmd). The former would read from the command (it would map to popen() with "r"), and the latter would write to the command (it would map to popen() with "w"). EDIT: The first requires relatively little new infrastructure in jq (C-coded generators). The latter requires dealing with file handles as well.

EDIT: Fix typo.

nicowilliams on 26 Oct 2015

See #1005.

nicowilliams on 28 Oct 2015

I'm converting base64 encoded binary SHA256 hashes back into hexadecimal representation. |@base64d breaks when the decoded value is not a UTF string.

I want to be able to "shell-out" to execute base64 -d | xxd -p -c32 | tr -cd '[:alnum:]\n' on a field. The output of that pipeline is an hexadecimal encoded hash that is a valid json value.

At the moment I have an awk hack to do it.

mterron on 25 Jun 2019

@mterron have you considered using a proper programming language? i know it
might be daunting, but i might be able to help if you have some sample data.
here are some links:

Python

Ruby

ghost on 25 Jun 2019

I have but jq does 99.9% of what I want so why bother. I hacked that awk thing in 10 minutes, it'd take me 10 times as much to do it in Python or Ruby and also that adds a huge dependency framework to my pipeline that I'd rather not have.

Example json input (after a lot of manipulation with jq):

{
  "component": "1000hz-bootstrap-validator",
  "version": "0.10.2",
  "hashes": [
    {
      "file": "validator.js",
      "base64": "sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto="
    },
    {
      "file": "validator.min.js",
      "base64": "sha256-mbv8R/8RTicMdfYPxNwD4QVAvNPO6Ht+ZDW9EK0gNHM="
    }
  ]
}

and json output:

{
  "component": "1000hz-bootstrap-validator",
  "version": "0.10.2",
  "hashes": [
    {
      "file": "validator.js",
      "base64": "sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto=",
      "sha256": "7979b272bd4483fdaf6c5c55bcce9abc15a894197c07a5644d1579ffd07d92da"
    },
    {
      "file": "validator.min.js",
      "base64": "sha256-mbv8R/8RTicMdfYPxNwD4QVAvNPO6Ht+ZDW9EK0gNHM=",
      "sha256": "99bbfc47ff114e270c75f60fc4dc03e10540bcd3cee87b7e6435bd10ad203473"
    }
  ]
}

mterron on 25 Jun 2019

~
$ echo sha256-eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto= | base64 -d
□□base64: invalid input
~

ghost on 25 Jun 2019

You need to process the string first, sub() is nice.

$ echo "eXmycr1Eg/2vbFxVvM6avBWolBl8B6VkTRV5/9B9kto=" | base64 -d | xxd -p -c32
7979b272bd4483fdaf6c5c55bcce9abc15a894197c07a5644d1579ffd07d92da

mterron on 25 Jun 2019

Sorry, I see you are moving goalposts. I have dealt with that before and I have no tolerance policy. Good luck.

ghost on 25 Jun 2019

👎1

Meaning? I just provided a real life use case for the shell-out feature. What I want to do can't be done with jq without a shell-out feature.

It is an escape hatch.

mterron on 25 Jun 2019

ghost on 25 Jun 2019

I'm not sure what you are trying to say tbh.

mterron on 25 Jun 2019

This conversation is getting a little unnecessarily heated, so let's please remain polite and civil, shall we?

On topic, though: @mterron The ability to shell out is something we're working on, and there are some branches floating around with the capability. They're a little buggy at the moment, and may not actually have direct support for shelling out yet (I'd have to check), but they contain the necessary groundwork for us to support it.