Dvc: Provide interoperability for JS, Java

Created on 17 Sep 2020  ·  20Comments  ·  Source: iterative/dvc

As we came close to implementing our first official plugin for VS Code we need to find a way to call Python from JS code.

Repeating an excellent summary by @hediet:

Option 1: It builds the data and modifies the files itself (everything in JS)

_Contra:_ Would require to reimplement DVC in JS.

Option 2: It calls into DVC internal Python code
Either the VS Code extension ships the python code + runtime, or it expects python libraries in some global paths.
_Contra:_ This is very fragile

Option 3: It uses the DVC command line (like dvc push or dvc commit -m “demo”)
_Contra_: It is not developer friendly to build command line strings manually.
_Contra_: The DVC process must be started for each data query / action. Can be slow as we know (>1s sometimes).

Option 4: It launches some dvc --jsonRpc and uses json RPC to access DVC functionality (https://www.jsonrpc.org/specification).
_Pro_: Offers a standardized API that can be consumed by many programming languages.
_Note_: This is how language services can be implemented for VS Code in any programming language
_Pro_: DVC can send notifications to VS Code in real time
_Pro_: State can be shared across multiple requests for faster response time
_Contra_: if hundreds of megabytes are sent for a single request or response, this blocks the entire communication channel.

@iterative/engineering WDYT?

@Suor @hediet could you please move your arguments into this ticket?

feature request research

Most helpful comment

I took a look at some of the existing VSCode source control extensions (git, hg, svn) and it looked to me like they all just wrap their respective source control CLI programs. Is it really necessary for us to maintain another API here?

All 20 comments

Sounds like # 4 is the superior choice. It's one con can be avoided, I think:

if hundreds of megabytes are sent for a single request or response, this blocks the entire communication channel.

The plugin can just send path strings to data files/dirs, and the underlying DVC process could open those files directly.

I think, eventually we will need to do all of the options above (except 2?). The plugin might need to understand ".dvc" and "dvc.yaml" files to provide better completion (eg: params completion) or some "hover" actions with plain JS. The same could be said, for example, to show DAGs, file diffs, metrics/params diff.

For hefty tasks, it should just use the DVC command line, but it'll only provide higher-level access. Let's say, if we want to show Repo's file structure and provide few options such as "get_url" and a bunch of other options at the same time, we might need jsonRPC. Or, make multiple small plumbing commands like git (which will be akin to jsonRPC/REST way).

I'm OK with using JSON RPC. We open a high port on localhost, and talk to it. I don't know about windows, but on linux, this is incredibly efficient. And it has the potential for usage in any app that can talk HTTP.

I would also like plain old JSON and REST, which gives us a lot of options in terms of transport.

The comms overhead doesn't seem like a big issue to me. We can cross that bridge when we come to it, with multiple techniques (from least to most complex):

  • Use pagination -- The user can't see the whole thing at every point, and a client can pre-fetch the next page for an instant pagination or "infinite scroll" experience.
  • Use threads -- python threads have the GIL problem, but I suspect a lot of this app will be IO-bound. If DVC isn't thread-safe we can use multiprocessing.Pool
  • Use a sync protocol -- git and dvc are both structured like merkle trees. These are really easy to sync especially when round-trips aren't a big issue (we're using localhost!). Each node in the tree can be placed in a separate localStorage key, and denormalized to optimize lookups. For data which hasn't been committed to git/DVC a special string CURRENT can be used.

We open a high port on localhost, and talk to it.

This is not even required - you could use stdin/stdout as transport channel!

Also, JSON RPC is much simpler than REST. There is no content-encoding, method semantic, query parameter parsing, urls etc. involved. You just send some json object with a method, a body and a request id and get back some json object with a matching request id.

This is not even required - you could use stdin/stdout as transport channel!

Yup, that seems good as well!

Also, regarding helpers for DVC files, a language server can work on other editors as well (like vim with vim-ale). So it might be worth it to separate the two.

I think RPC is def. more appropriate than REST in the case of dvc anyway, as we're talking about operations (e.g. all the commands) and not resources. DVC's resources are all files and directories (pipelines in dvc.yaml files, metrics/plots files and images, even tracked data), so no need to pass them around via HTTP 🙂 — and we have Git and sync commands to share and get all these resources remotely.

Does VSCode provide any API to get what files changed in the repository? Or, maybe we could package "watchman" or something similar in the extension itself? This with dvc daemon might make it quite responsive.

I took a look at some of the existing VSCode source control extensions (git, hg, svn) and it looked to me like they all just wrap their respective source control CLI programs. Is it really necessary for us to maintain another API here?

@pmrowla but this or this is not nice code to maintain - especially if you plan to support other IDEs in different programming languages later.

But of course it depends. If there are only a fews commands and if the output is structured already, using the dvc command line as is might work just fine.

In my opinion, CLI tools that are not specifically meant for programmatic consumption are hard to use for such. They are either great for human use (with all the fancy CLI UI) or great for automation, but usually not both. Also, every non-normative output (like a user-friendly way of reporting progress) of the CLI tool will become normative as eventually some code will depend on it.

Does VSCode provide any API to get what files changed in the repository?

Yes it has.

Maybe try an small MVP just wrapping DVC commands and move onto RPC if it's worth it.

So my take on this.

Accessing files directly

This is ok with me if we talk about reading. Won't add much complexity to implementation since whatever you read from dvc.yaml you will get the same from rpc or whatever and your code needs to understand that.

So advantages:

  • fast
  • no need to implement any commands or handles mostly sending files as is

Things that I don't like here:

  • need to look all over repo to collect .dvc and dvc.yaml files
  • wouldn't want making plugin read another versions of that files from git history (hope this is not needed)

Also this can't be used to perform any dvc thing, like pull/push/checkout.

DVC cli

We already have an interface here, whose compatibilty and stability we maintain. Building command line arguments is no more complex than sending them via JSON. However, parsing output might be tricky, i.e. tables, these ones will need a json option.

On it being slow. For most commands like push/pull/retro this doesn't matter. And we only get 1s+ when we screw up loading all the remote dependencies on startup, I think we have some tests for it now. Beside startup speed will be the same as rpc.

So pluses:

  • reuse existing API

Minuses:

  • need to add --json flag to many output commands
  • some startup cost on every call

JSON RPC

We can try to make it resemble our exiting non-public Repo API to make it more managable (btw REST won't allow that, which makes it inferior). However, our Repo api is not really stable, is not really public and also object oriented. We will need to bridge all these gaps, but still we get a new interface, which we will need to maintain.

State between calls can't be shared unfortunately because dvc files might change under the hood. Unless we lock the repo while server is running, which will make dvc cli not usable in the meantime.

Pluses:

  • no startup cost

Minuses:

  • need to maintain a separate API
  • need to manage server from plugin
  • some new deps

Startup cost might play if we use many info only commands, like show stages or outputs. Simply reading files might bridge it.

Deps if we use sync: wsgiref (stdlib) or werkzeug for webserver, json-rpc.
Deps if we go async: aiohttp, aiohttp-rpc.

I don't see the JSON RPC idea that good. Enforces you to install DVC as a service. Not sure also how it's going to operate with relative paths. Also one should have to write almost the same code than in the parser CLI parser since all the parameters has to be sent via RPC.

Contra: if hundreds of megabytes are sent for a single request or response, this blocks the entire communication channel.

If we were definitely going towards rpc I would consider protobuf to alleviate this. But I don't think thats the case since the messages should be like the cli ones.

Contra: It is not developer friendly to build command line strings manually.

I think that witting a parser is quite easy especially with a proper parser generator like peg.js or antlr4. I would go that way.
If I had to translate an XML to another XML I would use XSLT and not traverse the whole XML in a complex code. The same here. A defined grammar with the parser generator is ending with a tiny file.

Enforces you to install DVC as a service.

I don't understand this argument: The VS Code extension would instantiate and maintain the dvc process. DVC must be installed in a way that the extension can access it anyways.

Also one should have to write almost the same code than in the parser CLI parser

You don't need to parse anything, merely JSON objects need to be converted into your target structure.

Not sure also how it's going to operate with relative paths.

Either setting a cwd per request or just using normalized, absolute paths.
VS Code works with normalized URIs that are basically absolute file paths.

I would consider protobuf to alleviate this

I would strongly discourage the use of protobuf. It brings a lot of complexity into all parties.

I think that witting a parser is quite easy especially with a proper parser generator like peg.js or antlr4.

This sounds like a very fragile approach, especially if the CLI output is optimized for human readabilty. In this case, the output sometimes is not even context free (like indented tables), so good luck parsing that with peg. Also, defining these grammars is a very time consuming task and they are probably not very forward compatible.


I think we can agree that JSON RPC is the most comfortable way suggested so far to access DVC functionality from IDE extensions (like for VS Code, Sublime Text or PyCharm). Just compare these code snippets:

// If properly done, the args and the result is typed
const list = await dvcRpcChannel.request("status", { showUnchanged: settings.showUnchanged });
for (const item of list) { console.log(item.fullPath, item.status); }

vs

const args = ["status"];
if (settings.showUnchanged) { args.push("--showUnchanged"); }
const result = await execAsync("dvc", args);
const modifiedFullPaths = result.matchAll(/modified:\t(.*)/);
const unchangedFullPaths = result.matchAll(/unchanged:\t(.*)/);

I think we can also agree that JSON RPC would also require the most work for DVC to offer its functionality to consumers.

I think we should try to find the sweet spot between these two sides.
This depends on many factors. As I don't know the size of DVCs API surface and only know what works well in VS Code, I cannot really help here.

If you don't go with the json rpc approach, you should at least consider adding a json output flag. If you are going forward with the CLI+JSON option, it would be awesome if you could export or craft a machine-readable documentation of your CLI (including all supported arguments and the JSON output type) so that at least command builders can be generated.

Maybe it would be helpful to have a precise document of what the first iteration of the VS Code DVC extension should offer and then to see which APIs from DVC are needed for that and how they can be used.

This sounds like a very fragile approach,

Could you please elaborate?

```// If properly done, the args and the result is typed
const list = await dvcRpcChannel.request("status", { showUnchanged: settings.showUnchanged });
for (const item of list) { console.log(item.fullPath, item.status); }
vs

const args = ["status"];
if (settings.showUnchanged) { args.push("--showUnchanged"); }
const result = await execAsync("dvc", args);
const modifiedFullPaths = result.matchAll(/modified:\t(.)/);
const unchangedFullPaths = result.matchAll(/unchanged:\t(.
)/);

dvc has --json-output as a parameter to receive that, a json output. There is no need to massage the output.

const args = ["status"];
if (settings.showUnchanged) { args.push("--showUnchanged"); }
const result = await execAsync("dvc", args);
const modifiedFullPaths = result.matchAll(/modified:\t(.)/);
const unchangedFullPaths = result.matchAll(/unchanged:\t(.
)/);
```

Again this is very poor. Please, not to be compared with a parser generator. I would never do it this way.

Also, defining these grammars is a very time consuming task and they are probably not very forward compatible.

I don't think is a time consuming task. Actually in my experience has been pretty fast for my needs.

I don't understand this argument: The VS Code extension would instantiate and maintain the dvc process. DVC must be installed in a way that the extension can access it anyways.

At the end is a service or a daemon right? If not is a kind of magic I still don't know 😸

I would strongly discourage the use of protobuf. It brings a lot of complexity into all parties.

I said just in case size would be an issue. What kind if issues did yoou have with gRPC is a RPC also as you like.

This sounds like a very fragile approach

What about dvc service crashing like many extensions in XCode (eslint comes to my mind) would the user have to restart the IDE? I never use those extensions because of that actually.

dvc has --json-output as a parameter to receive that, a json output.

In this case, if all commands understand --json-output and output their result as json, why would you need parser generators?

Could you please elaborate?

If using --json-output, this discussion is not needed. In my experience, parsing non-json output from CLI tools that are meant to be used by humans using grammars/regexs/... is fragile, as the output is designed to be read by humans, not by computers. Outputs for humans are usually subject to change. Strict grammars that parse the entire output are even more fragile than regular expressions as they break with the sligthest change of the output.

I would never do it this way.

Then the argument that VS Code also just executes CLI commands for git integration is not valid, as they do it this way.

At the end is a service or a daemon right? If not is a kind of magic I still don't know 😸

It would be a long running command line process without any visible window. The extension would start a dvc process on startup and kill it on shutdown. If the DVC process crashes, it would be restarted. No user interaction required.

It would be a long running command line process without any visible window. The extension would start a dvc process on startup and kill it on shutdown. If the DVC process crashes, it would be restarted. No user interaction required.

How do you do RPC? dvc is going to run as a daemon or service listening for TCP connections, right? That sounds to me like a service or daemon and of course without any visible window.

If the DVC process crashes, it would be restarted

How? it's not that trivial, take into account user permissions to restart services depending on the OS. Might be possible or might be not... we should consider that friction also.

In this case, if all commands understand --json-output and output their result as json, why would you need parser generators?

To properly generate the inputs if needed. I give a simple solution VS hand made input generation.

Then the argument that VS Code also just executes CLI commands for git integration is not valid, as they do it this way.

That does not means they are doing it correctly (not incorrectly either). I put the example also of XSLT since many people traverses the XML with code VS a simple XSLT transformation.

In my experience, parsing non-json output from CLI tools that are meant to be used by humans using grammars/regexs/... is fragile, as the output is designed to be read by humans, not by computers.

Totally agree! DVC has --show-json that is also good for other tools integration. The grammar is to just be able to generate the inputs with a very concise and simple code. No hand made loops and decisions. Might be some commands that do not support --show-json

files - whatever you read from dvc.yaml you will get the same from rpc or whatever

I don't see this @Suor. Most DVC commands combine and process info from command line, metafiles, and file system significantly, no? The "files" approach sounds like re-implementing DVC in JS to me.

cli
startup speed will be the same as rpc
some startup cost on every call

^ That's contradicting. But if the argument is that the startup time is acceptable (not enough of a reason to implement RPC) then I agree.

  • In general though, it sounds like a major pro of CLI is to avoid implementing RPC. Is that a valid arg? I actually wonder which one is harder to maintain.

sounds like a very fragile approach... the output sometimes is not even context free... at least consider adding a json output flag

+1 @hediet. I also think that a loosely coupled middleware to wrap the CLI (has to be written in JS no?) sounds possibly fragile.

DVC has --show-json

Only 5 commands have that option @DavidGOrtega

craft a machine-readable documentation of your CLI

If going the CLI+JSON way gets us there (something like http://docopt.org/), then I like that option more 🙂

RPC
State between calls can't be shared because dvc files might change under the hood

Good point. But is it really a problem? I'm guessing the same applies to any plugin like Git: you run a command on the IDE and another one via terminal and they don't know about each other, just use the latest working dir state. This is actually a feature in DVC: parallel repros.

is a service or a daemon right? What about dvc service crashing like many extensions

It would be included in the regular installation, and the plugin would manage it in the background. Assuming that works well, then no extra work (or awareness even) is required from users.


Summarizing, I think that if RPC is the "right" way to go (which isn't clear yet), and helps us refactor core classes (e.g. the Repo "API") so that different interfaces can adapt to them (CLI, RPC, etc.), then it's worth trying.

find the sweet spot between these two

Yup, I also vote for wrapping the CLI for a basic MVP (as mentioned before) + a few RPC endpoints for more complex operations — letting us fully commit later depending on that experience.

Another matter to consider for either route is how will we keep the plugin version in sync with the core DVC version. DVC evolves rapidly and while in theory no minor changes should break things + efforts are made to keep backward compatibility, in reality it my happen that a DVC update breaks the plugin, an RPC endpoint, or the CLI middleware.

Integration tests, I guess? Which alternative makes it easier to manage this aspect?

Thanks everyone for the discussion! It looks like just regular --show-json CLI might be a good way to start. Just a few comments/questions:

need to maintain a separate API

@Suor if it's written with some introspection you don't need a separate API- it should be more or less auto-generated layer on top of the same API that we use for CLI

need to manage server from plugin

this doesn't sound like a big deal, but what I'm worried about here is that it might complicate some edge cases like delivering updates (e.g. DVC package itself). But also have to agree that it is indeed another moving part in all this integration.

wouldn't want making plugin read another versions of that files from git history (hope this is not needed)

I guess, we will need this. Also, extracting experiments (even the current list) is dependent on running internal git. So, I would not rely on parsing files- we need either RPC/REST or --show-json.


Questions:

  1. How will VS Code detect changes in the repository (file changed, someone runs an experiment, someone changed one of the dvc.yaml files)? Since we have multiple places for different webviews (files, DAGs, experiments) it might become a bottleneck if the only option is to run a few CLI instances every few seconds?

  2. @hediet have you seen/had experience with codespaces already? How online version of an extension is different? E.g. will there be some benefits of using one approach vs another?

  3. For jsonRPC- what happens if multiple users run it or one user runs multiple projects? Will we need a special mechanism to run DVC services on some random port (I guess we'll use regular RPC over TCP?) Can it by chance conflict with some other service running.

  4. Overall- are there some example implemented in one or another? Can we check existing extensions?

if it's written with some introspection you don't need a separate API- it should be more or less auto-generated layer on top of the same API that we use for CLI

@shcheklein CLI returns text, no way it may be wrapped to return json. Not sure whether it is even returning text rather than printing it from random nested code. So we will need to add some intermediate level abstraction and then wrap it both with CLI and JSON. The CLI wrapping will be manual, JSON part might be more automatic.

Also, some of the functions might be completely different.

@jorgeorpinel Beside startup (logical pause) speed will be the same as rpc. I.e. we still need to collect stages do the work.

On reading files. DVC might be performing some heavy things, but plugin might require basic info like a list of stages or outputs, reading files for that seems ok. The alternative is asking for the same info in the same format from dvc, so no more understanding needed than necessary and no reimplementing dvc.

Was this page helpful?
0 / 5 - 0 ratings