Dvc: Plugins for semantic changes tracking in dependencies

Created on 4 Feb 2019 · 17Comments · Source: iterative/dvc

Problem

DVC reproduces command if dependencies were changed. Today we support many general types of dependencies:

Files in major cloud storages like S3, GCS, SSH, and others like dvc run -d azure://path/to/my blob train.py ....
Local data files and code through dependencies dvc run -d train.py -d images/ train.py ...

However, there are a bunch of not general dependencies which cannot be validated by DVC.

Problem examples:

Tables in a database. Usually, a custom query is needed to check if datatable\objects was changed.
A semantic check in a local data or code file. For example #1572: check if a method mycode() was changed in class MyClass in a python file train.py.

Possible solution

A custom plugin (code) might be executed to check a dependency change. A plugin could be any command which returns 0 if repro is not needed.

Solution examples:

Run a script check_db.sh to validate if a table was changed and then execute the DB dump script (if it was a change). Command example: dvc -d db_dump.sh -p check_db.sh -o clients.csv run db_dump.sh clients.csv. Note, there is a new, plugin option -p.
dvc -d train.py -p "python check_method_change.py MyClass.mycode change_timestamp" -d change_timestamp -o clients.csv run train.py where check_method_change.py check the code changes and returns 0 if it was a change.

UPDATE: Please note that the script check_method_change.py might be still our responsibility and we should implement it (probably outside of DVC core).

enhancement p3-nice-to-have

Source

dmpetrov

👍8

Most helpful comment

I like it, @dmpetrov , specially for working with databases in a flexible way!

Maybe, instead of using the exit code, we can track the output (for example psql -c "select count(1) from mytable") and re-run the command if the output changed (e.g. count incremented 999 -> 1000). Note that psql -c could fail due to different reasons (e.g. connectivity issues) and returning with an exit code denotating a failure would reproduce the stage, causing possibly unwanted effects.

There are several instances of dvc that verify if the dependency changed or not (repro, status, checkout, etc.) if the command takes time to run, it will slow down dvc in general.

I would prefer to sit on it and think on other solutions to comply with databases.

I may be short-sighted, but I'm not seeing any advantages to maintain a feature like that besides the integration with databases :see_no_evil:

Other possible name could be dynamic dependencies

ghost on 5 Feb 2019

👍4

All 17 comments

I like it, @dmpetrov , specially for working with databases in a flexible way!

There are several instances of dvc that verify if the dependency changed or not (repro, status, checkout, etc.) if the command takes time to run, it will slow down dvc in general.

I would prefer to sit on it and think on other solutions to comply with databases.

I may be short-sighted, but I'm not seeing any advantages to maintain a feature like that besides the integration with databases :see_no_evil:

Other possible name could be dynamic dependencies

ghost on 5 Feb 2019

👍4

@mroutis totally agree! The solution can benefit a lot from this flexibility - if we save the outputs in dvc files then it will save users from having to write additional status files.

On the one hand, integration with databases is a super important scenario. On the other hand, @mroutis brought a great point about repro, status, checkout. I can imagine that each of these commands will require a new option --no-semantic-dependency-checks. We should think carefully 🤔 before introducing this feature.

dmpetrov on 5 Feb 2019

I am currently evaluating DVC for use in our ML workflow. Databases play a role as we have images as input for which metadata needs to be stored. DVC works great for experimentation when adding a dataset directly (thanks a lot!), but in the end I want to store data independent of DVC and not duplicated.

I first thought of adding a S3 or GCP directory as external dependency (https://dvc.org/doc/user-guide/external-dependencies), but it seemed to not be geared towards supporting directories (which are expensive to find changes in). At least all my attempts failed and the documentation only shows it for files.

I am new to DVC, but could the database problem be worked around by having a "stage" in which the count query is saved to a local file which is tracked by DVC and somehow forcefully execute this even though the script has not changed? So like the --force option but only for certain type of dependency.
As this script would be cheap to execute this would not make a lot of difference if nothing has changed upstream in the database. Does this make any sense?

fmannhardt on 25 May 2019

Hi @fmannhardt !

I first thought of adding a S3 or GCP directory as external dependency (https://dvc.org/doc/user-guide/external-dependencies), but it seemed to not be geared towards supporting directories (which are expensive to find changes in). At least all my attempts failed and the documentation only shows it for files.

Directories on both S3 and GCP _can_ be supported as external dependencies/outputs, we just didn't get to implementing needed calls for those two types of remotes. For example, we already support
ssh directories. https://github.com/iterative/dvc/issues/1654

I am new to DVC, but could the database problem be worked around by having a "stage" in which the count query is saved to a local file which is tracked by DVC and somehow forcefully execute this even though the script has not changed? So like the --force option but only for certain type of dependency.
As this script would be cheap to execute this would not make a lot of difference if nothing has changed upstream in the database. Does this make any sense?

Sorry, I don't quite understand your scenario and your proposed solution. Could you please elaborate?

efiop on 25 May 2019

Directories on both S3 and GCP _can_ be supported as external dependencies/outputs, we just didn't get to implementing needed calls for those two types of remotes. For example, we already support
ssh directories. #1654

Cool. Would be great to see this.

Sorry, I don't quite understand your scenario and your proposed solution. Could you please elaborate?

The scenario is to have the images (or the image URI) in a database to be queried and used for different training sets. Now, from my understanding when I have a script querying the DB as stage in a pipeline. Now DVC would keep track of changes to the SQL query and re-execute the stage when I change the query. But it would not re-execute when additional data (images) was added to the DB through some other (non-tracked) channel. How should it now without executing the query again.

What I thought as a workaround is similar what is proposed here to have a query providing some cheap metadata that can be tuned to the desired level of robustness, e.g. the total count of rows in an append-only DB would be enough. But differently from what I read here, this query would write be executed in a standard DVC stage that write the result to a file that is tracked by DVC as output. Now, in case this output was changes (detected with the standard MD5 mechanism), everything downstream would need to be re-run. Otherwise, everything is assumed to be up-to-date.

Of course, this should only be done upon request from the user to keep having reproducible results for previous executions of the pipeline. I saw the --force parameter, but this would re-run everything and the --single-item parameter, but this would not run the remainder of the pipeline. Assuming count_query.dvc is a cheap query to identify updates and experiment.dvc the expensive training.
Maybe a workaround would be to have dvc repro --force --single-item count_query.dvc followed by dvc repro experiment.dvc?

What I was proposing is to somehow automate this by marking count_query.dvc as cheap operation which is always re-run when dvc repro is run with some kind of force-update-cheap-operations flag. I hope this makes it clear? As I said, I am new to DVC so maybe there are some mistakes in my line of thoughts.

fmannhardt on 26 May 2019

@fmannhardt Thanks for the explanation! :slightly_smiling_face:

Maybe a workaround would be to have dvc repro --force --single-item count_query.dvc followed by dvc repro experiment.dvc?

Yes, I think so.

What I was proposing is to somehow automate this by marking count_query.dvc as cheap operation which is always re-run when dvc repro is run with some kind of force-update-cheap-operations flag. I hope this makes it clear? As I said, I am new to DVC so maybe there are some mistakes in my line of thoughts.

We have a so-called "callback" stages, that don't have dependencies and run every time you run "dvc repro"(e.g. dvc run -o foo 'echo foo > foo'). Maybe that would be suitable for your scenario? Those are run every time ondvc repro` and don't have any special option to turn them on and off, but if the execution of those is cheap, maybe it would be ok to run them every time?

efiop on 27 May 2019

👍1

We have a so-called "callback" stages, that don't have dependencies and run every time you run "dvc repro"(e.g. dvc run -o foo 'echo foo > foo'). Maybe that would be suitable for your scenario? Those are run every time ondvc repro` and don't have any special option to turn them on and off, but if the execution of those is cheap, maybe it would be ok to run them every time?

I think this feature would do the trick. Thanks!

fmannhardt on 27 May 2019

👍1

A user asked about this use case up today on Discord. Specifically about DVC understanding Python imports inside commands fed to dvc run so if a.py imports b.py (both being project source code, not libraries) and a.py is tracked by a stage file, but then only b.py changes, dvc repro would not recognize that it needs to rebuild the cache.

So besides implementing these plugins or middleware that Dmitry mentioned, what about out-of-box support for certain programming languages like Python, C++, etc? so in the case above, DVC would autodetect that a.py is a Python file and examines its import statements, registering the file imports (found in the workspace) automatically as dependencies in the stage file.

jorgeorpinel on 23 Dec 2019

👍1

@jorgeorpinel yes, it is a bit different use case - python file dependencies is not the same as dependencies to python functions from the initial message.

The file dependencies use case should be easier to implement, I guess. Package systems should be able to track the dependencies check and I hope this ideas (or code) can be reused in DVC.

dmpetrov on 23 Dec 2019

Yes, it's a bit different but related. I can open a separate issue if you prefer.

I'm no talking about packages or libraries though, in that case you could kind of hack it now, by having requirements.txt as a dependency, for example (in Python). I'm talking about inter-dependencies between source code files in the project i.e. when your stage is spread in several source code files, but only one is executable and marked as a dvc run -d. A solution is to just mark all the other files as dependencies, but there could potentially be many of these files and inside recursive directory structures (e.g. when developing an ML library).

Also note I'm not just talking about Python code but multiple languages. I guess Python would be a first obvious platform to include such a feature for, since our core code is also Python.

jorgeorpinel on 23 Dec 2019

Hi everyone. I think I have an idea about how to implement this for python (and many other languages, actually):

We can manually compile python endpoint like that:

python -m compileall script_to_run.py

and to add a script_to_run.pyc as a dependency to consequent scripts. Python interpreter doesn't re-compile .py whose dependencies weren't changed, which is exactly what we need in this case.

This also works with C/C++: we just need to use compiled endpoint as a dependency.

In case of databases I think we could take advantage of information_schema.tables, AFAIK there should be information about last update time. This brings us back to the timestamps instead of hashing but at least that's something.

So all that DVC plugin should do is to automatically compile endpoint and to redirect code dependencies to that binary. We could make some kind of flag, like --auto-dependencies which would switch this behavior on.

anotherbugmaster on 27 Jan 2020

👍3

@anotherbugmaster sounds like a good option to automatically detect all the changes in the dependencies recursively and it will probably avoid rerunning stuff if I changed only a comment or whitespace in the script?

It's not a solution for the:

A semantic check in a local data or code file. For example #1572: check if a method mycode() was changed in class MyClass in a python file train.py.

as far as I can tell.

shcheklein on 27 Jan 2020

👍1

Yeah, seems like I misunderstood the issue here.

The approach would be useful anyway, if only we had a way to split up a source file into symbols, which in turn would be hashed.

anotherbugmaster on 28 Jan 2020

The approach would be useful anyway, if only we had a way to split up a source file into symbols, which in turn would be hashed.

@anotherbugmaster for sure. It can be a part of a solution.

dmpetrov on 3 Feb 2020

Over in #2378, we are discussing a similar issue, and I have hacked up support for database status checking by monkey-patching a custom remote (with associated output and dependency support) into DVC: https://github.com/iterative/dvc/issues/2378#issuecomment-597254218

As I've thought more on this issue, I've become increasingly persuaded that external dependencies with custom remote schemas are one of the more elegant ways to deal with this family of issues, in particularly because they do not require adding any new syntax or concepts to DVC stage files - they just need the ability to dispatch URLs with a custom scheme to an appropriate class, function, or command.