Thanks your work on the dvc project. I've found it very useful!
I have a single file make_dataset.py
that compiles data from disparate database sources and usually multiple views within each database. For each database, I bundle data extraction tasks from the various views as methods under a single class. This organizes my code base in a logical manner and allows the primary methods in each class to share common data cleaning operations that are particular to certain databases or views (the common data cleaning operations being housed in a single method at the top of the class).
I can then call make_dataset.py
from _cmd_ specifying options of the datasource I want to build, e.g., python make_dataset.py --build_source1
or python make_dataset.py --build_source2
. I feed these respective commands to dvc run
specifying the relevant dependencies, one of them of course being make_dataset.py. However, the option --build_source1
does not depend on the entire contents of make_dataset.py, only the contents within the specific class or method it is referencing.
Feature request: Is it possible to add an option to dvc enabling a user to specify a class instance (or the like) within a file as a dependency instead of the entire contents of a file?
It seems this could make certain workflows more efficient. For example, if I want to dvc status
to determine if the output of python make_dataset.py --build_source1
needs to be rebuilt, dvc
will tell me a rebuild is required based on content changes in make_dataset.py even if those content changes were specific only to the class associated with python make_dataset.py --build_source2
.
I suppose I could unbundle all the methods and classes into separate files or simply tolerate the lack of specificity within my pipeline, but that seems less than ideal in my case. There, of course, could be complexities I am overlooking in implementing such an option or other solutions that I'm not considering. I appreciate your thoughts either way.
Thanks again!
Hello, @brbarkley ! :wave:
It may be possible to add "line" boundries within a file, for example:
dvc run -d file.py:10-20
(or a similar syntax) to express that you only want to depend on that range (10th to 20th line) of file.py
.
I'm afraid that adding constraints by classes (dvc run -d file.py:MyClass
) would break compatibility with other languages (or at least, favor the use of Python);
DVC tries to be _language-agnostic_, _SCM agnostic (not relying only on Git)_, even _Storage agnostic (use S3, SSH, or whatever protocol you want)_.
I don't want to underestimate your project, but I think the problem could be addressed with a different design on your modules.
For example, having the following file structure:
dataset_builder
├── __init__.py
├── __main__.py
├── first.py
└── second.py
# dataset_builder/__main__.py
import sys
from . import build
if __name__ == "__main__":
source = sys.argv[1]
build(source)
# dataset_builder/__init__.py
def build(source):
if source == "first":
from .first import make_dataset
else:
from .second import make_dataset
make_dataset(source)
# dataset_builder/first.py
def make_dataset(source):
print("Making first dataset")
# dataset_builder/second.py
def make_dataset(source):
print("Making second dataset")
You can run dvc run -d dataset_builder/first.py 'python -m dataset_builder first'
Thanks for your thoughtfulness :stuck_out_tongue: let me know if it helps
@brbarkley thank you for the feature request!
As @mroutis mentioned - DVC is a language-agnostic tool and we should NOT analyse semantic of dependencies directly.
However, we got a few requests in the past for semantic tracking in databases for scenarios like dump a table if the number of rows was changed.
We see an opportunity to generalize this semantic check for dependencies case. So, I opened FR #1577 for the DVC plugins. Once it is implemented we can develop a small ad-hoc script check_method_change.py
(see #1577) and close the current issue.
@mroutis and @efiop what are your thoughts on this solution?
Closing this issue in favor of https://github.com/iterative/dvc/issues/1577
@mroutis please note that the script check_method_change.py
might be still our responsibility (outside of DVC). And it is purely related to this FR, not #1577.
@dmpetrov we can add it as a ticbox in #1577, no need to keep a separate issue for now.
@mroutis @dmpetrov @efiop thanks!
I think #1577 could offer a reasonable solution. I would also find a database check useful in addition to the specifics of my original FR (#1572).
@mroutis I will consider your suggestion re my file structure. Could offer greater flexibility in the near term. However, I would hesitate to build a semantic check based on a range of lines in a file since the lines in which a current method resides might not always be where it resides in the future (as edits are made to the file, etc.).
Most helpful comment
@mroutis @dmpetrov @efiop thanks!
I think #1577 could offer a reasonable solution. I would also find a database check useful in addition to the specifics of my original FR (#1572).
@mroutis I will consider your suggestion re my file structure. Could offer greater flexibility in the near term. However, I would hesitate to build a semantic check based on a range of lines in a file since the lines in which a current method resides might not always be where it resides in the future (as edits are made to the file, etc.).