Please consider this as a naive question rather than any bug or improvement report.
Is there any equivalent "write" function like dvc,api.read()? We want to push the in-memory object directly to the remote store via python API without saving it to local storage and run the bash command.
Thanks!
Hi @BikashShaw !
There is no native interface for that right now in dvc itself :( Seems like that might be an API extension on a straight-to-remote feature for CLI https://github.com/iterative/dvc/issues/4520 , but that would only solve pushing and generating local *.dvc file. But you will still need to push git changes somehow. Could you elaborate on your scenario, please?
Hi @efiop thanks for getting back on this. I am working with @BikashShaw on this, so let me further elaborate on the use case.
The snippet below is of the code we use for tracking our ML models using dvc/s3 - it writes a model to disk, dvc adds it, git adds the *.dvc and .gitignore, pushes the model to s3, and removes the model from disk. It allows us to do all this without ever leaving the jupyter notebook.
While we are making do with this (somewhat hacky) workflow for models, we don't really want to do the same with large data files as it uses the disk as intermediary. So we are looking to see if there is something more elegant like dvc.api.read() -- say a dvc.api.write() which would push a data/model object in RAM to s3 leaving behind dvc metadata in git and not using disk as an intermediary.
import git
import os
import subprocess
def get_git_root(path):
git_repo = git.Repo(path, search_parent_directories=True)
git_root = git_repo.git.rev_parse("--show-toplevel")
return git_root
def commit_model_with_msg(model_info,
path = "path/to/somewhere/in/models/dir/of/project/repo",
name = "model_expt_x",
commit_msg = "Adding model to vc"
):
"""Start tracking a model using dvc/s3 and git
Parameters
----------
model_info (dict)
A `dict` containing keys `'pipeline'`,`'features'`, and `'explainer'`.
path (str)
A path under `models/` dir of the project where the model's `.dvc` metadata will live
name (str)
A unique identifier for the model
Returns
-------
None
"""
# create directory if needed
directory = os.path.join(get_git_root(os.getcwd()),path)
if not os.path.exists(directory):
os.makedirs(directory)
fname = os.path.join(get_git_root(os.getcwd()),path,name)+".joblib"
# dump model to disk
joblib.dump(model_info, fname)
# dvc add
process = subprocess.Popen(["dvc", "add", fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
print("dvc add... \n", stdout.decode(),stderr.decode())
# git add
try:
process = subprocess.Popen(["git", "add"]+stdout.decode().split("git add")[1].split(),
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
print("git add... \n", stdout.decode(),stderr.decode())
except IndexError:
print("Model name already under vc...no changes to the repo")
return
# git commmit
process = subprocess.Popen(["git", "commit", "-m", commit_msg],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
print("git commit... \n",stdout.decode(),stderr.decode())
# dvc push
process = subprocess.Popen(["dvc", "push"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
print("dvc push... \n",stdout.decode(),stderr.decode())
process = subprocess.Popen(["rm", fname],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
print("rm... \n",stdout.decode(),stderr.decode())
Thanks for helping us!
@jayant91089 Thanks for the example code! Makes sense! Ok, I definitely see this as https://github.com/iterative/dvc/issues/4520 but for API. All the needed internals for your feature request will be implemented in that ticket, most likely. Does the current workflow work fine for you for now? If it does, then I would advise to stick with it until https://github.com/iterative/dvc/issues/4520 is implemented.