Dvc: How to push in-memory object directly to remote store?

Created on 9 Dec 2020  路  3Comments  路  Source: iterative/dvc

Please consider this as a naive question rather than any bug or improvement report.

Is there any equivalent "write" function like dvc,api.read()? We want to push the in-memory object directly to the remote store via python API without saving it to local storage and run the bash command.

Thanks!

awaiting response question

All 3 comments

Hi @BikashShaw !

There is no native interface for that right now in dvc itself :( Seems like that might be an API extension on a straight-to-remote feature for CLI https://github.com/iterative/dvc/issues/4520 , but that would only solve pushing and generating local *.dvc file. But you will still need to push git changes somehow. Could you elaborate on your scenario, please?

Hi @efiop thanks for getting back on this. I am working with @BikashShaw on this, so let me further elaborate on the use case.
The snippet below is of the code we use for tracking our ML models using dvc/s3 - it writes a model to disk, dvc adds it, git adds the *.dvc and .gitignore, pushes the model to s3, and removes the model from disk. It allows us to do all this without ever leaving the jupyter notebook.

While we are making do with this (somewhat hacky) workflow for models, we don't really want to do the same with large data files as it uses the disk as intermediary. So we are looking to see if there is something more elegant like dvc.api.read() -- say a dvc.api.write() which would push a data/model object in RAM to s3 leaving behind dvc metadata in git and not using disk as an intermediary.

import git
import os
import subprocess

def get_git_root(path):
    git_repo = git.Repo(path, search_parent_directories=True)
    git_root = git_repo.git.rev_parse("--show-toplevel")
    return git_root

def commit_model_with_msg(model_info,
                       path = "path/to/somewhere/in/models/dir/of/project/repo",
                       name = "model_expt_x",
                       commit_msg = "Adding model to vc"
                      ):
    """Start tracking a model using dvc/s3 and git 

    Parameters
    ----------
    model_info (dict)
        A `dict` containing keys `'pipeline'`,`'features'`, and `'explainer'`.
    path (str)
        A path under `models/` dir of the project where the model's `.dvc` metadata will live
    name (str)
        A unique identifier for the model

    Returns
    -------
    None

    """
    # create directory if needed
    directory = os.path.join(get_git_root(os.getcwd()),path)
    if not os.path.exists(directory):
        os.makedirs(directory)
    fname = os.path.join(get_git_root(os.getcwd()),path,name)+".joblib"

    # dump model to disk
    joblib.dump(model_info, fname)

    # dvc add
    process = subprocess.Popen(["dvc", "add", fname],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("dvc add... \n", stdout.decode(),stderr.decode())

    # git add 
    try:
        process = subprocess.Popen(["git", "add"]+stdout.decode().split("git add")[1].split(),
                         stdout=subprocess.PIPE, 
                         stderr=subprocess.PIPE)
        stdout, stderr = process.communicate()
        print("git add... \n", stdout.decode(),stderr.decode())

    except IndexError:
        print("Model name already under vc...no changes to the repo")
        return 

    # git commmit
    process = subprocess.Popen(["git", "commit", "-m", commit_msg],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("git commit... \n",stdout.decode(),stderr.decode())

    # dvc push 
    process = subprocess.Popen(["dvc", "push"],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("dvc push... \n",stdout.decode(),stderr.decode())

    process = subprocess.Popen(["rm", fname],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    print("rm... \n",stdout.decode(),stderr.decode()) 

Thanks for helping us!

@jayant91089 Thanks for the example code! Makes sense! Ok, I definitely see this as https://github.com/iterative/dvc/issues/4520 but for API. All the needed internals for your feature request will be implemented in that ticket, most likely. Does the current workflow work fine for you for now? If it does, then I would advise to stick with it until https://github.com/iterative/dvc/issues/4520 is implemented.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

prihoda picture prihoda  路  3Comments

mdscruggs picture mdscruggs  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments

shcheklein picture shcheklein  路  3Comments

mfrata picture mfrata  路  3Comments