Dvc: Opening Pickle Files Changes Hash

Created on 2 Oct 2020  路  2Comments  路  Source: iterative/dvc

Hi all I have been having an issue with hashes generated on a pickled model that I track with DVC. I originally asked about this issue in this message on discord. I will summarize the issue below:

In my project I have a DAG that looks like this:

stages:
  download_data:
    cmd: python src/download_data.py
    deps:
    - src/download_data.py
    outs:
    - data/raw_data.csv
  transform_data:
    cmd: python src/transform_data.py
    deps:
    - data/raw_data.csv
    - src/transform_data.py
    outs:
    - data/data_transformer.pkl
    - data/modeling_data.csv
  train_model:
    cmd: python src/train_model.py
    deps:
    - data/modeling_data.csv
    outs:
    - data/model.pkl
  evaluate_model:
    cmd: python src/evaluate_model.py
    deps:
    - data/model.pkl
    - data/modeling_data.csv
    outs:
    - data/model_plot.png:
        cache: false
    metrics:
    - data/model_metrics.json

And I make a change to my database (which will change the data that is downloaded) in the download_data step.

So to have dvc run through the entire pipeline I run dvc repro -f. But then I get a curious result when I look at the dvc.lock file after running.
When I look at the diff of the dvc.lock file the hashes for my model doesn't line up:

gitstatus

The new model.pkl file in the train_model step has a different hash than the model.pkl file in the evaluate_model step. However, all the other files have consistent hashes between the different steps, e.g. raw_data.csv and modeling_data.csv. What is going on? __Why would model.pkl file have two different hashes across the two steps?__

I talked with @pared about it in discord and he asked if my evaluate_model step was changing my model.pkl file at all and I told him that it shouldn't be. The way I am saving and loading the pickle file is pretty basic. In train_model I dump the model with:

# Save the trained model.
with open("data/model.pkl", "wb") as f:
    pickle.dump(model, f)

And in evaluate_model I load it with:

# Load the model.
with open("data/model.pkl", "rb") as f:
    model = pickle.load(f)

I don't touch the file in any other lines in those two steps.

I don't really know what is going on. I was able to confirm the hashes in the screenshot above, that is, I did dvc repro -f train_model and got a hash of model.pkl of 59f89b3e627db1deecc85911690699fa and then ran dvc repo -f evaluate_model and got a hash of model.pkl of 2d06aa6a6d3b5e98ae4f0c9de3f0294c (this was done in a separate script using hashlib similar to how it is done in DVC). This may not even be a DVC issue and may be just a pickle issue, but I am completely lost as to why it is happening.

__DVC version__

(base) pfigl@lxm0224 [.../dvc_demo/my_model] $ dvc version
DVC version: 1.8.1 (conda)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1127.19.1.el7.x86_64-x86_64-with-glibc2.10
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: hardlink, symlink
Repo: dvc, git

__DVC config__

(base) pfigl@lxm0224 [.../dvc_demo/my_model] $ cat .dvc/config 
[core]
    remote = localremote
['remote "localremote"']
    url = /home/pfigl/GitHub/dvc_demo/dvc_remote

Most helpful comment

I figured out the issue. For reference here are my train_model.py and evaluate_model.py files that run for the train_model and evaluate_model stages in dvc respectively.
train_model.py:

import pickle

import pandas as pd
import numpy as np
from scipy import optimize
from sklearn.base import BaseEstimator, RegressorMixin

# Load the modeling data set.
df = pd.read_csv("data/modeling_data.csv")

# Create a little class to do scipy curve fitting.
class CurveFit(BaseEstimator, RegressorMixin):
    # The fitting function.
    curve = lambda x, m, b: m * x + b

    def __init__(self, m=None, b=None):
        self.m = m
        self.b = b

    def fit(self, X, y):
        optimal_params, cov = optimize.curve_fit(CurveFit.curve, X, y)
        self.m = optimal_params[0]
        self.b = optimal_params[1]
        return self

    def predict(self, X):
        return CurveFit.curve(X, self.m, self.b)

# Create and fit the model object.
model = CurveFit()
model.fit(df.x, df.y)

# Save the trained model.
with open("data/model.pkl", "wb") as f:
    pickle.dump(model, f)

evaluate_model.py

import pickle
import json

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from train_model import CurveFit

# Load the modeling data.
df = pd.read_csv("data/modeling_data.csv")

# Load the model.
with open("data/model.pkl", "rb") as f:
    model = pickle.load(f)

# Evaluate the model.
r_squared = model.score(df.x, df.y)

# Create metrics file for dvc to track.
metrics_file = {"R Squared": r_squared}
with open("data/model_metrics.json", "w") as f:
    json.dump(metrics_file, f)

# Create a plot of our model on top of our data
plt.scatter(df.x, df.y)
x_pred = np.linspace(df.x.min(), df.x.max(), num=1000)
plt.plot(x_pred, model.predict(x_pred), color="C1")
plt.ylabel("y")
plt.xlabel("x")
plt.savefig("data/model_plot.png")

The key to the issue is the 6th line of evaluate_model.py where it says from train_model import CurveFit where I import the CurveFit class that I wrote in train_model.py so I can properly import my model. The problem is that in train_model.py pickle.dump(model, f) will be run just by importing the train_model file. This can be fixed by putting the legwork of the train_model script in an if __name__ == "__main__": clause. The correct version is shown below:

import pickle

import pandas as pd
import numpy as np
from scipy import optimize
from sklearn.base import BaseEstimator, RegressorMixin

# Create a little class to do scipy curve fitting.
class CurveFit(BaseEstimator, RegressorMixin):
    # The fitting function.
    curve = lambda x, m, b: m * x + b

    def __init__(self, m=None, b=None):
        self.m = m
        self.b = b

    def fit(self, X, y):
        optimal_params, cov = optimize.curve_fit(CurveFit.curve, X, y)
        self.m = optimal_params[0]
        self.b = optimal_params[1]
        return self

    def predict(self, X):
        return CurveFit.curve(X, self.m, self.b)

if __name__ == "__main__":
    # Load the modeling data set.
    df = pd.read_csv("data/modeling_data.csv")

    # Create and fit the model object.
    model = CurveFit()
    model.fit(df.x, df.y)

    # Save the trained model.
    with open("data/model.pkl", "wb") as f:
        pickle.dump(model, f)

This way, the pickle.dump(model, f) won't be called when I import the CurveFit class. I suppose you can close this issue as it is not related to DVC.
However, if any smart python experts out there know __why__ this produces a different pickle file I would very much like to know. I went through the trouble of using git reset HEAD --hard followed by dvc checkout -f so I can try running my dvc repro pipeline many times to troubleshoot this issue. When I run the train_model stage of the pipeline it always always always produces a pickle file with the same hash, however this does not seem to be the case when the script is run when importing it as a module, if anyone knows I would love to know. Thanks!

All 2 comments

I figured out the issue. For reference here are my train_model.py and evaluate_model.py files that run for the train_model and evaluate_model stages in dvc respectively.
train_model.py:

import pickle

import pandas as pd
import numpy as np
from scipy import optimize
from sklearn.base import BaseEstimator, RegressorMixin

# Load the modeling data set.
df = pd.read_csv("data/modeling_data.csv")

# Create a little class to do scipy curve fitting.
class CurveFit(BaseEstimator, RegressorMixin):
    # The fitting function.
    curve = lambda x, m, b: m * x + b

    def __init__(self, m=None, b=None):
        self.m = m
        self.b = b

    def fit(self, X, y):
        optimal_params, cov = optimize.curve_fit(CurveFit.curve, X, y)
        self.m = optimal_params[0]
        self.b = optimal_params[1]
        return self

    def predict(self, X):
        return CurveFit.curve(X, self.m, self.b)

# Create and fit the model object.
model = CurveFit()
model.fit(df.x, df.y)

# Save the trained model.
with open("data/model.pkl", "wb") as f:
    pickle.dump(model, f)

evaluate_model.py

import pickle
import json

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from train_model import CurveFit

# Load the modeling data.
df = pd.read_csv("data/modeling_data.csv")

# Load the model.
with open("data/model.pkl", "rb") as f:
    model = pickle.load(f)

# Evaluate the model.
r_squared = model.score(df.x, df.y)

# Create metrics file for dvc to track.
metrics_file = {"R Squared": r_squared}
with open("data/model_metrics.json", "w") as f:
    json.dump(metrics_file, f)

# Create a plot of our model on top of our data
plt.scatter(df.x, df.y)
x_pred = np.linspace(df.x.min(), df.x.max(), num=1000)
plt.plot(x_pred, model.predict(x_pred), color="C1")
plt.ylabel("y")
plt.xlabel("x")
plt.savefig("data/model_plot.png")

The key to the issue is the 6th line of evaluate_model.py where it says from train_model import CurveFit where I import the CurveFit class that I wrote in train_model.py so I can properly import my model. The problem is that in train_model.py pickle.dump(model, f) will be run just by importing the train_model file. This can be fixed by putting the legwork of the train_model script in an if __name__ == "__main__": clause. The correct version is shown below:

import pickle

import pandas as pd
import numpy as np
from scipy import optimize
from sklearn.base import BaseEstimator, RegressorMixin

# Create a little class to do scipy curve fitting.
class CurveFit(BaseEstimator, RegressorMixin):
    # The fitting function.
    curve = lambda x, m, b: m * x + b

    def __init__(self, m=None, b=None):
        self.m = m
        self.b = b

    def fit(self, X, y):
        optimal_params, cov = optimize.curve_fit(CurveFit.curve, X, y)
        self.m = optimal_params[0]
        self.b = optimal_params[1]
        return self

    def predict(self, X):
        return CurveFit.curve(X, self.m, self.b)

if __name__ == "__main__":
    # Load the modeling data set.
    df = pd.read_csv("data/modeling_data.csv")

    # Create and fit the model object.
    model = CurveFit()
    model.fit(df.x, df.y)

    # Save the trained model.
    with open("data/model.pkl", "wb") as f:
        pickle.dump(model, f)

This way, the pickle.dump(model, f) won't be called when I import the CurveFit class. I suppose you can close this issue as it is not related to DVC.
However, if any smart python experts out there know __why__ this produces a different pickle file I would very much like to know. I went through the trouble of using git reset HEAD --hard followed by dvc checkout -f so I can try running my dvc repro pipeline many times to troubleshoot this issue. When I run the train_model stage of the pipeline it always always always produces a pickle file with the same hash, however this does not seem to be the case when the script is run when importing it as a module, if anyone knows I would love to know. Thanks!

@pfigliozzi Great catch! Glad you've found the cause. :tada:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

anotherbugmaster picture anotherbugmaster  路  3Comments

analystanand picture analystanand  路  3Comments

mdscruggs picture mdscruggs  路  3Comments

mfrata picture mfrata  路  3Comments

dmpetrov picture dmpetrov  路  3Comments