Hi all I have been having an issue with hashes generated on a pickled model that I track with DVC. I originally asked about this issue in this message on discord. I will summarize the issue below:
In my project I have a DAG that looks like this:
stages:
download_data:
cmd: python src/download_data.py
deps:
- src/download_data.py
outs:
- data/raw_data.csv
transform_data:
cmd: python src/transform_data.py
deps:
- data/raw_data.csv
- src/transform_data.py
outs:
- data/data_transformer.pkl
- data/modeling_data.csv
train_model:
cmd: python src/train_model.py
deps:
- data/modeling_data.csv
outs:
- data/model.pkl
evaluate_model:
cmd: python src/evaluate_model.py
deps:
- data/model.pkl
- data/modeling_data.csv
outs:
- data/model_plot.png:
cache: false
metrics:
- data/model_metrics.json
And I make a change to my database (which will change the data that is downloaded) in the download_data step.
So to have dvc run through the entire pipeline I run dvc repro -f. But then I get a curious result when I look at the dvc.lock file after running.
When I look at the diff of the dvc.lock file the hashes for my model doesn't line up:

The new model.pkl file in the train_model step has a different hash than the model.pkl file in the evaluate_model step. However, all the other files have consistent hashes between the different steps, e.g. raw_data.csv and modeling_data.csv. What is going on? __Why would model.pkl file have two different hashes across the two steps?__
I talked with @pared about it in discord and he asked if my evaluate_model step was changing my model.pkl file at all and I told him that it shouldn't be. The way I am saving and loading the pickle file is pretty basic. In train_model I dump the model with:
# Save the trained model.
with open("data/model.pkl", "wb") as f:
pickle.dump(model, f)
And in evaluate_model I load it with:
# Load the model.
with open("data/model.pkl", "rb") as f:
model = pickle.load(f)
I don't touch the file in any other lines in those two steps.
I don't really know what is going on. I was able to confirm the hashes in the screenshot above, that is, I did dvc repro -f train_model and got a hash of model.pkl of 59f89b3e627db1deecc85911690699fa and then ran dvc repo -f evaluate_model and got a hash of model.pkl of 2d06aa6a6d3b5e98ae4f0c9de3f0294c (this was done in a separate script using hashlib similar to how it is done in DVC). This may not even be a DVC issue and may be just a pickle issue, but I am completely lost as to why it is happening.
__DVC version__
(base) pfigl@lxm0224 [.../dvc_demo/my_model] $ dvc version
DVC version: 1.8.1 (conda)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1127.19.1.el7.x86_64-x86_64-with-glibc2.10
Supports: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: hardlink, symlink
Repo: dvc, git
__DVC config__
(base) pfigl@lxm0224 [.../dvc_demo/my_model] $ cat .dvc/config
[core]
remote = localremote
['remote "localremote"']
url = /home/pfigl/GitHub/dvc_demo/dvc_remote
I figured out the issue. For reference here are my train_model.py and evaluate_model.py files that run for the train_model and evaluate_model stages in dvc respectively.
train_model.py:
import pickle
import pandas as pd
import numpy as np
from scipy import optimize
from sklearn.base import BaseEstimator, RegressorMixin
# Load the modeling data set.
df = pd.read_csv("data/modeling_data.csv")
# Create a little class to do scipy curve fitting.
class CurveFit(BaseEstimator, RegressorMixin):
# The fitting function.
curve = lambda x, m, b: m * x + b
def __init__(self, m=None, b=None):
self.m = m
self.b = b
def fit(self, X, y):
optimal_params, cov = optimize.curve_fit(CurveFit.curve, X, y)
self.m = optimal_params[0]
self.b = optimal_params[1]
return self
def predict(self, X):
return CurveFit.curve(X, self.m, self.b)
# Create and fit the model object.
model = CurveFit()
model.fit(df.x, df.y)
# Save the trained model.
with open("data/model.pkl", "wb") as f:
pickle.dump(model, f)
evaluate_model.py
import pickle
import json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from train_model import CurveFit
# Load the modeling data.
df = pd.read_csv("data/modeling_data.csv")
# Load the model.
with open("data/model.pkl", "rb") as f:
model = pickle.load(f)
# Evaluate the model.
r_squared = model.score(df.x, df.y)
# Create metrics file for dvc to track.
metrics_file = {"R Squared": r_squared}
with open("data/model_metrics.json", "w") as f:
json.dump(metrics_file, f)
# Create a plot of our model on top of our data
plt.scatter(df.x, df.y)
x_pred = np.linspace(df.x.min(), df.x.max(), num=1000)
plt.plot(x_pred, model.predict(x_pred), color="C1")
plt.ylabel("y")
plt.xlabel("x")
plt.savefig("data/model_plot.png")
The key to the issue is the 6th line of evaluate_model.py where it says from train_model import CurveFit where I import the CurveFit class that I wrote in train_model.py so I can properly import my model. The problem is that in train_model.py pickle.dump(model, f) will be run just by importing the train_model file. This can be fixed by putting the legwork of the train_model script in an if __name__ == "__main__": clause. The correct version is shown below:
import pickle
import pandas as pd
import numpy as np
from scipy import optimize
from sklearn.base import BaseEstimator, RegressorMixin
# Create a little class to do scipy curve fitting.
class CurveFit(BaseEstimator, RegressorMixin):
# The fitting function.
curve = lambda x, m, b: m * x + b
def __init__(self, m=None, b=None):
self.m = m
self.b = b
def fit(self, X, y):
optimal_params, cov = optimize.curve_fit(CurveFit.curve, X, y)
self.m = optimal_params[0]
self.b = optimal_params[1]
return self
def predict(self, X):
return CurveFit.curve(X, self.m, self.b)
if __name__ == "__main__":
# Load the modeling data set.
df = pd.read_csv("data/modeling_data.csv")
# Create and fit the model object.
model = CurveFit()
model.fit(df.x, df.y)
# Save the trained model.
with open("data/model.pkl", "wb") as f:
pickle.dump(model, f)
This way, the pickle.dump(model, f) won't be called when I import the CurveFit class. I suppose you can close this issue as it is not related to DVC.
However, if any smart python experts out there know __why__ this produces a different pickle file I would very much like to know. I went through the trouble of using git reset HEAD --hard followed by dvc checkout -f so I can try running my dvc repro pipeline many times to troubleshoot this issue. When I run the train_model stage of the pipeline it always always always produces a pickle file with the same hash, however this does not seem to be the case when the script is run when importing it as a module, if anyone knows I would love to know. Thanks!
@pfigliozzi Great catch! Glad you've found the cause. :tada:
Most helpful comment
I figured out the issue. For reference here are my train_model.py and evaluate_model.py files that run for the
train_modelandevaluate_modelstages in dvc respectively.train_model.py:
evaluate_model.py
The key to the issue is the 6th line of evaluate_model.py where it says
from train_model import CurveFitwhere I import the CurveFit class that I wrote in train_model.py so I can properly import my model. The problem is that in train_model.pypickle.dump(model, f)will be run just by importing the train_model file. This can be fixed by putting the legwork of the train_model script in anif __name__ == "__main__":clause. The correct version is shown below:This way, the
pickle.dump(model, f)won't be called when I import theCurveFitclass. I suppose you can close this issue as it is not related to DVC.However, if any smart python experts out there know __why__ this produces a different pickle file I would very much like to know. I went through the trouble of using
git reset HEAD --hardfollowed bydvc checkout -fso I can try running mydvc repropipeline many times to troubleshoot this issue. When I run thetrain_modelstage of the pipeline it always always always produces a pickle file with the same hash, however this does not seem to be the case when the script is run when importing it as a module, if anyone knows I would love to know. Thanks!