For some GeoJSON data I'm pulling, a cache hit for the data is as slow as a cache miss. I can see that the cache does hit, due to a print statement. I also did a custom hash based on the file's mod time to ensure the data isn't changing, and I'm loading it from disk. However, I'll show you below the URL it came from.
What are the steps we should take to reproduce the bug:
curl https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json --output counties.json `virtualenv --python 3.8 venv && source venv/bin/activate && pip install streamlitapp.pyimport json
import os
import time
import streamlit as st
COUNTIES_GEOJSON_URL = "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json"
class FileReference:
def __init__(self, filename):
self.filename = filename
def hash_file_reference(file_reference):
filename = file_reference.filename
print(os.path.getmtime(filename))
return (filename, os.path.getmtime(filename))
@st.cache(hash_funcs={FileReference: hash_file_reference})
def load_data(file_reference):
print('cache miss -> load_data()')
with open(file_reference.filename, 'r') as fp:
data = json.load(fp)
return data
def main():
st.title("Title")
st.write('Loading Data')
t0 = time.time()
counties_geojson = load_data(FileReference('counties.json'))
t_elapsed = time.time() - t0
st.write(f"Elapsed time: {t_elapsed}")
value = st.selectbox("Select a value", ['a', 'b', 'c'])
st.write(value)
main()
On my machine, the reported elapsed time on first call (cache miss) is 15-20 seconds.
I expect the second call (cache hit) to report a significantly reduce elapsed time, under 5 seconds for sure?
The reported elapsed time is 10-20 seconds.
Looks like no. I installed 0.69 and saw the same behavior.
Hi @aagnone3 馃憢
The short answer is change your code to the following.
import json
import os
import time
import streamlit as st
@st.cache(allow_output_mutation=True)
def load_data(file_name):
print('cache miss -> load_data()')
with open(file_name, 'r') as fp:
data = json.load(fp)
return data
def main():
st.write('Loading Data')
t0 = time.time()
counties_geojson = load_data('counties.json')
t_elapsed = time.time() - t0
st.write(f"Elapsed time: {t_elapsed}")
value = st.selectbox("Select a value", ['a', 'b', 'c'])
st.write(value)
main()
The kind-of-but-not-really an explanation is
allow_output_mutation disables the caching of the function output which in this case is slow.Alternatively, pandas.read_json() might be an option for you.
The hashing of a dataframe is faster than the hashing of a json blob so you won't need allow_output_mutation in that case.
@st.cache
def load_data(file_name):
print('cache miss -> load_data()')
return pd.read_json(file_name)
Good deal! This solved my issue.