Streamlit: A cache hit is almost as slow as a cache miss

Created on 17 Nov 2020 · 3Comments · Source: streamlit/streamlit

Summary

For some GeoJSON data I'm pulling, a cache hit for the data is as slow as a cache miss. I can see that the cache does hit, due to a print statement. I also did a custom hash based on the file's mod time to ensure the data isn't changing, and I'm loading it from disk. However, I'll show you below the URL it came from.

Steps to reproduce

What are the steps we should take to reproduce the bug:

curl https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json --output counties.json `
(optional) virtualenv --python 3.8 venv && source venv/bin/activate && pip install streamlit
Put this script into app.py

import json
import os
import time

import streamlit as st

COUNTIES_GEOJSON_URL = "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json"


class FileReference:
    def __init__(self, filename):
        self.filename = filename

def hash_file_reference(file_reference):
    filename = file_reference.filename
    print(os.path.getmtime(filename))
    return (filename, os.path.getmtime(filename))


@st.cache(hash_funcs={FileReference: hash_file_reference})
def load_data(file_reference):
    print('cache miss -> load_data()')
    with open(file_reference.filename, 'r') as fp:
        data = json.load(fp)
    return data


def main():
    st.title("Title")

    st.write('Loading Data')
    t0 = time.time()
    counties_geojson = load_data(FileReference('counties.json'))
    t_elapsed = time.time() - t0
    st.write(f"Elapsed time: {t_elapsed}")

    value = st.selectbox("Select a value", ['a', 'b', 'c'])
    st.write(value)


main()

Expected behavior:

On my machine, the reported elapsed time on first call (cache miss) is 15-20 seconds.

I expect the second call (cache hit) to report a significantly reduce elapsed time, under 5 seconds for sure?

Actual behavior:

The reported elapsed time is 10-20 seconds.

Is this a regression?

Looks like no. I installed 0.69 and saw the same behavior.

Debug info

Streamlit version: 0.71.0
Python version: 3.8.5
Used general environment as well as the documented virtualenv
OS version: macOS Catalina 10.15.6.
Browser version: Version 86.0.4240.198 (Official Build) (x86_64)

Additional information

question

Source

aagnone3

All 3 comments

Hi @aagnone3 👋

The short answer is change your code to the following.

import json
import os
import time

import streamlit as st

@st.cache(allow_output_mutation=True)
def load_data(file_name):
    print('cache miss -> load_data()')
    with open(file_name, 'r') as fp:
        data = json.load(fp)
    return data


def main():
    st.write('Loading Data')
    t0 = time.time()
    counties_geojson = load_data('counties.json')
    t_elapsed = time.time() - t0
    st.write(f"Elapsed time: {t_elapsed}")

    value = st.selectbox("Select a value", ['a', 'b', 'c'])
    st.write(value)


main()

The kind-of-but-not-really an explanation is

allow_output_mutation disables the caching of the function output which in this case is slow.
Since you're allowing output mutation, you either want to guarantee that you don't actually mutate the output or if you need to modify it, you could (deep) copy the output before handling it.

jrhone on 17 Nov 2020

❤1

Alternatively, pandas.read_json() might be an option for you.

The hashing of a dataframe is faster than the hashing of a json blob so you won't need allow_output_mutation in that case.

@st.cache
def load_data(file_name):
    print('cache miss -> load_data()')
    return pd.read_json(file_name)

jrhone on 17 Nov 2020

Good deal! This solved my issue.

aagnone3 on 17 Nov 2020

❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Make map points bigger

ShivamBhirud · 3Comments

DarkMode

imneonizer · 3Comments

Add ability to hide the "Made by Streamlit" footer

matthew-trava · 3Comments

Matplotlib plots are blurry when using a large figsize

tconkling · 3Comments

RuntimeError: Data of size 82.1MB exceeds write limit of 50.0MB - CSV File Download Issues

equester · 3Comments