Streamlit: A cache hit is almost as slow as a cache miss

Created on 17 Nov 2020  路  3Comments  路  Source: streamlit/streamlit

Summary

For some GeoJSON data I'm pulling, a cache hit for the data is as slow as a cache miss. I can see that the cache does hit, due to a print statement. I also did a custom hash based on the file's mod time to ensure the data isn't changing, and I'm loading it from disk. However, I'll show you below the URL it came from.

Steps to reproduce

What are the steps we should take to reproduce the bug:

  1. curl https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json --output counties.json `
  2. (optional) virtualenv --python 3.8 venv && source venv/bin/activate && pip install streamlit
  3. Put this script into app.py
import json
import os
import time

import streamlit as st

COUNTIES_GEOJSON_URL = "https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json"


class FileReference:
    def __init__(self, filename):
        self.filename = filename

def hash_file_reference(file_reference):
    filename = file_reference.filename
    print(os.path.getmtime(filename))
    return (filename, os.path.getmtime(filename))


@st.cache(hash_funcs={FileReference: hash_file_reference})
def load_data(file_reference):
    print('cache miss -> load_data()')
    with open(file_reference.filename, 'r') as fp:
        data = json.load(fp)
    return data


def main():
    st.title("Title")

    st.write('Loading Data')
    t0 = time.time()
    counties_geojson = load_data(FileReference('counties.json'))
    t_elapsed = time.time() - t0
    st.write(f"Elapsed time: {t_elapsed}")

    value = st.selectbox("Select a value", ['a', 'b', 'c'])
    st.write(value)


main()

Expected behavior:

On my machine, the reported elapsed time on first call (cache miss) is 15-20 seconds.

I expect the second call (cache hit) to report a significantly reduce elapsed time, under 5 seconds for sure?

Actual behavior:

The reported elapsed time is 10-20 seconds.

Is this a regression?

Looks like no. I installed 0.69 and saw the same behavior.

Debug info

  • Streamlit version: 0.71.0
  • Python version: 3.8.5
  • Used general environment as well as the documented virtualenv
  • OS version: macOS Catalina 10.15.6.
  • Browser version: Version 86.0.4240.198 (Official Build) (x86_64)

Additional information

question

All 3 comments

Hi @aagnone3 馃憢

The short answer is change your code to the following.

import json
import os
import time

import streamlit as st

@st.cache(allow_output_mutation=True)
def load_data(file_name):
    print('cache miss -> load_data()')
    with open(file_name, 'r') as fp:
        data = json.load(fp)
    return data


def main():
    st.write('Loading Data')
    t0 = time.time()
    counties_geojson = load_data('counties.json')
    t_elapsed = time.time() - t0
    st.write(f"Elapsed time: {t_elapsed}")

    value = st.selectbox("Select a value", ['a', 'b', 'c'])
    st.write(value)


main()

The kind-of-but-not-really an explanation is

  • allow_output_mutation disables the caching of the function output which in this case is slow.
  • Since you're allowing output mutation, you either want to guarantee that you don't actually mutate the output or if you need to modify it, you could (deep) copy the output before handling it.

Alternatively, pandas.read_json() might be an option for you.

The hashing of a dataframe is faster than the hashing of a json blob so you won't need allow_output_mutation in that case.

@st.cache
def load_data(file_name):
    print('cache miss -> load_data()')
    return pd.read_json(file_name)

Good deal! This solved my issue.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thingumajig picture thingumajig  路  3Comments

equester picture equester  路  3Comments

danlester picture danlester  路  3Comments

matthew-trava picture matthew-trava  路  3Comments

nadgirsanket picture nadgirsanket  路  3Comments