Spacy: spacy model on dataflow

Created on 28 Jan 2019  Â·  4Comments  Â·  Source: explosion/spaCy

i am trying to use spacy==2.0.18 on google dataflow but i simply can't figure out how to make the models downloaded in all the workers.
It would be so awesome if i could just have pip install spacy[en] but that does not exist.

How to reproduce the problem

dockerfile:

FROM python:2.7.12

ADD requirements.txt .
RUN pip install -r requirements.txt

RUN python -m spacy download en_core_web_sm

ADD setup.py .
ADD Makefile .
ADD main.py .
ADD src src

CMD ["make", "run_locally"]

requirements:

google-cloud-dataflow==2.5.0
requests==2.19.1
h5py==2.8.0
spacy==2.0.18
textacy==0.6.2
python-json-logger==0.1.9
dill==0.2.7.1
numpy==1.16.0

makefile:

run_locally:
    @echo 'download spacy model'
    python -m spacy download en_core_web_sm
    @echo 'downloaded spacy model'
    --project '${GOOGLE_CLOUD_PROJECT}' \
    --region europe-west1 \
    --runner DataflowRunner \
    --input '${INPUT_BUCKET}/sentence_input/export_*.json' \
    --output '${OUTPUT_BUCKET}/sentences/export_*.json' \
    --lower_date_limit '${LOWER_DATE_BOUNDARY}' \
    --upper_date_limit '${UPPER_DATE_BOUNDARY}' \
    --disk_size_gb 50 \
    --staging_location '${INPUT_BUCKET}/dataflow_staging' \
    --temp_location '${INPUT_BUCKET}/temp' \
    --job_name sentence-to-nouns-and-phrases \
    --setup_file ./setup.py \
    --save_main_session \
    --max_num_workers '${MAX_DATAFLOW_WORKERS}' \
    --machine_type 'n1-highmem-4' \
    --worker_machine_type 'n1-highmem-4' \
    --num_workers 5

get_model snippet

def get_model():
    try:
        get_model.english_nlp
    except:
        get_model.english_nlp = spacy.load("en_core_web_sm")
    return get_model.english_nlp

output:

Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

or if i use my snippet for loading inside the worker at runtime:

def get_model():
    try:
        get_model.english_nlp
    except:
        logging.warning("downloading spacy model")
        spacy_download('en_core_web_sm')
        logging.warning("downloaded spacy model")

        get_model.english_nlp = spacy.load("en_core_web_sm")
    return get_model.english_nlp

i get

IOError: [E053] Could not read meta.json from /usr/local/lib/python2.7/dist-packages/spacy/data/en_core_web_sm/meta.json
docs help wanted models

All 4 comments

seems solved by using a setup.py file specifying files to include from the raw module package:

cat setup.py
...

setup(
    name="frankenstein_conquers_the_world",
    version="0.0.1",
    description="nothing here",
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    package_data={
        'src.en_core_web_sm': ['tokenizer', 'meta.json', 'accuracy.json'],
        'src.en_core_web_sm.ner': ['cfg','lower_model','moves','tok2vec_model','upper_model'],
        'src.en_core_web_sm.parser': ['cfg','lower_model','moves','tok2vec_model','upper_model'],
        'src.en_core_web_sm.tagger': ['cfg','model','tag_map'],
        'src.en_core_web_sm.vocab': ['key2row','lexemes.bin','strings.json','vectors']
        },
)
...

this coupled with a manual download of en_core_web_sm-2.0.0 and unpacking the first inner layer and storing it under your package, in this case src is the package.

even though i found a solution i think your docs should be better at mentioning how to bundle up models with python packages. Especially needed for dataflow and pypi.

update:
this only works if all levels in the model have a __init__.py file, so i just created those and it worked

Thanks for updating with your solution!

This does seem quite hacky and I wonder if there's maybe a better solution 🤔 I'm no expert on Dataflow unfortunately, but I'll label this help wanted, so maybe someone else has an idea? We'd definitely like to include a recommendation for this in the docs.

It would be so awesome if i could just have pip install spacy[en] but that does not exist.

I agree, but I don't think we can make this work, unless we host our own PyPi server. In order to download a model, we need to resolve the compatibility – and the compatibility table should live outside spaCy, to allow shipping new models without having to update the core library.

You should be able to do pip install pointing to the URLs of the model on the https://github.com/explosion/spacy-models releases page. You could add the models into a requirements.txt, so that you can just do pip install -r requirements.txt

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tonywangcn picture tonywangcn  Â·  3Comments

melanietosik picture melanietosik  Â·  3Comments

TropComplique picture TropComplique  Â·  3Comments

ajayrfhp picture ajayrfhp  Â·  3Comments

curiousgeek0 picture curiousgeek0  Â·  3Comments