i am trying to use spacy==2.0.18 on google dataflow but i simply can't figure out how to make the models downloaded in all the workers.
It would be so awesome if i could just have pip install spacy[en] but that does not exist.
dockerfile:
FROM python:2.7.12
ADD requirements.txt .
RUN pip install -r requirements.txt
RUN python -m spacy download en_core_web_sm
ADD setup.py .
ADD Makefile .
ADD main.py .
ADD src src
CMD ["make", "run_locally"]
requirements:
google-cloud-dataflow==2.5.0
requests==2.19.1
h5py==2.8.0
spacy==2.0.18
textacy==0.6.2
python-json-logger==0.1.9
dill==0.2.7.1
numpy==1.16.0
makefile:
run_locally:
@echo 'download spacy model'
python -m spacy download en_core_web_sm
@echo 'downloaded spacy model'
--project '${GOOGLE_CLOUD_PROJECT}' \
--region europe-west1 \
--runner DataflowRunner \
--input '${INPUT_BUCKET}/sentence_input/export_*.json' \
--output '${OUTPUT_BUCKET}/sentences/export_*.json' \
--lower_date_limit '${LOWER_DATE_BOUNDARY}' \
--upper_date_limit '${UPPER_DATE_BOUNDARY}' \
--disk_size_gb 50 \
--staging_location '${INPUT_BUCKET}/dataflow_staging' \
--temp_location '${INPUT_BUCKET}/temp' \
--job_name sentence-to-nouns-and-phrases \
--setup_file ./setup.py \
--save_main_session \
--max_num_workers '${MAX_DATAFLOW_WORKERS}' \
--machine_type 'n1-highmem-4' \
--worker_machine_type 'n1-highmem-4' \
--num_workers 5
get_model snippet
def get_model():
try:
get_model.english_nlp
except:
get_model.english_nlp = spacy.load("en_core_web_sm")
return get_model.english_nlp
output:
Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
or if i use my snippet for loading inside the worker at runtime:
def get_model():
try:
get_model.english_nlp
except:
logging.warning("downloading spacy model")
spacy_download('en_core_web_sm')
logging.warning("downloaded spacy model")
get_model.english_nlp = spacy.load("en_core_web_sm")
return get_model.english_nlp
i get
IOError: [E053] Could not read meta.json from /usr/local/lib/python2.7/dist-packages/spacy/data/en_core_web_sm/meta.json
seems solved by using a setup.py file specifying files to include from the raw module package:
cat setup.py
...
setup(
name="frankenstein_conquers_the_world",
version="0.0.1",
description="nothing here",
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
package_data={
'src.en_core_web_sm': ['tokenizer', 'meta.json', 'accuracy.json'],
'src.en_core_web_sm.ner': ['cfg','lower_model','moves','tok2vec_model','upper_model'],
'src.en_core_web_sm.parser': ['cfg','lower_model','moves','tok2vec_model','upper_model'],
'src.en_core_web_sm.tagger': ['cfg','model','tag_map'],
'src.en_core_web_sm.vocab': ['key2row','lexemes.bin','strings.json','vectors']
},
)
...
this coupled with a manual download of en_core_web_sm-2.0.0 and unpacking the first inner layer and storing it under your package, in this case src is the package.
even though i found a solution i think your docs should be better at mentioning how to bundle up models with python packages. Especially needed for dataflow and pypi.
update:
this only works if all levels in the model have a __init__.py file, so i just created those and it worked
Thanks for updating with your solution!
This does seem quite hacky and I wonder if there's maybe a better solution 🤔 I'm no expert on Dataflow unfortunately, but I'll label this help wanted, so maybe someone else has an idea? We'd definitely like to include a recommendation for this in the docs.
It would be so awesome if i could just have pip install spacy[en] but that does not exist.
I agree, but I don't think we can make this work, unless we host our own PyPi server. In order to download a model, we need to resolve the compatibility – and the compatibility table should live outside spaCy, to allow shipping new models without having to update the core library.
You should be able to do pip install pointing to the URLs of the model on the https://github.com/explosion/spacy-models releases page. You could add the models into a requirements.txt, so that you can just do pip install -r requirements.txt
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.