Allennlp: Cannot train language_model: RuntimeError: Tensors must be CUDA and dense

Created on 3 Sep 2020 · 6Comments · Source: allenai/allennlp

Checklist

[x] I have verified that the issue exists against the master branch of AllenNLP.
[x] I have read the relevant section in the contribution guide on reporting bugs.
[x] I have checked the issues list for similar or identical bug reports.
[x] I have checked the pull requests list for existing proposed fixes.
[x] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
[x] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[x] I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
[x] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[x] I have included in the "Environment" section below the output of pip freeze.
[x] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

Python traceback:

0 | 2020-09-03 12:39:41,240 - INFO - allennlp.training.trainer - Beginning training.
0 | 2020-09-03 12:39:41,240 - INFO - allennlp.training.trainer - Epoch 0/9
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 2984.032
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 1 memory usage MB: 2985.344
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 2 memory usage MB: 2988.028
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 3 memory usage MB: 2984.496
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 4 memory usage MB: 2989.064
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 5 memory usage MB: 2987.996
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 6 memory usage MB: 2988.592
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 7 memory usage MB: 2991.332
reading instances: 0it [00:00, ?it/s]0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 1614
reading instances: 0it [00:00, ?it/s]0 | 2020-09-03 12:39:41,603 - INFO - allennlp.training.trainer - Training
reading instances: 0it [00:00, ?it/s]0 | 2020-09-03 12:39:41,605 - INFO - src.models.language_model_conll_reader - Loading data from data/processed/no_bi_tags/new_model1_train.txt
reading instances: 0it [00:00, ?it/s]/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 91.52it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 85.48it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
0it [00:00, ?it/s]
reading instances: 63it [00:00, 83.76it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 83.20it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 81.84it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 79.57it/s]
reading instances: 63it [00:00, 80.02it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 79.14it/s]
Traceback (most recent call last):
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 92, in main
    args.func(args)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 118, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 177, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 304, in train_model
    nprocs=num_procs,
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 439, in _train_worker
    metrics = train_loop.run()
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 501, in run
    return self.trainer.train()
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 867, in train
    train_metrics = self._train_epoch(epoch)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 589, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 479, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp_models/lm/models/language_model.py", line 297, in forward
    self._perplexity(average_loss)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py", line 37, in __call__
    dist.all_reduce(count, op=dist.ReduceOp.SUM)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

Related issues or possible duplicates

allenai/allennlp/issues/4554

Environment

OS: Red Hat Enterprise Linux Server release 7.6 (Maipo), Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Python version: 3.7.8 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
Also: cudatoolkit 9.2, installed via conda

Output of pip freeze:

alabaster==0.7.12
alembic==1.4.2
allennlp==1.1.0rc4
allennlp-models==1.1.0rc4
appdirs==1.4.3
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1596629848788/work
astroid @ file:///home/conda/feedstock_root/build_artifacts/astroid_1591645081755/work
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1597959372343/work
Babel==2.8.0
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache==1.6.1
beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1597679909012/work
black @ file:///home/conda/feedstock_root/build_artifacts/black-recipe_1596181673569/work
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1588608214987/work
blis==0.4.1
boto==2.49.0
boto3 @ file:///home/conda/feedstock_root/build_artifacts/boto3_1599083817494/work
botocore @ file:///home/conda/feedstock_root/build_artifacts/botocore_1599079795233/work
brotlipy==0.7.0
bz2file==0.98
cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1593420445823/work
catalogue==1.0.0
certifi==2020.6.20
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1595805535531/work
chardet==3.0.4
click==7.1.2
cliff @ file:///home/conda/feedstock_root/build_artifacts/cliff_1596473693634/work
cmaes @ file:///home/conda/feedstock_root/build_artifacts/cmaes_1596014311252/work
cmd2 @ file:///home/conda/feedstock_root/build_artifacts/cmd2_1591809284986/work
colorama==0.4.3
colorlog==4.2.1
conllu==4.0
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1598621005849/work
css-html-js-minify @ file:///home/conda/feedstock_root/build_artifacts/css-html-js-minify_1589201803800/work
cycler==0.10.0
cymem==2.0.3
Cython @ file:///home/conda/feedstock_root/build_artifacts/cython_1594253470382/work
decorator==4.4.2
defusedxml==0.6.0
docutils==0.15.2
en-core-web-sm @ file:///home/conda/feedstock_root/build_artifacts/spacy-model-en_core_web_1595437534769/work/sm
entrypoints==0.3
filelock==3.0.12
flake8 @ file:///home/conda/feedstock_root/build_artifacts/flake8_1594584774608/work
ftfy==5.8
future==0.18.2
gensim @ file:///home/conda/feedstock_root/build_artifacts/gensim_1589441213562/work
google-api-core @ file:///home/conda/feedstock_root/build_artifacts/google-api-core-split_1597353416133/work
google-auth @ file:///home/conda/feedstock_root/build_artifacts/google-auth_1598972371532/work
google-cloud-core @ file:///home/conda/feedstock_root/build_artifacts/google-cloud-core_1596721852962/work
google-cloud-storage @ file:///home/conda/feedstock_root/build_artifacts/google-cloud-storage_1598557481535/work
google-crc32c @ file:///home/conda/feedstock_root/build_artifacts/google-crc32c_1597069930917/work
google-resumable-media @ file:///home/conda/feedstock_root/build_artifacts/google-resumable-media_1598452053521/work
googleapis-common-protos==1.51.0
grpcio @ file:///home/conda/feedstock_root/build_artifacts/grpcio_1596715635580/work
h5py==2.10.0
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1593328102638/work
imagesize==1.2.0
importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1593211369179/work
iniconfig==1.0.1
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1595446871027/work/dist/ipykernel-5.3.4-py3-none-any.whl
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1598749946943/work
ipython-genutils==0.2.0
isort @ file:///home/conda/feedstock_root/build_artifacts/isort_1599134253980/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1595018882455/work
Jinja2==2.11.2
jmespath @ file:///home/conda/feedstock_root/build_artifacts/jmespath_1589369830981/work
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1593624380152/work
json5 @ file:///home/conda/feedstock_root/build_artifacts/json5_1591810480056/work
jsonnet==0.16.0
jsonpickle==1.4.1
jsonschema==3.2.0
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1598486169312/work
jupyter-core==4.6.3
jupyterlab==2.2.6
jupyterlab-server @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_server_1593951277307/work
kiwisolver==1.2.0
lazy-object-proxy==1.4.3
livereload @ file:///home/conda/feedstock_root/build_artifacts/livereload_1598114753789/work
lunr==0.5.8
lxml @ file:///home/conda/feedstock_root/build_artifacts/lxml_1594322698782/work
Mako @ file:///home/conda/feedstock_root/build_artifacts/mako_1595925083607/work
Markdown @ file:///home/conda/feedstock_root/build_artifacts/markdown_1589366472132/work
MarkupSafe==1.1.1
matplotlib @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-base_1597952254444/work
mccabe==0.6.1
mistune==0.8.4
mkdocs @ file:///home/conda/feedstock_root/build_artifacts/mkdocs_1591017186129/work
mkdocs-material @ file:///home/conda/feedstock_root/build_artifacts/mkdocs-material_1598971568401/work
more-itertools==8.5.0
murmurhash==1.0.2
mypy @ file:///home/conda/feedstock_root/build_artifacts/mypy_1592923020520/work
mypy-extensions==0.4.3
nbconvert==5.6.1
nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1594060262917/work
networkx==2.5
nltk==3.4.4
notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1597285190767/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1597938346492/work
olefile==0.46
optuna @ file:///home/conda/feedstock_root/build_artifacts/optuna_1598323699360/work
overrides==3.1.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1589925210001/work
pandas @ file:///home/conda/feedstock_root/build_artifacts/pandas_1598294454723/work
pandocfilters==1.4.2
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1595548966091/work
pathspec==0.8.0
patsy==0.5.1
pbr @ file:///home/conda/feedstock_root/build_artifacts/pbr_1598996708473/work
pexpect==4.8.0
pickleshare==0.7.5
Pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1594213010297/work
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
prettytable==0.7.2
prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1590412252446/work
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1598885455507/work
protobuf==3.13.0
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1594826921622/work
ptyprocess==0.6.0
py==1.9.0
py-rouge==1.1
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycodestyle @ file:///home/conda/feedstock_root/build_artifacts/pycodestyle_1589305246696/work
pycorenlp==0.3.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1593275161868/work
pyflakes==2.2.0
Pygments==2.6.1
pylint @ file:///home/conda/feedstock_root/build_artifacts/pylint_1598117058668/work
pymdown-extensions @ file:///home/conda/feedstock_root/build_artifacts/pymdown-extensions_1597166028984/work
Pyment==0.3.3
pyneuroner==1.0.8
pyOpenSSL==19.1.0
pyparsing==2.4.7
pyperclip @ file:///home/conda/feedstock_root/build_artifacts/pyperclip_1591810382257/work
pyrsistent==0.16.0
PySocks==1.7.1
pytest==6.0.1
python-dateutil==2.8.1
python-editor==1.0.4
python-magic==0.4.18
python-slugify @ file:///home/conda/feedstock_root/build_artifacts/python-slugify_1593573453419/work
pytz==2020.1
PyYAML==5.3.1
pyzmq==19.0.2
regex @ file:///home/conda/feedstock_root/build_artifacts/regex_1594799371287/work
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1592425495151/work
rsa @ file:///home/conda/feedstock_root/build_artifacts/rsa_1591996208734/work
s3cmd==2.1.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1596546074663/work
scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy_1595583586868/work
seaborn @ file:///home/conda/feedstock_root/build_artifacts/seaborn-base_1591878760859/work
Send2Trash==1.5.0
sentencepiece==0.1.91
six @ file:///home/conda/feedstock_root/build_artifacts/six_1590081179328/work
slugify==0.0.1
smart-open @ file:///home/conda/feedstock_root/build_artifacts/smart_open_1598572939086/work
snowballstemmer==2.0.0
soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1597680516047/work
spacy==2.3.2
Sphinx @ file:///home/conda/feedstock_root/build_artifacts/sphinx_1597405755328/work
sphinx-material @ file:///home/conda/feedstock_root/build_artifacts/sphinx-material_1597663680729/work
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
SQLAlchemy @ file:///home/conda/feedstock_root/build_artifacts/sqlalchemy_1597701920245/work
srsly==1.0.2
statsmodels @ file:///home/conda/feedstock_root/build_artifacts/statsmodels_1598551025620/work
stevedore @ file:///home/conda/feedstock_root/build_artifacts/stevedore_1598982656343/work
tensorboardX==2.1
teradata==15.10.0.21
terminado==0.8.3
testpath==0.4.4
text-unidecode==1.3
thinc==7.4.1
threadpoolctl @ file:///tmp/tmp79xdzxkt/threadpoolctl-2.1.0-py3-none-any.whl
tokenizers==0.8.1rc1
toml @ file:///home/conda/feedstock_root/build_artifacts/toml_1589469402899/work
torch==1.6.0
torchvision==0.7.0
tornado==6.0.4
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1596476591553/work
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1598976315411/work
transformers==3.0.2
typed-ast==1.4.1
typing-extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1588470653596/work
Unidecode==1.1.1
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1595434816409/work
wasabi==0.8.0
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1595859607677/work
webencodings==0.5.1
word2number==1.1
wrapt==1.11.2
XlsxWriter @ file:///home/conda/feedstock_root/build_artifacts/xlsxwriter_1597362230404/work
zipp==3.1.0

Steps to reproduce

Example source:

allennlp train --force --include-package src --serialization-dir tmp/elmo_lm_tran1 configs/bidirectional_language_model_part_notes.jsonnet

Contents of `configs/bidirectional_language_model_part_notes.jsonnet`:

{
  dataset_reader: {
    type: 'language_model_conll_reader',
    //"tokenizer": {
    //    "type": "just_spaces"
    //},
    token_indexers: {
      tokens: {
        type: 'single_id',
      },
      token_characters: {
        type: 'elmo_characters',
      },
    },
    //"max_sequence_length": 400,
    //"start_tokens": ["<S>"],
    //"end_tokens": ["</S>"]
  },
  train_data_path: 'data/processed/no_bi_tags/new_model1_train.txt',
  model: {
    type: 'language_model',
    bidirectional: true,
    //"num_samples": 8192,
    sparse_embeddings: true,
    text_field_embedder: {
      //"allow_unmatched_keys": true,
      token_embedders: {
        //tokens: {
        //          type: 'embedding',
        //          pretrained_file: 'models/word2vec_newtoken1.txt',
        //          embedding_dim: 50,
        //          trainable: false
        //      },
        tokens: {
          type: 'empty',
        },
        token_characters: {
          type: 'character_encoding',
          embedding: {
            num_embeddings: 262,
            embedding_dim: 16,
          },
          encoder: {
            type: 'cnn-highway',
            activation: 'relu',
            embedding_dim: 16,
            filters: [
              [1, 32],
              [2, 32],
              [3, 64],
              [4, 128],
              [5, 256],
              [6, 512],
              [7, 1024],
            ],
            num_highway: 2,
            projection_dim: 512,
            projection_location: 'after_highway',
            do_layer_norm: true,
          },
        },
      },
    },
    dropout: 0.1,
    contextualizer: {
      type: 'bidirectional_language_model_transformer',
      input_dim: 512,
      hidden_dim: 2048,
      num_layers: 6,
      dropout: 0.1,
      input_dropout: 0.1,
    },
  },
  data_loader: {
    batch_size: 64,
  },
  distributed: {
    cuda_devices: std.range(0, 8 - 1),
  },
  trainer: {
    num_epochs: 10,
    optimizer: {
      type: 'dense_sparse_adam',
    },
    // TODO(brendanr): Needed with transformer too?
    // "grad_norm": 10.0,
    learning_rate_scheduler: {
      type: 'noam',
      // See https://github.com/allenai/calypso/blob/master/calypso/train.py#L401
      model_size: 512,
      // See https://github.com/allenai/calypso/blob/master/bin/train_transformer_lm1b.py#L51.
      // Adjusted based on our sample size relative to Calypso's.
      warmup_steps: 6000,
    },
    //"should_log_learning_rate": true
  },
}

The `dataset_reader` and the data itself in the `train_data_path` are custom code and custom data, respectively. I imagine that part is not the issue, since we are able to build a vocabulary and attempt to train. Should be reproducible with any data and dataset_reader.

bug

Source

tpanza

All 6 comments

If I fallback to an older environment with these versions:

allennlp==1.0.0
allennlp-models==1.0.0

and

cudatoolkit               9.2                           0
pytorch                   1.5.1           py3.7_cuda9.2.148_cudnn7.6.3_0
python                    3.7.7           hcf32534_0_cpython

and use the same configs/bidirectional_language_model_part_notes.jsonnet provided above, the language model does train

tpanza on 3 Sep 2020

If, in my PyTorch 1.6 and AllenNLP 1.10RC4 environment, I try setting this environment variable:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

and then try the same command with same jsonnet config file:

allennlp train --force --include-package src --serialization-dir tmp/elmo_lm_tran1 configs/bidirectional_language_model_part_notes.jsonnet

I still see the same RuntimeError: Tensors must be CUDA and dense error.

Output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:15:00.0 Off |                    0 |
| N/A   36C    P0    45W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:16:00.0 Off |                    0 |
| N/A   36C    P0    45W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:3A:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   37C    P0    43W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   36C    P0    44W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:B2:00.0 Off |                    0 |
| N/A   34C    P0    44W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:B3:00.0 Off |                    0 |
| N/A   37C    P0    43W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

tpanza on 3 Sep 2020

Thanks @tpanza, I think I found the issue. PR coming soon.

epwalsh on 3 Sep 2020

❤1

Hi @epwalsh I have updated to 1.1.0 that includes the fix from #4624. Looks like I am now getting a different error, this time with the perplexity metric:

0 | 2020-09-16 16:30:05,371 - INFO - allennlp.training.trainer - Beginning training.
0 | 2020-09-16 16:30:05,371 - INFO - allennlp.training.trainer - Epoch 0/9
0 | 2020-09-16 16:30:05,372 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 2990.82
0 | 2020-09-16 16:30:05,372 - INFO - allennlp.training.trainer - Worker 1 memory usage MB: 2993.016
0 | 2020-09-16 16:30:05,372 - INFO - allennlp.training.trainer - Worker 2 memory usage MB: 2990.024
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 3 memory usage MB: 2988.112
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 4 memory usage MB: 2988.212
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 5 memory usage MB: 2990.156
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 6 memory usage MB: 2986.692
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 7 memory usage MB: 2997.112
reading instances: 0it [00:00, ?it/s]0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 1614
reading instances: 0it [00:00, ?it/s]0 | 2020-09-16 16:30:05,753 - INFO - allennlp.training.trainer - Training
0it [00:00, ?it/s] 0it [00:00, ?it/s]0 | 2020-09-16 16:30:05,755 - INFO - src.models.language_model_conll_reader - Loading data from data/processed/no_bi_tags/new_model1_train.txt
reading instances: 0it [00:00, ?it/s]/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  total_value = torch.tensor(_total_value).to(device)
0it [00:01, ?it/s]
reading instances: 63it [00:01, 36.85it/s]
reading instances: 63it [00:01, 36.71it/s]
reading instances: 63it [00:01, 36.74it/s]
reading instances: 63it [00:01, 36.71it/s]
reading instances: 63it [00:01, 36.50it/s]
reading instances: 63it [00:01, 36.63it/s]
reading instances: 63it [00:01, 36.57it/s]
reading instances: 63it [00:01, 36.60it/s]
Traceback (most recent call last):
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/__main__.py", line 34, in run
    main(prog="allennlp")
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 92, in main
    args.func(args)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 118, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 177, in train_model_from_file
    file_friendly_logging=file_friendly_logging,
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 304, in train_model
    nprocs=num_procs,
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception:

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 443, in _train_worker
    metrics = train_loop.run()
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 505, in run
    return self.trainer.train()
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 867, in train
    train_metrics = self._train_epoch(epoch)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 657, in _train_epoch
    cuda_device=self.cuda_device,
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/util.py", line 294, in get_metrics
    metrics = model.get_metrics(reset=reset)
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp_models/lm/models/language_model.py", line 325, in get_metrics
    return {"perplexity": self._perplexity.get_metric(reset=reset)}
  File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/perplexity.py", line 32, in get_metric
    perplexity = float(torch.exp(average_loss))
TypeError: exp(): argument 'input' (position 1) must be Tensor, not float

tpanza on 17 Sep 2020

Nevermind... looks like this was fixed by #4632 . I've applied that patch on top of the 1.1.0 code, and the language model was able to train.

tpanza on 17 Sep 2020

👍1

It looks like this is also an issue with categorical accuracy in basic_classifier. It seems to be a very easy fix, just change line 98 or allennlp.training.metrics.categorical_accuracy from:

_total_count = torch.tensor(gold_labels.numel())

to:

_total_count = torch.tensor(gold_labels.numel(), device=gold_labels.device)

jacobdanovitch on 2 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to continue training the model after it is out of patience?

MeiqiGuo · 4Comments

Add start and end tokens to source sentence in seq2seq DatasetReader

maksym-del · 4Comments

Multiple prediction spans from machine comprehension (BiDAF)?

Andy-jqa · 3Comments

Updates to the docs

DeNeutoy · 4Comments

How to convert a model to ONNX format?

onetonfoot · 3Comments