master branch of AllenNLP.pip freeze.
Python traceback:
0 | 2020-09-03 12:39:41,240 - INFO - allennlp.training.trainer - Beginning training.
0 | 2020-09-03 12:39:41,240 - INFO - allennlp.training.trainer - Epoch 0/9
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 2984.032
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 1 memory usage MB: 2985.344
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 2 memory usage MB: 2988.028
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 3 memory usage MB: 2984.496
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 4 memory usage MB: 2989.064
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 5 memory usage MB: 2987.996
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 6 memory usage MB: 2988.592
0 | 2020-09-03 12:39:41,241 - INFO - allennlp.training.trainer - Worker 7 memory usage MB: 2991.332
reading instances: 0it [00:00, ?it/s]0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 1614
0 | 2020-09-03 12:39:41,601 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 1614
reading instances: 0it [00:00, ?it/s]0 | 2020-09-03 12:39:41,603 - INFO - allennlp.training.trainer - Training
reading instances: 0it [00:00, ?it/s]0 | 2020-09-03 12:39:41,605 - INFO - src.models.language_model_conll_reader - Loading data from data/processed/no_bi_tags/new_model1_train.txt
reading instances: 0it [00:00, ?it/s]/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 91.52it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 85.48it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
0it [00:00, ?it/s]
reading instances: 63it [00:00, 83.76it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 83.20it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 81.84it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 79.57it/s]
reading instances: 63it [00:00, 80.02it/s]
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
reading instances: 63it [00:00, 79.14it/s]
Traceback (most recent call last):
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/__main__.py", line 34, in run
main(prog="allennlp")
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 92, in main
args.func(args)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 118, in train_model_from_args
file_friendly_logging=args.file_friendly_logging,
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 177, in train_model_from_file
file_friendly_logging=file_friendly_logging,
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 304, in train_model
nprocs=num_procs,
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 439, in _train_worker
metrics = train_loop.run()
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 501, in run
return self.trainer.train()
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 867, in train
train_metrics = self._train_epoch(epoch)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 589, in _train_epoch
batch_outputs = self.batch_outputs(batch, for_training=True)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 479, in batch_outputs
output_dict = self._pytorch_model(**batch)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp_models/lm/models/language_model.py", line 297, in forward
self._perplexity(average_loss)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py", line 37, in __call__
dist.all_reduce(count, op=dist.ReduceOp.SUM)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense
OS: Red Hat Enterprise Linux Server release 7.6 (Maipo), Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Fri Apr 19 21:09:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Python version: 3.7.8 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08)
Also: cudatoolkit 9.2, installed via conda
Output of
pip freeze:
alabaster==0.7.12
alembic==1.4.2
allennlp==1.1.0rc4
allennlp-models==1.1.0rc4
appdirs==1.4.3
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1596629848788/work
astroid @ file:///home/conda/feedstock_root/build_artifacts/astroid_1591645081755/work
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1597959372343/work
Babel==2.8.0
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backports.functools-lru-cache==1.6.1
beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1597679909012/work
black @ file:///home/conda/feedstock_root/build_artifacts/black-recipe_1596181673569/work
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1588608214987/work
blis==0.4.1
boto==2.49.0
boto3 @ file:///home/conda/feedstock_root/build_artifacts/boto3_1599083817494/work
botocore @ file:///home/conda/feedstock_root/build_artifacts/botocore_1599079795233/work
brotlipy==0.7.0
bz2file==0.98
cachetools @ file:///home/conda/feedstock_root/build_artifacts/cachetools_1593420445823/work
catalogue==1.0.0
certifi==2020.6.20
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1595805535531/work
chardet==3.0.4
click==7.1.2
cliff @ file:///home/conda/feedstock_root/build_artifacts/cliff_1596473693634/work
cmaes @ file:///home/conda/feedstock_root/build_artifacts/cmaes_1596014311252/work
cmd2 @ file:///home/conda/feedstock_root/build_artifacts/cmd2_1591809284986/work
colorama==0.4.3
colorlog==4.2.1
conllu==4.0
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1598621005849/work
css-html-js-minify @ file:///home/conda/feedstock_root/build_artifacts/css-html-js-minify_1589201803800/work
cycler==0.10.0
cymem==2.0.3
Cython @ file:///home/conda/feedstock_root/build_artifacts/cython_1594253470382/work
decorator==4.4.2
defusedxml==0.6.0
docutils==0.15.2
en-core-web-sm @ file:///home/conda/feedstock_root/build_artifacts/spacy-model-en_core_web_1595437534769/work/sm
entrypoints==0.3
filelock==3.0.12
flake8 @ file:///home/conda/feedstock_root/build_artifacts/flake8_1594584774608/work
ftfy==5.8
future==0.18.2
gensim @ file:///home/conda/feedstock_root/build_artifacts/gensim_1589441213562/work
google-api-core @ file:///home/conda/feedstock_root/build_artifacts/google-api-core-split_1597353416133/work
google-auth @ file:///home/conda/feedstock_root/build_artifacts/google-auth_1598972371532/work
google-cloud-core @ file:///home/conda/feedstock_root/build_artifacts/google-cloud-core_1596721852962/work
google-cloud-storage @ file:///home/conda/feedstock_root/build_artifacts/google-cloud-storage_1598557481535/work
google-crc32c @ file:///home/conda/feedstock_root/build_artifacts/google-crc32c_1597069930917/work
google-resumable-media @ file:///home/conda/feedstock_root/build_artifacts/google-resumable-media_1598452053521/work
googleapis-common-protos==1.51.0
grpcio @ file:///home/conda/feedstock_root/build_artifacts/grpcio_1596715635580/work
h5py==2.10.0
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1593328102638/work
imagesize==1.2.0
importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1593211369179/work
iniconfig==1.0.1
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1595446871027/work/dist/ipykernel-5.3.4-py3-none-any.whl
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1598749946943/work
ipython-genutils==0.2.0
isort @ file:///home/conda/feedstock_root/build_artifacts/isort_1599134253980/work
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1595018882455/work
Jinja2==2.11.2
jmespath @ file:///home/conda/feedstock_root/build_artifacts/jmespath_1589369830981/work
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1593624380152/work
json5 @ file:///home/conda/feedstock_root/build_artifacts/json5_1591810480056/work
jsonnet==0.16.0
jsonpickle==1.4.1
jsonschema==3.2.0
jupyter-client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1598486169312/work
jupyter-core==4.6.3
jupyterlab==2.2.6
jupyterlab-server @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_server_1593951277307/work
kiwisolver==1.2.0
lazy-object-proxy==1.4.3
livereload @ file:///home/conda/feedstock_root/build_artifacts/livereload_1598114753789/work
lunr==0.5.8
lxml @ file:///home/conda/feedstock_root/build_artifacts/lxml_1594322698782/work
Mako @ file:///home/conda/feedstock_root/build_artifacts/mako_1595925083607/work
Markdown @ file:///home/conda/feedstock_root/build_artifacts/markdown_1589366472132/work
MarkupSafe==1.1.1
matplotlib @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-base_1597952254444/work
mccabe==0.6.1
mistune==0.8.4
mkdocs @ file:///home/conda/feedstock_root/build_artifacts/mkdocs_1591017186129/work
mkdocs-material @ file:///home/conda/feedstock_root/build_artifacts/mkdocs-material_1598971568401/work
more-itertools==8.5.0
murmurhash==1.0.2
mypy @ file:///home/conda/feedstock_root/build_artifacts/mypy_1592923020520/work
mypy-extensions==0.4.3
nbconvert==5.6.1
nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1594060262917/work
networkx==2.5
nltk==3.4.4
notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1597285190767/work
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1597938346492/work
olefile==0.46
optuna @ file:///home/conda/feedstock_root/build_artifacts/optuna_1598323699360/work
overrides==3.1.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1589925210001/work
pandas @ file:///home/conda/feedstock_root/build_artifacts/pandas_1598294454723/work
pandocfilters==1.4.2
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1595548966091/work
pathspec==0.8.0
patsy==0.5.1
pbr @ file:///home/conda/feedstock_root/build_artifacts/pbr_1598996708473/work
pexpect==4.8.0
pickleshare==0.7.5
Pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1594213010297/work
plac==1.1.3
pluggy==0.13.1
preshed==3.0.2
prettytable==0.7.2
prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1590412252446/work
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1598885455507/work
protobuf==3.13.0
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1594826921622/work
ptyprocess==0.6.0
py==1.9.0
py-rouge==1.1
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycodestyle @ file:///home/conda/feedstock_root/build_artifacts/pycodestyle_1589305246696/work
pycorenlp==0.3.0
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1593275161868/work
pyflakes==2.2.0
Pygments==2.6.1
pylint @ file:///home/conda/feedstock_root/build_artifacts/pylint_1598117058668/work
pymdown-extensions @ file:///home/conda/feedstock_root/build_artifacts/pymdown-extensions_1597166028984/work
Pyment==0.3.3
pyneuroner==1.0.8
pyOpenSSL==19.1.0
pyparsing==2.4.7
pyperclip @ file:///home/conda/feedstock_root/build_artifacts/pyperclip_1591810382257/work
pyrsistent==0.16.0
PySocks==1.7.1
pytest==6.0.1
python-dateutil==2.8.1
python-editor==1.0.4
python-magic==0.4.18
python-slugify @ file:///home/conda/feedstock_root/build_artifacts/python-slugify_1593573453419/work
pytz==2020.1
PyYAML==5.3.1
pyzmq==19.0.2
regex @ file:///home/conda/feedstock_root/build_artifacts/regex_1594799371287/work
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1592425495151/work
rsa @ file:///home/conda/feedstock_root/build_artifacts/rsa_1591996208734/work
s3cmd==2.1.0
s3transfer==0.3.3
sacremoses==0.0.43
scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_1596546074663/work
scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy_1595583586868/work
seaborn @ file:///home/conda/feedstock_root/build_artifacts/seaborn-base_1591878760859/work
Send2Trash==1.5.0
sentencepiece==0.1.91
six @ file:///home/conda/feedstock_root/build_artifacts/six_1590081179328/work
slugify==0.0.1
smart-open @ file:///home/conda/feedstock_root/build_artifacts/smart_open_1598572939086/work
snowballstemmer==2.0.0
soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1597680516047/work
spacy==2.3.2
Sphinx @ file:///home/conda/feedstock_root/build_artifacts/sphinx_1597405755328/work
sphinx-material @ file:///home/conda/feedstock_root/build_artifacts/sphinx-material_1597663680729/work
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
SQLAlchemy @ file:///home/conda/feedstock_root/build_artifacts/sqlalchemy_1597701920245/work
srsly==1.0.2
statsmodels @ file:///home/conda/feedstock_root/build_artifacts/statsmodels_1598551025620/work
stevedore @ file:///home/conda/feedstock_root/build_artifacts/stevedore_1598982656343/work
tensorboardX==2.1
teradata==15.10.0.21
terminado==0.8.3
testpath==0.4.4
text-unidecode==1.3
thinc==7.4.1
threadpoolctl @ file:///tmp/tmp79xdzxkt/threadpoolctl-2.1.0-py3-none-any.whl
tokenizers==0.8.1rc1
toml @ file:///home/conda/feedstock_root/build_artifacts/toml_1589469402899/work
torch==1.6.0
torchvision==0.7.0
tornado==6.0.4
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1596476591553/work
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1598976315411/work
transformers==3.0.2
typed-ast==1.4.1
typing-extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1588470653596/work
Unidecode==1.1.1
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1595434816409/work
wasabi==0.8.0
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1595859607677/work
webencodings==0.5.1
word2number==1.1
wrapt==1.11.2
XlsxWriter @ file:///home/conda/feedstock_root/build_artifacts/xlsxwriter_1597362230404/work
zipp==3.1.0
Example source:
allennlp train --force --include-package src --serialization-dir tmp/elmo_lm_tran1 configs/bidirectional_language_model_part_notes.jsonnet
Contents of `configs/bidirectional_language_model_part_notes.jsonnet`:
{
dataset_reader: {
type: 'language_model_conll_reader',
//"tokenizer": {
// "type": "just_spaces"
//},
token_indexers: {
tokens: {
type: 'single_id',
},
token_characters: {
type: 'elmo_characters',
},
},
//"max_sequence_length": 400,
//"start_tokens": ["<S>"],
//"end_tokens": ["</S>"]
},
train_data_path: 'data/processed/no_bi_tags/new_model1_train.txt',
model: {
type: 'language_model',
bidirectional: true,
//"num_samples": 8192,
sparse_embeddings: true,
text_field_embedder: {
//"allow_unmatched_keys": true,
token_embedders: {
//tokens: {
// type: 'embedding',
// pretrained_file: 'models/word2vec_newtoken1.txt',
// embedding_dim: 50,
// trainable: false
// },
tokens: {
type: 'empty',
},
token_characters: {
type: 'character_encoding',
embedding: {
num_embeddings: 262,
embedding_dim: 16,
},
encoder: {
type: 'cnn-highway',
activation: 'relu',
embedding_dim: 16,
filters: [
[1, 32],
[2, 32],
[3, 64],
[4, 128],
[5, 256],
[6, 512],
[7, 1024],
],
num_highway: 2,
projection_dim: 512,
projection_location: 'after_highway',
do_layer_norm: true,
},
},
},
},
dropout: 0.1,
contextualizer: {
type: 'bidirectional_language_model_transformer',
input_dim: 512,
hidden_dim: 2048,
num_layers: 6,
dropout: 0.1,
input_dropout: 0.1,
},
},
data_loader: {
batch_size: 64,
},
distributed: {
cuda_devices: std.range(0, 8 - 1),
},
trainer: {
num_epochs: 10,
optimizer: {
type: 'dense_sparse_adam',
},
// TODO(brendanr): Needed with transformer too?
// "grad_norm": 10.0,
learning_rate_scheduler: {
type: 'noam',
// See https://github.com/allenai/calypso/blob/master/calypso/train.py#L401
model_size: 512,
// See https://github.com/allenai/calypso/blob/master/bin/train_transformer_lm1b.py#L51.
// Adjusted based on our sample size relative to Calypso's.
warmup_steps: 6000,
},
//"should_log_learning_rate": true
},
}
The `dataset_reader` and the data itself in the `train_data_path` are custom code and custom data, respectively. I imagine that part is not the issue, since we are able to build a vocabulary and attempt to train. Should be reproducible with any data and dataset_reader.
If I fallback to an older environment with these versions:
allennlp==1.0.0
allennlp-models==1.0.0
and
cudatoolkit 9.2 0
pytorch 1.5.1 py3.7_cuda9.2.148_cudnn7.6.3_0
python 3.7.7 hcf32534_0_cpython
and use the same configs/bidirectional_language_model_part_notes.jsonnet provided above, the language model does train
If, in my PyTorch 1.6 and AllenNLP 1.10RC4 environment, I try setting this environment variable:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
and then try the same command with same jsonnet config file:
allennlp train --force --include-package src --serialization-dir tmp/elmo_lm_tran1 configs/bidirectional_language_model_part_notes.jsonnet
I still see the same RuntimeError: Tensors must be CUDA and dense error.
Output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:15:00.0 Off | 0 |
| N/A 36C P0 45W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:16:00.0 Off | 0 |
| N/A 36C P0 45W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:3A:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 34C P0 44W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:B3:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Thanks @tpanza, I think I found the issue. PR coming soon.
Hi @epwalsh I have updated to 1.1.0 that includes the fix from #4624. Looks like I am now getting a different error, this time with the perplexity metric:
0 | 2020-09-16 16:30:05,371 - INFO - allennlp.training.trainer - Beginning training.
0 | 2020-09-16 16:30:05,371 - INFO - allennlp.training.trainer - Epoch 0/9
0 | 2020-09-16 16:30:05,372 - INFO - allennlp.training.trainer - Worker 0 memory usage MB: 2990.82
0 | 2020-09-16 16:30:05,372 - INFO - allennlp.training.trainer - Worker 1 memory usage MB: 2993.016
0 | 2020-09-16 16:30:05,372 - INFO - allennlp.training.trainer - Worker 2 memory usage MB: 2990.024
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 3 memory usage MB: 2988.112
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 4 memory usage MB: 2988.212
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 5 memory usage MB: 2990.156
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 6 memory usage MB: 2986.692
0 | 2020-09-16 16:30:05,373 - INFO - allennlp.training.trainer - Worker 7 memory usage MB: 2997.112
reading instances: 0it [00:00, ?it/s]0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 0 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 1 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 2 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 3 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 4 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 5 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 6 memory usage MB: 1614
0 | 2020-09-16 16:30:05,751 - INFO - allennlp.training.trainer - GPU 7 memory usage MB: 1614
reading instances: 0it [00:00, ?it/s]0 | 2020-09-16 16:30:05,753 - INFO - allennlp.training.trainer - Training
0it [00:00, ?it/s] 0it [00:00, ?it/s]0 | 2020-09-16 16:30:05,755 - INFO - src.models.language_model_conll_reader - Loading data from data/processed/no_bi_tags/new_model1_train.txt
reading instances: 0it [00:00, ?it/s]/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/average.py:36: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
total_value = torch.tensor(_total_value).to(device)
0it [00:01, ?it/s]
reading instances: 63it [00:01, 36.85it/s]
reading instances: 63it [00:01, 36.71it/s]
reading instances: 63it [00:01, 36.74it/s]
reading instances: 63it [00:01, 36.71it/s]
reading instances: 63it [00:01, 36.50it/s]
reading instances: 63it [00:01, 36.63it/s]
reading instances: 63it [00:01, 36.57it/s]
reading instances: 63it [00:01, 36.60it/s]
Traceback (most recent call last):
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/__main__.py", line 34, in run
main(prog="allennlp")
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 92, in main
args.func(args)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 118, in train_model_from_args
file_friendly_logging=args.file_friendly_logging,
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 177, in train_model_from_file
file_friendly_logging=file_friendly_logging,
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 304, in train_model
nprocs=num_procs,
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 5 terminated with the following error:
Traceback (most recent call last):
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 443, in _train_worker
metrics = train_loop.run()
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/commands/train.py", line 505, in run
return self.trainer.train()
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 867, in train
train_metrics = self._train_epoch(epoch)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/trainer.py", line 657, in _train_epoch
cuda_device=self.cuda_device,
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/util.py", line 294, in get_metrics
metrics = model.get_metrics(reset=reset)
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp_models/lm/models/language_model.py", line 325, in get_metrics
return {"perplexity": self._perplexity.get_metric(reset=reset)}
File "/app/local/anaconda3/envs/torch1.6_allennlp1.1/lib/python3.7/site-packages/allennlp/training/metrics/perplexity.py", line 32, in get_metric
perplexity = float(torch.exp(average_loss))
TypeError: exp(): argument 'input' (position 1) must be Tensor, not float
Nevermind... looks like this was fixed by #4632 . I've applied that patch on top of the 1.1.0 code, and the language model was able to train.
It looks like this is also an issue with categorical accuracy in basic_classifier. It seems to be a very easy fix, just change line 98 or allennlp.training.metrics.categorical_accuracy from:
_total_count = torch.tensor(gold_labels.numel())
to:
_total_count = torch.tensor(gold_labels.numel(), device=gold_labels.device)