Hi, when I trained a new Levenshtein Transformer in my datasets, and it was processed by process.py by '--joined-dictionary', It will show this error, and I trained the model in two gpus. Could you help me find the reasons, thanks very much.
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGSEGV
And when I trained the model in 3 gpus, the error is :
170848, 512Traceback (most recent call last):
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 2 terminated with signal SIGSEGV
Traceback (most recent call last):
File "
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Oh, thank you.
Similar issues:
- #1308
- #1305
hi, @sdll , the issues #1308 and #1305 do not solve the problem.
This should be fixed now, can you please try again?
This should be fixed now, can you please try again?
hi, thanks for you replay, and I download the latest version of fairseq, and processed the data ‘--joined-dictionary’, but when I trained a new model, the problem is:
Traceback (most recent call last):
File "train.py", line 337, in
cli_main()
File "train.py", line 329, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 3 terminated with signal SIGSEGV
could you help me?
seems to be a multi-gpu error? Can you try running on single GPU for debugging?
seems to be a multi-gpu error? Can you try running on single GPU for debugging?
hi, I have used in single GPU, when the model loaded the trained data, the code is over, but it does not show anything, and I can the GPU is not used.
seems to be a multi-gpu error? Can you try running on single GPU for debugging?
hi, I have used in single GPU, when the model loaded the trained data, the code is over, but it does not show anything, and I can the GPU is not used.
@xiaoshengjun just following up. Is the problem still there in the recent refactored codebase?
@MultiPath this still occurs in recent codebase.
Even without apex. With less no of max tokens and with higher gcc version and nvcc installed.
Installed fairseq via python setup.py build_ext --inplace.
Even on single GPU, it is segmentation fault.
Please recheck.
I just ran the example command in the README and it works fine.
Can you confirm that libnat is built properly? Please try running the following:
$ python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'
You should see something like: [[[0], [0], [0], [0], [5, 6], [0, 1, 0, 0]]]
python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'
Thanks for your reply, and I have run the command what you said above, and get :
from fairseq import libnat
print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))
Segmentation fault
so something is wrong in my environment?
seems to be a multi-gpu error? Can you try running on single GPU for debugging?
hi, I have used in single GPU, when the model loaded the trained data, the code is over, but it does not show anything, and I can the GPU is not used.
@xiaoshengjun just following up. Is the problem still there in the recent refactored codebase?
yes, I also meet the problem.
so something is wrong in my environment?
Yes, libnat has not been built properly. Can you please run python setup.py build_ext --inplace and share the output here?
so something is wrong in my environment?
Yes, libnat has not been built properly. Can you please run
python setup.py build_ext --inplaceand share the output here?
hi, the output is:
which: no nvcc in (/root/anaconda3/envs/fq09py12/bin:/root/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/bin:/usr/local/bin:/usr/libexec/git-core:/root/bin)
running build_ext
/root/anaconda3/envs/fq09py12/lib/python3.6/site-packages/torch/utils/cpp_extension.py:196: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (g++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 4.9 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 4.9 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-3.6/fairseq/libbleu.cpython-36m-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-3.6/fairseq/data/data_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/data/token_block_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/libnat.cpython-36m-x86_64-linux-gnu.so -> fairseq
Please follow the instructions in the warning message :)
so something is wrong in my environment?
Yes, libnat has not been built properly. Can you please run
python setup.py build_ext --inplaceand share the output here?
My output:
running build_ext
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-3.7/fairseq/libbleu.cpython-37m-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-3.7/fairseq/data/data_utils_fast.cpython-37m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.7/fairseq/data/token_block_utils_fast.cpython-37m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.7/fairseq/libnat.cpython-37m-x86_64-linux-gnu.so -> fairseq
And the snippet python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))' gives Segmentation fault.
Recloned the Repo and tried installing and it gives the following error ::
Complete output (12 lines):
running develop
running egg_info
writing fairseq.egg-info/PKG-INFO
writing dependency_links to fairseq.egg-info/dependency_links.txt
writing entry points to fairseq.egg-info/entry_points.txt
writing requirements to fairseq.egg-info/requires.txt
writing top-level names to fairseq.egg-info/top_level.txt
reading manifest file 'fairseq.egg-info/SOURCES.txt'
writing manifest file 'fairseq.egg-info/SOURCES.txt'
running build_ext
cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
error: /home/workspace/nat/fairseq/fairseq/data/data_utils_fast.pyx
Steps ::
Can you install gcc from the main channel instead? Something like:
conda install gcc_linux-64 gxx_linux-64
There should be versions for other platforms/architectures too.
Here's what I just ran and it works as expected:
Please follow the instructions in the warning message :)
The problem has been solved after I update the version of gcc, thanks.
Can you install gcc from the main channel instead? Something like:
conda install gcc_linux-64 gxx_linux-64There should be versions for other platforms/architectures too.
Here's what I just ran and it works as expected:
1. conda create -n nat python=3.7 && conda activate nat 2. git clone https://github.com/pytorch/fairseq.git && cd fairseq 3. conda install gcc_linux-64 gxx_linux-64 4. pip install torch && pip install cython 5. python setup.py build_ext --inplace 6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'
Solved Thanks.
@myleott when I run scaling-nmt translation example, https://github.com/pytorch/fairseq/blame/master/examples/scaling_nmt/README.md#L40 in the above conda environment along with --encoder-layerdrop 0.2 the gpu utilisation amounts to 100 and training does not start and is stuck but with the --ddp-backend no_c10d it is working fine.
Whereas without --encoder/decoder-layerdrop it works with/without ddp-backend.
Is this intended behaviour?
Yes, no_c10d is required when some of the model parameters are not used in the forward pass, as is the case with LayerDrop.
Can you install gcc from the main channel instead? Something like:
conda install gcc_linux-64 gxx_linux-64There should be versions for other platforms/architectures too.
Here's what I just ran and it works as expected:
- conda create -n nat python=3.7 && conda activate nat
- git clone https://github.com/pytorch/fairseq.git && cd fairseq
- conda install gcc_linux-64 gxx_linux-64
- pip install torch && pip install cython
- python setup.py build_ext --inplace
- python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'
great! It works!!!!!!!!
Most helpful comment
Solved Thanks.