Fairseq: Segmentation fault with Levenshtein Transformer (build issue with libnat)

Created on 5 Nov 2019 · 23Comments · Source: pytorch/fairseq

Hi, when I trained a new Levenshtein Transformer in my datasets, and it was processed by process.py by '--joined-dictionary', It will show this error, and I trained the model in two gpus. Could you help me find the reasons, thanks very much.

File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 1 terminated with signal SIGSEGV

And when I trained the model in 3 gpus, the error is :

170848, 512Traceback (most recent call last):
File "train.py", line 343, in
cli_main()
File "train.py", line 335, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 2 terminated with signal SIGSEGV
Traceback (most recent call last):
File "", line 1, in
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/root/anaconda3/envs/len_transform/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

question

Source

xiaoshengjun

Most helpful comment

Can you install gcc from the main channel instead? Something like:
conda install gcc_linux-64 gxx_linux-64
There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:
1. conda create -n nat python=3.7 && conda activate nat

2. git clone https://github.com/pytorch/fairseq.git && cd fairseq

3. conda install gcc_linux-64 gxx_linux-64

4. pip install torch && pip install cython

5. python setup.py build_ext --inplace

6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

Solved Thanks.

gvskalyan on 20 Dec 2019

👍2

All 23 comments

Similar issues:

sdll on 5 Nov 2019

Oh, thank you.

Similar issues:

#1308

#1305

xiaoshengjun on 5 Nov 2019

hi, @sdll , the issues #1308 and #1305 do not solve the problem.

xiaoshengjun on 5 Nov 2019

This should be fixed now, can you please try again?

myleott on 13 Nov 2019

This should be fixed now, can you please try again?

hi, thanks for you replay， and I download the latest version of fairseq， and processed the data ‘--joined-dictionary’， but when I trained a new model， the problem is：

Traceback (most recent call last):
File "train.py", line 337, in
cli_main()
File "train.py", line 329, in cli_main
nprocs=args.distributed_world_size,
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/len_transform/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 3 terminated with signal SIGSEGV

could you help me？

xiaoshengjun on 21 Nov 2019

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

MultiPath on 22 Nov 2019

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

hi， I have used in single GPU， when the model loaded the trained data， the code is over, but it does not show anything, and I can the GPU is not used.

xiaoshengjun on 25 Nov 2019

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

hi， I have used in single GPU， when the model loaded the trained data， the code is over, but it does not show anything, and I can the GPU is not used.

@xiaoshengjun just following up. Is the problem still there in the recent refactored codebase?

MultiPath on 16 Dec 2019

@MultiPath this still occurs in recent codebase.
Even without apex. With less no of max tokens and with higher gcc version and nvcc installed.
Installed fairseq via python setup.py build_ext --inplace.
Even on single GPU, it is segmentation fault.
Please recheck.

gvskalyan on 17 Dec 2019

I just ran the example command in the README and it works fine.

Can you confirm that libnat is built properly? Please try running the following:

$ python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

You should see something like: [[[0], [0], [0], [0], [5, 6], [0, 1, 0, 0]]]

myleott on 17 Dec 2019

python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

Thanks for your reply, and I have run the command what you said above, and get :

from fairseq import libnat
print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))
Segmentation fault

so something is wrong in my environment?

xiaoshengjun on 18 Dec 2019

seems to be a multi-gpu error? Can you try running on single GPU for debugging?

hi， I have used in single GPU， when the model loaded the trained data， the code is over, but it does not show anything, and I can the GPU is not used.

@xiaoshengjun just following up. Is the problem still there in the recent refactored codebase?

yes, I also meet the problem.

xiaoshengjun on 18 Dec 2019

so something is wrong in my environment?

Yes, libnat has not been built properly. Can you please run python setup.py build_ext --inplace and share the output here?

myleott on 18 Dec 2019

so something is wrong in my environment?

Yes, libnat has not been built properly. Can you please run python setup.py build_ext --inplace and share the output here?

hi, the output is：

python setup.py build_ext --inplace

which: no nvcc in (/root/anaconda3/envs/fq09py12/bin:/root/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/bin:/usr/local/bin:/usr/libexec/git-core:/root/bin)
running build_ext
/root/anaconda3/envs/fq09py12/lib/python3.6/site-packages/torch/utils/cpp_extension.py:196: UserWarning:

                           !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (g++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 4.9 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 4.9 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                          !! WARNING !!

warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-3.6/fairseq/libbleu.cpython-36m-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-3.6/fairseq/data/data_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/data/token_block_utils_fast.cpython-36m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.6/fairseq/libnat.cpython-36m-x86_64-linux-gnu.so -> fairseq

xiaoshengjun on 18 Dec 2019

👍1

Please follow the instructions in the warning message :)

myleott on 18 Dec 2019

so something is wrong in my environment?

Yes, libnat has not been built properly. Can you please run python setup.py build_ext --inplace and share the output here?

My output:
running build_ext
skipping 'fairseq/data/data_utils_fast.cpp' Cython extension (up-to-date)
skipping 'fairseq/data/token_block_utils_fast.cpp' Cython extension (up-to-date)
copying build/lib.linux-x86_64-3.7/fairseq/libbleu.cpython-37m-x86_64-linux-gnu.so -> fairseq
copying build/lib.linux-x86_64-3.7/fairseq/data/data_utils_fast.cpython-37m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.7/fairseq/data/token_block_utils_fast.cpython-37m-x86_64-linux-gnu.so -> fairseq/data
copying build/lib.linux-x86_64-3.7/fairseq/libnat.cpython-37m-x86_64-linux-gnu.so -> fairseq

And the snippet python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))' gives Segmentation fault.

Recloned the Repo and tried installing and it gives the following error ::
Complete output (12 lines):
running develop
running egg_info
writing fairseq.egg-info/PKG-INFO
writing dependency_links to fairseq.egg-info/dependency_links.txt
writing entry points to fairseq.egg-info/entry_points.txt
writing requirements to fairseq.egg-info/requires.txt
writing top-level names to fairseq.egg-info/top_level.txt
reading manifest file 'fairseq.egg-info/SOURCES.txt'
writing manifest file 'fairseq.egg-info/SOURCES.txt'
running build_ext
cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
error: /home/workspace/nat/fairseq/fairseq/data/data_utils_fast.pyx

gvskalyan on 19 Dec 2019

Steps ::

conda create -n nat python=3.7 && conda activate nat
git clone https://github.com/pytorch/fairseq.git && cd fairseq
conda install -c psi4 gcc-5
conda install libgcc -y
pip install torch && pip install cython
python setup.py build_ext
python setup.py install --user

gvskalyan on 19 Dec 2019

👍1

Can you install gcc from the main channel instead? Something like:

conda install gcc_linux-64 gxx_linux-64

There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:

conda create -n nat python=3.7 && conda activate nat
git clone https://github.com/pytorch/fairseq.git && cd fairseq
conda install gcc_linux-64 gxx_linux-64
pip install torch && pip install cython
python setup.py build_ext --inplace
python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

myleott on 19 Dec 2019

👍2

Please follow the instructions in the warning message :)

The problem has been solved after I update the version of gcc, thanks.

xiaoshengjun on 20 Dec 2019

👍1

Can you install gcc from the main channel instead? Something like:
conda install gcc_linux-64 gxx_linux-64
There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:
1. conda create -n nat python=3.7 && conda activate nat

2. git clone https://github.com/pytorch/fairseq.git && cd fairseq

3. conda install gcc_linux-64 gxx_linux-64

4. pip install torch && pip install cython

5. python setup.py build_ext --inplace

6. python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'

Solved Thanks.

gvskalyan on 20 Dec 2019

👍2

@myleott when I run scaling-nmt translation example, https://github.com/pytorch/fairseq/blame/master/examples/scaling_nmt/README.md#L40 in the above conda environment along with --encoder-layerdrop 0.2 the gpu utilisation amounts to 100 and training does not start and is stuck but with the --ddp-backend no_c10d it is working fine.
Whereas without --encoder/decoder-layerdrop it works with/without ddp-backend.
Is this intended behaviour?

gvskalyan on 25 Dec 2019

Yes, no_c10d is required when some of the model parameters are not used in the forward pass, as is the case with LayerDrop.

myleott on 25 Dec 2019

👍1

Can you install gcc from the main channel instead? Something like:
conda install gcc_linux-64 gxx_linux-64
There should be versions for other platforms/architectures too.

Here's what I just ran and it works as expected:

conda create -n nat python=3.7 && conda activate nat

git clone https://github.com/pytorch/fairseq.git && cd fairseq

conda install gcc_linux-64 gxx_linux-64

pip install torch && pip install cython

python setup.py build_ext --inplace

python -c 'from fairseq import libnat; print(libnat.suggested_ed2_path([[1, 2, 3, 4]], [[1, 3, 4, 5, 6]], 0))'