Pytorch-lightning: ValueError: bad value(s) in fds_to_keep, when attemping DDP

Created on 21 Nov 2019 · 31Comments · Source: PyTorchLightning/pytorch-lightning

I can't get DDP working without getting the following error:

Traceback (most recent call last):
  File "train.py", line 86, in <module>
    main(config)
  File "train.py", line 41, in main
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch
    cmd, self._fds)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds
    False, False, None)
ValueError: bad value(s) in fds_to_keep

What I have tried and that didn't work:

Python 3.7
Python 3.6
pytorch 1.1.0
pytorch 1.2.0
Downgrading "scikit-learn" which had helped in unrelated projects according to results on Google
Lightning 0.5.3.2
Lightning master version
Lightning 0.5.2.1
CUDA 9.0
CUDA 9.2
CUDA 10.0
Tried removing visdom from the project

The error occurs on two servers I tried on, one with 4 Titan X cards and one with 8 Tesla V100 running Ubuntu 18.04.3 LTS.

I suspect that something in my model is triggering it and would appreciate ideas. I can not share the source code though. The model works in dp and single gpu mode.

bug / fix information needed

Source

sdac

All 31 comments

I suspect you're right, it probably is your model. mp.spawn is most likely the culprit. One thing you can try is changing now it works through set_start_method. https://docs.python.org/3.5/library/multiprocessing.html#contexts-and-start-methods

If that doesn't work, I have a change that allows you to bypass mp.spawn and use torch.distributed.launch for DDP that I'd like to push upstream once I have enough time. it works by having a separate process run for each GPU, which bypasses some limitations with python interpreter.

jeffling on 23 Nov 2019

@jeffling should we add a hook for the mp.spawn call so ppl can do whatever they want?

williamFalcon on 23 Nov 2019

@williamFalcon Absolutely, I think that would be the right thing to do.

This is the relevant part of the patched fit function version that we are running, in case it helps anyone else reading this.

        elif self.use_ddp:
            if self.is_slurm_managing_tasks:
                task = int(os.environ["SLURM_LOCALID"])
                self.ddp_train(task, model)
            else:
                # This is the custom part - we disabled mp.spawn.
                assert "LOCAL_RANK" in os.environ, (
                    "LOCAL_RANK not found in os.environ. "
                    "Make sure you're running with the torch.distributed.launch utility."
                )
                rank = int(os.environ["LOCAL_RANK"])
                if rank > 0:
                    self.show_progress_bar = False
                self.ddp_train(rank, model)

We could have a callback that allows people to override everything under use_ddp.

I also think that using a process per GPU has some real benefits, and it would be good for the framework to support it out of the box, so we can allow for callback override _and_ also support this usecase in the default callback :)

jeffling on 26 Nov 2019

Just in case people want to use the snippet before we support this officially:

You need to set Trainer.gpus to your world_size, and Trainer.distributed_backend to ddp.

In your module, you need the following overrides as well:

    def configure_ddp(self, model, device_ids):
        """
        Configure to use a single GPU set on local rank.

        Must return model.
        :param model:
        :param device_ids:
        :return: DDP wrapped model
        """
        device_id = f"cuda:{os.environ['LOCAL_RANK']}"

        model = LightningDistributedDataParallel(
            model,
            device_ids=[device_id],
            output_device=device_id,
            find_unused_parameters=True,
        )

        return model

    def init_ddp_connection(self, proc_rank, world_size):
        """
        Connect all procs in the world using the env:// init
        Use the first node as the root address
        """

        import torch.distributed as dist

        dist.init_process_group("nccl", init_method="env://")

        # Explicitly setting seed to make sure that models created in two processes
        # start from same random weights and biases.
        # TODO(jeffling): I'm pretty sure we need to set other seeds as well?
        print(f"Setting torch manual seed to {FIXED_SEED} for DDP.")
        torch.manual_seed(FIXED_SEED)

jeffling on 26 Nov 2019

😕1

@jeffling any tips on bypassing the mp.spawn and using the pytorch's torch.distributed.launch?

armancohan on 1 Dec 2019

@armancohan You can use the code snippets and instructions in my two comments above, and then just use the launch utility.

jeffling on 2 Dec 2019

Thanks @jeffling, this was very helpful.

armancohan on 17 Dec 2019

Just to understand @jeffling - currently pytorch lightning does NOT work with 'ddp' out of the box. That's because the framework doesn't propagate the seed value set at the initial process to the spawned processes (Therefore, each of the processes in the distributed training may have a different seed..).
The only way to overcome this at the moment is to directly touch the library code and manually set the seed value.
Is the above correct? I'm asking this because I'm seeing very different results when training my GAN netowrk with ddp versus on a single GPU, and this might be the reason.

kwanUm on 24 Dec 2019

@kwanUm You can set the seed value inside your lightning module as well.

There are a few things that lightning can't or doesn't currently do, but a lot of these are dependant on your model and data:

make sure seed values are correct (this is something we could help with but doesn't yet)
distributed sampler
Depending on your model, sync batch norm

jeffling on 26 Dec 2019

I have same issue, can't make ddp works.

Proposed solution by @jeffling have no effect on my machine.

Any idea would be highly appreciated.

astariul-colanim on 27 Feb 2020

Same here. Firstly I couldn't understand what's the reason. It'll be very appreciated if someone can explain the meaning of the error.

kyoungrok0517 on 23 Mar 2020

@Colanim @kyoungrok0517 may you give details about you machines and environments?

Borda on 26 Mar 2020

@Borda Here's my environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V

Nvidia driver version: 440.64
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] numpydoc==0.9.1
[pip] pytorch-lightning==0.7.1
[pip] torch==1.4.0
[pip] torchtext==0.5.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-lightning         0.7.1                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.5.0                py37_cu101    pytorch

kyoungrok0517 on 26 Mar 2020

@jeffling may you check?

Borda on 26 Mar 2020

👍1

I'm experiencing the same problem in any multi-gpu environment, not only with two Titan Vs.

kyoungrok0517 on 29 Mar 2020

@kyoungrok0517 @Colanim Just as a sanity check, do you have the same issues with running any model with lightning? Also, exactly which arguments are passing in?

jeffling on 29 Mar 2020

No. The model was running well on multi-gpu in early stage but from some
point the problem occurred. Since the model structure stayed the same I
suspect the error has something to do with the data loading step but can't
be certain. I attach some code snippets that might cause the problem:

# data loading (pandas, parquet)
class News20Dataset(Dataset):
    def __init__(self, data_path, tokenizer):
        super().__init__()
        self.data = pd.read_parquet(data_path)
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        row = self.data.iloc[index]
        text_ids = self.tokenizer.encode(row.text)
        return {"text": text_ids, "target": row.label}

# data loading (nunpy, zarr)
class TripleDataset(Dataset):
    def __init__(self, data_path, tokenizer):
        super().__init__()
        self.data = zarr.open(data_path, "r")
        self.tokenizer = tokenizer

# vector loading (torchtext.vocab)
def _get_bow_vocab(self):
    VOCAB_PATH = Path(root_dir) / "../../vocab/vocab.json"
    VECTORS = "fasttext.en.300d"
    MIN_FREQ = 10
    MAX_SIZE = 100000
    with open(VOCAB_PATH, "r", encoding="utf-8") as f:
        vocab_counts = Counter(json.load(f))

    return Vocab(
        vocab_counts,
        vectors=VECTORS,
        min_freq=MIN_FREQ,
        max_size=MAX_SIZE
    )

kyoungrok0517 on 30 Mar 2020

Here's my dataloaders

def _get_dataloader(self, dataset, test=False):
    batch_size = self.hparams.batch_size if not test else 10000
    num_workers = int(cpu_count() / 4) or 1
    return DataLoader(
        dataset, batch_size=batch_size, num_workers=num_workers, pin_memory=True,
    )

def train_dataloader(self):
    return self._get_dataloader(self._train_dataset)

def val_dataloader(self):
    return self._get_dataloader(self._val_dataset)

def test_dataloader(self):
    return self._get_dataloader(self._test_dataset, test=True)

kyoungrok0517 on 30 Mar 2020

@kyoungrok0517 @Colanim Just as a sanity check, do you have the same issues with running any model with lightning? Also, exactly which arguments are passing in?

This is my Trainer

    if hparams.profile:
        profiler = AdvancedProfiler()
    else:
        profiler = None

    # train
    # trainer = Trainer.from_argparse_args(hparams)
    trainer = Trainer(
        logger=tt_logger,
        default_save_path=root_dir,
        max_nb_epochs=hparams.max_nb_epochs,
        gpus=hparams.gpus,
        distributed_backend=hparams.distributed_backend,
        fast_dev_run=hparams.fast_dev_run,
        amp_level=hparams.amp_level,
        precision=hparams.precision,
        early_stop_callback=early_stop_callback,
        benchmark=True,
        profiler=profiler,
    )
    trainer.fit(model)

kyoungrok0517 on 30 Mar 2020

FWIW: Have you recently updated Ubuntu? I just started experiencing this in the last hour - and I am using a local fork that has not changed in a few weeks - so doesn't seem likely it's lightning. Will add more if I learn more.

Update I do not believe this is pytorch-lightning. I have recently minted models that are virtually identical that do NOT show this problem. Not clear what is causing it ... almost certainly file related as the specific error is a multiprocessing/posix complaint about file descriptors that do not have appropriate values.

sneiman on 31 Mar 2020

Well maybe or maybe not. Then the problem could be CUDA or apex compatibility issue. To add I see the problem only when I use DDP. Using DP shows no problem.

kyoungrok0517 on 31 Mar 2020

I have resolved this in my environment. Writing it here in case it helps. During the construction of the model, and before creating trainer, etc, I store the .data property of a Parameter:

self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
self.class_p_t      = self.class_p.data

I do NOT actually do anything with self.class_p_t - I assigned it in anticipation of logging these values to monitor progress of self learning curriculum learning class weights. Speculating there is some issue relating to autograd - but just a guess. It generated ~100 file descriptors, which is what appeared to choke fds_to_keep() in the multiprocessing module. Eliminating this assignment ended the problem - finally!

For the record:Ubuntu 19.10, python 3.7.5, pytorch 1.4, venv.

sneiman on 1 Apr 2020

Hmm... are you suggesting it's a bug in pytorch? Is it expected behavior to
generate many file descriptors if we extract .data property from tensors?
For temporary storage?? I couldn't understand...

kyoungrok0517 on 2 Apr 2020

Sorry - I do not know if it's a bug in pytorch or not. I shared my results in case it helps.

I am not sure under what circumstances the .data field is to be accessed - if ever. I don't even know if its just revealing a bug in the latest ubuntu release. I just know that removing that line made the problem go away. As I tried to say, I was merely speculating that this was due to some behind the scenes memory sharing approach from python/pytorch multiprocessing - of which memory files could be a useful model for large tensor sharing. Again, this is ONLY speculation.

I did post a question on pytorch github.

sneiman on 2 Apr 2020

The people at pytorch did look at a fragment that reproduces this problem - and they had some insight into what is happening and why. I need to look at some lightning internals to make sure their diagnosis applies to what we are seeing.

sneiman on 9 Apr 2020

I have verified what causes the problem in my model and what will fix it. The problem is my naively assigning a parameter to another variable. This new reference to the parameter does not get moved to the correct gpu when pytorch-lightning copies it with model.cuda(gpu_idx) in ddp_train(). The reference is in another process space when ddp is used, and so creates the multiprocessing fault noted at the head of this issue.

This is NOT a ptl bug. This is the result of the naive assignment of the parameter to another variable:

        # nn.Parameter() ensures pytorch knows about this - and will move it new gpu when required
        self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)

        # causes a crash if using ddp: self.class_p_t refers to the original process space, not the one it has been moved to by ddp
        self.class_p_t      = self.class_p.data

self.class_p is known to pytorch as a parameter, and so is copied to the correct gpu. But the reference to it in self.class_p_t is not known to pytorch as a parameter, and so this reference is not updated when the model is copied. To fix this simply, do a deep copy instead of the naive assignment. The self.class_p_t is still not moved to the gpu, but it is now within the process space of each ddp model:

        # this now works
        self.class_p        = nn.Parameter(torch.Tensor(np.ones(self.data.num_classes) * np.log(1.0)), requires_grad=True)
        self.class_p_t      = copy.deepcopy(self.class_p.data)

Hope this helps ...

sneiman on 9 Apr 2020

thanks!

williamFalcon on 9 Apr 2020

@williamFalcon mrshenli's comments in pytorch do raise a question for me - he points out that something similar could happen when the model is passed as an arg to a ddp process. I think ptl is probably okay because the model.cuda(gpu_idx) must effectively make a deep copy - but just food for thought as I have not been able to confirm exactly what model.cuda(gpu_idx) does.

Also, he inadvertently partially demonstrates something I have been meaning to try in bringing a model back to spawning process from ddp - that is to use the special way in which pytorch handles tensors/models on queues. I suspect if we used a queue() to pass the model to process on gpus[0], the models parameters may be automatically resolved back to cpu - and thus the trained model is available without any special effort. I will try to get to this in the next week ...

sneiman on 9 Apr 2020

👍1

Not sure if I'm late to the game, but this might be a hint to the origin of the problem.
Using pytorch-lightning 0.8.4 on Ubuntu 18.04.4 LTS, I get the error when specifying specific gpus (e.g. --gpus 0,1,2,3), while I do not get the error if I specify the number of gpus to use (e.g. --gpus 4).