Sagemaker-python-sdk: PyTorch: increasing --shm-size to allow multiprocessing data loaders

Created on 16 Jul 2019 · 12Comments · Source: aws/sagemaker-python-sdk

Reference: MLFW-2582

System Information

Framework: PyTorch
Framework Version: 1.1.0
Python Version:py3
CPU or GPU: Both
Python SDK Version: 1.32.0
Are you using a custom image: Yes, N/A

Describe the problem

When using a data loader with multiprocessing in PyTorch (set num_workers > 0), the following error comes up:

algo-1-xxsq9_1  |   File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch
algo-1-xxsq9_1  |     raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
algo-1-xxsq9_1  | RuntimeError: DataLoader worker (pid(s) 76) exited unexpectedly
tmpitprkm4o_algo-1-xxsq9_1 exited with code 1

This is because /dev/shm is at only 64M by default. The solution to this seems to be simply passing --shm-size with a higher value to docker run but if one is using sagemaker that option isn't there.

I've extended the PyTorch container and added a bunch of custom packages and settings in the Dockerfile with no problem, but can't set runtime flags/args meant to be passed to docker run (Note: docker build has a --shm-size args but that is NOT related to the /dev/shm size of the final container).

This becomes a huge bottleneck since training is very slow with num_workers = 0. Can't we just increase the default shared memory? Or provide an easier way for users to set it using the sagemaker sdk?

Exact command to reproduce:
Any example using DataLoader with num_workers > 0.

feature request

Source

ksanjeevan

👍7

Most helpful comment

just wondering if this issue is resolved. This feature will really make debugging much easier.

bill10 on 19 Feb 2020

👍8 😕1

All 12 comments

Hello @ksanjeevan,

Thanks for bringing this issue to our attention.

Let me reach out to the team that owns the training platform.

ChoiByungWook on 16 Jul 2019

👍1

Are you using local mode training?

If you're not using local mode then, in fact docker containers running in SageMaker training do NOT use the 64MB default shm-size - we adjust it depending on the instance type.

You didn't mention the instance type you're using, which is it? And how much is the max shared memory is your algorithm meant to use?

ishaaq on 5 Sep 2019

Yeah this was using local mode, using remote seems to be fine.

ksanjeevan on 5 Sep 2019

thanks for the clarification! So it seems that we would need to find a way to expose shm-size as an option that would then get written into the docker-compose.yml file that is used for local mode. I'll open up an item in our internal backlog, which we're always reprioritizing based on feedback.

some potentially helpful links for anyone wanting to take a stab at this:

code where the yaml file is written
docker compose shm-size option - note that this would also include upgrading the version we use for the file

laurenyu on 6 Sep 2019

👍2

Hi @laurenyu, yes thanks! In an ideal world, we could pass some kind of run_hyperparameters dictionary so that we can add any flag for docker run, but having shm-size is good enough.

ksanjeevan on 9 Sep 2019

just wondering if this issue is resolved. This feature will really make debugging much easier.

bill10 on 19 Feb 2020

👍8 😕1

This is how you can monkey patch the sagemaker SDK to enable multiprocessing in local mode:

Example SageMaker SDK location:
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py:

def _create_docker_host(self, host, environment, optml_subdirs, command, volumes):
    """
    Args:
        host:
        environment:
        optml_subdirs:
        command:
        volumes:
    """
    optml_volumes = self._build_optml_volumes(host, optml_subdirs)
    optml_volumes.extend(volumes)

    # Added block (use 95% of total memory)
    from psutil import virtual_memory
    mem = virtual_memory()
    shm_size = str(int(int(str(mem.total)[:2])*.95))+'gb'

    host_config = {
        "image": self.image,
        "stdin_open": True,
        "tty": True,
        "volumes": [v.map for v in optml_volumes],
        "environment": environment,
        "command": command,
        "networks": {"sagemaker-local": {"aliases": [host]}},
        "shm_size": shm_size # Added line
    }
...

@laurenyu any chance we could get a PR with an option like this?

austinmw on 24 Mar 2020

👍1

Any updates? The feature would be usefull indeed for debugging locally

VladVin on 2 Nov 2020

👍2

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

Open this file in your editor:

sudo vim /etc/docker/daemon.json

Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.
Restart Docker daemon:

sudo systemctl restart docker

VladVin on 3 Nov 2020

👍6

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

Open this file in your editor:
sudo vim /etc/docker/daemon.json
Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.

Restart Docker daemon:
sudo systemctl restart docker

Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.

ivan-saptech on 19 Mar 2021

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

Open this file in your editor:
sudo vim /etc/docker/daemon.json
Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.

Restart Docker daemon:
sudo systemctl restart docker
Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.