Reference: MLFW-2582
When using a data loader with multiprocessing in PyTorch (set num_workers > 0), the following error comes up:
algo-1-xxsq9_1 | File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch
algo-1-xxsq9_1 | raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
algo-1-xxsq9_1 | RuntimeError: DataLoader worker (pid(s) 76) exited unexpectedly
tmpitprkm4o_algo-1-xxsq9_1 exited with code 1
This is because /dev/shm is at only 64M by default. The solution to this seems to be simply passing --shm-size with a higher value to docker run but if one is using sagemaker that option isn't there.
I've extended the PyTorch container and added a bunch of custom packages and settings in the Dockerfile with no problem, but can't set runtime flags/args meant to be passed to docker run (Note: docker build has a --shm-size args but that is NOT related to the /dev/shm size of the final container).
This becomes a huge bottleneck since training is very slow with num_workers = 0. Can't we just increase the default shared memory? Or provide an easier way for users to set it using the sagemaker sdk?
num_workers > 0.Hello @ksanjeevan,
Thanks for bringing this issue to our attention.
Let me reach out to the team that owns the training platform.
Are you using local mode training?
If you're not using local mode then, in fact docker containers running in SageMaker training do NOT use the 64MB default shm-size - we adjust it depending on the instance type.
You didn't mention the instance type you're using, which is it? And how much is the max shared memory is your algorithm meant to use?
Yeah this was using local mode, using remote seems to be fine.
thanks for the clarification! So it seems that we would need to find a way to expose shm-size as an option that would then get written into the docker-compose.yml file that is used for local mode. I'll open up an item in our internal backlog, which we're always reprioritizing based on feedback.
some potentially helpful links for anyone wanting to take a stab at this:
Hi @laurenyu, yes thanks! In an ideal world, we could pass some kind of run_hyperparameters dictionary so that we can add any flag for docker run, but having shm-size is good enough.
just wondering if this issue is resolved. This feature will really make debugging much easier.
This is how you can monkey patch the sagemaker SDK to enable multiprocessing in local mode:
Example SageMaker SDK location:
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py:
def _create_docker_host(self, host, environment, optml_subdirs, command, volumes):
"""
Args:
host:
environment:
optml_subdirs:
command:
volumes:
"""
optml_volumes = self._build_optml_volumes(host, optml_subdirs)
optml_volumes.extend(volumes)
# Added block (use 95% of total memory)
from psutil import virtual_memory
mem = virtual_memory()
shm_size = str(int(int(str(mem.total)[:2])*.95))+'gb'
host_config = {
"image": self.image,
"stdin_open": True,
"tty": True,
"volumes": [v.map for v in optml_volumes],
"environment": environment,
"command": command,
"networks": {"sagemaker-local": {"aliases": [host]}},
"shm_size": shm_size # Added line
}
...
@laurenyu any chance we could get a PR with an option like this?
Any updates? The feature would be usefull indeed for debugging locally
As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:
sudo vim /etc/docker/daemon.json
"default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.sudo systemctl restart docker
As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:
- Open this file in your editor:
sudo vim /etc/docker/daemon.json
- Add option
"default-shm-size": "13G"as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.- Restart Docker daemon:
sudo systemctl restart docker
Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.
As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:
- Open this file in your editor:
sudo vim /etc/docker/daemon.json
- Add option
"default-shm-size": "13G"as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.- Restart Docker daemon:
sudo systemctl restart dockerThanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.
@ivan-saptech will we able to deploy the same into sagemaker and the sagemaker endpoint will get the same shm-size configuration?
CC - @VladVin
This is happening in sagemaker studio as well. Is there a way to adjust the studio settings?
Most helpful comment
just wondering if this issue is resolved. This feature will really make debugging much easier.