Detectron2: Can't train with multi gpu

Created on 10 Jun 2020 · 3Comments · Source: facebookresearch/detectron2

Instructions To Reproduce the Issue:

My code:

import detectron2
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.data import MetadataCatalog
from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg
from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, hooks, launch

from detectron2.data.datasets import register_coco_instances
register_coco_instances("doc_train", {}, "data.json", "./data")

cfg = get_cfg()
cfg.merge_from_file("./d2/configs/COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("doc_train",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 4
cfg.SOLVER.IMS_PER_BATCH = 4
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 125500
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 64
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)

launch(trainer.train(), num_gpus_per_machine=2)

when i ran this script. i faced an error:
RuntimeError: CUDA error: out of memory
and when i reduced batch size to 64, script can run but only one gpu ran (check by nvidia-smi).

Expected behavior:

Train with multi gpu

Environment:

GPU: 2 GeForce RTX 2080 Ti/PCIe/SSE2
Driver Version: 435.21
CUDA Version: 10.1
OS: Kubuntu

Thank you so much.

Source

ChungNPH

All 3 comments

reduce IMS_PER_BATCH

ppwwyyxx on 10 Jun 2020

I get this error as well. Reducing IMS_PER_BATCH didn't solve the memory issue. Additionally, even after calling launch(trainer.train(), num_gpus_per_machine=4) it only utilizes one GPU.... Any updates here?