Trying to train by using TridentNet on custom dataset . The config is the following which i used,
`from projects.TridentNet.tridentnet import add_tridentnet_config
cfg = get_cfg()
add_tridentnet_config(cfg)
cfg.merge_from_file(project_root+"/projects/TridentNet/configs/tridentnet_fast_R_50_C4_3x.yaml")
cfg.DATASETS.TRAIN = ("train", )
cfg.OUTPUT_DIR = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
cfg.DATASETS.TEST = ("val", )
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.001
cfg.SOLVER.MAX_ITER = 200000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
cfg.TEST.EVAL_PERIOD = 200
cfg.SOLVER.CHECKPOINT_PERIOD = 600
cfg.SOLVER.MOMENTUM = 0.87
from detectron2.modeling import build_model
from detectron2.checkpoint import DetectionCheckpointer
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()`
Got the following error

You're loading an ImageNet pre-trained model (because that's what's written in the config file) and ImageNet pre-trained model contains classification layers that are not used by detection model. So it's expected.
You're loading an ImageNet pre-trained model (because that's what's written in the config file) and ImageNet pre-trained model contains classification layers that are not used by detection model. So it's expected.
Which is the imagenet pretrained model for detection ?
that's what's written in the config file
You're loading an ImageNet pre-trained model (because that's what's written in the config file) and ImageNet pre-trained model contains classification layers that are not used by detection model. So it's expected.
Did u meant that 'cfg.MODEL.WEIGHTS = "detectron2://ImageNetPretrained/MSRA/R-50.pkl" contains classification layer ?How to solve this error ? will u show which part of the config file mentioned the classification layer?
Yes.
It's expected, which means it's not an error.
Yes.
It's expected, which means it's not an error.
But training failed in 0th iteration itself.. u can see it in the question
Please provide full logs. I can't see what is the error in the screenshot
Please provide full logs. I can't see what is the error in the screenshot
> proposal_generator.anchor_generator.cell_anchors.0
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
[02/10 15:36:04 d2.checkpoint.c2_model_loading]: The checkpoint contains parameters not used by the model:
fc1000_b
fc1000_w
conv1_b
[02/10 15:36:04 d2.engine.train_loop]: Starting training from iteration 0
[02/10 15:36:04 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)
Registering val image
100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻坾 9958/9958 [01:20<00:00, 124.41it/s]
9958 Images registered successfully.
[02/10 15:37:25 d2.data.build]: Distribution of instances among all 1 categories:
| category | #instances |
|:----------:|:-------------|
| person | 57016 |
| | |
WARNING [02/10 15:37:25 d2.engine.defaults]: No evaluator found. Use `DefaultTrainer.test(evaluators=)`, or implement its `build_evaluator` method.
Traceback (most recent call last):
File "tridentnet_custom_train.py", line 96, in <module>
trainer.train()
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/engine/defaults.py", line 373, in train
super().train(self.start_iter, self.max_iter)
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/engine/train_loop.py", line 132, in train
self.run_step()
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/engine/train_loop.py", line 212, in run_step
loss_dict = self.model(data)
File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 129, in forward
_, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/projects/TridentNet/tridentnet/trident_rcnn.py", line 66, in forward
pred_instances, losses = super().forward(images, features, proposals, all_targets)
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 392, in forward
box_features = self._shared_roi_transform(
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/modeling/roi_heads/roi_heads.py", line 378, in _shared_roi_transform
x = self.pooler(features, boxes)
File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/modeling/poolers.py", line 215, in forward
return self.level_poolers[0](x[0], pooler_fmt_boxes)
File "/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/layers/roi_align.py", line 94, in forward
return roi_align(
File "/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/layers/roi_align.py", line 19, in forward
output = _C.roi_align_forward(
RuntimeError: CUDA error: invalid device function (ROIAlign_forward_cuda at /mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/layers/csrc/ROIAlign/ROIAlign_cuda.cu:361)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f4ac3239627 in /opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: detectron2::ROIAlign_forward_cuda(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xa24 (0x7f4aa8c8c770 in /mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/_C.cpython-38-x86_64-linux-gnu.so)
frame #2: detectron2::ROIAlign_forward(at::Tensor const&, at::Tensor const&, float, int, int, int, bool) + 0xb6 (0x7f4aa8c09fc6 in /mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/_C.cpython-38-x86_64-linux-gnu.so)
frame #3: <unknown function> + 0x543a9 (0x7f4aa8c1a3a9 in /mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/_C.cpython-38-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x5039e (0x7f4aa8c1639e in /mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2/_C.cpython-38-x86_64-linux-gnu.so)
<omitting python frames> frame #10: THPFunction_apply(_object*, _object*) + 0xb2f (0x7f4af51c0d1f in /opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
[1] 11845 segmentation fault (core dumped) python tridentnet_custom_train.py
(d2_train)
Your issue is answered in https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md#common-installation-issues already.
If you need help, please also include environment information following the issue template.
Your issue is probably answered in https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md already.
If you need help, please also include environment information following the issue template.I have already trained retinanet model using detectron. I got the above error when i tried with other models
The output of 'python -m detectron2.utils.collect_env'
$ python -m detectron2.utils.collect_env
sys.platform linux
Python 3.8.1 (default, Jan 8 2020, 22:29:32) [GCC 7.3.0]
numpy 1.18.1
detectron2 0.1 @/mnt/Data_common/PPE_Violation_Detection_Samjith/MPC_model/detectron2/detectron2
detectron2 compiler GCC 7.4
detectron2 CUDA compiler 10.0
detectron2 arch flags sm_61
DETECTRON2_ENV_MODULE
PyTorch 1.4.0 @/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torch
PyTorch debug build False
CUDA available True
GPU 0,1 GeForce GTX 1080
CUDA_HOME /usr/local/cuda
NVCC Cuda compilation tools, release 10.0, V10.0.130
Pillow 6.2.2
torchvision 0.5.0 @/opt/anaconda3/envs/d2_train/lib/python3.8/site-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
cv2 4.2.0
PyTorch built with:
I have found that ,
detectron2 CUDA compiler 10.0
CUDA_HOME /usr/local/cuda
PyTorch built with:
- CUDA Runtime 10.1
Detectron2 CUDA compiler is 10.0 but pytorch build cuda is 10.1. Should i rebuild the detectron2 or should i install cuda 10.0 and rebuild pytorch with cuda 10.0?
Your issue is answered in https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md#common-installation-issues already.
If you need help, please also include environment information following the issue template.
I have build the detectron with cuda 10.1 by using following command
pip install detectron2 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/index.html
Most helpful comment
I have build the detectron with cuda 10.1 by using following command
for CUDA 10.1:
pip install detectron2 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/index.html