Maskrcnn-benchmark: How to record mAP for every epoch on the validation dataset

Created on 16 Jan 2019 · 10Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

Hi,

I have two questions relating to storing training/val/test accuracies/losses -

In the log file, where I can find mAP values for training, validation, and testing datasets?
I have implemented the workflow in issue #171 , where I call the inference function in engine.trainer.do_train. Where should I modify the code (in the "inference" function or in the "coco_evaluation" file to save mAP values and losses at every epoch?

question

Source

nprasad2021

Most helpful comment

I found the solution to problem. It seems that when inference() puts the model to evaluation mode 'model.eval()'. So once we resume training, instead of returning losses, it gives prediction. All we have to do is put the model back to 'model.train()' before resuming the training process.

muzammil360 on 15 Mar 2019

🎉5

All 10 comments

Hi,

1 - We don't compute the mAP during training. You could compute it for the training set, but it would take quite some time.
2 - You can modify this function, to store the mAP / losses in the format that you want https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/data/datasets/evaluation/coco/coco_eval.py#L63

Let me know if you have further questions.

fmassa on 16 Jan 2019

Thanks! I've added the following function "val" to maskrcnn_benchmark.engine.trainer - >

def val(cfg, model, distributed=False):
    if distributed:
        model = model.module
    torch.cuda.empty_cache()  # TODO check if it helps
    iou_types = ("bbox",)
    if cfg.MODEL.MASK_ON:
        iou_types = iou_types + ("segm",)
    output_folders = [None] * (len(cfg.DATASETS.TEST) + len(cfg.DATASETS.TRAIN))
    dataset_names = cfg.DATASETS.TEST + cfg.DATASETS.TRAIN
    if cfg.OUTPUT_DIR:
        for idx, dataset_name in enumerate(dataset_names):
            print(dataset_name)
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
            mkdir(output_folder)
            output_folders[idx] = output_folder
    data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
    output_tuple = {}
    for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
        (dataset_name)
        result = inference(
            model,
            data_loader_val,
            dataset_name=dataset_name,
            iou_types=iou_types,
            box_only=cfg.MODEL.RPN_ONLY,
            device=cfg.MODEL.DEVICE,
            expected_results=cfg.TEST.EXPECTED_RESULTS,
            expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
            output_folder=output_folder,
        )[0].results['bbox']
        output_tuple[dataset_name] = {}
        output_tuple[dataset_name]['AP'] = result['AP'].item()
        output_tuple[dataset_name]['AP50'] = result['AP50'].item()

    return output_tuple

I also added the following to the function do_train():

if iteration % checkpoint_period == 0:
            checkpointer.save("model_{:07d}".format(iteration), **arguments)
            print("ENTER VALIDATION CALCULATIONS")
            output[iteration] = val(cfg, model, distributed)

I am getting the following error:

Traceback (most recent call last):
  File "tools/train_net.py", line 184, in <module>
    main()
  File "tools/train_net.py", line 177, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 77, in train
    distributed
  File "/data/home/nprasad/Documents/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 113, in do_train
    losses = sum(loss for loss in loss_dict.values())
AttributeError: 'list' object has no attribute 'values'

Do you have any idea what might have caused this error?
Alternatively, I could create a separate script to evaluate all models saved by the checkpointer, similar to tools/test_net.py. How do I select a specific model path for this script to use in evaluation?

I know the line 61 in test_net.py builds the model
model = build_detection_model(cfg)
How do I specify a model saved at an earlier iteration?

nprasad2021 on 17 Jan 2019

How do I specify a model saved at an earlier iteration?

Just use tools/test_net.py and pass MODEL.WEIGHT to be the checkpoint you want, that will be much easier I believe, and you'll not need to handle multi-processing due to distributed training.

fmassa on 17 Jan 2019

Thanks - the following script, adapted from test_net.py performs the appropriate calculations

from maskrcnn_benchmark.utils.env import setup_environment  # noqa F401 isort:skip

import argparse
import os, pickle, sys
print(sys.path)

from os import listdir
from os.path import isfile, join

import torch
from maskrcnn_benchmark.config import cfg
from maskrcnn_benchmark.data import make_data_loader
from maskrcnn_benchmark.engine.inference import inference
from maskrcnn_benchmark.modeling.detector import build_detection_model
from maskrcnn_benchmark.utils.checkpoint import DetectronCheckpointer
from maskrcnn_benchmark.utils.collect_env import collect_env_info
from maskrcnn_benchmark.utils.comm import synchronize, get_rank
from maskrcnn_benchmark.utils.logger import setup_logger
from maskrcnn_benchmark.utils.miscellaneous import mkdir
from maskrcnn_benchmark.engine.plotMaps import plot

def inf(args, cfg):

    num_gpus = int(os.environ["WORLD_SIZE"]) if "WORLD_SIZE" in os.environ else 1
    distributed = num_gpus > 1

    if distributed:
        torch.cuda.set_device(args.local_rank)
        torch.distributed.init_process_group(
            backend="nccl", init_method="env://"
        )

    save_dir = os.path.join(cfg.OUTPUT_DIR, "testInf")
    mkdir(save_dir)
    logger = setup_logger("maskrcnn_benchmark", save_dir, get_rank())
    logger.info("Using {} GPUs".format(num_gpus))
    logger.info(cfg)

    logger.info("Collecting env info (might take some time)")
    logger.info("\n" + collect_env_info())
    print(cfg.MODEL.WEIGHT)
    model = build_detection_model(cfg)
    model.to(cfg.MODEL.DEVICE)

    output_dir = cfg.OUTPUT_DIR
    checkpointer = DetectronCheckpointer(cfg, model, save_dir=output_dir)
    _ = checkpointer.load(cfg.MODEL.WEIGHT)

    iou_types = ("bbox",)
    if cfg.MODEL.MASK_ON:
        iou_types = iou_types + ("segm",)
    output_folders = [None] * len(cfg.DATASETS.TEST) 
    dataset_names = cfg.DATASETS.TEST
    print("Dataset Names", dataset_names)
    if cfg.OUTPUT_DIR:
        for idx, dataset_name in enumerate(dataset_names):
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
            mkdir(output_folder)
            output_folders[idx] = output_folder
    data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
    output_tuple = {}
    for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
        r = inference(
            model,
            data_loader_val,
            dataset_name=dataset_name,
            iou_types=iou_types,
            box_only=cfg.MODEL.RPN_ONLY,
            device=cfg.MODEL.DEVICE,
            expected_results=cfg.TEST.EXPECTED_RESULTS,
            expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
            output_folder=output_folder,
        )[0].results['bbox']

        output_tuple[dataset_name] = {}
        output_tuple[dataset_name]['AP'] = r['AP'].item()
        output_tuple[dataset_name]['AP50'] = r['AP50'].item()

        synchronize()
    return output_tuple

def recordResults(args, cfg):
    homeDir = "/home/nprasad/Documents/github/maskrcnn-benchmark"
    model_paths = [cfg.MODEL.WEIGHT] + get_model_paths(join(homeDir, cfg.OUTPUT_DIR))
    output = {}
    for path in model_paths:
        cfg.MODEL.WEIGHT = path
        if "final" in path:
            ite = cfg.SOLVER.MAX_ITER
        elif "no" in path:
            ite = 0
        else:
            ite = int(path.split("_")[1].split(".")[0])
        output[ite] = inf(args, cfg)
    plot(output, cfg)

def get_model_paths(directory):
    onlyfiles = [f for f in listdir(directory) if isfile(join(directory, f))]
    return [join(directory, file) for file in onlyfiles if ".pth" in file]

def main():
    parser = argparse.ArgumentParser(description="PyTorch Object Detection Inference")
    parser.add_argument(
        "--config-file",
        default="/home/nprasad/Documents/github/maskrcnn-benchmark/configs/heads.yaml",
        metavar="FILE",
        help="path to config file",
    )
    parser.add_argument("--local_rank", type=int, default=0)
    parser.add_argument(
        "opts",
        help="Modify config options using the command-line",
        default=None,
        nargs=argparse.REMAINDER,
    )

    args = parser.parse_args()

    cfg.merge_from_file(args.config_file)
    cfg.merge_from_list(args.opts)
    recordResults(args, cfg)

if __name__ == "__main__":
    main()

in the function recordResults() the MODEL.WEIGHT is modified. subsequently, accuracies are plotted. However, the plotting function shows no change in accuracy over time. However, training AP50 is 100% from the start, and validation accuracy and testing accuracy maintain a constant number throughout. During training, losses do converge towards 0 for the training set. What do you think could be wrong.

nprasad2021 on 17 Jan 2019

After running the model for a variable number of iterations, and then running the script below, it seems that the accuracies are different.

Train Model for 100 iterations
Test - Acc. on Training is .95

Train Model for 5000 iterations
Test - Acc on Training is 1.00

However running the script, and evaluating accuracy at each checkpoint, yields
the same exact accuracy for iteration 50, 100, 150, etc.

Therefore, I think there is a problem with the way the config is initialized. - does the model weight change actually change after initialization of the model?

nprasad2021 on 17 Jan 2019

It seems that despite changing cfg.MODEL.WEIGHT systematically in the script in the comment above, does not yield any change in the actual model retrieval in the model build script. Is this correct?

nprasad2021 on 17 Jan 2019

@nprasad2021 you are probably setting the OUTPUT_DIR to be the path where you trained your model. In this case, you'll be always picking the last trained checkpoint.

This happens this way in order to easily support restarting jobs.

So I'd recommend changing the OUTPUT_DIR to a different folder, and passing the MODEL.WEIGHT to be the path to your checkpoints.

fmassa on 18 Jan 2019

Thanks, this works!

nprasad2021 on 18 Jan 2019

Thanks! I've added the following function "val" to maskrcnn_benchmark.engine.trainer - >

def val(cfg, model, distributed=False):
    if distributed:
        model = model.module
    torch.cuda.empty_cache()  # TODO check if it helps
    iou_types = ("bbox",)
    if cfg.MODEL.MASK_ON:
        iou_types = iou_types + ("segm",)
    output_folders = [None] * (len(cfg.DATASETS.TEST) + len(cfg.DATASETS.TRAIN))
    dataset_names = cfg.DATASETS.TEST + cfg.DATASETS.TRAIN
    if cfg.OUTPUT_DIR:
        for idx, dataset_name in enumerate(dataset_names):
            print(dataset_name)
            output_folder = os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name)
            mkdir(output_folder)
            output_folders[idx] = output_folder
    data_loaders_val = make_data_loader(cfg, is_train=False, is_distributed=distributed)
    output_tuple = {}
    for output_folder, dataset_name, data_loader_val in zip(output_folders, dataset_names, data_loaders_val):
        (dataset_name)
        result = inference(
            model,
            data_loader_val,
            dataset_name=dataset_name,
            iou_types=iou_types,
            box_only=cfg.MODEL.RPN_ONLY,
            device=cfg.MODEL.DEVICE,
            expected_results=cfg.TEST.EXPECTED_RESULTS,
            expected_results_sigma_tol=cfg.TEST.EXPECTED_RESULTS_SIGMA_TOL,
            output_folder=output_folder,
        )[0].results['bbox']
        output_tuple[dataset_name] = {}
        output_tuple[dataset_name]['AP'] = result['AP'].item()
        output_tuple[dataset_name]['AP50'] = result['AP50'].item()

    return output_tuple

I also added the following to the function do_train():

if iteration % checkpoint_period == 0:
            checkpointer.save("model_{:07d}".format(iteration), **arguments)
            print("ENTER VALIDATION CALCULATIONS")
            output[iteration] = val(cfg, model, distributed)

I am getting the following error:

Traceback (most recent call last):
  File "tools/train_net.py", line 184, in <module>
    main()
  File "tools/train_net.py", line 177, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 77, in train
    distributed
  File "/data/home/nprasad/Documents/github/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 113, in do_train
    losses = sum(loss for loss in loss_dict.values())
AttributeError: 'list' object has no attribute 'values'

I know the line 61 in test_net.py builds the model
model = build_detection_model(cfg)
How do I specify a model saved at an earlier iteration?

@nprasad2021, were you able to resolve AttributeError: 'list' object has no attribute 'values' error? I am also trying to do same thing as you but I get the exact same error.

I would prefer not to evaluate all the model files later.

muzammil360 on 15 Mar 2019

👍1

muzammil360 on 15 Mar 2019

🎉5

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Error when trying to train: RuntimeError: cuda runtime error (59) : device-side assert triggered

Nacho114 · 4Comments

Can't build a model a second time

hadim · 4Comments

Raise ValueError: Type mismatch (<type 'str'> vs. <type 'tuple'>) with values (coco_2017_train vs. ('coco_2017_train',)) for config key: DATASETS.TRAIN

SkeletonOne · 3Comments

cuda runtime error (77): an illegal memory access was encountered

IenLong · 4Comments

Cityscapes to COCO inefficiency

botcs · 3Comments