Fairseq: [Feature Proposal] Add image captioning example

Created on 26 May 2019  路  5Comments  路  Source: pytorch/fairseq

Hi fairseq team!

As mentioned in issue #90, #313 and #475 , there are plenty of places where vision and language intersect (e.g., image captioning, VQA). And I have written an image captioning example based on this excellent fairseq toolkit, I want to know whether there is a plan to add an image captioning / text recognition example?

My implementation is in my text-recognition branch, current structure is only CRNN with a CTCLoss criterion.

My next plan is add attention module and transformer module to image captioning task based on fairseq's official implementation modules.

Most helpful comment

@adrelino in fairseq-image-captioning you can already use pre-computed features extracted with a Faster-RCNN for training transformer-based captioning models by using the command line option --features obj. These are pre-computed for the MS-COCO dataset and split into Karpathy train, validation and test sets.

At the moment, I only use these pre-computed features. A later version of fairseq-image-captioning will then use a Faster-RCNN directly and implementations from torchvision or detectron2 are good candidates. This will also allow fine-tuning the object detector together with the image captioning model (which will probably require a larger dataset than MS-COCO). Happy to collaborate on that or accept pull requests.

At the moment, I'm implementing Self-critical Sequence Training for Image Captioning and have already promising results. It took me a while to implement as I had to re-write the sequence generator so that it can also be used during training i.e. supports back-propagation (which is not supported by the default sequence generator in fairseq). Should be on Github soon. Update Feb 25, 2020: Self-critical sequence training now implemented.

Afterwards, I initially planned to implement M2: Meshed-Memory Transformer for Image Captioning which requires some extensions to the transformer implementation in fairseq but I'm also open to give a Faster-RCNN implementation a higher priority if you are interested in a collaboration.

Regarding

This is then fed into fairseq's LSTM/Transformer -based decoder to generate the captions.

fairseq-image-captioning also supports feeding extracted features into a transformer encoder for self-attention on visual "tokens" and then feeding the encoder output into a transformer decoder. Using a transformer encoder can be enabled with the --arch default-captioning-arch command line option.

All 5 comments

I'm not aware of any plans to add an image captioning or text recognition example. Feel free to submit a PR!

Together with @cstub, I started to work on an image captioning extension for fairseq. Still early-access but you can already train transformer-based image captioning models. There's also a simple demo and a pre-trained model available. More details in the project's README.

I am also interested in combining vision and and language.

Both of you re-implemented a CNN-based encoder to get an intermediate feature representation from images: image_captioning_encoder @zhiqwang or inception_v3 @krasserm. This is then fed into fairseq's LSTM/Transformer -based decoder to generate the captions.

Instead of re-implementing the feature extractor, I am wondering if anyone ever tried to combine fairseq with a Faster-RCNN object-based feature extractor from torchvision or detectron2?

Hi @adrelino , I am also interested in connecting fairseq and detectron2 together, maybe we can cooperate in doing this?

@adrelino in fairseq-image-captioning you can already use pre-computed features extracted with a Faster-RCNN for training transformer-based captioning models by using the command line option --features obj. These are pre-computed for the MS-COCO dataset and split into Karpathy train, validation and test sets.

At the moment, I only use these pre-computed features. A later version of fairseq-image-captioning will then use a Faster-RCNN directly and implementations from torchvision or detectron2 are good candidates. This will also allow fine-tuning the object detector together with the image captioning model (which will probably require a larger dataset than MS-COCO). Happy to collaborate on that or accept pull requests.

At the moment, I'm implementing Self-critical Sequence Training for Image Captioning and have already promising results. It took me a while to implement as I had to re-write the sequence generator so that it can also be used during training i.e. supports back-propagation (which is not supported by the default sequence generator in fairseq). Should be on Github soon. Update Feb 25, 2020: Self-critical sequence training now implemented.

Afterwards, I initially planned to implement M2: Meshed-Memory Transformer for Image Captioning which requires some extensions to the transformer implementation in fairseq but I'm also open to give a Faster-RCNN implementation a higher priority if you are interested in a collaboration.

Regarding

This is then fed into fairseq's LSTM/Transformer -based decoder to generate the captions.

fairseq-image-captioning also supports feeding extracted features into a transformer encoder for self-attention on visual "tokens" and then feeding the encoder output into a transformer decoder. Using a transformer encoder can be enabled with the --arch default-captioning-arch command line option.

Was this page helpful?
0 / 5 - 0 ratings