Hi fairseq team!
As mentioned in issue #90, #313 and #475 , there are plenty of places where vision and language intersect (e.g., image captioning, VQA). And I have written an image captioning example based on this excellent fairseq toolkit, I want to know whether there is a plan to add an image captioning / text recognition example?
My implementation is in my text-recognition branch, current structure is only CRNN with a CTCLoss criterion.
My next plan is add attention module and transformer module to image captioning task based on fairseq's official implementation modules.
I'm not aware of any plans to add an image captioning or text recognition example. Feel free to submit a PR!
Together with @cstub, I started to work on an image captioning extension for fairseq. Still early-access but you can already train transformer-based image captioning models. There's also a simple demo and a pre-trained model available. More details in the project's README.
I am also interested in combining vision and and language.
Both of you re-implemented a CNN-based encoder to get an intermediate feature representation from images: image_captioning_encoder @zhiqwang or inception_v3 @krasserm. This is then fed into fairseq's LSTM/Transformer -based decoder to generate the captions.
Instead of re-implementing the feature extractor, I am wondering if anyone ever tried to combine fairseq with a Faster-RCNN object-based feature extractor from torchvision or detectron2?
Hi @adrelino , I am also interested in connecting fairseq and detectron2 together, maybe we can cooperate in doing this?
@adrelino in fairseq-image-captioning you can already use pre-computed features extracted with a Faster-RCNN for training transformer-based captioning models by using the command line option --features obj. These are pre-computed for the MS-COCO dataset and split into Karpathy train, validation and test sets.
At the moment, I only use these pre-computed features. A later version of fairseq-image-captioning will then use a Faster-RCNN directly and implementations from torchvision or detectron2 are good candidates. This will also allow fine-tuning the object detector together with the image captioning model (which will probably require a larger dataset than MS-COCO). Happy to collaborate on that or accept pull requests.
At the moment, I'm implementing Self-critical Sequence Training for Image Captioning and have already promising results. It took me a while to implement as I had to re-write the sequence generator so that it can also be used during training i.e. supports back-propagation (which is not supported by the default sequence generator in fairseq). Should be on Github soon. Update Feb 25, 2020: Self-critical sequence training now implemented.
Afterwards, I initially planned to implement M2: Meshed-Memory Transformer for Image Captioning which requires some extensions to the transformer implementation in fairseq but I'm also open to give a Faster-RCNN implementation a higher priority if you are interested in a collaboration.
Regarding
This is then fed into fairseq's LSTM/Transformer -based decoder to generate the captions.
fairseq-image-captioning also supports feeding extracted features into a transformer encoder for self-attention on visual "tokens" and then feeding the encoder output into a transformer decoder. Using a transformer encoder can be enabled with the --arch default-captioning-arch command line option.
Most helpful comment
@adrelino in fairseq-image-captioning you can already use pre-computed features extracted with a Faster-RCNN for training transformer-based captioning models by using the command line option
--features obj. These are pre-computed for the MS-COCO dataset and split into Karpathy train, validation and test sets.At the moment, I only use these pre-computed features. A later version of fairseq-image-captioning will then use a Faster-RCNN directly and implementations from torchvision or detectron2 are good candidates. This will also allow fine-tuning the object detector together with the image captioning model (which will probably require a larger dataset than MS-COCO). Happy to collaborate on that or accept pull requests.
At the moment, I'm implementing Self-critical Sequence Training for Image Captioning and have already promising results. It took me a while to implement as I had to re-write the sequence generator so that it can also be used during training i.e. supports back-propagation (which is not supported by the default sequence generator in fairseq). Should be on Github soon. Update Feb 25, 2020: Self-critical sequence training now implemented.
Afterwards, I initially planned to implement M2: Meshed-Memory Transformer for Image Captioning which requires some extensions to the transformer implementation in fairseq but I'm also open to give a Faster-RCNN implementation a higher priority if you are interested in a collaboration.
Regarding
fairseq-image-captioning also supports feeding extracted features into a transformer encoder for self-attention on visual "tokens" and then feeding the encoder output into a transformer decoder. Using a transformer encoder can be enabled with the
--arch default-captioning-archcommand line option.