Vision: Implementing image reading functions

Created on 29 Jul 2019  路  30Comments  路  Source: pytorch/vision

Proposal

We must have a function that reads images and converts it to a tensor in order to implement C++ datasets. I have written the said function here with OpenCV. But using OpenCV just to read images is an overkill.

I suggest that we implement functions that read a specific format (for example a function that reads png images using libpng). We can cover the formats that are used in the datasets that will be implemented.

enhancement help wanted io needs discussion

Most helpful comment

what should be the format that we will return from the image loading?
Should it be CxHxW or HxWxC? - match image library format by default (e.g. PIL) with option to specify the format

Should it be uint8 in 0-255 or float in 0-1? This should be uint8. Eventually we'll support quantized/low precision training that won't be float. Also, type conversions are ideally abstracted away / automatically done only when needed.

What formats do we want to support? JPEG / PNG / others? - by default support as many formats as possible, but allow for specific formats to be specified as an optimization.

All 30 comments

I think this goes in the right direction.

A few things we should also discuss:

  • what should be the format that we will return from the image loading?

    • Should it be CxHxW or HxWxC?



      • The images are natively stored in HxWxC, but the default format we use in PyTorch is CxHxW



    • Should it be uint8 in 0-255 or float in 0-1?



      • we had plenty of complaints about weird handling of to_tensor and to_pil_image in many cases, we should try as much as possible to be consistent.



    • Can we make it work on both Windows and Linux / OSX?

    • if yes, then it would be a candidate for potentially replacing PIL in the future

  • should we have a separate function for reading and decoding?

    • we might want to have in-memory encoded images as a bytes string, and just decode it when needed

    • this means we might want to have a decode_image function, similar to TF, which is also supported by PIL

  • What formats do we want to support? JPEG / PNG / others?
  • What about images encoded with other data types, like int16 or float?

Here is what I think:

  • Input image: I think that it's better if we use 0-1 float values in our tensors because they are eventually turned into float when training and such. This way other data types like int16 and float can be turned into float. Also I have currently implemented CxHxW. What is the point of keeping HxWxC?
  • Separating reading and decoding: I haven't used tensorflow so I don't know what are the benefits of separating reading and decoding of images.
  • Image formats: In the link that you provided tensorflow supports BMP, GIF, JPEG, PNG. I think that for now we should support formats that are enough for reading the image of datasets that are implemented.

I gave it a second thought, and I'm leaning towards the following:

Image format

float 0-1 vs original format

There have been a lot of feature requests from users about extending support for to_tensor and to_pil_image to handle particular edge cases. Those edge cases exist because of the implicit assumption that float should be in 0-1. But this breaks some things when people want to use uint16 images, which have larger range of values.

The original reason why we convert the images to 0-1 format is because we used to feed them directly as is to the network (with sigmoid / tanh activations), so having the image be in 0-1 would help training.

But since a while, we have been performing an image normalization step which consists of dividing by the std of the values. This means that, if we use such normalization, it doesn't matter if the float tensor is in 0-1 or 0-255, all it matters is that the normalization values are compatible with how the images are loaded.

This makes me think that we might want to have the images be returned the original range, be it 0-255 or larger for larger types.
I think we should also return the data in the standard type they have been stored (uint8 for most cases), because it uses less space and many operations can be efficiently performed on uint8 types.

CHW x HWC

I'm not 100% clear on this one. Maybe stick with HWC and let a transform perform the conversion?

reading and decoding

The benefits of separating reading and decoding is that we have more flexibility.
If the API only supports passing a file path, then we can't decode a bytestream that lives (encoded) in memory. I think this is an important requirement.

Image Formats

Agree that we should support the formats that are necessary for reading the images of the datasets that we implement.

I'm far from an expert but balancing the desire

  • to have something that is "standard" (no dependency on file format, backend library, ...) as output by default,
  • to be efficient (i.e. have something that is ideally "zero copy" after returning from the reading library), not lose information in preprocessing.

The cleanest way perhaps would be having a "one size fits all" recommended image reading function that returns a fixed dtype / layout. And then maybe "raw_read_someformat". A "raw" option could be another way, but I must admit I cringe thinking about the return dtype depending on an option.

I would probably be OK with non-contiguous layouts to be returned if it reduces the copying going on - as Francisco said, we usually have a normalization step and that would typically deal with non-contiguous inputs.

So adding it all together:

  • Supporting the image formats of our datasets.
  • Having a simple function that reads files into memory.
  • Having a function that reads raw image from memory and keeps type and size and HWC. Also having a function that reads all the formats desired and returns a fixed type and size but actually using the raw functions to read the datasets.

@fmassa @t-vi does this make sense?
I can start working on this after all of this is 100%.

OK. So these are the results and I have started implementing. Closing this.

what should be the format that we will return from the image loading?
Should it be CxHxW or HxWxC? - match image library format by default (e.g. PIL) with option to specify the format

Should it be uint8 in 0-255 or float in 0-1? This should be uint8. Eventually we'll support quantized/low precision training that won't be float. Also, type conversions are ideally abstracted away / automatically done only when needed.

What formats do we want to support? JPEG / PNG / others? - by default support as many formats as possible, but allow for specific formats to be specified as an optimization.

I'm reopening this issue until we have the PRs merged in torchvision.

And I agree with @jguerin-fb comments, thanks for the feedback!

Hi everyone,

I am the author of ticket #1595 that seems to as for a subset of what you are discussing here and therefore is redundant.

I mostly agree with @fmassa and @jguerin-fb comments on the issue regarding separating decode and read functions and keeping the native format.

I would also suggest that we have the layout format of the read/decode match the pytorch default . So I suggest we keep HxWxC as the output, here is why :

  • It will ease the learning curve for beginners getting started with torchvision.
  • This will allow for any downstream operation to be written in standard torch and help us with #1375.
  • Not depending of the image library format might ease the path to production for models built with torchvision especially once we make progress with #1375. The data pipeline that feed's the models once they are in production might be built in another programming language ( for efficiency, cost, legacy reasons etc...). Binding the input format to a specific python library would not help data engineering teams put models to production.

What do you think ?

One last node, having a separate decode function would also help with production use-cases especially if it's torchscriptable.

Thanks

EDIT: I would like to help moving this issue forward. As I am new to the project. Can any of you give some pointers on where to start ?

Since this is reopened I should mention the related PR. This is the PR for this discussion that implements png format. It has been a long time since I worked on TorchVision.

Hi @r-zenine

@ShahriarSS has indeed sent a PR with PNG decoding, which he pointed out in the comment (thanks again @ShahriarSS for the PR).

I didn't have enough bandwidth to properly review the PR and get it merged, also because I wanted to get some extra feedback on people working on production to decide on convention.

I agree with everything @jguerin-fb said.

Next steps

Merge PNG PR from @ShahriarSS

I think a first step would be to build on top of @ShahriarSS PR, but instead adding the decoding function as an operator exposed via torchscript, like the video reading operator and others.

We would also need to check how to package the libpng so that it's visible to users building from source, or just downloading from pip / conda. This is something we will need to figure out.

Then, we would need a full set of tests comparing the result of the decoded image with PIL, to check that we don't regress in any aspect.

Add JPEG support

I think a natural next step would be to add support for JPEG as well. It's a widely used format and without it we can't do much. We should use libjpeg (and maybe link against libjpegturbo).

uint8 interpolation on HWC

PyTorch only currently supports interpolation with CHW, float tensors. In order to properly use HWC uint8 for data augmentation / transformation, we need to add support for it (maybe via memory_format). This is something that should probably go in PyTorch at some point, but could be kept in torchvision initially.

@r-zenine @jguerin-fb thoughts?

NHWC is now the recommended format for cuDNN performance (https://docs.nvidia.com/deeplearning/sdk/pdf/Deep-Learning-Performance-Guide.pdf). Regardless, I agree there should be a memory_format option. For example, NCHW used to be the optimal format for cuDNN and was the default for Caffe2.

I also agree that decoding should be its own operator. For example, it's not uncommon to be given encoded image bytes in a request. Moreover, in addition to having specific libraries for jpeg and png, I think we should have it optional to include something like imagemagik to handle the long tail of formats.

Hi Everyone,
Sounds like a good plan.

@fmassa , @jguerin-fb I could start working on the jpeg support while taking into account your remarks regarding the png PR.

I propose the following :

  • Link against libjpeg-turbo. I am not an expert in any way so correct me if I am wrong. I read that it leverages SIMD Ops which might result in better performance. I could also link against libjpeg your call !
  • I would keep reading and decoding separate and maybe provide a helper function to do both.
  • I would provide unit test to compare with PIL.
  • For the first implementation, I should preserve the libjpeg format.

I can dedicate some time to this every morning.

@fmassa , @jguerin-fb Any thoughts ?

Hi @r-zenine ,

I think this is fine, but I would start with the PNG support, so finishing on top of the PR from @ShahriarSS would be best. We would need to make some tests to ensure that no memory leaks are present, and figure out how to get the binaries shipped with libpng.

Maybe initially just implement a function to read from a memory (via a Tensor), and then we can add other functions afterwards that handles opening the file etc

Hi @fmassa,

I'll do that. @ShahriarSS can I add to your PR ?

I don't know what adding means in this context but right now if the PR gets reviewed I won't have the time to finish it (I'm very busy right now) so feel free to build on top of it. Let me know if you have any problems or maybe I can review the PR (if @fmassa agrees).

@ShahriarSS That's what I meant. Thanks.

Sure, we will be reviewing the PR, and if @ShahriarSS could also help reviewing it it would be great.

One thing that was missing in the original PR which is absolutely needed is python bindings and tests (correctness and memory tests).

Hi @fmassa,

I just read the PR and I think I will build on top of it. Just to summarize, here is my understanding of the main items to do ( ordered by priority ) :

  1. Generate Python Bindings for the read_png function.
  2. Write a test suite in python to compare with PIL and fix implementation if necessary.
  3. Once implementation is compliant with PIL. Check code to ensure memory safety.
  4. Figure out a way to ship libpng with the binaries.
  5. Add helpers functions to read files etc...

Did I miss anything ?

Thanks

@r-zenine this sounds good to me.
In order to generate the python bindings, you can have a look at https://github.com/pytorch/vision/blob/2e8bcf8ba48ea94e30a362e1a57e5e3de8dbd2d0/torchvision/csrc/vision.cpp#L42-L50
you can either directly add support for it in the torchvision ops namespace, or create your own registry, like we did for video in https://github.com/pytorch/vision/blob/2e8bcf8ba48ea94e30a362e1a57e5e3de8dbd2d0/torchvision/csrc/cpu/video_reader/VideoReader.cpp#L492-L500

I'd recommend trying to add it to the torchvision ops registry, instead of creating another registry.

Also, I would change the function signature of read_png to receive a torch::Tensor as input, where the tensor contain the raw bytes, instead of passing a void*.

Let me know if you have questions.

Hi @fmassa ,
The libpng project seems to be using cmake ( see here ).

We could add it as a sub-module and use the same strategy that is used in torch the ship libpng with the package.

What do you think ?

Hi everyone,

Just a quick update message.

I had some time to work on the issue this week-end. The code is available here.

I've build upon @ShahriarSS initial code. I changed the decode_png function to take a Tensor and I've added a read_png_from_file function. I registered the OP and wrote python bindings in torchvision.ops.

I need to write the unit tests. Once it's done, I'll submit the code for an initial review.

I am not happy with error handling and the quality of my error messages. Right now, if anything goes wrong I return an empty tensor with no additional information that could help the user debug. Am I allowed to used exceptions on the C++ ops code, I have not seen that used in the other ops in torchvision?

Thanks

Hi @r-zenine

We could add it as a sub-module and use the same strategy that is used in torch the ship libpng with the package

Adding libpng as a submodule is one option, but I would prefer if we could avoid it if possible (or else the users will have to build our own version of libpng). My thinking was to maybe use an approach similar to what Pillow do (for example https://pillow.readthedocs.io/en/latest/installation.html#building-from-source , I haven't looked in details yet).

Right now, if anything goes wrong I return an empty tensor with no additional information that could help the user debug. Am I allowed to used exceptions on the C++ ops code, I have not seen that used in the other ops in torchvision?

Yes, you can use torch exceptions, which gets properly translated into Python, see https://github.com/pytorch/vision/blob/227027d5abc8eacb110c93b5b5c2f4ea5dd401d6/torchvision/csrc/cpu/nms_cpu.cpp#L8-L11 for example (although AT_ASSERTM is deprecated in favor of TORCH_CHECK)

I need to write the unit tests. Once it's done, I'll submit the code for an initial review

Awesome, looking forward to the PR!

Hi @fmassa,

Thanks for the reply. I'll finish the PR, ask for an initial review then look at how Pillow is integrating libpng.

Have a good day!

Thanks a lot @r-zenine !

Hi everyone,

I am struggling a bit with the CI pipeline ( for Conda and windows ) on the Pr #1632.
Could anyone help me figure out how to fix it ?

Thanks

@r-zenine I'll have a look now

@fmassa Thanks !

@fmassa can we close this too ?

Yes. We have implemented JPEG and PNG support, so I believe we can be closing this. For requests for other formats we can track it in different issues.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bodokaiser picture bodokaiser  路  3Comments

martinarjovsky picture martinarjovsky  路  4Comments

chinglamchoi picture chinglamchoi  路  3Comments

varagrawal picture varagrawal  路  3Comments

Linardos picture Linardos  路  4Comments