I was wondering if there is any planned feature to improve data loading performance. The standard Pytorch Dataset and DataLoader implementation rely on random I/O, which can be bottleneck for image classification for small input, even with a M2 SSD. It's possible to convert the dataset into a binary data file like LMDB but I was wondering if there is a standard or perhaps better way in Pytorch, ideally supported out-of-box by high level training framework like Ignite.
Or is manually moving the data into RAM the best option? If they can fit into memory?
@pkdogcom we didn't planned any features on that. If you have more details on that we can discuss more what can be introduced into Ignite.
@pkdogcom actually maybe Nvidia/DALI can be interesting to test if improve dataflow performances.
@vfdev-5 Thanks for the information. On a second test with more data processing threads, I was able to reduce a great amount of data pre-processing time which means for a M2 SSD case CPU can potentially be the bottleneck instead of just the I/O.
I did a quick research on DALI and it seems to support LMDB/TFRecord types of data loading but not Pytorch type of image random access. Although I believe Pytorch training pipeline can still benefit from other data augmentation support in DALI, there seems to be non-trivial amount of work to get the right recipe of integrating DALI with Pytorch pipeline, especially Ignite.
I'll keep the eye open for other solutions and will let you know if I find anything useful.
It turns out that with a fast disk (such as M2 SSD), the bottleneck of data flow will most likely be decoding images with CPU if using the default pytorch pil_loader. A quick and effective fix will be using a faster image decoder in place of pil_loader. I've tried jpeg4py, which is a wrapper of libjpeg-turbo, and I was able to reduce the overhead of data pre-processing, in my experiments, from 0.6s per batch on top of the 0.2s model forward/backward time to 0s, completely hidden by the model processing time.
FYI, Here is the codes of my image loader:
from PIL import Image
import jpeg4py as jpeg
import imghdr
def fast_img_loader(path):
with open(path.encode('utf-8'), 'rb') as f:
# Use (wrapper of) libjpeg-turbo for faster JPEG decode
try:
if imghdr.what(f) == 'jpeg': # Test image format by file prefix
img = jpeg.JPEG(f).decode() # Decode as 'RGB' by default
return Image.fromarray(img)
except Exception as e:
logging.warn('Failed to decode image {} as jpeg: {}'.format(path, e))
# Fall back to PIL image loader for non-JPEG images or in case of exception
img = Image.open(f)
return img.convert('RGB')
Of course, you will need to install libjpeg-turbo and jpeg4py (with pip).
@pkdogcom yes, that's true that torchvision backended with Pillow is not the fastest data loading/processing. Have you tried Pillow-SIMD or OpenCV ? Opencv intrernally should use turbo jpeg, I think.
I've been using Pillow-SIMD and OpenCV and libjpeg-turbo still gives a big performance boost. I will need to check if OpenCV can be compiled with libjpeg-turbo
It seems like from 3.4.2 onward OpenCV will default to use libjpeg-turbo instead of libjpeg. So for newer version of OpenCV it might be easier to rely on OpenCV (with most image libs enabled during compilation) as an efficient overall image loading library.
@pkdogcom I'll close this issue. Feel free to reopen if we can improve this from ignite side.
Sure. What about implementing a OpenCV/libjpeg-turbo image loader (and maybe converted to PIL in the loader for compatibility in downstream processing) and let the user have better awareness of this issue?
@pkdogcom IMO the dataflow stuff (data reading, augs, batching etc) is out of scope of ignite as there is a plenty of libs who are managing some part of it. Soon I'll think to provide a contrib handler for some basic time profiling: batch creation, time passed in handlers, time of processing function. In some notes of this future handler we can mention about these accelerated ways of reading images in case of dataflow bottleneck. Another thing which may improve the dataflow could be a sort of memory caching of loaded data.
Agree. I think having some of these best pratices either implemented or mentioned in the contrib module should be enough
FWIW you can build Pillow-SIMD against libjpeg-turbo and this greatly improves its performance without having to abandon torchvision: https://docs.fast.ai/performance.html#installation
Most helpful comment
It seems like from 3.4.2 onward OpenCV will default to use libjpeg-turbo instead of libjpeg. So for newer version of OpenCV it might be easier to rely on OpenCV (with most image libs enabled during compilation) as an efficient overall image loading library.