Add a flag to the train script allowing the user to recache labels.
I've made a script to convert my labels to the coco correct format, the first time i've lunched the script i made mistakes and the labels were wrong.
After fixing them and displaying them correctly using an external script i relaunched the train but the labels were still wrong.
Turns out that even if the label files are changed the script will still load the old cached .npy and obviously the output is going to be incorrect.
At the moment the only solution is to manually delete the *.npy file and relaunch train.py
Maybe calling train.py --force-recache or something like this could be a solution to this potential problem. Or at least add to the documentation that this error can occur.
Tested on Windows, don't know if this issue happens on Linux too (i suppose it does).
Hello @lorenzomammana, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com.
@lorenzomammana yes, this is true. The labels are saved in *.npy for faster loading in the future, but if you change the underlying *.txt file labels it will not recache. There is no easy way to detect changes without loading all of the text files. Simply deleting the *.npy file will fix the issue as you've found.
@lorenzomammana or maybe there might be a way to check for necessary recaching. If we had a function that could quickly read the total bytes in a directory, then save that byte-count in the *.npy, a difference there could trigger a recaching?
import os
dir = '../coco/labels/val2017/*.*'
sum(os.path.getsize(f) for f in glob.glob(dir) if os.path.isfile(f))
I had exactly the same idea honestly! I can't test your code atm but it does look right to me, are you going to publish the code or do i need to make a pull request?
@lorenzomammana the entire label caching and loading region of the code needs a rework really. It would be ideal to have the *.npy files be stored dictionaries rather than lists, so that labels could be loaded explicitly like labels['image_134'] rather than labels[0], which may be more failure prone since it is not actually checking the filename but dependent on order, and then the dictionary could include the shapes information as well, which is currently saved in a seperate *.shapes files, which is again order based.
I might try to do this later on this week, it's a low priority item but would be good to have for the future, as users have run into issues in all the sections above (changing labels while not recaching, shapes files issues, and also having extra images not used during training (i.e. mixing dataset images in a single folder but only calling some with a train.txt), which is not possible now but the dictionary npy would solve I think).
EDIT: If you'd like to try to do this that would be great. I'd recommend creating a new function in datasets.py to handle this unified *.npy functionality, with integrated change-detection as we mentioned above. Unifying the npy and shapes files would be nice also, one less complication and one less file out there confusing people.
I think this is a good solution for change detection. Once the single npy file is produced for a dataset, subsequent trainings will scan the images and labels, sum up their bytes into a "hash" value which is stored in the actual npy file as well. If the values are equal, the npy is used to load the labels and shapes. If they are different, the npy is deleted and the caching process is repeated.
def get_hash(files):
# Returns a single hash value of a list of files
return sum(os.path.getsize(f) for f in files) # faster
# return sum(os.path.getsize(f) for f in files if os.path.isfile(f)) # slower, more robust
# Get hash
h = get_hash(self.label_files + self.img_files)
@lorenzomammana I just pushed a major update https://github.com/ultralytics/yolov5/commit/520f5de6f0febf30dd3545793f528c3478c18299 that should resolve all label caching issues. Labels and images are scanned together the first time a dataset is trained, and cached into a labels.cache file now which contains a dictionary of image names, shapes, labels and a unique hash for the cached dataset. Images are scanned for corruption now also as part of the process. All subsequent trainings retrieve the current hash value of the dataset, and if it matches the cached value, then the cached file is used, else the entire dataset is recached and a new file saved. Please try out the changes and let me know if this resolves your issue!
I'm going to make the big merge from your latest updates to my fork this morning and see if everything is working properly!
All right switching between datasets will force label recaching, if the .cache is missing file is created and loaded properly.
Changing a single digit in a label file will alter the hash and force recaching, seems like it's working perfectly.
Tested on Ubuntu 18.04, single tesla V100 with multiple datasets, single dataset, mosaic and multi scale.
@lorenzomammana great!! Thanks for checking.
The new caching process also verifies each image with PIL img.verify() to help weed out any problems, which is a nice addition. I want to add additional checks, such as rejecting images based on size constraints, too small or too large, etc. Are there any other checks you think should be run on images or labels during caching?
@glenn-jocher At the moment i'm not thinking to any particular checks for images except the size constraint as you mentioned.
For the labels I'm not actually sure what happens if the label width or height exceeds the size of the image, I'm sure that some bounding boxes in my dataset actually exceed the image size (x_center and y_center still in [0, 1]) and they are not detected as errors, i'm also quite confident that this shouldn't be a problem for yolo regression but maybe it's worth mention.
@lorenzomammana ok. size checks have been integrated now into caching. Will remove TODO label and close now as original issue appears resolved!