When training a text classifier or sequence labeler, the current implementation loads the entire training data set into memory. However, many community members have reported that this requires too much memory when the training data set is very large, as is the case often in text classification (see #457 and #426 for instance).
Task: Develop a solution that does not read everything into memory. The PyTorch DataLoader API might be a good candidate for a solution
In Gensim, the training is streamed, meaning sentences can be a generator, reading input data from disk on-the-fly, without loading the entire corpus into RAM.
This may be considered.
Yes that's a good idea!
Stream compressors like gzip or bzip2 are recommended for saving space, resulting in .jsonl.gz or .jsonl.bz2 files.
This might help you as we are looking for similar feature in Flair.
Most helpful comment
Yes that's a good idea!