Flair: Iterating data fetcher for large training data sets

Created on 5 Feb 2019  路  3Comments  路  Source: flairNLP/flair

When training a text classifier or sequence labeler, the current implementation loads the entire training data set into memory. However, many community members have reported that this requires too much memory when the training data set is very large, as is the case often in text classification (see #457 and #426 for instance).

Task: Develop a solution that does not read everything into memory. The PyTorch DataLoader API might be a good candidate for a solution

feature

Most helpful comment

Yes that's a good idea!

All 3 comments

In Gensim, the training is streamed, meaning sentences can be a generator, reading input data from disk on-the-fly, without loading the entire corpus into RAM.

This may be considered.

Yes that's a good idea!

This interesting observation I got from JSONlines.org -

Stream compressors like gzip or bzip2 are recommended for saving space, resulting in .jsonl.gz or .jsonl.bz2 files.

Text editing programs call the first line of a text file "line 1". The first value in a JSON Lines file should also be called "value 1".

This might help you as we are looking for similar feature in Flair.

Was this page helpful?
0 / 5 - 0 ratings