Flair: Iterating data fetcher for large training data sets

Created on 5 Feb 2019 · 3Comments · Source: flairNLP/flair

When training a text classifier or sequence labeler, the current implementation loads the entire training data set into memory. However, many community members have reported that this requires too much memory when the training data set is very large, as is the case often in text classification (see #457 and #426 for instance).

Task: Develop a solution that does not read everything into memory. The PyTorch DataLoader API might be a good candidate for a solution

feature

Source

alanakbik

Most helpful comment

Yes that's a good idea!

alanakbik on 6 Feb 2019

👍3

All 3 comments

In Gensim, the training is streamed, meaning sentences can be a generator, reading input data from disk on-the-fly, without loading the entire corpus into RAM.

This may be considered.