Flair: ClassificationCorpus not using all data provided

Created on 28 May 2020  路  2Comments  路  Source: flairNLP/flair

Describe the bug
ClassificationCorpus is not using all the data provided in 3 txt files (train.txt, dev.txt and test.txt). I suspect this to be an encoding issue, however Pandas is able to read in all the data with the same encoding type (utf-16) provided as parameter.

To Reproduce
This colab notebook.

Expected behavior
I expect the ClassificationCorpus object to use all data provided. The weird thing is, when I read the txt data as Pandas dataframes, all data is read. But when I read the txt data as ClassificationCorpus objects, some data is not used, as can be seen by the difference in total length (see notebook).

Environment (please complete the following information):

  • Google Colab
  • Flair: installed from source
bug

Most helpful comment

The problem is that in the data file the label and the text is separated by a tab and not a blank.
E.g. one line looks like this: __label__1\t100st Prunus padus 60/90\n
with a \t between __label__1 and 100st. Now if we have a line where the text consists of only one word the line is not processed since there is no blanks at all in the line. So all the lines with one word are omitted.

It should work if you replace the tabs with blank spaces.
Since this is not a sensible restriction I will add the case where tabs are used to the code.

All 2 comments

@NielsRogge thanks for reporting this and sharing the steps to reproduce! @marcelmmm can you take a look at this?

The problem is that in the data file the label and the text is separated by a tab and not a blank.
E.g. one line looks like this: __label__1\t100st Prunus padus 60/90\n
with a \t between __label__1 and 100st. Now if we have a line where the text consists of only one word the line is not processed since there is no blanks at all in the line. So all the lines with one word are omitted.

It should work if you replace the tabs with blank spaces.
Since this is not a sensible restriction I will add the case where tabs are used to the code.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

frtacoa picture frtacoa  路  3Comments

inyukwo1 picture inyukwo1  路  3Comments

Aditya715 picture Aditya715  路  3Comments

davidsbatista picture davidsbatista  路  3Comments

mnishant2 picture mnishant2  路  3Comments