Flair: About NER corpus

Created on 16 May 2019 · 3Comments · Source: flairNLP/flair

Hello,
I want to train my own NER model using the french wikiner dataset as a starting point (augmented with another dataset). The documentation here suggests that an IOB2 scheme is used to token entities.
However loading the French Wikiner corpus via

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.WIKINER_FRENCH)

suggests that an IOB scheme should be used to annotate the training corpus. Indeed
head -n 100 /Users/xxxx/.flair/datasets/wikiner_french/aij-wikiner-fr-wp3.train | tail -n 39
gives the following (truncated) output.

Il      PRO:PER O
a       VER:pres        O
formé   VER:pper        O
toute   PRO:IND O
une     DET:ART O
génération      NOM     O
de      PRP     O
linguistes      NOM     O
français        ADJ     O
,       PUN     O
parmi   PRP     O
lesquels        PRO:REL O
Emile   NAM     I-PER
Benveniste      NAM     I-PER
,       PUN     O
Marcel  NAM     I-PER
Cohen   NAM     I-PER
,       PUN     O
Georges NAM     I-PER
Dumézil NAM     I-PER
,       PUN     O
André   NAM     I-PER
Martinet        NAM     I-PER
,       PUN     O
Aurélien        NAM     I-PER
Sauvageot       NAM     I-PER
,       PUN     O
Lucien  NAM     I-PER
Tesnière        NAM     I-PER
,       PUN     O
Joseph  NAM     I-PER
Vendryes        NAM     I-PER
.       SENT    O

Am I missing something ? Is there some internal IOB to IOB2 conversion that is done behind the hood before actual training or should I convert all my datasets to the IOB format in all cases ?

question

Source

aschmu

Most helpful comment

@alanakbik Thanks for the clarification !

aschmu on 22 May 2019

👍2

All 3 comments

Hello @aschmu yes we always auto-convert the data to IOB2 when reading with the NLPTaskDataFetcher, so you don't have to do anything. If you use the pre-set autoloader for WIKINER, we even automatically convert to BIOES format.

If you want IOB2 instead, you can change this behavior removing the tag_to_biloes variable from the code below.

data_folder = 'path/to/wikiner/data'
columns = {0: "text", 1: "pos", 2: "ner"}

corpus = NLPTaskDataFetcher.load_column_corpus(
    data_folder,
    columns,
    tag_to_biloes="ner"
)

alanakbik on 16 May 2019

@alanakbik Thanks for the clarification !

aschmu on 22 May 2019

👍2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.