Flair: About NER corpus

Created on 16 May 2019  路  3Comments  路  Source: flairNLP/flair

Hello,
I want to train my own NER model using the french wikiner dataset as a starting point (augmented with another dataset). The documentation here suggests that an IOB2 scheme is used to token entities.
However loading the French Wikiner corpus via

corpus = NLPTaskDataFetcher.load_corpus(NLPTask.WIKINER_FRENCH)

suggests that an IOB scheme should be used to annotate the training corpus. Indeed
head -n 100 /Users/xxxx/.flair/datasets/wikiner_french/aij-wikiner-fr-wp3.train | tail -n 39
gives the following (truncated) output.

Il      PRO:PER O
a       VER:pres        O
form茅   VER:pper        O
toute   PRO:IND O
une     DET:ART O
g茅n茅ration      NOM     O
de      PRP     O
linguistes      NOM     O
fran莽ais        ADJ     O
,       PUN     O
parmi   PRP     O
lesquels        PRO:REL O
Emile   NAM     I-PER
Benveniste      NAM     I-PER
,       PUN     O
Marcel  NAM     I-PER
Cohen   NAM     I-PER
,       PUN     O
Georges NAM     I-PER
Dum茅zil NAM     I-PER
,       PUN     O
Andr茅   NAM     I-PER
Martinet        NAM     I-PER
,       PUN     O
Aur茅lien        NAM     I-PER
Sauvageot       NAM     I-PER
,       PUN     O
Lucien  NAM     I-PER
Tesni猫re        NAM     I-PER
,       PUN     O
Joseph  NAM     I-PER
Vendryes        NAM     I-PER
.       SENT    O

Am I missing something ? Is there some internal IOB to IOB2 conversion that is done behind the hood before actual training or should I convert all my datasets to the IOB format in all cases ?

question

Most helpful comment

@alanakbik Thanks for the clarification !

All 3 comments

Hello @aschmu yes we always auto-convert the data to IOB2 when reading with the NLPTaskDataFetcher, so you don't have to do anything. If you use the pre-set autoloader for WIKINER, we even automatically convert to BIOES format.

If you want IOB2 instead, you can change this behavior removing the tag_to_biloes variable from the code below.

data_folder = 'path/to/wikiner/data'
columns = {0: "text", 1: "pos", 2: "ner"}

corpus = NLPTaskDataFetcher.load_column_corpus(
    data_folder,
    columns,
    tag_to_biloes="ner"
)

@alanakbik Thanks for the clarification !

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings