Hello,
I want to train my own NER model using the french wikiner dataset as a starting point (augmented with another dataset). The documentation here suggests that an IOB2 scheme is used to token entities.
However loading the French Wikiner corpus via
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.WIKINER_FRENCH)
suggests that an IOB scheme should be used to annotate the training corpus. Indeed
head -n 100 /Users/xxxx/.flair/datasets/wikiner_french/aij-wikiner-fr-wp3.train | tail -n 39
gives the following (truncated) output.
Il PRO:PER O
a VER:pres O
form茅 VER:pper O
toute PRO:IND O
une DET:ART O
g茅n茅ration NOM O
de PRP O
linguistes NOM O
fran莽ais ADJ O
, PUN O
parmi PRP O
lesquels PRO:REL O
Emile NAM I-PER
Benveniste NAM I-PER
, PUN O
Marcel NAM I-PER
Cohen NAM I-PER
, PUN O
Georges NAM I-PER
Dum茅zil NAM I-PER
, PUN O
Andr茅 NAM I-PER
Martinet NAM I-PER
, PUN O
Aur茅lien NAM I-PER
Sauvageot NAM I-PER
, PUN O
Lucien NAM I-PER
Tesni猫re NAM I-PER
, PUN O
Joseph NAM I-PER
Vendryes NAM I-PER
. SENT O
Am I missing something ? Is there some internal IOB to IOB2 conversion that is done behind the hood before actual training or should I convert all my datasets to the IOB format in all cases ?
Hello @aschmu yes we always auto-convert the data to IOB2 when reading with the NLPTaskDataFetcher, so you don't have to do anything. If you use the pre-set autoloader for WIKINER, we even automatically convert to BIOES format.
If you want IOB2 instead, you can change this behavior removing the tag_to_biloes variable from the code below.
data_folder = 'path/to/wikiner/data'
columns = {0: "text", 1: "pos", 2: "ner"}
corpus = NLPTaskDataFetcher.load_column_corpus(
data_folder,
columns,
tag_to_biloes="ner"
)
@alanakbik Thanks for the clarification !
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@alanakbik Thanks for the clarification !