Flair: Creating a ColumnCorpus without .txt files

Created on 9 Apr 2021  路  2Comments  路  Source: flairNLP/flair

Good afternoon,

I have been using Flair for a while now and am very happy with its ease of use and performance.
However, I was wondering: is it possible to create a ColumnCorpus without the need for a train.txt, dev.txt, and test.txt? I load my data from a Pandas DataFrame, and it feels quite awkward to write the data to files first only to load it back right after.

I have searched around, but haven't been able to find a way to do this. If this is just not possible, could anyone explain why this choice was made?

Thanks in advance, and have a nice day!

question

Most helpful comment

Hi @fabero ,

I have used something like this in the past.

from flair.datasets import SentenceDataset
from flair.data import Corpus, Sentence

def get_flair_dataset_from_dataframe(data, text_col, label_col):
    sentences = list(data.apply(lambda row: Sentence(row[text_col]).add_label('class', row[label_col]), axis=1))
    return SentenceDataset(sentences)

train_dataset = get_flair_dataset_from_dataframe(train_df, "text_column", "label_column")
dev_dataset = get_flair_dataset_from_dataframe(val_df, "text_column", "label_column")
test_dataset = get_flair_dataset_from_dataframe(test_df, "text_column", "label_column")

corpus = Corpus(train=train_dataset, dev=dev_dataset, test=test_dataset, name="my_corpus", sample_missing_splits=False)

Hope this helps!

All 2 comments

Hi @fabero ,

I have used something like this in the past.

from flair.datasets import SentenceDataset
from flair.data import Corpus, Sentence

def get_flair_dataset_from_dataframe(data, text_col, label_col):
    sentences = list(data.apply(lambda row: Sentence(row[text_col]).add_label('class', row[label_col]), axis=1))
    return SentenceDataset(sentences)

train_dataset = get_flair_dataset_from_dataframe(train_df, "text_column", "label_column")
dev_dataset = get_flair_dataset_from_dataframe(val_df, "text_column", "label_column")
test_dataset = get_flair_dataset_from_dataframe(test_df, "text_column", "label_column")

corpus = Corpus(train=train_dataset, dev=dev_dataset, test=test_dataset, name="my_corpus", sample_missing_splits=False)

Hope this helps!

Hi @kishaloyhalder ,

That's great! Thanks a lot!

Was this page helpful?
0 / 5 - 0 ratings