Pytorch-lightning: how to use custom dataset and dataloader

Created on 10 Jun 2020 · 7Comments · Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

It isn't clear to me what is the right way to use a custom DataLoader. On the docs it is stated that the initial data loading has to happen in the prepare_data function while the transformations that occur in memory have to happen in the train_dataloader function. I find the structure of a DataSet class to be clear, but I cannot see how to translate this information into a pytorch lightning class.

I'm using a custom Dataset for an NLP transformer, specifically I don't see how to write the __get_item__ method and the __init__ method of the Dataset.

Code

The code contains the lightningModule part, I don't know how to structure the prepare_data and the train_dataloader parts

class DisasterTweetModel(LightningModule):

    init__(self, roberta_path, fold_i=None, train_len=None):
        super().__init__()

        self.fold_i = fold_i
        self.train_len = train_len

        roberta_config = RobertaConfig.from_pretrained(roberta_path)
        roberta_config.output_hidden_states = True
        self.roberta = RobertaModel.from_pretrained(roberta_path, config=roberta_config)
        self.dropout = nn.Dropout(0.1)
        self.linear = nn.Linear(roberta_config.hidden_size * 2, 1)
        torch.nn.init.normal_(self.linear.weight, std=0.02)

        self.am_tloss = AverageMeter()
        self.am_vloss = AverageMeter()

    def forward(self, input_ids, attn_mask, token_type_ids):
        _, _, out = self.roberta(
            input_ids=input_ids,
            attention_mask=attn_mask,
            token_type_ids=token_type_ids
        )
        out = torch.cat([out[-1], out[-2]], dim=2)
        out = torch.mean(out, 1)
        out = self.linear(self.dropout(out))
        out = torch.squeeze(out)
        return out

    def prepare_data(self):
        # save to disk data? can be the creation of fold folders?
        # stuff here is done once at the very beginning of training
        # before any distributed training starts

        # download stuff
        # save to disk
        train_df = pd.read_csv(GCONF.path/'train.csv')

    def train_dataloader(self, df, tokenizer, max_length, is_testing=False):
        self.df = df
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.is_testing = is_testing
        # process training data (can be remove spaces or other modifications?)
        # returns a DataLoader of the dataset transformed including the batch size
        # data transforms
        # dataset creation
        # return a DataLoader

What have you tried?

I know that having a separate Dataset class would do the trick and then I could instantiate it in the kfold crossvalidation for loop, but I feel like that is not the way it should be done to harness the advantages that come with pytorch lightning (I might be wrong here). This would be done something like this:

for fold_i, (train_ix, valid_ix) in enumerate(skf.split(train_df, train_df['target'].values)):
    train_ds = RobertaDataset(train_df.iloc[train_ix], tokenizer, max_length, is_testing=False)
    train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

What's your environment?

kaggle kernel

question won't fix

Source

laurencejennings

All 7 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 10 Jun 2020

You can pass DataLoaders into trainer.fit directly. I usually set up my CV like this:

for fold, (train_idx, valid_idx) in enumerate(kfold.split(train_df):
    train_loader = create_dataloader(train_df.iloc[train_idx])
    valid_loader = create_dataloader(train_df.iloc[valid_idx])

    model = DisasterTweetModel(args)
    trainer = pl.Trainer()
    trainer.fit(model, train_dataloader=train_loader, val_dataloaders=valid_loader)

See https://pytorch-lightning.readthedocs.io/en/latest/new-project.html#datasets

Anjum48 on 10 Jun 2020

is this the correct pytorch lightning way to do it? I somehow feel like it might be code that should be inside the lightningModule instead of in a DataLoader. Is the create_dataloader function a specific function of pytorch lightning?

Thanks a lot

laurencejennings on 12 Jun 2020

The create_dataloader function is a homebrew helper function.

The other way I have done it is by passing the data as additional arguments to your PL module:

for fold, (train_idx, valid_idx) in enumerate(kfold.split(train_df):
    model = DisasterTweetModel(args, train_data=train_df.iloc[train_idx], valid_data=train_df.iloc[valid_idx])
    trainer = pl.Trainer()
    trainer.fit(model)

And then your train_dataloader function would look like this (same for valid_dataloader):

def train_dataloader(self):
    dataset = DisasterTweetDataset(self.train_data))
    return DataLoader(dataset, num_workers=4, batch_size=self.batch_size, shuffle=True)

The disadvantage of this is that you'll have to take care in case the datasets don't cause issues with the saving of your hyperparameters

I guess it depends on preference. Both are allowed according to the docs, with the first method better suited for "production" models

Anjum48 on 12 Jun 2020

👍1

thanks a lot for the in depth explanation

laurencejennings on 12 Jun 2020

we are going to move to DataModules...
cc: @nateraw

Borda on 4 Aug 2020

👍1

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!