It isn't clear to me what is the right way to use a custom DataLoader. On the docs it is stated that the initial data loading has to happen in the prepare_data function while the transformations that occur in memory have to happen in the train_dataloader function. I find the structure of a DataSet class to be clear, but I cannot see how to translate this information into a pytorch lightning class.
I'm using a custom Dataset for an NLP transformer, specifically I don't see how to write the __get_item__ method and the __init__ method of the Dataset.
The code contains the lightningModule part, I don't know how to structure the prepare_data and the train_dataloader parts
class DisasterTweetModel(LightningModule):
init__(self, roberta_path, fold_i=None, train_len=None):
super().__init__()
self.fold_i = fold_i
self.train_len = train_len
roberta_config = RobertaConfig.from_pretrained(roberta_path)
roberta_config.output_hidden_states = True
self.roberta = RobertaModel.from_pretrained(roberta_path, config=roberta_config)
self.dropout = nn.Dropout(0.1)
self.linear = nn.Linear(roberta_config.hidden_size * 2, 1)
torch.nn.init.normal_(self.linear.weight, std=0.02)
self.am_tloss = AverageMeter()
self.am_vloss = AverageMeter()
def forward(self, input_ids, attn_mask, token_type_ids):
_, _, out = self.roberta(
input_ids=input_ids,
attention_mask=attn_mask,
token_type_ids=token_type_ids
)
out = torch.cat([out[-1], out[-2]], dim=2)
out = torch.mean(out, 1)
out = self.linear(self.dropout(out))
out = torch.squeeze(out)
return out
def prepare_data(self):
# save to disk data? can be the creation of fold folders?
# stuff here is done once at the very beginning of training
# before any distributed training starts
# download stuff
# save to disk
train_df = pd.read_csv(GCONF.path/'train.csv')
def train_dataloader(self, df, tokenizer, max_length, is_testing=False):
self.df = df
self.tokenizer = tokenizer
self.max_length = max_length
self.is_testing = is_testing
# process training data (can be remove spaces or other modifications?)
# returns a DataLoader of the dataset transformed including the batch size
# data transforms
# dataset creation
# return a DataLoader
I know that having a separate Dataset class would do the trick and then I could instantiate it in the kfold crossvalidation for loop, but I feel like that is not the way it should be done to harness the advantages that come with pytorch lightning (I might be wrong here). This would be done something like this:
for fold_i, (train_ix, valid_ix) in enumerate(skf.split(train_df, train_df['target'].values)):
train_ds = RobertaDataset(train_df.iloc[train_ix], tokenizer, max_length, is_testing=False)
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
kaggle kernel
Hi! thanks for your contribution!, great first issue!
You can pass DataLoaders into trainer.fit directly. I usually set up my CV like this:
for fold, (train_idx, valid_idx) in enumerate(kfold.split(train_df):
train_loader = create_dataloader(train_df.iloc[train_idx])
valid_loader = create_dataloader(train_df.iloc[valid_idx])
model = DisasterTweetModel(args)
trainer = pl.Trainer()
trainer.fit(model, train_dataloader=train_loader, val_dataloaders=valid_loader)
See https://pytorch-lightning.readthedocs.io/en/latest/new-project.html#datasets
is this the correct pytorch lightning way to do it? I somehow feel like it might be code that should be inside the lightningModule instead of in a DataLoader. Is the create_dataloader function a specific function of pytorch lightning?
Thanks a lot
The create_dataloader function is a homebrew helper function.
The other way I have done it is by passing the data as additional arguments to your PL module:
for fold, (train_idx, valid_idx) in enumerate(kfold.split(train_df):
model = DisasterTweetModel(args, train_data=train_df.iloc[train_idx], valid_data=train_df.iloc[valid_idx])
trainer = pl.Trainer()
trainer.fit(model)
And then your train_dataloader function would look like this (same for valid_dataloader):
def train_dataloader(self):
dataset = DisasterTweetDataset(self.train_data))
return DataLoader(dataset, num_workers=4, batch_size=self.batch_size, shuffle=True)
The disadvantage of this is that you'll have to take care in case the datasets don't cause issues with the saving of your hyperparameters
I guess it depends on preference. Both are allowed according to the docs, with the first method better suited for "production" models
thanks a lot for the in depth explanation
we are going to move to DataModules...
cc: @nateraw
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!