Transformers: Training data format

Created on 22 Jul 2020  路  29Comments  路  Source: huggingface/transformers

I have text on which I want to fine tune the gpt2 model for text autocompletion on my text the text sentences are separated by new line is there any format I should follow. When I trained on the data as it is it is not giving me proper results with the default training parameters. I have nearly after split 25k sentences for training. Please suggest. The training data looks like this
Screenshot 2020-07-22 at 10 24 01 PM

wontfix

All 29 comments

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("language-modeling/output/")
model = AutoModelWithLMHead.from_pretrained("language-modeling/output/")


input_text="organic che"
features = tokenizer([input_text], return_tensors='pt')

output = model.generate(input_ids=features['input_ids'], 
           attention_mask=features['attention_mask'])

tokenizer.decode(output[0])

I want to make a query auto complete these are the user queries separated by new line

@patil-suraj

should I add some special token at the end and start of every search query

as far as I can see, your dataset is format is correct, also you don't need to add any special tokens, tokenizer adds that by default.

--line_by_line I added then the error is coming
You are attempting to pad samples but the tokenizer you are using (GPT2Tokenizer) does not have one.

I want to make a text auto complete am I using correct model ? do I have sufficient training sentences? should I add --line_by_line while training? Please help!!
@patil-suraj

Hi @vyaslkv you can use GPT-2 for auto complete, as for training examples you will need to experiment.

pinging @sgugger for the error.

LineByLineDataset is not really suitable for GPT2: you should concatenate your texts with the separation token and feed chunks of the the model size (can't remember if it's 512 or 1024 at the top of my mind but it should be in the config of the model). Like the error message says, GPT2 does not know padding.

@sgugger can you explain me a bit which token to use and how the code will look like in that case so sorry if I am asking too much or can you give me some reference which I could use

Thanks for responding

The separation token will automatically be added by the tokenizer. The rest is just standard python: concatenate all your lists of tokens in a big numpy array, then reshape it to something x model_len, something being the number of "sequences" (they'll actually span over several lines of your dataset) you can build with your dataset. You can then iterate through the rows of that array as a dataset.

In this what changes I need to do

[python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE](url)

The script will do this automatically for you if you don't add the line by line flag. (Except the sentences are separated by new lines and not the special token.) You can try to replace the new lines by "<|endoftext|>"

cool Thanks @sgugger Just to clarify If I add "<|endoftext|>" in place of new line I don't need to make any changes right?

Normally, no.

Thanks @sgugger Thanks a ton really for help so quick

@sgugger @patil-suraj I trained with the format you shared but it is generating some irrelevant text not from the training data I gave. What I am missing in this case

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("output1/")
model = AutoModelWithLMHead.from_pretrained("output1/")
input_ids = tokenizer.encode('Vegetative reproduction of Agave', return_tensors='pt')
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=10, 
    no_repeat_ngram_size=2, 
    num_return_sequences=10, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=False)))

I want to generate text autocomplete from the training data text

@sgugger can you please help

Hi @vyaslkv , I think the best place to ask this question is HF forums someone who has already worked on similar task can answer it better. Although @sgugger might have some answers :)

@patil-suraj Thanks I will put my question there as well

@sgugger @patil-suraj no one has responded on the forum 馃様

@patil-suraj I didn't get any response can you please help

Hi @vyaslkv , I'll see if anyone I know has worked on similar problem and get back to you.

@patil-suraj Thanks

@patil-suraj ?

Hello, @patil-suraj we found anything related to that?

Thanks!!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

HansBambel picture HansBambel  路  3Comments

guanlongtianzi picture guanlongtianzi  路  3Comments

siddsach picture siddsach  路  3Comments

fyubang picture fyubang  路  3Comments

chuanmingliu picture chuanmingliu  路  3Comments