Transformers: Training data format

Created on 22 Jul 2020 · 29Comments · Source: huggingface/transformers

I have text on which I want to fine tune the gpt2 model for text autocompletion on my text the text sentences are separated by new line is there any format I should follow. When I trained on the data as it is it is not giving me proper results with the default training parameters. I have nearly after split 25k sentences for training. Please suggest. The training data looks like this
Screenshot 2020-07-22 at 10 24 01 PM

wontfix

Source

vyaslkv

All 29 comments

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("language-modeling/output/")
model = AutoModelWithLMHead.from_pretrained("language-modeling/output/")


input_text="organic che"
features = tokenizer([input_text], return_tensors='pt')

output = model.generate(input_ids=features['input_ids'], 
           attention_mask=features['attention_mask'])

tokenizer.decode(output[0])

vyaslkv on 22 Jul 2020

I want to make a query auto complete these are the user queries separated by new line

vyaslkv on 22 Jul 2020

@patil-suraj

vyaslkv on 22 Jul 2020

should I add some special token at the end and start of every search query

vyaslkv on 22 Jul 2020

as far as I can see, your dataset is format is correct, also you don't need to add any special tokens, tokenizer adds that by default.

patil-suraj on 22 Jul 2020

--line_by_line I added then the error is coming
You are attempting to pad samples but the tokenizer you are using (GPT2Tokenizer) does not have one.

vyaslkv on 22 Jul 2020

I want to make a text auto complete am I using correct model ? do I have sufficient training sentences? should I add --line_by_line while training? Please help!!
@patil-suraj

vyaslkv on 22 Jul 2020

Hi @vyaslkv you can use GPT-2 for auto complete, as for training examples you will need to experiment.

pinging @sgugger for the error.

patil-suraj on 23 Jul 2020

❤1

LineByLineDataset is not really suitable for GPT2: you should concatenate your texts with the separation token and feed chunks of the the model size (can't remember if it's 512 or 1024 at the top of my mind but it should be in the config of the model). Like the error message says, GPT2 does not know padding.

sgugger on 23 Jul 2020

@sgugger can you explain me a bit which token to use and how the code will look like in that case so sorry if I am asking too much or can you give me some reference which I could use

Thanks for responding

vyaslkv on 23 Jul 2020

The separation token will automatically be added by the tokenizer. The rest is just standard python: concatenate all your lists of tokens in a big numpy array, then reshape it to something x model_len, something being the number of "sequences" (they'll actually span over several lines of your dataset) you can build with your dataset. You can then iterate through the rows of that array as a dataset.

sgugger on 23 Jul 2020

In this what changes I need to do

[python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE](url)

vyaslkv on 23 Jul 2020

The script will do this automatically for you if you don't add the line by line flag. (Except the sentences are separated by new lines and not the special token.) You can try to replace the new lines by "<|endoftext|>"

sgugger on 23 Jul 2020

❤1

cool Thanks @sgugger Just to clarify If I add "<|endoftext|>" in place of new line I don't need to make any changes right?

vyaslkv on 23 Jul 2020

Normally, no.

sgugger on 23 Jul 2020

❤1

Thanks @sgugger Thanks a ton really for help so quick

vyaslkv on 23 Jul 2020

@sgugger @patil-suraj I trained with the format you shared but it is generating some irrelevant text not from the training data I gave. What I am missing in this case

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("output1/")
model = AutoModelWithLMHead.from_pretrained("output1/")
input_ids = tokenizer.encode('Vegetative reproduction of Agave', return_tensors='pt')
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=10, 
    no_repeat_ngram_size=2, 
    num_return_sequences=10, 
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=False)))

vyaslkv on 29 Jul 2020

I want to generate text autocomplete from the training data text

vyaslkv on 29 Jul 2020

@sgugger can you please help

vyaslkv on 31 Jul 2020

Hi @vyaslkv , I think the best place to ask this question is HF forums someone who has already worked on similar task can answer it better. Although @sgugger might have some answers :)

patil-suraj on 31 Jul 2020

@patil-suraj Thanks I will put my question there as well

vyaslkv on 31 Jul 2020

https://discuss.huggingface.co/t/search-query-autocomplete-from-the-queries-i-have-in-my-data/546

vyaslkv on 31 Jul 2020

@sgugger @patil-suraj no one has responded on the forum 😔

vyaslkv on 3 Aug 2020

@patil-suraj I didn't get any response can you please help

vyaslkv on 14 Aug 2020

Hi @vyaslkv , I'll see if anyone I know has worked on similar problem and get back to you.

patil-suraj on 14 Aug 2020

@patil-suraj Thanks

vyaslkv on 14 Aug 2020

@patil-suraj ?

vyaslkv on 20 Aug 2020

Hello, @patil-suraj we found anything related to that?

Thanks!!

vyaslkv on 2 Sep 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.