I'd like to ask a few question about the text generated from the flair training model(forward). The training data includes 13.8 million lines of text which domain varies from news to keyboard conversation data. (Most of them are keyboard conversation data). The validation set and the testing set contains 140k lines of sentences each. I've already trained for nearly 1 week and the perplexity of the validation set reaches to nearly 3.Then I found out that most of the generated short sentences are meaningful. But for the case of long sentences, the words are meaningful in phrase level but not meaningful (I mean the generated phrases are not meaningfully connected) for sentence levels. Is this normal or did my language model training need something? I'm using the hidden state of 2048, mini-batch size of 100 and patience of 50.
Can you post some examples? What value is the sequence_length parameter when you train the LM?
The sequence length is 250 as suggested by tutorials. The below text are the text generated from when training of one split of the data is finished. The language I'm currently training is the Burmese.
````
nပြည်ကအသိရှိတယ် nနှစ်ယောက်စလုံးထက်အများကြီးပိုကြမ်းတယ်အလုပ်မပိတ်ရင်လဲ nအမေချက်ကျွေးတာလား nဆေးရုံစာတွေပြရမှာ nဘာတွေလိုအပ်လာရင်သုံးမယ်ပေါ့ nခမလေးကျူရှင်သွားတော့မယ်နော် nအင်တာနက်ကုန်တာဘယ်လိုလုပ်ရမှာလဲဟ nဖူးစားကြည်ပါချစ်နေတယ်လေ nရေကစားဖို့လာခေါ်ရင်လဲမလိုက်တော့ဘူးလေ nတကယ်ပဲညနက်နေပီပဲ့ nအိပ်လိုက်ဖို့လို့ nရီချင်မိတယ်အရူးလေး nဒီမှာတော့ကားခကောက်တဲ့ထဲ nတခုခုလုပ်ပေးမယ် nအိမ်ထဲကနေအိမ်အပြင်ကိုမထွက်ဘူး nအဲ့တော့သူတစ်ပိုင်းသေနေတဲ့အကွက်တွေနဲ့ nသူမနှစ်သက်တာမှန်သမျှတွေကိုခွက်တင်ပြီးအရေးမယူသင့်ဘူးဖျော့ဖျောင့်ရှောင်လိုက်ရင်းနဲ့တဖြည်းဖြည်းဝေးကွာသွားတာ nနေမကောင်းပါဘူးဆိုဆေးသောက်လို့ပြောပီးလက်ခဏခဏဆေးရတယ်လေ nအလုပ်ဆင်းနေလားဖုန်းလာတယ် nအေးမြတ်ခိုင်သာအခုလိုဖြစ်လာပါတယ် nဘာတွေသောက်ရမယ်ဆိုတာသူများတွေလိုဂရုစိုက်မူမရှိဘူး nမြန်မာပြည်ကနဂိုကတည်းကမကောင်းတာ nဘာတေခါးခါးမှမသိဘူး nပြန်လာမှကြိုးစားဆက်လုပ်ပါ nစကားတစ်ခွန်းပြောမိရင်မညိုလက်အောက်ခံကြီးပဲ nအဓိကကသူတို့ကအဲ့လောက်ပြောပြောနေတားလေအမနဲ့မဆိုင်ပါဘူး nဒီလိုမျိုးကိစ္စအတွက်နစ်နာမှုတစ်ခုလုပ်ရပ်များနဲ့စောင့်ကြည့်ရမှာဖြစ်ပါတယ် nငါ့အရူးမလိုင်းခိုးတက် nငါနဲ့အရင်လိုစကားတွေပြောနေတာမက'
````
For this sentence,
````
သူမနှစ်သက်တာမှန်သမျှတွေကိုခွက်တင်ပြီးအရေးမယူသင့်ဘူးဖျော့ဖျောင့်ရှောင်လိုက်ရင်းနဲ့တဖြည်းဖြည်းဝေးကွာသွားတာ
````
This means "all the things she liked are placed on a cup and it should not be punished and was gone gradually while preventing it lightly." (I translate for u because google translate is still bad at translating Burmese informal written pattern.)
IMO this is the quality of this language model.
@djstrong , is it bad?? Do I need to retrain it? Which parts did I need in previous training??
I mean, the quality of generated text using any "Flair" (character-based LSTM) language model won't be better.
Did the learning rate already anneal during training? It should anneal first to 5 (if you started with 20) and then to 1.25. After annealing, you typically see an improvement in perplexity.
If this already happened, I agree with @djstrong that this is as good as the language model will be. Generally, RNNs struggle to learn complex information that spans long sequences, so it is normal to see text that makes more sense on the phrase level than on the sentence level.
@djstrong , aww I see. This is one of the characteristics of the character-based LSTM language model.
@alanakbik , yes. The learning rate already annealed to 5 and then 1.25 and then 0.3125 and now 0.0781. And I also found that the perplexity of the validation set didn't improve from 3.03 since learning rate annealed to 0.3125. And Should I stop or continue the training process? I've already trained for 8 days. And I'd like to know one more thing? How can I test the quality of the trained Flair model? I mean using visualization of words or sth.
You can stop training if the learning rate is so small. How long did it take to anneal the first time? If that went too fast you could think about increasing the patience.
I haven't experimented with visualizations - generally I evaluate the embeddings in downstream tasks such as NER or others to see if it improves.
@alanakbik , thanks for yr kind answer. It took nearly 3 days to anneal the first time. Is it fast?? I split nearly 14 million lines into 9 splits and for each split, it took nearly 15 minutes. I have already stopped forward training because it didn't improve for two days. And the perplexity of the test set reaches 3.02 for forward training. Now, I'm using 30 patience for backward training.
That sounds good for this corpus size.