Here, I'd like to collect some ideas for features that we would like to see in the next version of Flair.
Ideas:
flair.nn.Model interface needs to be simplified (fewer methods) and generalized in such a way that implementing this interface will immediately enable training using the ModelTrainer class (see also #474). segtok for tokenization, but maybe we can include other tokenizers (#394), perhaps even our own trained over the UD corpora.Any other ideas? Please let us know!
A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)
Currently, the transfer learning is purely feature based, do you consider adding fine-tuning based transfer learning for both sequence tagging and text classification? I think it would be a great addition.
_Disclaimers_: I'm pretty new to Flair so this is probably at least somewhat misinformed. Also, this might be a bit of a tall order, especially for 0.5 since it alters the API significantly. Hopefully its not a waste of your time!
Flair is a great library in that it provides a uniform interface for disparate NLP models in one place. So far I see two main weaknesses in the library in this respect:
SequenceTagger for example has a .predict method and Embeddings have .embed methods. My suggestion is that the model class hierarchy would be refactored in a way that there is a new universal (abstract) base class for the whole library. It would include uniform interfaces with methods like:
.predict - incremental prediction.train - incremental training.train_predict - incremental training and prediction.batch - training and/or prediction for a batch of files.fetch - fetch pretrained model from the web.load - deserialize model.save - serialize modelDoing this would make it easier to combine the models and abstract over them. What I would love to see is that in the future I could instantiate a FlairModel with my choice of language model(s), tagger(s), embedder(s) and classifier(s) suitably stacked and linked. This class would help me seamlessly combine all of these into a single NLP model, the complete lifecycle of which I could control completely using a combined interface. Thus calling flairmodel.train(sentence) would train the all the taggers, embeddings, and other models I put inside the model, simultaneously.
I plan to release language models (trained on Wikipedia dumps) for 16 languages (no, fa, ar, id, pl, da, hi, nl, eu, sl, he, hr, fi, bg, cs and sv). LMs are already trained, but I have to check their performance (at least on Universal Dependencies) first.
Support OpenAI GPT Embeddings is implemented, I'll prepare a PR in the next days
I think I found a way to use XLM embeddings in flair, but integrating has a couple of challenging tasks (library import in Python, License...)
@gccome @ixxie @stefan-it thanks for the input!
Fine-tuning is definitely a feature to add. Also it would be great to clean up the whole serialization process and make it more robust to changes between version etc.
Also agree on making everything more modular so researchers can stack components, though I am not sure if all modules need exactly the same interface since they have different functions. It may end up being more intuitive for users if embeddings have a .embed() method and models a .predict() method since they both do different things, but this is something we'll need to figure out.
@stefan-it really looking forward to your next PRs! :)
@alanakbik I understand the concerns; since this is a longer discussion, I created a seperate issue for this purpose: https://github.com/zalandoresearch/flair/issues/567
I vote on:
Refactor data loading methods is clearly preferred.
We currently try to train some NER models based on string embeddings. If stored in cache (which doesn't help to speed up the processes), the forward and backward string embeddings take nearly 35 GB each. A total of 70 GB would be somewhat too much for our memory...
One new feature could be the integration of magnitude embeddings. It is a "fast, lightweight tool for utilizing and processing embeddings". Very different trained embeddings, from glove to elmo, are already available. It can bring some homogenization in the different classes. Let me know if it seems a good idea!
PS: You can now pinned this issue if wanted
@mauryaland hey this is a great idea - magnitude has really come a long way. Way back in the first version of Flair we looked at magnitude but there were some reasons for why we eventually went with gensim (can't remember what exactly, something with regards to speed and serialization I think). From a quick glance at your links, I really think we should look at integrating magnitude for v0.5.
I agree, please work on
Refactor data loading methods
as much as possible. I'm waiting to deploy and open source a state of the art Sentence Compression model, but literally no one else has a good framework for doing binary PoS tagging for word-level extractive compression. Flair would be preferred except that I can't load my dataset into VRAM.
Ok! Yes, the data loader is definitely the priority for v0.5!
A side note: In March, a few of us will be out of office for vacations, so development will likely slow down a bit. But come April, we'll start working full steam on 0.5 :)
Same here :) I'll be on vacation until 13th of March (maybe we could set this in our GitHub status here)
Please, see pull request #595 for implementation of "Refactor data loading methods"
@mfojtak wow this is great, thanks! :) We'll check it out!
Concerning the Tokenization part, it could be a good idea to use the freshly release Python Stanford NLP library. We can only use the tokenizer by disable other components such as lemmatization. With it, we can easily choose the language and get really good tokenization. If you are ok, I can try to implement it.
Another feature that could be nice for the SequenceTagger is the possibility to add metadata in order that the entity recognizer will respect the existing entity spans and adjust its predictions around it. It could be a one hot encoder into the crf or another idea.
Let me know if you are interested in those ideas!
Hello @mauryaland both ideas sound really interesting. I just checked the python stanford NLP library and it seems that it is fairly lighweight in terms of dependent libraries so including it should be possible! We'd very much appreciate an implementation that includes their multilingual tokanization!
For the second idea it would be great if this could be implemented as another Embeddings class. I.e. some sort of Gazetteer embedder that somehow encodes this metadata as one-hot embeddings at the word level. This way, we would have a nice way of including gazeteer knowledge (such as known entity spans) in any task, not just the SequenceTagger but also TextClassification and TextRegression etc.
For both, we'd very much appreciate your contribution!
@alanakbik I've seen a lot of issues and work on evaluation metrics over several months here. What do you think about using using the scikit-learn precision_recall_fscore_support methods as a kind of addition (or replacement)?
@stefan-it yes absolutely agree - we could probably use scikit learn instead for all evaluation metrics except the span F1 measure for which we would need to add back in the CoNLL-03 script.
what about supporting Python 3.7? It is supported by the most NLP packages now.
I'm becoming more familiar with Flair now and really impressed with it. My next task is to investigate productionisation options and in this vein I plan to look at torch.jit (https://pytorch.org/docs/stable/jit.html) - not sure if that's something that's on the projects roadmap?
Hello @alanakbik, I was wondering about the possibility to use another tokenizer (from the spacy to stanfordnlp ones) than segtok, what do you think of outsourcing this part, for example by making possible to call a list containing the tokens as an argument ? This solution could be very convenient in my opinion.
An idea of implementation:
```python
class Sentence:
"""
A Sentence is a list of Tokens and is used to represent a sentence or text fragment.
"""
def __init__(self, text: str = None, use_tokenizer: bool = False, tokens_from_list: List[str] = None, labels: Union[List[Label], List[str]] = None):
super(Sentence, self).__init__()
self.tokens: List[Token] = []
self.labels: List[Label] = []
if labels is not None: self.add_labels(labels)
self._embeddings: Dict = {}
# if text is passed, instantiate sentence with tokens (words)
if text is not None:
# tokenize the text first if option selected
if use_tokenizer:
# use a list of tokens obtained by any tokenizer of your choice
if tokens_from_list is not None and len(tokens_from_list) > 0:
tokens = tokens_from_list.copy()
else:
# use segtok for tokenization
tokens = []
sentences = split_single(text)
for sentence in sentences:
contractions = split_contractions(word_tokenizer(sentence))
tokens.extend(contractions)
@leo-gan I think many users are already running Flair with Python 3.7. Our requirement is only 3.6+ so that should include newer versions. Or do you mean specific features of Python 3.7+?
@emrul better productization options would be great. It is on our mind but not part of a specific roadmap, so if you could share your findings on torch jit that would be great :) We also want to look into PyText in this context.
@mauryaland yes, tokenization is definitely a feature we want to work on more. Right now, the way to accomplish using external tokenizers would be to take the tokens_from_list parameter (i.e. a list of string tokens) and make a whitespace tokenized string out of it when calling the constructor, like this:
sentence= Sentence(' '.join(tokens_from_list))
The way you posted might be more convenient so we'll definitely look into integrating an option like this!
@alanakbik Thanks for the tip for using an external tokenizer. Could I propose a PR with the code I suggested in order to launch the work on this topic?
@mauryaland Would be great :) I think we can then add several tokenizers (and even kind of "benchmark" then)!
@stefan-it yes absolutely agree - we could probably use scikit learn instead for all evaluation metrics except the span F1 measure for which we would need to add back in the CoNLL-03 script.
@alanakbik Even with the latest master there's a F1-score vs. accuracy calculation bug for PoS tagging tasks, I will open an issue for that. I think we should really rely on the CoNLL-03 script (with of course correct conversion to IOB tagging scheme).
I vote for:
Hello,
I have been working with the library for a while now, I also want to add some requests. I am willing to support the implementation side.
[ ] Apparently, flair does not show training accuracy except if we don't specify monitor_train variable, which is processing the whole training dataset once more. However, we can still calculate the accuracies at each batch during the training loop and show final accuracy at the end.
[ ] Although some changes and commits has been made, for large datasets, GPU memory still grows from 2GB to 10GB~ (SequenceTagger model). I think it needs to be inspected carefully and fixed, which is preventing to train another model in the same GPU.
[ ] GPU's are already under-utilized due to overhead of processing data, so I believe multi-gpu support should not be primary concern until find a solution. I propose c++ extensions for loss and initial processing parts, this tutorial might be helpful to achieve it.
[ ] Currently flair does not support distributed hyper-parameter tuning. I found out that library called Ray and it's submodule Tune is providing very strong and easy interface for that. This tutorial could be a good start for adding that feature to flair. I run the code once and if I am not mistaken, all training runs occur in separate threads, which allows distributed and multi-gpu hyper-parameter tuning.
[ ] Python files like datasets and embeddings in the library is way too large to include more code in them, my proposal is that, let's create sub-packages with folder names embeddings and datasets and move all related class and methods into separate files, like document_embeddings, word_embeddings, or conll03, wikiner etc..
[ ] flair has word and document embeddings but does not have sentence embedding for tasks such as sentiment analysis, why don't we add that as well.
Hello @myaldiz agree on pretty much all points and help would be very much appreciated. I especially agree that GPU underutilization is a major concern that has priority over multi-GPU (multi-GPU currently only really makes sense in the LanguageModelTrainer where GPU-utilization is always 100%). Also a good point on monitoring training accuracy - currently, the training data is evaluated at the end of an epoch (i.e. after everything has been learnt), while your proposal would always take the current state and evaluate on the current batch, yielding different numbers. But this would be more resource-effective so that would be good from my side.
We are planning to change the package structure especially with regards to embeddings - we are in fact in the process of adding embeddings for other modalities besides text (such as images) so the package structure needs to grow with this. This would be a bigger and potentially breaking change that we want to do for a major version upgrade.
Document and sentence embeddings for us are currently the same thing. They simply mean embeddings for a text that could be short or long.
@alanakbik I am curious there are multiple algorithms proposed for hyper parameter tuning with Tune, but does any of them specifically have advantage over the others?
I was thinking to use flair's existing hyper-opt part and integrate Tune to it in a new python file for now. If you have any ideas about others, appreciate your comments
Tune indeed seems like the best choice, since it wraps other libs: HyperOpt, Nevergrad, Scikit-opt, Ax. I don't know what is the best hyper-parameter optimization method (haven't found some standard benchmark, let me know if you find it), so we can leave the choice of specific hyper-parameter optimization algorithm/lib to the user. There are some new libs missing from Tune (e.g. DragonFly), but that's an issue of adding the lib to Tune.
Today I tried to implement Ray Tune with hyper-opt and tried to integrate it to the code using the previous methods so that there is no difference from the user's perspective for distributed hyper-parameter tuning. I noticed mainly two fundamental issues about the trainer class and nn.Model class.
First of all, while using Ray Tune, you inherit a method called Trainer which you have to implement some of its methods just like following
class TrainMNIST(Trainable):
def _setup(self, config):
# this is used instead of __init__
# we can initialize our trainer here,
some_init_code()
def _train_iteration(self):
self.model.train()
for batch_idx, (data, target) in enumerate(self.train_loader):
self.optimizer.zero_grad()
output = self.model(data)
loss = F.nll_loss(output, target)
loss.backward()
self.optimizer.step()
def _test(self):
self.model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for batch_idx, (data, target) in enumerate(self.test_loader):
output = self.model(data)
# sum up batch loss
test_loss += F.nll_loss(output, target, reduction="sum").item()
# get the index of the max log-probability
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(
target.data.view_as(pred)).long().cpu().sum()
test_loss = test_loss / len(self.test_loader.dataset)
accuracy = correct.item() / len(self.test_loader.dataset)
return {"mean_loss": test_loss, "mean_accuracy": accuracy}
def _train(self):
self._train_iteration()
return self._test()
def _save(self, checkpoint_dir):
checkpoint_path = os.path.join(checkpoint_dir, "model.pth")
torch.save(self.model.state_dict(), checkpoint_path)
return checkpoint_path
def _restore(self, checkpoint_path):
self.model.load_state_dict(torch.load(checkpoint_path))
To use this structure we need a method for training only single epoch. One problem for that is, trainer.train() inputting and initializing too many variables that should be initialized at its __init__ method. In my opinion, things such as learning rate should be initialized at __init__ method.
Another problem is we load and save checkpoints in the nn.model methods, which actually including parameters that have nothing to do with the model, I think saving checkpoint should be done by a trainer, not the model class. And while restoring, we need to restore variables like epoch, so that it does not start from epoch=0 each time.
Here is an example for how will it look like to do the DistributedParamSelection
class DistributedParamSelector(object):
def __init__(
self,
redis_address: str,
corpus: Corpus,
base_path: Union[str, Path],
max_epochs: int,
evaluation_metric: EvaluationMetric,
training_runs: int,
optimization_value: OptimizationValue,
use_gpu=torch.cuda.is_available(),
):
self.corpus = corpus
self.max_epochs = max_epochs
self.base_path = base_path
self.evaluation_metric = evaluation_metric
self.training_runs = training_runs
self.optimization_value = optimization_value
self.use_gpu = use_gpu
ray.init(redis_address=redis_address)
self.hb_scheduler = HyperBandScheduler(
time_attr="training_iteration",
metric="mean_loss", mode="min"
)
# Config dictionary for Tune Param Selector
config, args = dict(), dict()
self.config = config
config["args"] = args
args["cuda"] = self.use_gpu
args["corpus"] = corpus
def optimize(
self,
space: SearchSpace,
max_evals=100,
random_seed=1
):
config = self.config
args = config["args"]
config.update(space.search_space)
args["max_evals"] = max_evals
args["seed"] = random_seed
args["_set_up_model"] = self._set_up_model
tune.run(
TuneParamSelector,
scheduler=self.hb_scheduler,
**{
"stop": {
"training_iteration": self.max_epochs,
},
"resources_per_trial": {
"cpu": 3,
"gpu": 1 if self.use_gpu else 0,
},
"num_samples": 20,
"checkpoint_at_end": True,
"config": config
})
@abstractmethod
def _set_up_model(self, params: dict) -> flair.nn.Model:
pass
@myaldiz these are good points - I agree that saving and restoring checkpoints is not something that should belong to the flair.nn.Model, but rather to the ModelTrainer. I am not entirely sure why we put checkpointing into flair.nn.Model - I think we wanted to avoid any explicit mention of SequenceTagger, TextClassifier etc. in the ModelTrainer so that the trainer operates over the flair.nn.Model interface only. But this effect can probably also be achieved if we move checkpoint management entirely to the ModelTrainer. So I'd be very much open to refactoring this.
The same goes for the current hyperparameter selection interface: I'm generally open to making larger changes here, especially since as I understand it Tune subsumes Hyperopt. Also, I am not so sure if our current integration of Hyperopt is very user-friendly. So do not feel obliged to keep to this structure.
From my side it would also be OK to move more parameters from .train() to the ModelTrainer init class. Would this address the difficulty of Tune integration, or are greater changes to the Model interface required for this?
@alanakbik I think it would make everything easier if method like trainer.train_epoch() would exist without requiring any parameters and using variables instantiated with the object only. And for trainer.train();
num_epochs,
monitor_train,
monitor_test,
checkpoint,
save_final_model
could be only parameters required for train method.
Should we make PR for both issues? Maybe after those are fixed, I can move on to dist. hyper param thing. And as far as I understand, you have no obligations to completely replace hyperparameter tuning pipeline with Ray Tune?
@myaldiz sure that works from my side! Yes, no objections from my side since I think making hyperparameter tuning easily usable is more important than keeping to the old interfaces.
MultiGPU would be useful. What do you mean low utilization? I am running TextClassification models, and my GPUs are being utilized.
What is the current recommended procedure to go from here? I have to do hyperparameter tuning now. I would like this to be distributed. One GPU is too slow, especially the cheaper ones.
Should I try this Ray Tune thing? Or should I just divide the hyper parameter tuning into two halfs, and run one half on GPU 0 and the other on GPU 1? This isn't really scalable. I want to be able to get my training done during business hours, so I don't have to worry about paying for an idle GPU because it finished outside of business hours. 8 GPUS for 1 hr each costs pretty much the same as 1 GPU for 8 hours. Except, the faster way means it finishes before you leave the office, and can turn off the instance overnight.
This is even worse on the weekend. Your choices on the weekend.
1) Don't run anything on GPUs all weekend, wasting time.
2) Have enough automated that you can keep the GPUs busy all weekend. Harder to engineer.
3) Let the GPUs go idle, have company waste money on idle GPUs.
4) Take time out of your weekend to perform unpaid overtime, to VPN in just so you can turn off GPUs.
If I can't cost-effectively use 2 older GPUs, they aren't going to give me access to 8 State of the Art ones, that's for sure.
I'm gonna have to figure out the hyper parameter tuning one way or the other, so any advice here would be appreciated. Thank you very much.
MultiGPU would be useful. What do you mean low utilization? I am running TextClassification models, and my GPUs are being utilized.
What is the current recommended procedure to go from here? I have to do hyperparameter tuning now. I would like this to be distributed. One GPU is too slow, especially the cheaper ones.
Should I try this Ray Tune thing? Or should I just divide the hyper parameter tuning into two halfs, and run one half on GPU 0 and the other on GPU 1? This isn't really scalable. I want to be able to get my training done during business hours, so I don't have to worry about paying for an idle GPU because it finished outside of business hours. 8 GPUS for 1 hr each costs pretty much the same as 1 GPU for 8 hours. Except, the faster way means it finishes before you leave the office, and can turn off the instance overnight.
This is even worse on the weekend. Your choices on the weekend.
- Don't run anything on GPUs all weekend, wasting time.
- Have enough automated that you can keep the GPUs busy all weekend. Harder to engineer.
- Let the GPUs go idle, have company waste money on idle GPUs.
- Take time out of your weekend to perform unpaid overtime, to VPN in just so you can turn off GPUs.
If I can't cost-effectively use 2 older GPUs, they aren't going to give me access to 8 State of the Art ones, that's for sure.
I'm gonna have to figure out the hyper parameter tuning one way or the other, so any advice here would be appreciated. Thank you very much.
At that time I did not test the TextClassification models, but for SequenceTagger model, high-end GPU's are under-utilized, and utilization is better for TextClassification models.
About tuning procedure, you can specify number of GPU's required for model to be trained in Ray Tune, and it pretty much automates everything about experimentation for you. Although those 2 GPU's are slow, they will help you to finish tuning faster, since experiments made in parallel and does not need to output sequentially if you configure them.
Hey, this pytorch-lightning repo is really interesting:
https://github.com/williamFalcon/pytorch-lightning
Lightning is a very lightweight wrapper on PyTorch. It defers core training and validation logic to you and automates the rest. It guarantees correct, modern best practices for the core training logic.
It provides Mixed precision, multi GPU training and a bunch of other awesome features. I've just read a post on Twitter that pytorch-lightning will be integrated into PyTorch ecosystem very soon!
Oha interesting - will definitely check it out.
Thanks for the advice. If I need more GPUs, they'd probably let me have more. I just wanted to prove responsibility and usefulness for the 2 GPU use case before asking for more.
For now, I just ran two separate hyperopt runs on different GPUs (I made the cuda_device selectable by env variable) and it worked.
I might get to train my own embeddings this week, I hope so. If not this week, then soon for sure.
I also have new Flair Embeddings for the next release: better lm for Basque (more training data and longer training) and new lm for Tamil :)
Hi all, I work on https://github.com/ray-project/ray and maintain Tune. I stumbled upon this discussion just browsing github and was wondering if I could help.
RE: distributed hyperparameter tuning - Is there anything that we can do from the Tune side to help support your workload? I'd be happy to extend.
(FYI, not sure if this was a confusion, but you can easily do distributed data-parallel with distributed hyperparameter search in Tune.)
Hi @richardliaw we haven't yet looked into this, but hyperparameter selection is something that we really want to get better support for in the framework so thanks for reaching out! I'm sure once we get started with integrating Tune there'll be lots of questions from our side :)
It would be great if we could also support attention-based architectures models, like proposed in that recent EMNLP paper: "Hierarchically-Refined Label Attention Network for Sequence Labeling":
CRF has been used as a powerful model for statistical sequence labeling. For better representing label sequences, we investigate a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention.
That is really interesting, and a PyTorch implementation can be found in this repository by @Nealcly.
For one of the next releases I can provide the following (new) Flair Embeddings:
Greek (already trained), Estonian (already trained), Irish (already trained), Hungarian (currently training) and Romanian (scheduled after Hungarian).
I obtained special prepared/collected corpora from the Leipzig Corpora Collection for these languages :)
Awesome! :)
@alanakbik One question: could you add/upload WordEmbeddings for el, et, ga and hu? They're currently missing and I would really like to run experiments with these word embeddings later :)
@stefan-it sure, will do!
@stefan-it embeddings are added, both X-crawl and X-wiki should work now for these languages!
Thanks @alanakbik I totally forgot the word embeddings for Tamil (ta) 馃槄 Could you also add them? I'm currently doing PoS tagging experiments for these languages (with that new trained Flair Embeddings) and the results are looking great!
@stefan-it done! Look forward to the POS results! :)
Please add coreference resolution and dependency parsing, I cannot use this wonderful library without it...
You could use standfordnlp which leverage CoreNLP, please!
Hi @gccome and all,
For fine-tuning based transfer learning, I built out an experimental library called AdaptNLP that take a ULM-FIT approach for incorporating the latest transformers model architectures like BERT, GPT2, and ALBERT with Flair's prediction heads and trainers.
The library is located here: https://github.com/Novetta/adaptnlp and it's built atop Flair. It also provides some other interesting applications I've been working on or thought was very cool/useful NLP-wise so feel free to try out and raise some issues or feature requests in the repo. The library is in its early release stages so please feel free to address any issues in the repo too.
Flair has also been an awesome dependency and a joy to work with, so please let me know if there's anything AdaptNLP could be useful for Flair's development as well!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
We can close this since 0.5 is out (and 0.5.1 coming soon), but the remaining points are still on our list, especially multi-task learning.
Most helpful comment
Currently, the transfer learning is purely feature based, do you consider adding fine-tuning based transfer learning for both sequence tagging and text classification? I think it would be a great addition.