While trying to build a sentiment classifier using the sentiment140 corpus link, I encountered an error.
After (successfully?) loading the corpus, flair throws an 'BrokenPipeError' during the creation of the label dictionary.
I am running Python3.6 and flair0.4.3 within windows10.
The code:
from flair.data import Corpus
from flair.datasets import CSVClassificationCorpus
data_folder = './twitterData'
column_name_map = {5: "text", 0: "label_topic"}
corpus: Corpus = CSVClassificationCorpus(data_folder,
column_name_map,
skip_header=False,
delimiter=',',
)
label_dict = corpus.make_label_dictionary()
The output:
2019-09-20 10:14:09,107 Reading data from twitterData
2019-09-20 10:14:09,107 Train: twitterDatatraining.1600000.processed.noemoticon.csv
2019-09-20 10:14:09,107 Dev: None
2019-09-20 10:14:09,107 Test: twitterDatatestdata.manual.2009.06.14.csv
2019-09-20 10:14:18,221 Computing label dictionary. Progress:
Traceback (most recent call last):File "
", line 14, in
label_dict = corpus.make_label_dictionary()File "C:Anaconda3libsite-packagesflairdata.py", line 948, in make_label_dictionary
for batch in Tqdm.tqdm(iter(loader)):File "C:Anaconda3libsite-packagestorchutilsdatadataloader.py", line 278, in __iter__
return _MultiProcessingDataLoaderIter(self)File "C:Anaconda3libsite-packagestorchutilsdatadataloader.py", line 682, in __init__
w.start()File "C:Anaconda3libmultiprocessingprocess.py", line 112, in start
self._popen = self._Popen(self)File "C:Anaconda3libmultiprocessingcontext.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)File "C:Anaconda3libmultiprocessingcontext.py", line 322, in _Popen
return Popen(process_obj)File "C:Anaconda3libmultiprocessingpopen_spawn_win32.py", line 65, in __init__
reduction.dump(process_obj, to_child)File "C:Anaconda3libmultiprocessingreduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)BrokenPipeError: [Errno 32] Broken pipe
In linux with python 3.6 and the latest flair, it works well.
Maybe you can try to clone the latest flair from github.
In linux with python 3.6 and the latest flair, it works well.
Maybe you can try to clone the latest flair from github.
The problem resides in the fact that Windows multiprocessing uses spawn instead of fork (there is no fork mechanism in Windows) and that seems to be incompatible with the current implementation of the dataloader. You can test it yourself in Linux, if you force multiprocessing to use spawn it will fail too:
To reproduce in Linux add this inside your __main__ guard.
from torch.multiprocessing import Pool, Process, set_start_method
try:
set_start_method('spawn')
except RuntimeError:
pass
Thanks for reporting this. @mrbungie I am trying to reproduce the error my ubuntu setup, but everything seems to work, including with setting the start method to 'spawn'. Could you paste a minimal code example to reproduce the error?
I found this doc and i think it might be the issue and help solve it:
https://pytorch.org/docs/stable/notes/windows.html#multiprocessing-error-without-if-clause-protection
`import torch
def main()
for i, data in enumerate(dataloader):
# do something here
if __name__ == '__main__':
main()`
Im not sure where to add the code though.
If anybody can help, Thank You
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
The problem resides in the fact that Windows multiprocessing uses spawn instead of fork (there is no fork mechanism in Windows) and that seems to be incompatible with the current implementation of the dataloader. You can test it yourself in Linux, if you force multiprocessing to use spawn it will fail too:
To reproduce in Linux add this inside your __main__ guard.