fakeDataset = downloader.load('fake-news')
fails with the above error on Windows machines running Python 3.7 64 bit with gensim 3.8.1
fakeDataset = downloader.load('fake-news')
on machine with above configuration.
Windows-10-10.0.18362
Python 3.7.6
NumPy 1.18.1
SciPy 1.4.1
gensim 3.8.1
FAST_VERSION 0
I zipped the data directory from a Linux machine and gave it to a student to unzip on their Windows machine. Re-executing the code above failed with the same error, suggesting the problem is not in downloading but in loading the downloaded data. Perhaps there is a bug in unzipping the archive with Python?
It looks like the problem is at least in __init__.py in the fake-news data folder.
Specifically, on this line:
csv.field_size_limit(sys.maxsize)
This is where we get OverflowError: Python int too large to convert to C long
Attempting a dirty fix based on StackOverflow
#csv.field_size_limit(sys.maxsize)
maxInt = sys.maxsize
try:
csv.field_size_limit(maxInt)
break
except OverflowError:
maxInt = int(maxInt/10)
Waiting for student to respond if it worked.
Had to set up a Windows env to test this. The above approach did not work (hangs).
The code below works; do you want me to do a PR?
import os
import csv
import sys
from smart_open import smart_open
from gensim.downloader import base_dir
#csv.field_size_limit(sys.maxsize)
csv.field_size_limit(2147483647)
class Dataset(object):
def __init__(self, fn):
self.fn = fn
def __iter__(self):
with smart_open(self.fn, 'rb') as infile:
if sys.version_info[0] == 2:
reader = csv.DictReader(infile, delimiter=",", quotechar='"')
else:
reader = csv.DictReader((line.decode("utf-8") for line in infile), delimiter=",", quotechar='"')
for row in reader:
yield dict(row)
def load_data():
path = os.path.join(base_dir, 'fake-news', "fake-news.gz")
return Dataset(path)
If you want to test it, here is my conda yml, use conda env create -f environment.yml after putting this in a file environment.yml:
name: iis4011
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- attrs=19.3.0=py_0
- backcall=0.1.0=py_0
- blas=1.0=mkl
- bleach=3.1.4=pyh9f0ad1d_0
- ca-certificates=2020.1.1=0
- certifi=2019.11.28=py37_1
- cffi=1.14.0=py37ha419a9e_0
- chardet=3.0.4=py37hc8dfbb8_1006
- colorama=0.4.3=py_0
- cpuonly=1.0=0
- cryptography=2.8=py37hb32ad35_1
- decorator=4.4.2=py_0
- defusedxml=0.6.0=py_0
- entrypoints=0.3=py37hc8dfbb8_1001
- freetype=2.9.1=ha9979f8_1
- icc_rt=2019.0.0=h0cc432a_1
- idna=2.9=py_1
- importlib-metadata=1.6.0=py37hc8dfbb8_0
- importlib_metadata=1.6.0=0
- intel-openmp=2020.0=166
- ipykernel=5.2.0=py37h5ca1d4c_1
- ipython=7.13.0=py37hc8dfbb8_2
- ipython_genutils=0.2.0=py_1
- jedi=0.15.2=py37_0
- jinja2=2.11.1=py_0
- jpeg=9b=hb83a4c4_2
- json5=0.9.0=py_0
- jsonschema=3.2.0=py37hc8dfbb8_1
- jupyter_client=6.1.2=py_0
- jupyter_core=4.6.3=py37hc8dfbb8_1
- jupyterlab=1.2.6=pyhf63ae98_0
- jupyterlab_server=1.1.0=py_0
- libpng=1.6.37=h2a8f88b_0
- libsodium=1.0.17=h2fa13f4_0
- libtiff=4.1.0=h56a325e_0
- m2w64-gcc-libgfortran=5.3.0=6
- m2w64-gcc-libs=5.3.0=7
- m2w64-gcc-libs-core=5.3.0=7
- m2w64-gmp=6.1.0=2
- m2w64-libwinpthread-git=5.0.0.4634.697f757=2
- markupsafe=1.1.1=py37h8055547_1
- mistune=0.8.4=py37hfa6e2cd_1000
- mkl=2020.0=166
- mkl-service=2.3.0=py37hb782905_0
- mkl_fft=1.0.15=py37h14836fe_0
- mkl_random=1.1.0=py37h675688f_0
- msys2-conda-epoch=20160418=1
- nbconvert=5.6.1=py37_0
- nbformat=5.0.4=py_0
- ninja=1.9.0=py37h74a9793_0
- nodejs=13.10.1=0
- notebook=6.0.3=py37_0
- numpy-base=1.18.1=py37hc3f5095_1
- olefile=0.46=py37_0
- openssl=1.1.1f=he774522_0
- pandoc=2.9.2=0
- pandocfilters=1.4.2=py_1
- parso=0.6.2=py_0
- pickleshare=0.7.5=py37hc8dfbb8_1001
- pillow=7.0.0=py37hcc1f983_0
- pip=20.0.2=py37_1
- prometheus_client=0.7.1=py_0
- prompt-toolkit=3.0.5=py_0
- ptvsd=4.3.2=py37hfa6e2cd_1
- pycparser=2.20=py_0
- pygments=2.6.1=py_0
- pyopenssl=19.1.0=py_1
- pyrsistent=0.16.0=py37h8055547_0
- pysocks=1.7.1=py37hc8dfbb8_1
- python=3.7.7=h60c2a47_0_cpython
- python-dateutil=2.8.1=py_0
- python_abi=3.7=1_cp37m
- pytorch=1.4.0=py3.7_cpu_0
- pywin32=227=py37hfa6e2cd_0
- pywinpty=0.5.7=py37_0
- pyzmq=19.0.0=py37h8c16cda_1
- requests=2.23.0=pyh8c360ce_2
- send2trash=1.5.0=py_0
- setuptools=46.1.3=py37_0
- six=1.14.0=py_1
- sqlite=3.31.1=he774522_0
- terminado=0.8.3=py37hc8dfbb8_1
- testpath=0.4.4=py_0
- tk=8.6.8=hfa6e2cd_0
- torchvision=0.5.0=py37_cpu
- tornado=6.0.4=py37hfa6e2cd_0
- traitlets=4.3.3=py37hc8dfbb8_1
- urllib3=1.25.7=py37hc8dfbb8_1
- vc=14.1=h0510ff6_4
- vs2015_runtime=14.16.27012=hf0eaf9b_1
- wcwidth=0.1.9=pyh9f0ad1d_0
- webencodings=0.5.1=py_1
- wheel=0.34.2=py37_0
- win_inet_pton=1.1.0=py37_0
- wincertstore=0.2=py37_0
- winpty=0.4.3=4
- xeus=0.23.10=h1ad3211_0
- xeus-python=0.6.13=py37h5b9e2c8_1
- xz=5.2.4=h2fa13f4_4
- zeromq=4.3.2=h6538335_2
- zipp=3.1.0=py_0
- zlib=1.2.11=h62dcd97_3
- zstd=1.3.7=h508b16e_0
- pip:
- atomicwrites==1.3.0
- blis==0.4.1
- boto3==1.12.35
- botocore==1.15.35
- cachetools==4.0.0
- catalogue==1.0.0
- cycler==0.10.0
- cymem==2.0.3
- docutils==0.15.2
- funcy==1.14
- future==0.18.2
- gensim==3.8.1
- google-api-core==1.16.0
- google-auth==1.13.1
- google-cloud-core==1.3.0
- google-cloud-storage==1.27.0
- google-resumable-media==0.5.0
- googleapis-common-protos==1.51.0
- jmespath==0.9.5
- joblib==0.14.1
- kiwisolver==1.2.0
- matplotlib==3.2.1
- more-itertools==8.2.0
- murmurhash==1.0.2
- nltk==3.4.5
- numexpr==2.7.1
- numpy==1.18.2
- packaging==20.3
- pandas==1.0.3
- plac==1.1.3
- pluggy==0.13.1
- preshed==3.0.2
- protobuf==3.11.3
- py==1.8.1
- pyasn1==0.4.8
- pyasn1-modules==0.2.8
- pyldavis==2.1.2
- pyparsing==2.4.6
- pytest==5.4.1
- pytz==2019.3
- rsa==4.0
- s3transfer==0.3.3
- scikit-learn==0.22.2.post1
- scipy==1.4.1
- smart-open==1.10.0
- spacy==2.2.4
- srsly==1.0.2
- thinc==7.4.0
- tqdm==4.45.0
- wasabi==0.6.0
prefix: C:\Users\Andrew Olney\.conda\envs\iis4011
It'd help to include the full error stack for any error, to better identify the files/exact-lines involved in any errr.
I can't find any fake-news directory in this project's source control (nor in gensim-data) – is the problem code actually in some other project/source-tree?
This seems to be one of the not-version-controlled, hard-to-browse, hard-to-review active code files that's dynamically downloaded & run in a manner I consider highly unwise (per #2283). How would someone contribute a PR against such a file?
@gojomo The only way I know to get these files is to use the downloader successfully. When that happens, it places a folder hierarchy in the user's home directory, under gensim-data (screenshot below). The file in question is __init__.py. The code I pasted above represents the entire contents of that file, with the commented line being the original line where the error was thrown, and the line below it being the new line that fixes the problem. Just to be completely clear, I'm referring to
#csv.field_size_limit(sys.maxsize)
csv.field_size_limit(2147483647)
I can send a full stack if needed, but the problem is sys.maxsize on Windows for Python. I'm sure there are cleaner ways to solve the problem than what I have above, but it works.

The __init__.py file is here:
https://github.com/RaRe-Technologies/gensim-data/releases/tag/fake-news
CC @chaitaliSaini @menshikh-iv the authors of this code. I see no comment in the code, so not sure what that csv.field_size_limit(sys.maxsize) is about – why is it there?
@gojomo as discussed ad nauseum, the dataset releases are immutable by design and you cannot open a PR against them. You can release a new, updated version.
In linux at least, the default field size appears to be 131072:
import csv
print( csv.field_size_limit());
My guess is that some of the documents in fake-news might be longer than that, and that's the reason the author put it there in the first place. However, I haven't played around with different sizes. The size I gave above works but hasn't been memory optimized if that's a concern.
@gojomo as discussed ad nauseum, the dataset releases are immutable by design and you cannot open a PR against them. You can release a new, updated version.
But this isn't a problem in a dataset, it's a problem in executed source code – which could and should be in version control. Why should we be running code in project users' installations that isn't under version-control, hasn't been reviewed, & can't receive fix PRs (from either users or contributors)?
For reproducibility reasons. The same dataset should result in the same output, forever, bugs included. And that implies the same code too. That was the design anyway.
In this particular case, changing csv.field_size_limit() may result in changed results (I assume, haven't checked).
I agree that it was confusing to me to trace this problem back to code that was packaged with the data and not under version control.
That said, I would suggest opening another issue specifically for that suggested redesign. A separate issue would preserve the discussion and also invite comments from other users/developers.
I'm definitely open to that. It will need a strong open source contributor though – to make sure the redesign is an actual improvement :)
For reproducibility reasons. The same dataset should result in the same output, forever, bugs included. And that implies the same code too. That was the design anyway.
In this particular case, changing
csv.field_size_limit()may result in changed results (I assume, haven't checked).
Exactly - if you want reproducibility, you'd want more things under version control, not less, so you can see what was delivered at any one time – ideally as part of a named version!
Here, if someone changed the __init__.py 'asset' served by https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py - who could notice that change? If for 2 hours, or 2 days, it was changed to a malicious file, then changed back to something innocent, who would notice? Are you getting notifications of every asset-change there? Is there a persistent log somewhere? (I can't find one, but would feel somewhat better if there was one to allow you & me to know what code someone executed, who ran downloader.load('fake-news') on April 1, compared to some other date. Right now, I don't see how that's possible - reproducibility is not tracked or possible in the current practice.)
Most helpful comment
Had to set up a Windows env to test this. The above approach did not work (hangs).
The code below works; do you want me to do a PR?
If you want to test it, here is my conda yml, use
conda env create -f environment.ymlafter putting this in a fileenvironment.yml: