Gensim: New binary corpus format

Created on 7 Nov 2017 · 23Comments · Source: RaRe-Technologies/gensim

Based on https://github.com/RaRe-Technologies/gensim/pull/1679#issuecomment-341686324

New corpus in binary format:

Store each document as sequence of (feature_id, feature_weight) pairs as (uint32, float32) = 8 bytes.
(optional) possibility of indexing/slicing (some kind of "offsets")
Written in Python/Cython/C, any combinations
As variant: https://github.com/RaRe-Technologies/gensim/issues/1697#issuecomment-369507427

Should be significantly faster by reading/writing than classic MmCorpus, benchmark results is very important in this case.

difficulty easy feature

Source

menshikh-iv

All 23 comments

Let me add that a crucial part of this task is measuring and optimizing the read/write performance of this new corpus serialization format. Its main purpose is to be efficient and fast (otherwise we might as well continue using MmCorpus, which is a standard, well-understood and well-supported format).

piskvorky on 11 Nov 2017

Another idea: can we stream directly from/to the three CSC arrays? That would be the easiest solution, and also the most flexible (CSC/CSR are standard sparse matrix formats).

piskvorky on 27 Feb 2018

@piskvorky are you mean scipy sparse implementation of matrices or maybe "custom" solution?

menshikh-iv on 1 Mar 2018

Nothing custom, I mean standard CSC/CSR (=three dense arrays). These three dense arrays would be stored to disk streamed (save), and later read streamed (load). See also matutils.corpus2csc.

piskvorky on 1 Mar 2018

👍1

If no one else is working on this issue. Please assign it to me.

abhijeetchauhan on 1 Mar 2018

@abhijeetchauhan if you are interested - feel free to submit PR.

menshikh-iv on 1 Mar 2018

I have to code this in cython right?

abhijeetchauhan on 2 Mar 2018

@abhijeetchauhan Cython/C/C++, pick any.
Potentially, this can be done on pure python, but it's not an easy task to make it fast enough.

menshikh-iv on 2 Mar 2018

@abhijeetchauhan if it helps and you are still interested, I have some code for exploring a few different options here: https://github.com/arlenk/gensim/tree/binary_corpus

I tested out the following:

the python standard library's "struct" module.
Pros: no cython code, struct module is supported and well tested
Cons: binary file is not portable (hardware specific endian, packing, etc.). C
structs using cython and fread
Pros: faster
Cons: binary file is not portable, manual memory management, only works with local, non-zipped, files.
structs using cython and fread, but returning the results in an cython iterator class instead of a list of tuples
Pros: same as above
Cons: same as above plus result is an iterator so any code that depends on a list will break (though code should probably work for an iterator anyway)
a "read only" version of structs using cython. This versions reads the data but does not return any python objects. This is to determine the upper bound on speed using cython and fread/structs
python's array module
Pros: array module is already part of standard library, no cython code
Cons: binary file is still not portable

There are a couple scripts in the scripts director for benchmarking the different options. You can run them like:

python gensim/scripts/create_corpus.py text8 .
python gensim/scripts/benchmark_corpus.py text8 .

python gensim/scripts/create_corpus.py nytimes .
python gensim/scripts/benchmark_corpus.py nytimes .

(note the nytimes corpus will take up almost 2 gigs of disk space total)

The execution times on my machine are:

| | num_docs | num_terms | corpus class | time (seconds) |
|---------|:--------:|:---------:|:--------------------------------:|:--------------:|
| text8 | 2k | 250k | gensim 3.3 | 63 |
| | | | gensim 3.4 | 6.3 |
| | | | python structs | 12.0 |
| | | | cython structs | 3.4 |
| | | | cython structs (custom iterator) | 2.9 |
| | | | cython structs (read only) | 0.1 |
| | | | python array | 3.4 |
| nytimes | | | gensim 3.3 | 1078 |
| | | | gensim 3.4 | 124 |
| | | | python structs | 212 |
| | | | cython structs | 67 |
| | | | cython structs (custom iterator) | 54 |
| | | | cython structs (read only) | 2.4 |
| | | | python array | 62.4 |

I included a comparison of the "old" (gensim 3.3) mmreader code as well just to highlight that the code in version 3.4 is already almost an order of magnitude faster, so the extra code complexity and lack of support for non local (or compressed) files may not be worth it unless corpus processing time is really a bottleneck.

From prior profiling, corpus IO consumes about 10-15% of the time for fitting an LDAModel, though I am not as familiar with other models in gensim.

arlenk on 13 Mar 2018

👍1

Does this follow the standard CSC format mentioned above?

Now that we have MmCorpus optimized, I'm -1 on introducing a new custom format, unless it adds additional capabilities, such as mimicking the standard CSC layout using streaming and the large dense arrays being directly mmap-able.

piskvorky on 15 Mar 2018

@piskvorky as I understand - no

I'm +1 for this change because this can give us the significant boost for performance ~2-4x (even in comparison with optimized MmCorpus)

I see no serious difference between custom format & mmap for 3 dense numpy arrays, but this is a good idea to add this variant to benchmark.

UPD: I also worried about windows, in this case, does mmap works fine on Windows?

menshikh-iv on 15 Mar 2018

Yes.

Advantages of using a standard format include easier interoperability, ability to use standard read/write tools, wider applicability and numpy mmap.

I don't see much advantage to create and maintain a new custom format, for only a 2-4x speed-up (to an already fast reader). The worrying part is the writer, which currently cannot write compressed files with MmWriter. It's because it needs to seek to the beginning to write the header, after completing a full corpus pass. This is very inconvenient: you have to choose to either do two corpus passes (slow), or give up on streamed compression (big file size). See also #21.

We want all implementations to be streamed (iterables/generators), of course.

piskvorky on 15 Mar 2018

@piskvorky did you know, how to write iterable as numpy array to disk in "streaming manner"? I never heard about it.

I'm +1 for the standard format, but only if this works with approximately the same performance as custom solutions.

For me, priorities are read (speed) > write (speed) > file size, I don't think that compression so important now (2018 year, HDD is really cheap, SSD is cheap too, 34% of file size mentioned in #21 not worth it IMO). Although, sometimes compression improves performance (by slow disk reason).

menshikh-iv on 15 Mar 2018

Does this follow the standard CSC format mentioned above?

It does not use scipy.sparse.save_npz and load_npz as I could not figure out an easy way to stream with those. Under the hood, it looks like scipy just saves a sparse matrix as three numpy arrays (one with the data, and two with the indices that contain non zero values).

The formats I used here are similar. For each document, we just store the docid (int), the number of elements (int), and then pairs of (termid (int), value (float)). All the data is stored as raw binary data, so there's no parsing of strings like the current mmreader. The final file size is actually about 30% smaller than the scipy sparse files, though that's probably because I am using floats instead of doubles.

I can take another look at the scipy save_npz/load_npz functions, but based on the "read only" results above, I don't think the file I/O is the bottleneck. Rather, most of the time is spent in creating the return objects (tuples) for python. I may see if there's a way to use a memoryview to avoid creating all the tuples.

arlenk on 16 Mar 2018

For me, priorities are read (speed) > write (speed) > file size, I don't think that compression so important now (2018 year, HDD is really cheap, SSD is cheap too, 34% of file size mentioned in #21 not worth it IMO). Although, sometimes compression improves performance (by slow disk reason).

I ran an updated benchmark of fitting an LDA model with the nytimes corpus (300k documents, so large but not gigantic) and reading the corpus only takes 3% of the total time. Do you know if there are other use cases I can try where the corpus is the bottleneck?

arlenk on 16 Mar 2018

@arlenk so, you can try other topic models (like AuthorTopicModel or LdaSeqModel, but I don't think that results will be so different).

ping @persiyanov, please describe your results here too.

menshikh-iv on 16 Mar 2018

I don't think we can use save_npz/load_npz, as these do not support mmap.

AFAIK, the numpy format (which can be memory-mapped) is more or less raw bytes, with a small header at the beginning.

piskvorky on 16 Mar 2018

@arlenk @menshikh-iv

@arlenk thanks for the good benchmark. I used it to experiment a bit.

I rewrote python array version with Cython. It gave a miserable boost and not worthy attention. I also compared this cython array version with yours cython struct on reading the corpus only (ran 5 times, then averaged and calculated std):

benchmark results (nytimes corpus)
--------------------------------------------------------------------------------

cython struct (read only)               : 1.13 +- 0.31 seconds
cython array (read only)                : 3.48 +- 0.35 seconds

I agree that _corpus reading phase_ is actually done quite quickly. So, we need to optimize the time spent on _returned object_.

Also, I think it's worth trying to use np.memmap to on stored CSC/CSR matrix.

persiyanov on 17 Mar 2018

👍1

@arlenk @menshikh-iv @piskvorky

I've tested version based on np.memmap and scipy.sparse.csc_matrix. Here is the implementation: https://gist.github.com/persiyanov/5aed5165d7945c176a0f557a473ef848. I believe that save_corpus part could be optimized, but saving time is not very important right now. The main idea is to read data, indices, indptr dense arrays from disk and iterate over their memmaped versions (obtained with np.memmap).

The results are not promising, but maybe I haven't optimized my code well. Please, take a look at the gist, maybe I missed something and made some stupid mistakes. I used @arlenk benchmarking code and compared mine python memmap version with python struct, python array and current gensim 3.4 MmReader.

Results on _text8_ corpus:

benchmark results
--------------------------------------------------------------------------------

gensim 3.4 MmReader                     : 5.89 +- 0.76 seconds
python memmap                           : 26.43 +- 1.33 seconds
python struct                           : 9.11 +- 1.14 seconds
python array                            : 1.35 +- 0.07 seconds

Results on _nytimes_ corpus:

benchmark results
--------------------------------------------------------------------------------

gensim 3.4 MmReader                     : 118.70 +- 7.03 seconds
python memmap                           : 427.20 +- 23.09 seconds
python struct                           : 132.17 +- 17.28 seconds
python array                            : 27.01 +- 1.12 seconds

p.s. Tested on my MacBook Pro 2015, Core i5 2.6GHz, 8 Gb RAM

persiyanov on 20 Mar 2018

@persiyanov thank you, looks significantly worse :(

The most interesting part is __iter__ (as always): https://gist.github.com/persiyanov/5aed5165d7945c176a0f557a473ef848#file-python_memmap-py-L129

@arlenk have you any ideas why this solution so slow?

menshikh-iv on 21 Mar 2018

Results on nytimes corpus:

benchmark results

gensim 3.4 MmReader : 118.70 +- 7.03 seconds
python memmap : 427.20 +- 23.09 seconds
python struct : 132.17 +- 17.28 seconds
python array : 27.01 +- 1.12 seconds

@persiyanov, any chance you have the benchmark results for the "cython struct" (not the read only version). Your results for python array already look much better than mine -- I wonder if that's because I'm running my tests on older hardware (a 2010 imac with a non SSD HD).

arlenk on 22 Mar 2018

@persiyanov thank you, looks significantly worse :(

The most interesting part is __iter__ (as always): https://gist.github.com/persiyanov/5aed5165d7945c176a0f557a473ef848#file-python_memmap-py-L129

@arlenk have you any ideas why this solution so slow?

I'll try taking a look. I'm guessing there is still quite a bit of overhead in mumpy mmap.

arlenk on 22 Mar 2018

👍1

There is no overhead in mmap -- that's the whole point. The disk memory gets mapped into virtual memory instantly, and loaded into physical RAM by the OS when a page fault occurs (~ on first access).

Mmap bypasses OS file I/O (no buffering etc), which may be related to the performance.

piskvorky on 22 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings