Gensim: New binary corpus format

Created on 7 Nov 2017  路  23Comments  路  Source: RaRe-Technologies/gensim

Based on https://github.com/RaRe-Technologies/gensim/pull/1679#issuecomment-341686324

New corpus in binary format:

Should be significantly faster by reading/writing than classic MmCorpus, benchmark results is very important in this case.

difficulty easy feature

All 23 comments

Let me add that a crucial part of this task is measuring and optimizing the read/write performance of this new corpus serialization format. Its main purpose is to be efficient and fast (otherwise we might as well continue using MmCorpus, which is a standard, well-understood and well-supported format).

Another idea: can we stream directly from/to the three CSC arrays? That would be the easiest solution, and also the most flexible (CSC/CSR are standard sparse matrix formats).

@piskvorky are you mean scipy sparse implementation of matrices or maybe "custom" solution?

Nothing custom, I mean standard CSC/CSR (=three dense arrays). These three dense arrays would be stored to disk streamed (save), and later read streamed (load). See also matutils.corpus2csc.

If no one else is working on this issue. Please assign it to me.

@abhijeetchauhan if you are interested - feel free to submit PR.

I have to code this in cython right?

@abhijeetchauhan Cython/C/C++, pick any.
Potentially, this can be done on pure python, but it's not an easy task to make it fast enough.

@abhijeetchauhan if it helps and you are still interested, I have some code for exploring a few different options here: https://github.com/arlenk/gensim/tree/binary_corpus

I tested out the following:

  • the python standard library's "struct" module.
    Pros: no cython code, struct module is supported and well tested
    Cons: binary file is not portable (hardware specific endian, packing, etc.). C
  • structs using cython and fread
    Pros: faster
    Cons: binary file is not portable, manual memory management, only works with local, non-zipped, files.
  • structs using cython and fread, but returning the results in an cython iterator class instead of a list of tuples
    Pros: same as above
    Cons: same as above plus result is an iterator so any code that depends on a list will break (though code should probably work for an iterator anyway)
  • a "read only" version of structs using cython. This versions reads the data but does not return any python objects. This is to determine the upper bound on speed using cython and fread/structs
  • python's array module
    Pros: array module is already part of standard library, no cython code
    Cons: binary file is still not portable

There are a couple scripts in the scripts director for benchmarking the different options. You can run them like:

python gensim/scripts/create_corpus.py text8 .
python gensim/scripts/benchmark_corpus.py text8 .

python gensim/scripts/create_corpus.py nytimes .
python gensim/scripts/benchmark_corpus.py nytimes .

(note the nytimes corpus will take up almost 2 gigs of disk space total)

The execution times on my machine are:

| | num_docs | num_terms | corpus class | time (seconds) |
|---------|:--------:|:---------:|:--------------------------------:|:--------------:|
| text8 | 2k | 250k | gensim 3.3 | 63 |
| | | | gensim 3.4 | 6.3 |
| | | | python structs | 12.0 |
| | | | cython structs | 3.4 |
| | | | cython structs (custom iterator) | 2.9 |
| | | | cython structs (read only) | 0.1 |
| | | | python array | 3.4 |
| nytimes | | | gensim 3.3 | 1078 |
| | | | gensim 3.4 | 124 |
| | | | python structs | 212 |
| | | | cython structs | 67 |
| | | | cython structs (custom iterator) | 54 |
| | | | cython structs (read only) | 2.4 |
| | | | python array | 62.4 |

I included a comparison of the "old" (gensim 3.3) mmreader code as well just to highlight that the code in version 3.4 is already almost an order of magnitude faster, so the extra code complexity and lack of support for non local (or compressed) files may not be worth it unless corpus processing time is really a bottleneck.

From prior profiling, corpus IO consumes about 10-15% of the time for fitting an LDAModel, though I am not as familiar with other models in gensim.

Does this follow the standard CSC format mentioned above?

Now that we have MmCorpus optimized, I'm -1 on introducing a new custom format, unless it adds additional capabilities, such as mimicking the standard CSC layout using streaming and the large dense arrays being directly mmap-able.

@piskvorky as I understand - no

I'm +1 for this change because this can give us the significant boost for performance ~2-4x (even in comparison with optimized MmCorpus)

I see no serious difference between custom format & mmap for 3 dense numpy arrays, but this is a good idea to add this variant to benchmark.

UPD: I also worried about windows, in this case, does mmap works fine on Windows?

Yes.

Advantages of using a standard format include easier interoperability, ability to use standard read/write tools, wider applicability and numpy mmap.

I don't see much advantage to create and maintain a new custom format, for only a 2-4x speed-up (to an already fast reader). The worrying part is the writer, which currently cannot write compressed files with MmWriter. It's because it needs to seek to the beginning to write the header, after completing a full corpus pass. This is very inconvenient: you have to choose to either do two corpus passes (slow), or give up on streamed compression (big file size). See also #21.

We want all implementations to be streamed (iterables/generators), of course.

@piskvorky did you know, how to write iterable as numpy array to disk in "streaming manner"? I never heard about it.

I'm +1 for the standard format, but only if this works with approximately the same performance as custom solutions.

For me, priorities are read (speed) > write (speed) > file size, I don't think that compression so important now (2018 year, HDD is really cheap, SSD is cheap too, 34% of file size mentioned in #21 not worth it IMO). Although, sometimes compression improves performance (by slow disk reason).

Does this follow the standard CSC format mentioned above?

It does not use scipy.sparse.save_npz and load_npz as I could not figure out an easy way to stream with those. Under the hood, it looks like scipy just saves a sparse matrix as three numpy arrays (one with the data, and two with the indices that contain non zero values).

The formats I used here are similar. For each document, we just store the docid (int), the number of elements (int), and then pairs of (termid (int), value (float)). All the data is stored as raw binary data, so there's no parsing of strings like the current mmreader. The final file size is actually about 30% smaller than the scipy sparse files, though that's probably because I am using floats instead of doubles.

I can take another look at the scipy save_npz/load_npz functions, but based on the "read only" results above, I don't think the file I/O is the bottleneck. Rather, most of the time is spent in creating the return objects (tuples) for python. I may see if there's a way to use a memoryview to avoid creating all the tuples.

For me, priorities are read (speed) > write (speed) > file size, I don't think that compression so important now (2018 year, HDD is really cheap, SSD is cheap too, 34% of file size mentioned in #21 not worth it IMO). Although, sometimes compression improves performance (by slow disk reason).

I ran an updated benchmark of fitting an LDA model with the nytimes corpus (300k documents, so large but not gigantic) and reading the corpus only takes 3% of the total time. Do you know if there are other use cases I can try where the corpus is the bottleneck?

@arlenk so, you can try other topic models (like AuthorTopicModel or LdaSeqModel, but I don't think that results will be so different).

ping @persiyanov, please describe your results here too.

I don't think we can use save_npz/load_npz, as these do not support mmap.

AFAIK, the numpy format (which can be memory-mapped) is more or less raw bytes, with a small header at the beginning.

@arlenk @menshikh-iv

@arlenk thanks for the good benchmark. I used it to experiment a bit.

I rewrote python array version with Cython. It gave a miserable boost and not worthy attention. I also compared this cython array version with yours cython struct on reading the corpus only (ran 5 times, then averaged and calculated std):

benchmark results (nytimes corpus)
--------------------------------------------------------------------------------

cython struct (read only)               : 1.13 +- 0.31 seconds
cython array (read only)                : 3.48 +- 0.35 seconds

I agree that _corpus reading phase_ is actually done quite quickly. So, we need to optimize the time spent on _returned object_.

Also, I think it's worth trying to use np.memmap to on stored CSC/CSR matrix.

@arlenk @menshikh-iv @piskvorky

I've tested version based on np.memmap and scipy.sparse.csc_matrix. Here is the implementation: https://gist.github.com/persiyanov/5aed5165d7945c176a0f557a473ef848. I believe that save_corpus part could be optimized, but saving time is not very important right now. The main idea is to read data, indices, indptr dense arrays from disk and iterate over their memmaped versions (obtained with np.memmap).

The results are not promising, but maybe I haven't optimized my code well. Please, take a look at the gist, maybe I missed something and made some stupid mistakes. I used @arlenk benchmarking code and compared mine python memmap version with python struct, python array and current gensim 3.4 MmReader.

Results on _text8_ corpus:

benchmark results
--------------------------------------------------------------------------------

gensim 3.4 MmReader                     : 5.89 +- 0.76 seconds
python memmap                           : 26.43 +- 1.33 seconds
python struct                           : 9.11 +- 1.14 seconds
python array                            : 1.35 +- 0.07 seconds

Results on _nytimes_ corpus:

benchmark results
--------------------------------------------------------------------------------

gensim 3.4 MmReader                     : 118.70 +- 7.03 seconds
python memmap                           : 427.20 +- 23.09 seconds
python struct                           : 132.17 +- 17.28 seconds
python array                            : 27.01 +- 1.12 seconds

p.s. Tested on my MacBook Pro 2015, Core i5 2.6GHz, 8 Gb RAM

@persiyanov thank you, looks significantly worse :(

The most interesting part is __iter__ (as always): https://gist.github.com/persiyanov/5aed5165d7945c176a0f557a473ef848#file-python_memmap-py-L129

@arlenk have you any ideas why this solution so slow?

Results on nytimes corpus:

benchmark results

gensim 3.4 MmReader : 118.70 +- 7.03 seconds
python memmap : 427.20 +- 23.09 seconds
python struct : 132.17 +- 17.28 seconds
python array : 27.01 +- 1.12 seconds

@persiyanov, any chance you have the benchmark results for the "cython struct" (not the read only version). Your results for python array already look much better than mine -- I wonder if that's because I'm running my tests on older hardware (a 2010 imac with a non SSD HD).

@persiyanov thank you, looks significantly worse :(

The most interesting part is __iter__ (as always): https://gist.github.com/persiyanov/5aed5165d7945c176a0f557a473ef848#file-python_memmap-py-L129

@arlenk have you any ideas why this solution so slow?

I'll try taking a look. I'm guessing there is still quite a bit of overhead in mumpy mmap.

There is no overhead in mmap -- that's the whole point. The disk memory gets mapped into virtual memory instantly, and loaded into physical RAM by the OS when a page fault occurs (~ on first access).

Mmap bypasses OS file I/O (no buffering etc), which may be related to the performance.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

franciscojavierarceo picture franciscojavierarceo  路  3Comments

k0nserv picture k0nserv  路  3Comments

coopwilliams picture coopwilliams  路  3Comments

bgokden picture bgokden  路  3Comments

simonm3 picture simonm3  路  3Comments