Jabref: Performance issue with new search

Created on 16 Sep 2016 · 49Comments · Source: JabRef/jabref

JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0
windows 10 10.0 amd64
Java 1.8.0_101

With the current master branch (master--629c3f0) there is a significant decrease in performance when using the "search" feature of Jabref compared to a previous master branch (master--4913665).

Before the update entering a search term led to an instantaneous search and nearly instantaneous results. Now I have to wait up to 3 sec every time I'm entering a search term or modifying it. As I'm using the search feature quite extensively this is quite annoying.

The database is quite large (>8000 entries), but this shouldn't be the problem as it worked before the update.

Unfortunately I forgot to save the previous installer - Could you provide me with the older master--4913665?

search bug 🐛

Source

AEgit

All 49 comments

Ok, my current workaround is to use the latest release version (JabRef 3.6) which does not exhibit these performance problems.

AEgit on 16 Sep 2016

I cannot confirm this on my ~2200 entry database. There, search returns instant results. However, my database is also considerably smaller than yours. Also, I am not up to date with the recent changes in search. @koppor?

Also, @JabRef/developers What is the largest test file we have for testing performance?

lenhard on 16 Sep 2016

@bartsch-dev Can you have a look?

koppor on 16 Sep 2016

@lenhard
If needed, I could send you an older (and smaller) copy of my database (still >6000 entries). It has the same issues as my current database, albeit Jabref seems a bit faster with the smaller database: Still, if I open this database with Jabref 3.6 and use the search feature I get instant results. If I open the same database with JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0 I have to wait a couple of seconds for search results (and you also hear the CPU cooling unit spinning up, indicating that more processing power is needed (this is not the case with Jabref 3.6).

If I send you the database, however, I would ask you not to share it or make it publicly available, as it may contain sensible information (preferentially you would delete the file, once it is no longer needed).

AEgit on 16 Sep 2016

@AEgit ok, thanks! You can send it to joerg.[email protected] Then I will see how search works with my machine and delete your file afterwards.

It would be cool if one day we had a huge file for JabRef for testing purposes. Maybe we could go for auto-generated entries, but then again randomized data will never be representative of real actual data.

lenhard on 16 Sep 2016

I can confirm the performance problems with a DB of 6000 entries.
@lenhard there is one big file of mine in the admin rep.
@AEgit You don't need to send your file. But thanks for the offer.

tobiasdiez on 16 Sep 2016

Thanks! I played around with the file a little bit and also noticed a delay, albeit less than a second, which would not annoy me. But that is very person-dependent and also, of course, machine-dependent.

If performance has been better before, we should try to at least keep it at that level.

For the record, according to @AEgit's observation, the decrease was introduced some time between today and August 31st (The first comment mentioned that everything was still fine with this commit: 49136652bc71b72bf4f64f4189320e3e6f05f58e )

lenhard on 16 Sep 2016

The performance benchmarks confirm the problem:
27.06.2016 (no idea which build):
Benchmark Mode Cnt Score Error Units
Benchmarks.htmlToLatexConversion thrpt 20 3675.747 ± 93.344 ops/s
Benchmarks.inferBibDatabaseMode thrpt 20 16084.549 ± 715.908 ops/s
Benchmarks.keywordGroupContains thrpt 20 5970.994 ± 478.213 ops/s
Benchmarks.keywordGroupContainsWord thrpt 20 12199628.518 ± 584096.935 ops/s
Benchmarks.latexToHTMLConversion thrpt 20 88616.053 ± 1520.290 ops/s
Benchmarks.latexToUnicodeConversion thrpt 20 88126.629 ± 1262.394 ops/s
Benchmarks.parse thrpt 20 35.918 ± 0.877 ops/s
Benchmarks.search thrpt 20 192.584 ± 5.798 ops/s
Benchmarks.write thrpt 20 71.246 ± 2.905 ops/s

current master:
Benchmark Mode Cnt Score Error Units
Benchmarks.htmlToLatexConversion thrpt 20 4672.927 ± 75.992 ops/s
Benchmarks.inferBibDatabaseMode thrpt 20 16192.387 ± 420.196 ops/s
Benchmarks.keywordGroupContains thrpt 20 8927.462 ± 359.622 ops/s
Benchmarks.keywordGroupContainsWord thrpt 20 12317497.180 ± 321137.238 ops/s
Benchmarks.latexToHTMLConversion thrpt 20 90283.510 ± 2727.009 ops/s
Benchmarks.latexToUnicodeConversion thrpt 20 88338.103 ± 1459.373 ops/s
Benchmarks.parse thrpt 20 29.766 ± 0.337 ops/s
Benchmarks.search thrpt 20 30.106 ± 0.637 ops/s
Benchmarks.write thrpt 20 58.239 ± 0.563 ops/s

So the search score went down from nearly 200 to 30. (note that all the numbers decreased slightly...not good)

tobiasdiez on 16 Sep 2016

@ Everyone:

Thank you for your quick replies. Yea, I prefer it if I can avoid sending the file - maybe once I'll have the time to go through the database and delete all the stuff, that shouldn't be public, I could make it available (but currently I don't have the time). As pointed out by lenhard, the changes that let to the performance drop must have happened between the following two builds:

JabRef 3.7dev--snapshot--2016-09-16--master--4913665
JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0

AEgit on 16 Sep 2016

I made some quick tests and the entry search in the searchworker is about 8 times slower since 543c176200ec793d1272fc65fd763492ebd3105a than before.
https://github.com/JabRef/jabref/blob/master/src/main/java/net/sf/jabref/gui/search/SearchWorker.java#L43
To be precise it is the collectpart of the lambda.

chriba on 16 Sep 2016

Making the stream a parallelStream makes it significantly better but still worse than before (around 2 times slower).

chriba on 16 Sep 2016

@bartsch-dev You bet me by 17 seconds with suggesting parallelization ;) Well, 2 times slower is better than 8 times slower.

Anyways, if lambda expressions cause the delay we should probably think about re-implementing performance-critical operations in the old style. @bartsch-dev can you reimplement that fragment in an interative style and see how it performs?
Something like (this is pseudocode)

for(Entry entry: database.getEntries()){
  if(searchQuery.matches(entry)) {
    myLinkedList.add(entry);
  }
}

Care to use a LinkedList since it performs in O(1) (meaning in constant time) for invocations of add.

lenhard on 16 Sep 2016

It's not really the lambda that causes that problem.
If I insert the same lambda one commit before (not that I changed it in any remarkable way) it performs much better.

chriba on 16 Sep 2016

Well, the search() benchmark just uses the SearchQuery directly (without the SearchWorker). And even there the performance is 8 times slower.

tobiasdiez on 16 Sep 2016

Could it be related to the Unicode stuff?

oscargus on 16 Sep 2016

Just to keep everyone updated: I've tried the most recent snapshot (JabRef 3.7dev--snapshot--2016-09-27--master--b8d1c2e), but it is still clearly slower than previous versions. I will fall back again to the latest release version JabRef 3.6.

AEgit on 27 Sep 2016

I played around a little bit, and got following benchmark (note the score of the parallel search which is much higher than the search before the performance issue):

Benchmark                             Mode  Cnt         Score        Error  Units
Benchmarks.htmlToLatexConversion     thrpt   20      5017.331 ±    173.848  ops/s
Benchmarks.inferBibDatabaseMode      thrpt   20     21474.468 ±    542.497  ops/s
Benchmarks.keywordGroupContains      thrpt   20     12074.752 ±   1167.932  ops/s
Benchmarks.keywordGroupContainsWord  thrpt   20  12975699.817 ± 295384.417  ops/s
Benchmarks.latexToHTMLConversion     thrpt   20    100085.982 ±  11105.766  ops/s
Benchmarks.latexToUnicodeConversion  thrpt   20    105671.959 ±   5261.769  ops/s
Benchmarks.parallelSearch            thrpt   20       525.198 ±     42.728  ops/s
Benchmarks.parse                     thrpt   20        34.981 ±      1.116  ops/s
Benchmarks.removeLatexCommands       thrpt   20    816118.612 ±  33977.863  ops/s
Benchmarks.search                    thrpt   20       320.303 ±      7.338  ops/s
Benchmarks.write                     thrpt   20        69.749 ±      1.551  ops/s

The problem is that in the PR where I changed the searchbar I changed the ContainBasedSearchRule too.
Before the string about to be searched was stripped from all latex commands thus not recognizing {\"a} as ä but as a.
Removing all Latex commands is about 8 times quicker than converting them to Unicode (see Benchmark above).

Refs #518

chriba on 27 Sep 2016

Hmmm, indeed, your benchmark results indicate that the parallel search is much faster than the initial one. I guess that the change to the ContainBasedSearchRule, however, has such a high impact on the performance (indeed, you mention it to be 8 times quicker before the change), that the increased performance in the parallel search is not able overcome the overall problem, i.e., that the search still is still much slower than it was previously.
As far as I understand, the way the search is currently implemented, works as it SHOULD, as Latex commands are interpreted correctly (while before they were not). This correct behaviour, however, causes a major performance issue for large databases (which contain many Latex commands that need to be converted to Unicode? - What about databases, which don't contain much that needs to be converted?).

If it is not possible to solve the problem in another way: Would it be possible to implement a switch which enables the user to activate either the "fast search" (which does not convert the Latex commands) or the "slow/correct search" (which does convert the Latex commands - with the associated performance drop)? I guess it is just a workaround, but that might be a solution?

AEgit on 27 Sep 2016

First of all, it is a good thing if we have found the component that is causing the trouble, good job @bartsch-dev

It will probably make no difference if the database contains LaTeX characters or not, I think it is the searching for LaTeX in LatexToUnicode that is causing the problem. There is some potential for performance optimization in the code in this class, for instance by using pre-compiled regular expressions instead of the replaceAll statements, but I doubt this will really solve the problem. I am also against providing a configuration option for turning this off (sorry @AEgit ), since workarounds have a tendency to turn into permanent solutions. Either we do it right, or we leave it, but no more hacks that make future feature implementation harder.

518 is a possible long-term solution. However, if we compute a LaTeX-free version of every field on every change, load times of databases might go up significantly (which could be a compromise. Normally, you load once and might search often). Also other features might be affected. On the other side, we can of course circumvent increased load-times by implementing a cache-based version (computing LaTeX-free field content on demand). Then, the first search would be slow, but subsequent ones fast. All in all, I think this solution, including the cache, is worthwhile. @JabRef/developers What do you think other implications would be? Should we go for implementing #518?

lenhard on 27 Sep 2016

I think we should discuss this one day in a devcall. But in general it looks like this is the only long term solution we have. But I do not really understand why people keep latex encoded entries in their db, as normally all modern text writing programs support UTF-8.
I mean, we have a cleanup operation already for removing latex.

Siedlerchr on 27 Sep 2016

IMHO there are two obvious solutions for this to speed things up:

Keep an extra map with already latex free fields
Convert everything to latex free strings

I just tried the first option, the gained performance is enormous (https://github.com/bartsch-dev/jabref/commit/7a9740a758146f3af5016a42af73b5ba93298cda):

Benchmark                   Mode  Cnt     Score     Error  Units
Benchmarks.parallelSearch  thrpt   20  1200.462 ± 125.105  ops/s
Benchmarks.search          thrpt   20   650.013 ±  53.531  ops/s

chriba on 27 Sep 2016

👍1

Option 1 goes in the direction of "store latex free version" #518.
What I don't understand is that it actually works for the benchmark. I thought the benchmark works as follows:

Create DB with a lot of entries
Search for one term
Start at beginning

Thus every entry should be asked about its fields exactly once. Thus caching shouldn't affect the benchmark.

Since I'm not a big fan of caching (it makes development harder in general, since you often have to change two things), I would propose to improve performance of the unicode to latex formatter instead. I'll have a look at it during the weekend if nobody wants to do it before (@bartsch-dev: ?)

tobiasdiez on 27 Sep 2016

The latex free field gets stored as soon as the field gets set (thus the database needs about double the space it did before), not when it gets accessed.
The benchmark just measures the time the search takes, not the time the database needs to get initialized.

I'm away over the long weekend. You can try your luck (I opt for option 2, where do we the strings with latex commands anyway, except saving the database to disk which only occurs if the user enters the special characters in latex style).

chriba on 27 Sep 2016

👍1

Regarding caching: There is no point of improving performance if JabRef dies for large databases due to memory usage. @AEgit What is the current memory footprint for your 8000+ entry database? Will it be a problem for your enviroment if JabRef takes up twice of what it currently does?

lenhard on 28 Sep 2016

On Windows 10 Jabref currently uses ~1.4 GB of RAM for a database with nearly 9,000 entries (and several hundreds of groups). Doubling the memory footprint would be no issue for me, as I'm running this on a machine with 16 GB of RAM.

AEgit on 28 Sep 2016

@AEgit Just a question: Do you have the 64-Bit version of Java installed?

Siedlerchr on 28 Sep 2016

Yes I have.

AEgit on 28 Sep 2016

I note that I pointed this out as a probable reason quite some time ago. ;-)

@Siedlerchr There are for sure still use cases for non-UTF-8 databases. IEEE is just one.

oscargus on 28 Sep 2016

😕1

Plain LaTeX is important for so many users. For instance, when publishing scientific papers. Nearly no publisher supports biblatex. And bibtex8 still seems to be a hack and is 20 years old now. Sure, switching to biblatex is recommended, but not practical.

koppor on 29 Sep 2016

I doubt that caching the UTF-8 string ("latex-free version") takes twice as memory, because I assume that 80% of the Strings will be the same. The latex-free version can use String.intern() to reduce memory consumption drastically (see also http://stackoverflow.com/q/10578984/873282). We can also add -XX:+UseG1GC -XX:+UseStringDeduplication as JVM arguments, if we don't want to use intern(). See also http://stackoverflow.com/a/27953779/873282.

koppor on 29 Sep 2016

@oscargus Indeed, your intuition paved the way :-)
@koppor Sure it does not take twice the memory. That is just the worst-case upper bound estimation, which we should take into account in a design decision anyway.

Summarizing the discussion, I think we have a consensus that something will be done. We currently have the following options:

Performance optimization of LatexToUnicode
Duplicate storage of LaTeX-free field values in BibEntry
Solution 2 plus Caching in BibEntry
Solution 2 plus hand-coded String internalization
Solution 2 plus additional JVM arguments
A combination of 3 plus 4, or 3 plus 5

Let us decide which way to go at the next devcall!

lenhard on 29 Sep 2016

I think rewriting the converters would make sense anyway. I have just touched them, but there are at least three(?) places where the conversion map is checked, which seems quite redundant. I haven't really studied it in depth, but it may be possible to write a faster converter from scratch with a selectable conversion map (to deal with Latex to {Unicode, HTML, XML, RTF}), although unification and speed may be contradictory...

A first step might be to profile the LatexToUnicode converter to see where the actual bottleneck is.

oscargus on 29 Sep 2016

👍1

It is not my place to say this, as I'm not an active developer: But, would it be possible for you to go for the quickest solution among the ones mentioned so that I can switch to the new developer version. Because currently I can't and because of that I can also not use Google Scholar (see recent posts here: https://github.com/JabRef/jabref/issues/1886), which is quite essential to my work (much of it has no DOI but is found by Google Scholar).

Again, this is just a suggestion and apologies if I'm overstepping a boundary as a passive user who doesn't help in actively developing this wonderful piece of software.

AEgit on 29 Sep 2016

@AEgit if you desperately want a development version with quicker search, then you can get one right now here: http://builds.jabref.org/latex-free-fields/

This should be somewhat faster already. I'll do a few more minor improvements tomorrow and I have also not yet tested as to how much faster this actually is and how much memory it consumes. But it would be interesting to hear from you how search performs anyway. The first search run should still be pretty slow, but subsequent ones would be faster.

@oscargus I am doing a few minor optimizations in the LatexToUnicode, but a rewrite would probably the most reasonable option. I have to say that I barely understand what that algorithm does...

lenhard on 29 Sep 2016

JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e
windows 10 10.0 amd64
Java 1.8.0_101

@lenhard : Thank you so much!!!
This is indeed much faster - it feels as if it was even faster than before the issue (maybe due to the parallel optimizations?). Google Scholar works as well: Great!

The memory footprint seems to be nearly the same: I think there's a max. increase of 200 MB (Jabref is now using 1.6 GB instead of 1.4 GB) - but that is negligible on my machine. I think it also uses more CPU power - at least initially I heard my fans spinning up: Again, no issue for me, as I'm running this on a quite powerful Core i7 laptop.
So far I did not run into any (major) bugs.

Thank you very much!

AEgit on 29 Sep 2016

Ok, small bug spotted in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e

It's no longer possible to assign an article to a group by drag'n'drop. If I try that, nothing happens. Instead I have to manually enter the groups name in the "Groups" field: Then it works.

(I didn't know whether I should open a new ticket as you mentioned that you would do a few minor improvements tomorrow anyway)

AEgit on 29 Sep 2016

@AEgit That is great to hear :-) I will do some more testing tomorrow, but it sounds like we are getting this issue solved. It is also good that the additional memory needed seems to be acceptable. If so, then that is certainly due to the suggestions by @koppor

The groups bug you mention should be unrelated and I only touch the LaTeX converter tomorrow, so please open a new issue.

lenhard on 29 Sep 2016

Another small bug in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e

The DOI resolver does not appear anymore in the Web search interface.

AEgit on 29 Sep 2016

The DOI resolver is now in BibTeX -> New entry

mlep on 29 Sep 2016

@mlep
Thank you! Hmmm, would it be possible to readd the DOI fetcher to the Web search? Because now I have to make at least one additional click/use one additional shortcut to enter the DOI and get the bibtex data. Beforehand I could just enter the doi in the Web search that was already open.

AEgit on 29 Sep 2016

Would it help to press CTRL n and then the Input field is focussed?

koppor on 29 Sep 2016

Hmmm, yes, I was already using CTRL+N and focussing the input field would help - but it's not as convenient as it was before. At least I think so - maybe I just need to get used to it: But changing the focus to the input field would definitely help.

AEgit on 29 Sep 2016

@zellerdev

koppor on 29 Sep 2016

Another small bug I came across in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e

Select a group that contains articles
Perform a search
One of the articles within that group is found by the search and is selected (indicated in blue)
The entry preview of the selcted article is displayed
If we select now another article in the same group that is also found by the search, the entry preview is not shown - it still shows the previous article
This does not apply to articles that are not part of the same group (indicated in grey)

I think a similar issue has been reported so far.

AEgit on 29 Sep 2016

Indeed similar (but not the same) problems as the one mentioned above appeared in:
https://github.com/JabRef/jabref/issues/1760
https://github.com/JabRef/jabref/issues/1903

(I can create a new bug report, if this is not related to the changes made so far in search)

AEgit on 29 Sep 2016

I think we should add a side panel for creation of entries based on DOI etc.

oscargus on 30 Sep 2016

As far as I can tell, the small bug mentioned for JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e (entry preview not correctly displayed after having performed a search - see posts above me), appears to be fixed in JabRef 3.7dev--snapshot--2016-09-30--latex-free-fields--0ee1d17.

UPDATE: No, I was to quick - the bug still appears (but it is a bit more difficult to reproduce - I will open a new ticket).
The new ticket can be found here:
https://github.com/JabRef/jabref/issues/2105

AEgit on 30 Sep 2016

This is fixed in #2102. Discussion regarding related issues that have been discovered in this context should be carried on in separate threads.

lenhard on 3 Oct 2016

Thanks to everyone for this quick fix!

Bugs found in this context can be found here:
https://github.com/JabRef/jabref/issues/2098
https://github.com/JabRef/jabref/issues/2105

AEgit on 3 Oct 2016

Was this page helpful?

0 / 5 - 0 ratings