JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0
windows 10 10.0 amd64
Java 1.8.0_101
With the current master branch (master--629c3f0) there is a significant decrease in performance when using the "search" feature of Jabref compared to a previous master branch (master--4913665).
Before the update entering a search term led to an instantaneous search and nearly instantaneous results. Now I have to wait up to 3 sec every time I'm entering a search term or modifying it. As I'm using the search feature quite extensively this is quite annoying.
The database is quite large (>8000 entries), but this shouldn't be the problem as it worked before the update.
Unfortunately I forgot to save the previous installer - Could you provide me with the older master--4913665?
Ok, my current workaround is to use the latest release version (JabRef 3.6) which does not exhibit these performance problems.
I cannot confirm this on my ~2200 entry database. There, search returns instant results. However, my database is also considerably smaller than yours. Also, I am not up to date with the recent changes in search. @koppor?
Also, @JabRef/developers What is the largest test file we have for testing performance?
@bartsch-dev Can you have a look?
@lenhard
If needed, I could send you an older (and smaller) copy of my database (still >6000 entries). It has the same issues as my current database, albeit Jabref seems a bit faster with the smaller database: Still, if I open this database with Jabref 3.6 and use the search feature I get instant results. If I open the same database with JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0 I have to wait a couple of seconds for search results (and you also hear the CPU cooling unit spinning up, indicating that more processing power is needed (this is not the case with Jabref 3.6).
If I send you the database, however, I would ask you not to share it or make it publicly available, as it may contain sensible information (preferentially you would delete the file, once it is no longer needed).
@AEgit ok, thanks! You can send it to joerg.[email protected] Then I will see how search works with my machine and delete your file afterwards.
It would be cool if one day we had a huge file for JabRef for testing purposes. Maybe we could go for auto-generated entries, but then again randomized data will never be representative of real actual data.
I can confirm the performance problems with a DB of 6000 entries.
@lenhard there is one big file of mine in the admin rep.
@AEgit You don't need to send your file. But thanks for the offer.
Thanks! I played around with the file a little bit and also noticed a delay, albeit less than a second, which would not annoy me. But that is very person-dependent and also, of course, machine-dependent.
If performance has been better before, we should try to at least keep it at that level.
For the record, according to @AEgit's observation, the decrease was introduced some time between today and August 31st (The first comment mentioned that everything was still fine with this commit: 49136652bc71b72bf4f64f4189320e3e6f05f58e )
The performance benchmarks confirm the problem:
27.06.2016 (no idea which build):
Benchmark Mode Cnt Score Error Units
Benchmarks.htmlToLatexConversion thrpt 20 3675.747 ± 93.344 ops/s
Benchmarks.inferBibDatabaseMode thrpt 20 16084.549 ± 715.908 ops/s
Benchmarks.keywordGroupContains thrpt 20 5970.994 ± 478.213 ops/s
Benchmarks.keywordGroupContainsWord thrpt 20 12199628.518 ± 584096.935 ops/s
Benchmarks.latexToHTMLConversion thrpt 20 88616.053 ± 1520.290 ops/s
Benchmarks.latexToUnicodeConversion thrpt 20 88126.629 ± 1262.394 ops/s
Benchmarks.parse thrpt 20 35.918 ± 0.877 ops/s
Benchmarks.search thrpt 20 192.584 ± 5.798 ops/s
Benchmarks.write thrpt 20 71.246 ± 2.905 ops/s
current master:
Benchmark Mode Cnt Score Error Units
Benchmarks.htmlToLatexConversion thrpt 20 4672.927 ± 75.992 ops/s
Benchmarks.inferBibDatabaseMode thrpt 20 16192.387 ± 420.196 ops/s
Benchmarks.keywordGroupContains thrpt 20 8927.462 ± 359.622 ops/s
Benchmarks.keywordGroupContainsWord thrpt 20 12317497.180 ± 321137.238 ops/s
Benchmarks.latexToHTMLConversion thrpt 20 90283.510 ± 2727.009 ops/s
Benchmarks.latexToUnicodeConversion thrpt 20 88338.103 ± 1459.373 ops/s
Benchmarks.parse thrpt 20 29.766 ± 0.337 ops/s
Benchmarks.search thrpt 20 30.106 ± 0.637 ops/s
Benchmarks.write thrpt 20 58.239 ± 0.563 ops/s
So the search score went down from nearly 200 to 30. (note that all the numbers decreased slightly...not good)
@ Everyone:
Thank you for your quick replies. Yea, I prefer it if I can avoid sending the file - maybe once I'll have the time to go through the database and delete all the stuff, that shouldn't be public, I could make it available (but currently I don't have the time). As pointed out by lenhard, the changes that let to the performance drop must have happened between the following two builds:
JabRef 3.7dev--snapshot--2016-09-16--master--4913665
JabRef 3.7dev--snapshot--2016-09-16--master--629c3f0
I made some quick tests and the entry search in the searchworker is about 8 times slower since 543c176200ec793d1272fc65fd763492ebd3105a than before.
https://github.com/JabRef/jabref/blob/master/src/main/java/net/sf/jabref/gui/search/SearchWorker.java#L43
To be precise it is the collectpart of the lambda.
Making the stream a parallelStream makes it significantly better but still worse than before (around 2 times slower).
@bartsch-dev You bet me by 17 seconds with suggesting parallelization ;) Well, 2 times slower is better than 8 times slower.
Anyways, if lambda expressions cause the delay we should probably think about re-implementing performance-critical operations in the old style. @bartsch-dev can you reimplement that fragment in an interative style and see how it performs?
Something like (this is pseudocode)
for(Entry entry: database.getEntries()){
if(searchQuery.matches(entry)) {
myLinkedList.add(entry);
}
}
Care to use a LinkedList since it performs in O(1) (meaning in constant time) for invocations of add.
It's not really the lambda that causes that problem.
If I insert the same lambda one commit before (not that I changed it in any remarkable way) it performs much better.
Well, the search() benchmark just uses the SearchQuery directly (without the SearchWorker). And even there the performance is 8 times slower.
Could it be related to the Unicode stuff?
Just to keep everyone updated: I've tried the most recent snapshot (JabRef 3.7dev--snapshot--2016-09-27--master--b8d1c2e), but it is still clearly slower than previous versions. I will fall back again to the latest release version JabRef 3.6.
I played around a little bit, and got following benchmark (note the score of the parallel search which is much higher than the search before the performance issue):
Benchmark Mode Cnt Score Error Units
Benchmarks.htmlToLatexConversion thrpt 20 5017.331 ± 173.848 ops/s
Benchmarks.inferBibDatabaseMode thrpt 20 21474.468 ± 542.497 ops/s
Benchmarks.keywordGroupContains thrpt 20 12074.752 ± 1167.932 ops/s
Benchmarks.keywordGroupContainsWord thrpt 20 12975699.817 ± 295384.417 ops/s
Benchmarks.latexToHTMLConversion thrpt 20 100085.982 ± 11105.766 ops/s
Benchmarks.latexToUnicodeConversion thrpt 20 105671.959 ± 5261.769 ops/s
Benchmarks.parallelSearch thrpt 20 525.198 ± 42.728 ops/s
Benchmarks.parse thrpt 20 34.981 ± 1.116 ops/s
Benchmarks.removeLatexCommands thrpt 20 816118.612 ± 33977.863 ops/s
Benchmarks.search thrpt 20 320.303 ± 7.338 ops/s
Benchmarks.write thrpt 20 69.749 ± 1.551 ops/s
The problem is that in the PR where I changed the searchbar I changed the ContainBasedSearchRule too.
Before the string about to be searched was stripped from all latex commands thus not recognizing {\"a} as ä but as a.
Removing all Latex commands is about 8 times quicker than converting them to Unicode (see Benchmark above).
Refs #518
Hmmm, indeed, your benchmark results indicate that the parallel search is much faster than the initial one. I guess that the change to the ContainBasedSearchRule, however, has such a high impact on the performance (indeed, you mention it to be 8 times quicker before the change), that the increased performance in the parallel search is not able overcome the overall problem, i.e., that the search still is still much slower than it was previously.
As far as I understand, the way the search is currently implemented, works as it SHOULD, as Latex commands are interpreted correctly (while before they were not). This correct behaviour, however, causes a major performance issue for large databases (which contain many Latex commands that need to be converted to Unicode? - What about databases, which don't contain much that needs to be converted?).
If it is not possible to solve the problem in another way: Would it be possible to implement a switch which enables the user to activate either the "fast search" (which does not convert the Latex commands) or the "slow/correct search" (which does convert the Latex commands - with the associated performance drop)? I guess it is just a workaround, but that might be a solution?
First of all, it is a good thing if we have found the component that is causing the trouble, good job @bartsch-dev
It will probably make no difference if the database contains LaTeX characters or not, I think it is the searching for LaTeX in LatexToUnicode that is causing the problem. There is some potential for performance optimization in the code in this class, for instance by using pre-compiled regular expressions instead of the replaceAll statements, but I doubt this will really solve the problem. I am also against providing a configuration option for turning this off (sorry @AEgit ), since workarounds have a tendency to turn into permanent solutions. Either we do it right, or we leave it, but no more hacks that make future feature implementation harder.
I think we should discuss this one day in a devcall. But in general it looks like this is the only long term solution we have. But I do not really understand why people keep latex encoded entries in their db, as normally all modern text writing programs support UTF-8.
I mean, we have a cleanup operation already for removing latex.
IMHO there are two obvious solutions for this to speed things up:
I just tried the first option, the gained performance is enormous (https://github.com/bartsch-dev/jabref/commit/7a9740a758146f3af5016a42af73b5ba93298cda):
Benchmark Mode Cnt Score Error Units
Benchmarks.parallelSearch thrpt 20 1200.462 ± 125.105 ops/s
Benchmarks.search thrpt 20 650.013 ± 53.531 ops/s
Option 1 goes in the direction of "store latex free version" #518.
What I don't understand is that it actually works for the benchmark. I thought the benchmark works as follows:
Thus every entry should be asked about its fields exactly once. Thus caching shouldn't affect the benchmark.
Since I'm not a big fan of caching (it makes development harder in general, since you often have to change two things), I would propose to improve performance of the unicode to latex formatter instead. I'll have a look at it during the weekend if nobody wants to do it before (@bartsch-dev: ?)
The latex free field gets stored as soon as the field gets set (thus the database needs about double the space it did before), not when it gets accessed.
The benchmark just measures the time the search takes, not the time the database needs to get initialized.
I'm away over the long weekend. You can try your luck (I opt for option 2, where do we the strings with latex commands anyway, except saving the database to disk which only occurs if the user enters the special characters in latex style).
Regarding caching: There is no point of improving performance if JabRef dies for large databases due to memory usage. @AEgit What is the current memory footprint for your 8000+ entry database? Will it be a problem for your enviroment if JabRef takes up twice of what it currently does?
On Windows 10 Jabref currently uses ~1.4 GB of RAM for a database with nearly 9,000 entries (and several hundreds of groups). Doubling the memory footprint would be no issue for me, as I'm running this on a machine with 16 GB of RAM.
@AEgit Just a question: Do you have the 64-Bit version of Java installed?
Yes I have.
I note that I pointed this out as a probable reason quite some time ago. ;-)
@Siedlerchr There are for sure still use cases for non-UTF-8 databases. IEEE is just one.
Plain LaTeX is important for so many users. For instance, when publishing scientific papers. Nearly no publisher supports biblatex. And bibtex8 still seems to be a hack and is 20 years old now. Sure, switching to biblatex is recommended, but not practical.
I doubt that caching the UTF-8 string ("latex-free version") takes twice as memory, because I assume that 80% of the Strings will be the same. The latex-free version can use String.intern() to reduce memory consumption drastically (see also http://stackoverflow.com/q/10578984/873282). We can also add -XX:+UseG1GC -XX:+UseStringDeduplication as JVM arguments, if we don't want to use intern(). See also http://stackoverflow.com/a/27953779/873282.
@oscargus Indeed, your intuition paved the way :-)
@koppor Sure it does not take twice the memory. That is just the worst-case upper bound estimation, which we should take into account in a design decision anyway.
Summarizing the discussion, I think we have a consensus that something will be done. We currently have the following options:
LatexToUnicodeBibEntryBibEntryLet us decide which way to go at the next devcall!
I think rewriting the converters would make sense anyway. I have just touched them, but there are at least three(?) places where the conversion map is checked, which seems quite redundant. I haven't really studied it in depth, but it may be possible to write a faster converter from scratch with a selectable conversion map (to deal with Latex to {Unicode, HTML, XML, RTF}), although unification and speed may be contradictory...
A first step might be to profile the LatexToUnicode converter to see where the actual bottleneck is.
It is not my place to say this, as I'm not an active developer: But, would it be possible for you to go for the quickest solution among the ones mentioned so that I can switch to the new developer version. Because currently I can't and because of that I can also not use Google Scholar (see recent posts here: https://github.com/JabRef/jabref/issues/1886), which is quite essential to my work (much of it has no DOI but is found by Google Scholar).
Again, this is just a suggestion and apologies if I'm overstepping a boundary as a passive user who doesn't help in actively developing this wonderful piece of software.
@AEgit if you desperately want a development version with quicker search, then you can get one right now here: http://builds.jabref.org/latex-free-fields/
This should be somewhat faster already. I'll do a few more minor improvements tomorrow and I have also not yet tested as to how much faster this actually is and how much memory it consumes. But it would be interesting to hear from you how search performs anyway. The first search run should still be pretty slow, but subsequent ones would be faster.
@oscargus I am doing a few minor optimizations in the LatexToUnicode, but a rewrite would probably the most reasonable option. I have to say that I barely understand what that algorithm does...
JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e
windows 10 10.0 amd64
Java 1.8.0_101
@lenhard : Thank you so much!!!
This is indeed much faster - it feels as if it was even faster than before the issue (maybe due to the parallel optimizations?). Google Scholar works as well: Great!
The memory footprint seems to be nearly the same: I think there's a max. increase of 200 MB (Jabref is now using 1.6 GB instead of 1.4 GB) - but that is negligible on my machine. I think it also uses more CPU power - at least initially I heard my fans spinning up: Again, no issue for me, as I'm running this on a quite powerful Core i7 laptop.
So far I did not run into any (major) bugs.
Thank you very much!
Ok, small bug spotted in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e
It's no longer possible to assign an article to a group by drag'n'drop. If I try that, nothing happens. Instead I have to manually enter the groups name in the "Groups" field: Then it works.
(I didn't know whether I should open a new ticket as you mentioned that you would do a few minor improvements tomorrow anyway)
@AEgit That is great to hear :-) I will do some more testing tomorrow, but it sounds like we are getting this issue solved. It is also good that the additional memory needed seems to be acceptable. If so, then that is certainly due to the suggestions by @koppor
The groups bug you mention should be unrelated and I only touch the LaTeX converter tomorrow, so please open a new issue.
Another small bug in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e
The DOI resolver does not appear anymore in the Web search interface.
The DOI resolver is now in BibTeX -> New entry
@mlep
Thank you! Hmmm, would it be possible to readd the DOI fetcher to the Web search? Because now I have to make at least one additional click/use one additional shortcut to enter the DOI and get the bibtex data. Beforehand I could just enter the doi in the Web search that was already open.
Would it help to press CTRL n and then the Input field is focussed?
Hmmm, yes, I was already using CTRL+N and focussing the input field would help - but it's not as convenient as it was before. At least I think so - maybe I just need to get used to it: But changing the focus to the input field would definitely help.
@zellerdev
Another small bug I came across in JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e
I think a similar issue has been reported so far.
Indeed similar (but not the same) problems as the one mentioned above appeared in:
https://github.com/JabRef/jabref/issues/1760
https://github.com/JabRef/jabref/issues/1903
(I can create a new bug report, if this is not related to the changes made so far in search)
I think we should add a side panel for creation of entries based on DOI etc.
As far as I can tell, the small bug mentioned for JabRef 3.7dev--snapshot--2016-09-29--latex-free-fields--93e665e (entry preview not correctly displayed after having performed a search - see posts above me), appears to be fixed in JabRef 3.7dev--snapshot--2016-09-30--latex-free-fields--0ee1d17.
UPDATE: No, I was to quick - the bug still appears (but it is a bit more difficult to reproduce - I will open a new ticket).
The new ticket can be found here:
https://github.com/JabRef/jabref/issues/2105
This is fixed in #2102. Discussion regarding related issues that have been discovered in this context should be carried on in separate threads.
Thanks to everyone for this quick fix!
Bugs found in this context can be found here:
https://github.com/JabRef/jabref/issues/2098
https://github.com/JabRef/jabref/issues/2105