Antlr4: [Java] Lexer is >10x slower if input is not in Ascii range

Created on 18 Jan 2017 · 31Comments · Source: antlr/antlr4

Hi,

Currently antlr lexer runtime uses a cache which is limited to 127

Java version:

public static final int MAX_DFA_EDGE = 127; // forces unicode to stay in ATN

This default causes big slowdown if parsed document contains characters codes bigger than 127, which is common for a lot of latin scripts (from my tests, up to 3x slowdown for Turkish text , >10x slowdown if most of the chars are out of range) I believe this should be a problem for Polish, German, Scandinavian texts as well.

If this range is extended to 240 (Latin-1) or even better to 368 (Latin Extended A) This would extend fast lexing of mixed content documents to a lot more languages.

I believe the price of increasing this cache size a little is worth the speed gains.

performance lexers improvement

Source

mdakin

Most helpful comment

@parrt
I created https://github.com/antlr/antlr4/pull/2015

mdakin on 17 Sep 2017

👍3

All 31 comments

Increasing the size to 240 would result in a 452 byte increase in the size of every lexer DFA state instance, where the majority of inputs being fed to ANTLR never exercise input values in that range. If you need to increase the size of the edge tables as part of a particular application, you can extend LexerATNSimulator and override the getExistingTargetState and addDFAEdge methods, replacing MAX_DFA_EDGE with the constant value of your choice. You can then configure your Lexer instance to use the custom ATN simulator.

MyLexer lexer = new MyLexer(input);
// Note: the arguments to the constructor match the values seen in the generated MyLexer constructor
lexer.setInterpreter(new CustomLexerATNSimulator(...));

sharwell on 18 Jan 2017

Ok it seems like we could do it the way you described, thanks.

mdakin on 19 Jan 2017

Could we consider making DFAState.edges a HashMap or something similar instead of an array?

The only operations we use are set and put, so an unordered map would probably be quite performant.

bhamiltoncx on 2 Mar 2017

@bhamiltoncx I'm not sure what the answer here is. For reference, I use an abstract base class with a get and set method as part of a memory optimization in my fork. The performance overhead of my ArrayEdgeMap.get compared to the lookup in the reference implementation causes a surprisingly large gap in the lexer performance between the two (more than 10% in overall lexer performance). A HashMap would introduce substantially more complex code on this hot path.

sharwell on 2 Mar 2017

I'd guess the overhead in HashMap would mainly be the boxing and unboxing
of int keys to Integer objects.

I have some experience using more performant alternatives, so if you share
your perf test methodology, I can try a few Unicode perf tests with various
alternatives.

On Wed, Mar 1, 2017, 8:19 PM Sam Harwell notifications@github.com wrote:

@bhamiltoncx https://github.com/bhamiltoncx I'm not sure what the
answer here is. For reference, I use an abstract base class with a get and
set method as part of a memory optimization in my fork. The performance
overhead of ArrayEdgeMap.get
https://github.com/tunnelvisionlabs/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/dfa/ArrayEdgeMap.java#L49-L56
compared to the lookup in the reference implementation
https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/atn/LexerATNSimulator.java#L249-L253
causes a surprisingly large gap in the lexer performance between the two. A
HashMap would introduce substantially more complex code on this hot path.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/antlr/antlr4/issues/1613#issuecomment-283552105, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AApAU0rMAmACq2lIaWjDcmj4vljLAhEMks5rhkNsgaJpZM4Lm8-f
.

bhamiltoncx on 2 Mar 2017

@bhamiltoncx I typically test with TestPerformance. In this case you'd be interested in running over large inputs with RUN_PARSER and COMPUTE_CHECKSUM set to false.

sharwell on 2 Mar 2017

I'd also be looking at YourKit for precise information about the memory overhead from the change.

sharwell on 2 Mar 2017

We have some Map and Set classes for primitive key and values (Simple linear probing). Feel free to use it. Not sure it it is proper for you but they are quite fast and memory efficient.

https://github.com/ahmetaa/zemberek-nlp/tree/master/core/src/main/java/zemberek/core/collections

Such as UIntMap or UIntIntMap etc.

ahmetaa on 29 Mar 2017

For Java, we'll want to try a few approaches and build micro benchmarks for
each. Array access in Java is an order of magnitude faster than a map, due
to object boxing/unboxing.
On Wed, Apr 19, 2017 at 12:27 AM Mike Lischke notifications@github.com
wrote:

Side note: in the C++ target I moved to a map already, but not so much for
performance reasons, but for RAM usage reasons. With large + complex
grammars (especially with compless expression rules) I saw 600MB and more
RAM used just for ATN/DFA. This dropped to less than 300MB when I switched
to a map. So, I believe that the combined step of using a map and
increasing the max DFA edge value will rather have a positive overall
effect.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/antlr/antlr4/issues/1613#issuecomment-295137770, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AApAUzqS_rAZWOzZacNGNhHn_Zvtc3EEks5rxbd3gaJpZM4Lm8-f
.

bhamiltoncx on 19 Apr 2017

Makes sense, maybe using what @ahmetaa offered would help here?

mike-lischke on 19 Apr 2017

I use a dynamic map approach in my optimized fork as well (custom map type with no boxing designed to store these edges in a concurrent execution environment), also for its memory improvements. For inputs that rarely have code points above 0x7F, there is a clear performance difference between the two (the lexer in this repository is consistently faster) that I have not been able to resolve to date.

sharwell on 19 Apr 2017

FWIW, There could be a potential hybrid solution.
If input is between a certain range (0-126) use a lookup table (or tables, in range of 0-15 or 32-126). Otherwise fall back to a fast int-State hashmap with linear lookup (something like ahmetaa ssuggested)

I can test it as well if you could point me to your benchmark code. Or I can just run it with some ascii text and some Turkish, German etc text to see the difference.

mdakin on 4 Sep 2017

@sharwell @mike-lischke @bhamiltoncx @ahmetaa
Ok. Here is what I did and my findings:

I hacked UintMap into Antlr4 (from @ahmetaa)
Modified DFAState and LexerATNSimulator.java to use this instead of lookup table.
I run tests for a 50MB Turkish text with ~ 8.8% Non ASCII text (91.2% is ASCII)
Tested different initial hashmap sizes (8, 16, 32)

Here are the numbers etc:
https://gist.github.com/mdakin/1e84b4eeb2e5e5aec33fd5d809f4ced2

Summary is, HashMap version is >10x faster (10Million tokens per second vs 900K) and possibly uses < 3-10x less memory to represent edges.I only tested Lexer beacsue that is what I am interested in.

This is the super hacky code change, but note that apart from the hashmap classes actual change is very small (existing exposed edges encapsulation and -1 issue caused some troubles, but no biggie):
https://github.com/mdakin/antlr4/commit/3d13dbe534afa30a4675bd72e8a94f4c539ae076

Next I will test new version with full ASCII texts, I think it will still have compatible performance with lower memory use, but we will see.

mdakin on 11 Sep 2017

One of my current hypotheses is edges outside of the currently supported DFA range typically fall into exactly one of the following three categories:

The edge is an EOF
The symbol matches (unpredicated), and the target of the transition is independent of the input symbol
The symbol fails to match (unpredicated)

We can probably ignore EOF safely and talk about the remaining edges. If all of these edges either match or do not match, then a single additional field would be sufficient to cover the entire remaining input space.

:memo: This doesn't address all grammars containing transitions outside of the supported range (including your benchmark sample), but it does address all grammars with transitions only in the supported range, even if the input contains symbols outside the range.

sharwell on 11 Sep 2017

@sharwell
My understanding of how antlr runtime works is still very limited, however I had a closer look at relevant bits in the code and I want to share a few observations for this matter and propose a solution.

Current way of handling of edges in DFAState objects is quite problematic.
a) "edges" to other states are not encapsulated properly, it is a raw array and it is managed by other classes. This makes it very fragile, prone to accidents and difficult to replace with a better abstraction.

b) Using an array of states in DFAState is quite wasteful (As a cache in lexer's case, for transitions of edges that has characters 0-127) Even if the DFA state object has a single edge, it creates this array with 128 references (512 bytes)

c) As mentioned in other messages in this issue, performance of current way of handling DFAState transitions with charaters >127 is terrible. I guess if my test grammar was for Japanese or Arabic instead of Turkish it performance would be 100x worse, not 10x.

d) Having a special -1 index, and having array values shifted, further complicates logic

My solution to this is to remove existing array that represent edges, replace it with a fast / compact hashmap. I believe the performance penalty is minuscule for lexer grammars that deals with only ascii charactrers, and its memory footprint is much compact, ~60-80 bytes vs 512 bytes. Also no special -1 or other cases, so it will be simpler to read as well.

I already have a clean intmap implementation, but it will probably take some time to replace existing structure. I also added synchronization, it is still 10x faster for 90%-10% mix of ASCII-Non ASCII input for my test grammar.

You can see the current diff on my branch here:
https://github.com/antlr/antlr4/compare/master...mdakin:master

Another note is that maybe instead of a single DFAState class, maybe it would have been better to have different implementations for Lexer and Parser to represent states. But there is probably a reason for this , as I said I don't know the theory.

I can replace DFAState edges with my map and change all uses, but it will probably take some time. I can create a separate issue for this. What do you think?

mdakin on 11 Sep 2017

👍1

Current way of handling of edges in DFAState objects is quite problematic.
a) edges to other states is not encapsulated properly, it is a raw array and it is managed by other classes. This makes it very fragile, prone to accidents and difficult to replace with a better abstraction.

b) Using an array of states in the DFA state is quite wasteful (As a cache in lexer's case, for transitions of edges that has characters 0-127) Even if the DFA state object has a single edge, it creates this array with 128 references (512 bytes)

c) As mentioned in other messages int this issue, performance of current way of handling DFA edges for charaters >127 is terrible. I guess if my test grammar was for Japanese or Arabic instead of Turkish it performance would be 100x worse, not 10x.

d) Having a special -1 index, and having array values shifted, further complicates logic

My solution to this is to remove existing array that represent edges, replace it with a fast / compact hashmap. I believe the performance penalty is miniscule for lexer grammars that deals with only ascii charactrers, and its memory footprint is much compact, ~60-80 bytes vs 512 bytes. Also no special -1 or other cases, so it will be simpler to read as well

These are exactly the thoughts I had when I examined these details in the C++ target, where I replaced the array by a hash map already. Memory consumption dropped from >600MB to around ~300MB, in one use case, with this map in place. So it's a big difference. I also removed the need for a negative special index and the shifting.

mike-lischke on 12 Sep 2017

👍2

@mike-lischke
Cool! I will take a look to C++ version to see what other things to consider while replacing current implementation.

On another note, I think a special map would almost certainly be faster and more compact than std stuff.
My map for Java version: https://github.com/antlr/antlr4/blob/6fdb0a2960d77a8051af3609bdc9217cde74de48/runtime/Java/src/org/antlr/v4/runtime/dfa/SimpleIntMap.java

mdakin on 12 Sep 2017

@mdakin Keep in mind my comments from https://github.com/antlr/antlr4/issues/1613#issuecomment-283552105. Many of the issues you mentioned were solved as part of my work there, but so far the decision was to not merge these changes back into the reference implementation.

sharwell on 12 Sep 2017

@sharwell
I see, If we somehow solve performance issues do you think we can merge it to reference implementation?
How much degradation is acceptable? Is %5-10 performance loss for for >10x performance for >127 cases + memory gain ok? If so I believe I can do this.

mdakin on 12 Sep 2017

@mdakin The bar for merging into the reference implementation is more about the balance between simplicity and the number of affected users. For example, there are changes in my fork that make a 10,000:1 difference (improvement) in memory usage or speed for some real-world use cases. However, if only 1 or 2 users is impacted by this, it's not worth the added complexity in the reference implementation since it makes it much harder for entry-level researchers to understand the ALL(*) algorithm.

sharwell on 12 Sep 2017

@sharewell Maybe I am missing the point but If the issue is about code complexity, proposed solution here ( https://github.com/antlr/antlr4/compare/master...mdakin:master ) seems cleaner and safer to me. Most developers do not need to understand inner workings of a special hash table implementation as long as they know it is a hash table of .

ahmetaa on 12 Sep 2017

@sharwell
As @ahmetaa mentioned, I think proposed change makes runtime code saner, uses less memory for all users and if you point me a benchmark that represents the general usage to beat I am game for that as well.

mdakin on 12 Sep 2017

@mdakin It's not me you have to convince - I have no decision-making ability here. I was just letting you know what to look for when making the proposal to @parrt.

sharwell on 12 Sep 2017

@sharwell I see, who decides on things like this? I can just throw a PR and see if it sticks, but I don't want to waste my or anyone else's time.

mdakin on 12 Sep 2017

Ok I see that you updated the comment , Thanks Sam, I will try to reach @parrt (hopefully with more numbers and code next week)

mdakin on 12 Sep 2017

interesting discussion. it'd be nice to at least have an option for tuning between memory and speed for different use cases.

parrt on 12 Sep 2017

@sharwell @parrt
I made a quick test to see how much performance drops if we move to a hashmap.
I converted my Turkish lexer grammar to work only on ascii files and converted the 50MB Turkish text file to contain only ASCII characters and tokenized this full ASCII text.

The initial numbers (result fluctuates a bit, also if I run it in a loop JVM seems to improve stuff a bit)
Hashmap version : ~12M tokens/s
Array lookup version (BASE): ~15M tokens/s

My initial guess of hashmap version being 5-10% slower than lookup seems a bit too optimistic, it is more like ~20-25% slower.

For non ascii, of course hashma version is way faster. I did not measure memory gains for hashmap version yet.

mdakin on 13 Sep 2017

ok, good to know

parrt on 13 Sep 2017

@parrt, @sharwell
Sorry for spamming, but I realized I missed a few optimizations in hashmap versions, now they are very close, ~5-7% , now I will try different grammars and see how they will fare. Hopefully I will be back with a clean PR and more benchmark results in a few weeks. Thanks.

mdakin on 13 Sep 2017

ok, note I often read PRs and then do my own thing. Also I won't have ANTLR time until mid to late Oct 2017

parrt on 13 Sep 2017

@parrt
I created https://github.com/antlr/antlr4/pull/2015

mdakin on 17 Sep 2017

👍3

Was this page helpful?

0 / 5 - 0 ratings