Antlr4: [Java] Lexer is >10x slower if input is not in Ascii range

Created on 18 Jan 2017  Â·  31Comments  Â·  Source: antlr/antlr4

Hi,

Currently antlr lexer runtime uses a cache which is limited to 127

Java version:

public static final int MAX_DFA_EDGE = 127; // forces unicode to stay in ATN

This default causes big slowdown if parsed document contains characters codes bigger than 127, which is common for a lot of latin scripts (from my tests, up to 3x slowdown for Turkish text , >10x slowdown if most of the chars are out of range) I believe this should be a problem for Polish, German, Scandinavian texts as well.

If this range is extended to 240 (Latin-1) or even better to 368 (Latin Extended A) This would extend fast lexing of mixed content documents to a lot more languages.

I believe the price of increasing this cache size a little is worth the speed gains.

performance lexers improvement

Most helpful comment

All 31 comments

Increasing the size to 240 would result in a 452 byte increase in the size of every lexer DFA state instance, where the majority of inputs being fed to ANTLR never exercise input values in that range. If you need to increase the size of the edge tables as part of a particular application, you can extend LexerATNSimulator and override the getExistingTargetState and addDFAEdge methods, replacing MAX_DFA_EDGE with the constant value of your choice. You can then configure your Lexer instance to use the custom ATN simulator.

MyLexer lexer = new MyLexer(input);
// Note: the arguments to the constructor match the values seen in the generated MyLexer constructor
lexer.setInterpreter(new CustomLexerATNSimulator(...));

Ok it seems like we could do it the way you described, thanks.

Could we consider making DFAState.edges a HashMap or something similar instead of an array?

The only operations we use are set and put, so an unordered map would probably be quite performant.

@bhamiltoncx I'm not sure what the answer here is. For reference, I use an abstract base class with a get and set method as part of a memory optimization in my fork. The performance overhead of my ArrayEdgeMap.get compared to the lookup in the reference implementation causes a surprisingly large gap in the lexer performance between the two (more than 10% in overall lexer performance). A HashMap would introduce substantially more complex code on this hot path.

I'd guess the overhead in HashMap would mainly be the boxing and unboxing
of int keys to Integer objects.

I have some experience using more performant alternatives, so if you share
your perf test methodology, I can try a few Unicode perf tests with various
alternatives.

On Wed, Mar 1, 2017, 8:19 PM Sam Harwell notifications@github.com wrote:

@bhamiltoncx https://github.com/bhamiltoncx I'm not sure what the
answer here is. For reference, I use an abstract base class with a get and
set method as part of a memory optimization in my fork. The performance
overhead of ArrayEdgeMap.get
https://github.com/tunnelvisionlabs/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/dfa/ArrayEdgeMap.java#L49-L56
compared to the lookup in the reference implementation
https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/atn/LexerATNSimulator.java#L249-L253
causes a surprisingly large gap in the lexer performance between the two. A
HashMap would introduce substantially more complex code on this hot path.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/antlr/antlr4/issues/1613#issuecomment-283552105, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AApAU0rMAmACq2lIaWjDcmj4vljLAhEMks5rhkNsgaJpZM4Lm8-f
.

@bhamiltoncx I typically test with TestPerformance. In this case you'd be interested in running over large inputs with RUN_PARSER and COMPUTE_CHECKSUM set to false.

I'd also be looking at YourKit for precise information about the memory overhead from the change.

We have some Map and Set classes for primitive key and values (Simple linear probing). Feel free to use it. Not sure it it is proper for you but they are quite fast and memory efficient.

https://github.com/ahmetaa/zemberek-nlp/tree/master/core/src/main/java/zemberek/core/collections

Such as UIntMap or UIntIntMap etc.

For Java, we'll want to try a few approaches and build micro benchmarks for
each. Array access in Java is an order of magnitude faster than a map, due
to object boxing/unboxing.
On Wed, Apr 19, 2017 at 12:27 AM Mike Lischke notifications@github.com
wrote:

Side note: in the C++ target I moved to a map already, but not so much for
performance reasons, but for RAM usage reasons. With large + complex
grammars (especially with compless expression rules) I saw 600MB and more
RAM used just for ATN/DFA. This dropped to less than 300MB when I switched
to a map. So, I believe that the combined step of using a map and
increasing the max DFA edge value will rather have a positive overall
effect.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/antlr/antlr4/issues/1613#issuecomment-295137770, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AApAUzqS_rAZWOzZacNGNhHn_Zvtc3EEks5rxbd3gaJpZM4Lm8-f
.

Makes sense, maybe using what @ahmetaa offered would help here?

I use a dynamic map approach in my optimized fork as well (custom map type with no boxing designed to store these edges in a concurrent execution environment), also for its memory improvements. For inputs that rarely have code points above 0x7F, there is a clear performance difference between the two (the lexer in this repository is consistently faster) that I have not been able to resolve to date.

FWIW, There could be a potential hybrid solution.
If input is between a certain range (0-126) use a lookup table (or tables, in range of 0-15 or 32-126). Otherwise fall back to a fast int-State hashmap with linear lookup (something like ahmetaa ssuggested)

I can test it as well if you could point me to your benchmark code. Or I can just run it with some ascii text and some Turkish, German etc text to see the difference.

@sharwell @mike-lischke @bhamiltoncx @ahmetaa
Ok. Here is what I did and my findings:

I hacked UintMap into Antlr4 (from @ahmetaa)
Modified DFAState and LexerATNSimulator.java to use this instead of lookup table.
I run tests for a 50MB Turkish text with ~ 8.8% Non ASCII text (91.2% is ASCII)
Tested different initial hashmap sizes (8, 16, 32)

Here are the numbers etc:
https://gist.github.com/mdakin/1e84b4eeb2e5e5aec33fd5d809f4ced2

Summary is, HashMap version is >10x faster (10Million tokens per second vs 900K) and possibly uses < 3-10x less memory to represent edges.I only tested Lexer beacsue that is what I am interested in.

This is the super hacky code change, but note that apart from the hashmap classes actual change is very small (existing exposed edges encapsulation and -1 issue caused some troubles, but no biggie):
https://github.com/mdakin/antlr4/commit/3d13dbe534afa30a4675bd72e8a94f4c539ae076

Next I will test new version with full ASCII texts, I think it will still have compatible performance with lower memory use, but we will see.

One of my current hypotheses is edges outside of the currently supported DFA range typically fall into exactly one of the following three categories:

  1. The edge is an EOF
  2. The symbol matches (unpredicated), and the target of the transition is independent of the input symbol
  3. The symbol fails to match (unpredicated)

We can probably ignore EOF safely and talk about the remaining edges. If all of these edges either match or do not match, then a single additional field would be sufficient to cover the entire remaining input space.

:memo: This doesn't address all grammars containing transitions outside of the supported range (including your benchmark sample), but it does address all grammars with transitions only in the supported range, even if the input contains symbols outside the range.

@sharwell
My understanding of how antlr runtime works is still very limited, however I had a closer look at relevant bits in the code and I want to share a few observations for this matter and propose a solution.

Current way of handling of edges in DFAState objects is quite problematic.
a) "edges" to other states are not encapsulated properly, it is a raw array and it is managed by other classes. This makes it very fragile, prone to accidents and difficult to replace with a better abstraction.

b) Using an array of states in DFAState is quite wasteful (As a cache in lexer's case, for transitions of edges that has characters 0-127) Even if the DFA state object has a single edge, it creates this array with 128 references (512 bytes)

c) As mentioned in other messages in this issue, performance of current way of handling DFAState transitions with charaters >127 is terrible. I guess if my test grammar was for Japanese or Arabic instead of Turkish it performance would be 100x worse, not 10x.

d) Having a special -1 index, and having array values shifted, further complicates logic

My solution to this is to remove existing array that represent edges, replace it with a fast / compact hashmap. I believe the performance penalty is minuscule for lexer grammars that deals with only ascii charactrers, and its memory footprint is much compact, ~60-80 bytes vs 512 bytes. Also no special -1 or other cases, so it will be simpler to read as well.

I already have a clean intmap implementation, but it will probably take some time to replace existing structure. I also added synchronization, it is still 10x faster for 90%-10% mix of ASCII-Non ASCII input for my test grammar.

You can see the current diff on my branch here:
https://github.com/antlr/antlr4/compare/master...mdakin:master

Another note is that maybe instead of a single DFAState class, maybe it would have been better to have different implementations for Lexer and Parser to represent states. But there is probably a reason for this , as I said I don't know the theory.

I can replace DFAState edges with my map and change all uses, but it will probably take some time. I can create a separate issue for this. What do you think?

Current way of handling of edges in DFAState objects is quite problematic.
a) edges to other states is not encapsulated properly, it is a raw array and it is managed by other classes. This makes it very fragile, prone to accidents and difficult to replace with a better abstraction.

b) Using an array of states in the DFA state is quite wasteful (As a cache in lexer's case, for transitions of edges that has characters 0-127) Even if the DFA state object has a single edge, it creates this array with 128 references (512 bytes)

c) As mentioned in other messages int this issue, performance of current way of handling DFA edges for charaters >127 is terrible. I guess if my test grammar was for Japanese or Arabic instead of Turkish it performance would be 100x worse, not 10x.

d) Having a special -1 index, and having array values shifted, further complicates logic

My solution to this is to remove existing array that represent edges, replace it with a fast / compact hashmap. I believe the performance penalty is miniscule for lexer grammars that deals with only ascii charactrers, and its memory footprint is much compact, ~60-80 bytes vs 512 bytes. Also no special -1 or other cases, so it will be simpler to read as well

These are exactly the thoughts I had when I examined these details in the C++ target, where I replaced the array by a hash map already. Memory consumption dropped from >600MB to around ~300MB, in one use case, with this map in place. So it's a big difference. I also removed the need for a negative special index and the shifting.

@mike-lischke
Cool! I will take a look to C++ version to see what other things to consider while replacing current implementation.

On another note, I think a special map would almost certainly be faster and more compact than std stuff.
My map for Java version: https://github.com/antlr/antlr4/blob/6fdb0a2960d77a8051af3609bdc9217cde74de48/runtime/Java/src/org/antlr/v4/runtime/dfa/SimpleIntMap.java

@mdakin Keep in mind my comments from https://github.com/antlr/antlr4/issues/1613#issuecomment-283552105. Many of the issues you mentioned were solved as part of my work there, but so far the decision was to not merge these changes back into the reference implementation.

@sharwell
I see, If we somehow solve performance issues do you think we can merge it to reference implementation?
How much degradation is acceptable? Is %5-10 performance loss for for >10x performance for >127 cases + memory gain ok? If so I believe I can do this.

@mdakin The bar for merging into the reference implementation is more about the balance between simplicity and the number of affected users. For example, there are changes in my fork that make a 10,000:1 difference (improvement) in memory usage or speed for some real-world use cases. However, if only 1 or 2 users is impacted by this, it's not worth the added complexity in the reference implementation since it makes it much harder for entry-level researchers to understand the ALL(*) algorithm.

@sharewell Maybe I am missing the point but If the issue is about code complexity, proposed solution here ( https://github.com/antlr/antlr4/compare/master...mdakin:master ) seems cleaner and safer to me. Most developers do not need to understand inner workings of a special hash table implementation as long as they know it is a hash table of .

@sharwell
As @ahmetaa mentioned, I think proposed change makes runtime code saner, uses less memory for all users and if you point me a benchmark that represents the general usage to beat I am game for that as well.

@mdakin It's not me you have to convince - I have no decision-making ability here. I was just letting you know what to look for when making the proposal to @parrt.

@sharwell I see, who decides on things like this? I can just throw a PR and see if it sticks, but I don't want to waste my or anyone else's time.

Ok I see that you updated the comment , Thanks Sam, I will try to reach @parrt (hopefully with more numbers and code next week)

interesting discussion. it'd be nice to at least have an option for tuning between memory and speed for different use cases.

@sharwell @parrt
I made a quick test to see how much performance drops if we move to a hashmap.
I converted my Turkish lexer grammar to work only on ascii files and converted the 50MB Turkish text file to contain only ASCII characters and tokenized this full ASCII text.

The initial numbers (result fluctuates a bit, also if I run it in a loop JVM seems to improve stuff a bit)
Hashmap version : ~12M tokens/s
Array lookup version (BASE): ~15M tokens/s

My initial guess of hashmap version being 5-10% slower than lookup seems a bit too optimistic, it is more like ~20-25% slower.

For non ascii, of course hashma version is way faster. I did not measure memory gains for hashmap version yet.

ok, good to know

@parrt, @sharwell
Sorry for spamming, but I realized I missed a few optimizations in hashmap versions, now they are very close, ~5-7% , now I will try different grammars and see how they will fare. Hopefully I will be back with a clean PR and more benchmark results in a few weeks. Thanks.

ok, note I often read PRs and then do my own thing. Also I won't have ANTLR time until mid to late Oct 2017

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lionelplessis picture lionelplessis  Â·  21Comments

pavelvelikhov picture pavelvelikhov  Â·  57Comments

fvictorio picture fvictorio  Â·  12Comments

parrt picture parrt  Â·  27Comments

ashishnegi picture ashishnegi  Â·  25Comments