Antlr4: Performance tips

Created on 21 Dec 2016 · 25Comments · Source: antlr/antlr4

We are planning to replace our hand written parser with ANTLR4 in production for parsing our graph database language spec. ANTLR grammar looks elegant and precise.
Our language spec is a variant of GraphQL. The user queries that we will have to parse look like :

// List of directors with whom Tom Hanks has worked
{
  debug(_xid_: m.0bxtg) {
    type.object.name.en
    film.actor.film {
      film.performance.film {
        film.film.directed_by {
          type.object.name.en
        }
      }
    }
  }
}

We started benchmarking from the simplest subset grammar.
Benchmarks :

dgraph/antlr4go/graphqlpm$ gotb -test.run=XXX -benchtime=2s -v
BenchmarkQuery/q1-4                 5000       1221967 ns/op
BenchmarkQuery/q2-4                 2000       1710850 ns/op
BenchmarkQuery/q3-4                 3000       1230518 ns/op
BenchmarkQuery/q4-4                 3000       1568564 ns/op
BenchmarkQuery/q5-4                 2000       1803392 ns/op
PASS
ok      github.com/dgraph-io/dgraph/antlr4go/graphqlpm    22.179s

We expected these numbers to be under 0.05 ms. They are currently around 1.5 ms.

Here are comparisons with handwritten parser and antlr golang parser over practical queries :
Benchmarks :

query$ gotb -test.run=XXX -benchtime=2s -v
BenchmarkQueryParse/spielberg:handwitten:-4               100000         46918 ns/op
BenchmarkQueryParse/spielberg:antlr:-4                      2000       1741364 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4                100000         25982 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4                       2000       1654579 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4             30000         73053 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4                    1000       3385005 ns/op
PASS
ok      github.com/dgraph-io/dgraph/query    21.922s

Antlr4 is around 40x slower.

I have also tried SLL parsing. It also did not helped.

Q i did not got any SLL parsing failure over multiple inputs. In what kind of query and grammar can SLL fail and we have to move to LL ? I am asking this to confirm if i am doing something wrong.

Can i get some performance tips ?
Also, How can we benchmark lexer and parser separately ?

go question

Source

ashishnegi

Most helpful comment

@KvanTTT I personally think that it is totally appropriate to asking this type of performance issue question here. This does not look the type of question that could be answered by reading a faq. Ashish has reproducible code sitting in a public repository.
Performance is a bug like any other.

millergarym on 21 Dec 2016

👍36

All 25 comments

GitHub is not a place for performance and other questions.

Can i get some performance tips ?

Try to reduce the number of rules. For example, you can use token 'query' directly in operationDefinition rule instead of additional operationType:

operationType
   : 'query'
   ;

But I'am not sure that you'll get the significant speedup with such small optimizations. At first glance this grammar looks good.

Also, How can we benchmark lexer and parser separately ?

Within Java you can use the following code:

String code = readFile(args[0]);
ANTLRInputStream codeStream = new ANTLRInputStream(code);
SeparatedLexer lexer = new SeparatedLexer(codeStream);

// Start lexer benchmark
List<? extends Token> tokens = lexer.getAllTokens();
// End lexer benchmark

ListTokenSource tokensSource = new ListTokenSource(tokens);
CommonTokenStream tokensStream = new CommonTokenStream(tokensSource);
SeparatedParser parser = new SeparatedParser(tokensStream);

// Start parser benchmark
ParserRuleContext ast = parser.rule1();
// End parser benchmark

String stringTree = ast.toStringTree(parser);
System.out.print("Tree " + stringTree);

KvanTTT on 21 Dec 2016

👎3 👍1

Apologies for asking performance problem. I was not aware of it.
Is antlr google group only place to ask these kind of questions ?

And thanks for giving example for lexer and parser separate benchmarking.

ashishnegi on 21 Dec 2016

with only lexing :

func runAntlrParser(q string, b *testing.B) {
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        input := antlr.NewInputStream(q)
        lexer := parser.NewGraphQLPMLexer(input)
        // for only lexer benchmark
        t := lexer.NextToken()
        for t.GetTokenType() != antlr.TokenEOF {
            t = lexer.NextToken()
        }
    }
}

i found that lexing is taking most of the time :

benchmark with only lexer for antlr :

/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4               200000         45724 ns/op
BenchmarkQueryParse/spielberg:antlr:-4                      5000       1468218 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4                500000         28649 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4                       5000       1538988 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4            100000         80210 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4                    5000       3029668 ns/op
PASS
ok      github.com/dgraph-io/dgraph/query   63.546s

with both lexer and parser :

ashishnegi@ashish:~/work/golang/src/github.com/dgraph-io/dgraph/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4               300000         47772 ns/op
BenchmarkQueryParse/spielberg:antlr:-4                      3000       1868297 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4                500000         27980 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4                       5000       1616518 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4            100000         74961 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4                    2000       3312977 ns/op
PASS
ok      github.com/dgraph-io/dgraph/query   58.056s

I think that this is well known issue that lexing takes most of the time.
Where can i read more about bringing down the lexing time ?

ashishnegi on 21 Dec 2016

millergarym on 21 Dec 2016

👍36

@KvanTTT I agree with @millergarym 100% here. Not only are performance issues bugs, Google Groups is a useless piece of crap in which it's impossible to not lose formatting of the information @ashishnegi so meticulously put into tables.

So either accept GitHub issues like these, or use something that's not utterly broken as a forum (Discourse works)

ghost on 21 Dec 2016

😄7 👍2

@ashishnegi Can you try the following to rule out some more things?

Add an explicit EOF to the end of your entry rule (you should always have this):
```
document
   : definition EOF
   ;
```
Add the following rule to the very end of your grammar to make sure your lexer handles all input successfully:
```
ErrorChar
   : .
   ;
```

sharwell on 22 Dec 2016

@sharwell thanks for looking into this.
Here are the benchmarks after applying the changes.

lexing only benchmarks for antlr4 golang target
;; after adding EOF and ErrorCode
/query$ gotb -test.run=XXX -v  -benchtime=5s
BenchmarkQueryParse/spielberg:handwitten:-4               200000         55461 ns/op
BenchmarkQueryParse/spielberg:antlr:-4                      3000       1887333 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4                300000         32664 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4                       5000       2058863 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4            100000         74588 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4                    3000       3573973 ns/op
PASS
ok      github.com/dgraph-io/dgraph/query   57.434s

Here is the result of running grun GraphQLPM document -trace -diagnostics -tokens -SLL -tree on java target if that can help.

ashishnegi on 22 Dec 2016

I "benchmarked" one query nestedquery (which is taking most of the time) with Cpp target and the numbers are different :

Only lexing on nestedquery query for 1 million iterations => average time is ~42 microseconds.
Same query in golang target takes on average ~3.3 milliseconds for lexing.

/graphqlpmcpp$ time ./graphqlexe 

real    0m42.380s
user    0m42.368s
sys 0m0.000s

and with lexing and parsing : ~107 microseconds on average over 1 million iterations.

/graphqlpmcpp$ time ./graphqlexe 

real    1m46.194s
user    1m46.172s
sys 0m0.004s

If someone can proof read the cpp benchmark, it would be a double check.
~~Is it possible that golang target does not have caching for lexers and parsers support yet ?~~
Actually it has caching.

ashishnegi on 22 Dec 2016

It could be as simple as a hash function for ATNConfigSet that is not optimal for our usage. @sharwell had to put in a murmur hash to get java speed.

parrt on 22 Dec 2016

@parrt That would explain a case of slow behavior on the first pass. If the lexer remains slow on the exact same input after the first pass, it suggests that edges are getting dropped from the DFA somehow. Of course this always happens in cases where a semantic predicate is crossed, but there are no semantic predicates in the example grammar.

sharwell on 23 Dec 2016

Hmm...yeah, I'd have to compare the go/java runtime code.

parrt on 23 Dec 2016

@sharwell (cc: @parrt, @pboyer ),
Sam, I really need to pay more attention to your comments. Your comment above only made sense to me this morning. That was after spending more time than I should of improving the performance by 50% with murmur hash and then hitting a wall.

Before going into what I found, here is the result.
old = current anltr4 Go runtime (non murmur hash)
new = minor change to generated lexer (still no murmur)

benchmark            old ns/op     new ns/op     delta
BenchmarkLexer-4     99298         3188          -96.79%

benchmark            old allocs     new allocs     delta
BenchmarkLexer-4     457            22             -95.19%

benchmark            old bytes     new bytes     delta
BenchmarkLexer-4     16132         1200          -92.56%

The Go runtime is naive port of the Java runtime. It is a good start, but in many places it leaves a bit to be desired.

The hashing code was an example of this. The port basically used String() string (equivalent to String toString()) and then hashed the string. The result was a large number of memory allocs.

The "real" issue in the lexer is the that static fields in the Java are ported to non-static fields in the Go and initialization is done per new lexer. This is somewhat understandable as Go doesn't have an explicit static keyword. To achieve static semantics in Go the fields needs to be at the 'package level' and initialized in a func init() { .. } (aka in Java a static initializer).

This will be a minor change in the template. I'll tidy up my murmur hash code, make this change and hopefully turn it into a PR next week.

It would be nice to look for all similar issues in the template and runtime. I'll probably open a new issues for this, consolidating this and https://github.com/antlr/antlr4/issues/1705.

Cheers
Gary

millergarym on 10 Mar 2017

👍1

Uh, wow!

On Thu, Mar 9, 2017, 8:32 PM Gary Miller notifications@github.com wrote:

@sharwell https://github.com/sharwell (cc: @parrt
https://github.com/parrt, @pboyer https://github.com/pboyer ),
Sam, I really need to pay more attention to your comments. Your comment
above only made sense to me this morning. That was after spending more time
than I should of improving the performance by 50% with murmur hash and then
hitting a wall.

Before going into what I found, here is the result.
old = current anltr4 Go runtime (non murmur hash)
new = minor change to generated lexer (still no murmur)

benchmark old ns/op new ns/op delta
BenchmarkLexer-4 99298 3188 -96.79%

benchmark old allocs new allocs delta
BenchmarkLexer-4 457 22 -95.19%

benchmark old bytes new bytes delta
BenchmarkLexer-4 16132 1200 -92.56%

The Go runtime is naive port of the Java runtime. It is a good start, but
in many places it leaves a bit to be desired.

The hashing code was an example of this. The port basically used String()
string (equivalent to String toString()) and then hashed the string. The
result was a large number of memory allocs.

The "real" issue in the lexer is the that static fields in the Java are
ported to non-static fields in the Go and initialization is done per new
lexer. This is somewhat understandable as Go doesn't have an explicit
static keyword. To achieve static semantics in Go the fields needs to be at
the 'package level' and initialized in a func init() { .. } (aka in Java
a static initializer).

This will be a minor change in the template. I'll tidy up my murmur hash
code, make this change and hopefully turn it into a PR next week.

It would be nice to look for all similar issues in the template and
runtime. I'll probably open a new issues for this, consolidating this and

1705 https://github.com/antlr/antlr4/issues/1705.

Cheers
Gary

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/antlr/antlr4/issues/1540#issuecomment-285546336, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA37eYnOHuxqcIT9BEx_sFyqNfDzdEuQks5rkKgugaJpZM4LSugk
.

pboyer on 10 Mar 2017

@millergarym Do you have this in a branch? How can I help? I'll have some free time this weekend and it would be good to take a look.

pboyer on 10 Mar 2017

@pboyer Just saw your message. I don't have time right now so actually haven't test this. It should work as it is the changes I made to the generated code.
https://github.com/wxio/antlr4/tree/static2packagevar

millergarym on 11 Mar 2017

@ashishnegi can you try latest HEAD? Just incorporated Go speed improvements.

parrt on 14 Mar 2017

@parrt I benchmarked again on master of antlr4 and with antlr4-4.7.1-*.jar.

➜  query git:(8ccf5cb) ✗ got -v -bench=QueryParse -run=XXXX    
BenchmarkQueryParse/spielberg:handwitten:-4                50000         51960 ns/op
BenchmarkQueryParse/spielberg:antlr:-4                     10000        236041 ns/op
BenchmarkQueryParse/tomhanks:handwitten:-4                 50000         32912 ns/op
BenchmarkQueryParse/tomhanks:antlr:-4                      10000        148723 ns/op
BenchmarkQueryParse/nestedquery:handwritten:-4             20000         86670 ns/op
BenchmarkQueryParse/nestedquery:antlr:-4                    5000        406756 ns/op
PASS
ok      github.com/dgraph-io/dgraph/query   13.449s

➜  query git:(8ccf5cb) ✗ which antlr4
antlr4: aliased to java -Xmx500M -cp "/usr/local/lib/antlr4-4.7.1-SNAPSHOT-complete.jar:$CLASSPATH" org.antlr.v4.Tool

Numbers are now within range of 5 times. This is definitely a very good improvement. :+1:

Do i need to try any other branch ?

ashishnegi on 6 Apr 2017

👍1

It seems that the Go runtime is still 1~2x slower than Java.

I did some test with the C grammar and a macro-expanded source file from the Lua project (~200KB). Both the Go version and the Java version just walk the C file.

Noticeably the Go version spent over 40% of time on memory management. This is probably related to the binary-trees benchmark: the Go runtime eagerly scans all objects and pointers to free some garbage, but barely achieves anything in this case. Turning off GC is no good, either. Maybe some kind of memory pool? [[1](https://blog.kowalczyk.info/article/u5o7/Speeding-up-Go-and-C-with-custom-allocators.html), [2](https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs)]

Among the 10 most expensive functions, 7 are related to GC:

(pprof) top10
4260ms of 6680ms total (63.77%)
Dropped 134 nodes (cum <= 33.40ms)
Showing top 10 nodes out of 137 (cum >= 2790ms)
      flat  flat%   sum%        cum   cum%
    1300ms 19.46% 19.46%     2360ms 35.33%  runtime.scanobject
     960ms 14.37% 33.83%      980ms 14.67%  github.com/antlr/antlr4/runtime/Go/antlr.(*BaseSingletonPredictionContext).hash
     580ms  8.68% 42.51%      580ms  8.68%  runtime.heapBitsForObject
     500ms  7.49% 50.00%      500ms  7.49%  runtime.greyobject
     260ms  3.89% 53.89%      410ms  6.14%  runtime.mallocgc
     150ms  2.25% 56.14%     1560ms 23.35%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).closureWork
     150ms  2.25% 58.38%      150ms  2.25%  runtime.casgstatus
     130ms  1.95% 60.33%      380ms  5.69%  runtime.mapassign
     120ms  1.80% 62.13%      120ms  1.80%  runtime.heapBitsSetType
     110ms  1.65% 63.77%     2790ms 41.77%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).execATN

In the cumulative list, GC work accounts for over 40% of execution time:

(pprof) top5 -cum
0.13s of 6.68s total ( 1.95%)
Dropped 134 nodes (cum <= 0.03s)
Showing top 5 nodes out of 137 (cum >= 2.69s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%      2.81s 42.07%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).AdaptivePredict
     0.11s  1.65%  1.65%      2.79s 41.77%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).execATN
     0.02s   0.3%  1.95%      2.79s 41.77%  runtime.systemstack
         0     0%  1.95%      2.76s 41.32%  runtime.startTheWorldWithSema
         0     0%  1.95%      2.69s 40.27%  _/D_/Lab/ANTLR/perftest/go.(*CParser).ExternalDeclaration

wdscxsj on 6 Apr 2017

@wdscxsj can you please provide some more details.

What version of Antlr?
How did you generate the profiles (eg. benchmark, one off run, etc.)?
Is you testing setup in a public repo so it can be repeated / looking into?

If possible can you please include profile listing for the hot spots.
eg

list AdaptivePredict
list execATN

I'm not sure if list BaseSingletonPredictionContext.*hash will work. Else list hash and just copy the BaseSingletonPredictionContext method.

millergarym on 6 Apr 2017

@millergarym Thanks for your reply! Please see this repo: https://github.com/wdscxsj/antlrperfcomp.

wdscxsj on 6 Apr 2017

@wdscxsj a couple of things to note.

Since the parse is only run once, the assumption is the use-case is some sort of code analysis and not a long running job.
1-2x? 1x would mean Java time = Go time?
Only running the test once (ie not in a benchmark) is of little use or no use for long running processes. No JIT in the case of Java, and not caching of the DFA in both cases.

What use-case are you tested for?

millergarym on 6 Apr 2017

@millergarym Thanks again for the quick response. A little background may be helpful.

I'm using ANTLR in a private project that does some on-and-off code analysis and conversion. The original Java implementation is to be ported to Go.

We (I and my colleagues) found that the Go version was consistently slower than the original Java one. For example, for a middle-sized input, the lexing + parsing time (but dominantly lexing, as observed by ashishnegi) was 10.0s in Java and 27.9s in Go (or 125.2s for Go with ANTLR 4.6). Profiling indicates that the Go version spends too much time managing the memory. The memory consumption is significantly lower, but we just want it to run as fast.

And by "1~2x slower", I meant "2 to 3 times as as slow" as Java.

Have you tried to compare both runtimes' performance? Don't you think Go should run at roughly the same (if not a bit higher) speed as Java?

wdscxsj on 6 Apr 2017

@wdscxsj if Go is 2-3x slower on a single run then my guess would be that is would be even slower on a benchmark because of the JIT. I agree, in general if Go is slower than Java it is likely an implementation issue. (see http://benchmarksgame.alioth.debian.org/u64q/go.html)

Can you please add the anltr jars (complete and source) you used to the repo. When, in the future I do another round of performance work this will be a valuable baseline.

Cheers
Gary

millergarym on 7 Apr 2017

@millergarym The repo has been updated. Thanks.

wdscxsj on 7 Apr 2017

@millergarym I suppose this issue is (at least partially) caused by Go's GC design: https://github.com/golang/go/issues/23044. There is discussion about adding a GOGC_MIN or debug.SetMaxHeap().

By the way, @sharwell's optimized Java version is lightning fast. Thank you very much!

wdscxsj on 12 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings