A test task runs over 310,000 independent test executions. In 6.14.3 this completes in about 2.5 minutes. In 7.0.0-beta6 this slows down to a crawl at 200k tests, appearing to GC thrash as small batches of tests complete.
In the past, I've seen this occur due to reporting keeping parameters in-memory. The project uses a workaround to replace test parameters with their toString version, as these are too heavy for a large run. I suspect there is some similar per-test memory grab by the reporter.
The next step is probably to take a heap dump and determine what is held. Before running too far into this, do you know what changes were made that might hint at what to look for?
./gradlew :caffeine:strongKeysAndStrongValuesAsyncCaffeineTest
@ben-manes - Thanks for taking the time to report this issue. Between 6.14.3 to 7.0.0-beta6 a lot of changes have gone in. So it would be great if you could help do one of the following
It appears to be due to holding every test method invoked in a ConcurrentLinkedQueue (you might consider a slimmer JCTools' queue). The TestNGMethod and TestResult did get fatter between releases.
There are also many, many strings for reporting every test method. I reduced my CI time by 10 minutes by interning the strings. I suspect you could do the same as method names, etc are identical for a parameterized test. When returning to the prior commit, turning on G1's
-XX:+UseStringDeduplication delays the GC thrashing from ~200k to ~300k. Unfortunately, that also causes a lot of InterruptedException to be thrown, so I can't turn it on until another testing library is patched.



I was able to greatly reduce the string storage by using a short string for the context parameter if no failures had occurred. This way when a failure happens, I can obtain details but that won't be the common case. I also intern'd the TestName, which together brought it down to 4mb instead of 400mb.
The ConcurrentLinkedQueue$Node and the TestResult objects they hold are still the dominant cost. I am surprised this is all held in memory. Why not asynchronously write them to disk and reindex when writing the report? It seems very wasteful to hold it all in memory and doing so incurs more GC cycles. Something like log4j2's disruptor design for async logging might work well to avoid an I/O penalty.

This runs in 2 minutes on 6.14.3 and 4 minutes with 7.0.0. This is likely due to a lot more full GCs required, which causes thread interrupts. Thanks to fixes in https://github.com/awaitility/awaitility/issues/134, the tests all pass fine.
The slow down is concerning and would block upgrading. I have optimized my memory usage significantly, and the cause seems to be due to TestNG. The larger TestNGMethod and TestResult classes, plus storing them in-memory using a ConcurrentLinkedQueue are the bloat in my heap dumps. These are areas that I cannot optimize in my code except to split the test suites apart even further (as millions overall).
My proposed fix would be to have the results asynchronously streamed out of memory to a file and then process the file for the report. This would drop memory usage to significantly speed up large test suites (by seconds), but might make the common case of few tests a little slower (by millis).
@ben-manes - Thanks for adding that additional context. Yes the reporting piece of TestNG has started bloating in memory and perhaps would need some revamp or a different approach.
My proposed fix would be to have the results asynchronously streamed out of memory to a file and then process the file for the report.
That was something that I was thinking too, but the problem is that TestNG also exposes some objects which are a bit more complex such as the java.lang.reflect.Method for e.g.,
But please do share any ideas that you may have in terms of an approach of how to go about doing this. We can vet it out together and then have this addressed.
the problem is that TestNG also exposes some objects which are a bit more complex such as the java.lang.reflect.Method
For instances, like the test method / class / etc, can you use a cache and store the fully-qualified name or other unique identifier? The number of test instance methods will be small in comparison, so you can keep a Map<String, Method> and similar to resolve against when deserializing. That way all usages still get the exact instance. Cases like the instantiated test class can be qualified with an additional identifier, so the exact object instance is resolved. If think all of this can be worked around, but I agree it becomes less pretty.
When optimizing memory usage again, I noticed that Gradle uses a custom testng reporter. This does not extend TestListenerAdapter and instead is much less memory hungry thanks to capturing the minimal fields it needs and using synchronized for cheaper data structures. The performance seems fine, so perhaps CLQ is overkill?
The only TestListenerAdapter that I see in a heap dump comes SuiteRunner and appears to be unused. This is forcibly added during initialization, but that listener is never inspected afterwards. It appears to be entirely unnecessary?
The actual TextReporter is may be added later, but is not done so by Gradle as disabled. For those using it and others, I think you could rewrite them to not extend TestListenerAdapter and instead incrementally build the output like TestNGTestResultProcessorAdapter does, thereby allowing the heavy-weight objects to be discarded. Ideally those that write textual data could stream to disk, e.g. TextReporter prints the buffer at the end and could simply display a temp file instead.
My latest build optimizations to reduce TestNG's heavy-weight reporting seems to save 20 minutes due to not choking on full garbage collections (a 22% speedup). I believe all of my low-level hacks could be removed if TestNG would more aggressively discard its heavy result objects instead of unnecessarily keeping them throughout the execution. This would improve performance for large builds significantly, and shouldn't negatively impact small suites either.
I tried to reflectively clear the queues inside of this useless TestListenerAdapter instance. That dropped the CLQ#Node instance count by half, but did not reduce the memory usage. 
I then reflectively captured TestRunner#m_invokedMethods and clear it. This seems to only be utilized by SuiteRunner#getExcludedMethods() for JUnitReportReporter, SuiteHTMLReporter, and IgnoredMethodsPanel. None of these are used in my build. That halved the CLQ#Node instances again, but did not drop memory.
This lead to SuiteRunner#invokedMethods which is similarly unused. By reflectively clearing out those results, the CLQ#Node dropped from my heap dump and the execution ran successfully.
While it dropped memory, it didn't decrease as much as desired because the TestResults are still being held by a ConcurrentHashMap. This is from ResultMap inside of TestRunner. Again these are only useful for some reporters and I can clean the results manually.
This finally dropped the memory usage and didn't affect the test reports from Gradle or Eclipse. While certainly ugly, it does prove that TestNG is holding onto the test results unnecessarily for builds, and could avoid that unless explicitly needed by a reporter. Those reporters could also be much more efficient by not requiring the full results stay in-memory, but rather capture only the details they care about (and stream them to disk if possible).
It would be really great if this memory behavior could be improved so that I could remove my narly hacks 馃槈.
(See CacheValidationListener for all of these hacks...)
In today's round of optimizations, I was able to drop a test suite task from 315MB (8.6M objects) down to 22MB (0.5M objects). The noticeable pauses from a full GC also disappeared.
There are roughly 30 test suites of equal size (140k tests per suite) in order to reduce the report size, as a single test task failed due to the memory overhead. Previously I had already optimized by aggressively removing test parameters and expected exceptions if successful and shortening + dedup'ing test names in the report. The GC is also tuned.
4.2M tests executing in parallel on TravisCI with 512mb heap now takes ~1 hour.
@ben-manes - Based on the details that you shared here as observations, I went ahead and created a PR https://github.com/cbeust/testng/pull/2390 that attempts at reducing the footprint.
Would it be possible for you to checkout this branch, build TestNG locally and try it out without the tweaks you added to caffeine ? That way I can know for sure if the changes are taking us forward in the right direction.
Thanks!
It would be nice if TestNG could be sourced from jitpack to simplify testing a commit. I had to build a local jar and include its dependencies, rather than merely swap coordinates.
I wrote a simple stress tester for you. It begins showing GC pauses at ~1M tests and eventually runs out of memory ~1.3M. This occurs for both 7.3 and your branch. It appears that CLQ is improved (but remains), but those captured in the CHM still cause an issue. Note my descriptions for fixing TestRunner and SuiteRunner, or look at my horrid code 馃槃.
// terminal 1
./gradlew cleanTest test
// terminal 2
jmap -dump:live,format=b,file=/Users/ben/Downloads/1.hprof "$(pgrep -f GradleWorkerMain | head -n 1)"
Then open the heap dump, e.g. VisualVM.
Across runs this will be at different heap times, but offers enough to eyeball where problems are and iterate.


It would be nice if TestNG could be sourced from jitpack to simplify testing a commit.
What do you mean? You should be able to find the build from @krmahadevan's branch here:
https://jitpack.io/#krmahadevan/testng/fix_reduce_memory_footprint_in_reports-SNAPSHOT
Yes, but it fails to produce a detectable artifact. Go to branches and see the log.
fyi, you can use the environment variable JITPACK to detect and then skip incompatible steps, e.g. this case of failing to produce a jar because it was unable to sign the archive.
I ran the stress test at 256mb heap against master, after this was closed. I still see the memory problems, so it isn't quite fixed. However maybe a new issue is appropriate since this was is getting quite verbose. I'd recommend using the testng-stress project to optimize with, since it nicely shows where there are excessive allocations.

@ben-manes - Please try using #2392 changeset. It should bring down the memory foot print further down.
The biggest of the problems is the ITestResult object which embeds a reference to ITestNGMethod object which internally embeds a reference to ConstructorOrMethod object which embeds within it a reference to Constructor (or) Method.
I did notice the duplications of Method since Class#getDeclaredMethods copies it, so there are many duplicates. It wasn't a big problem, but there are lots of little wasteful allocations that can be removed.
The new code still has the memory problem because there are many places that hold onto all of the ITestResult objects. The ITestNGMethod did drop, so that's great.


@ben-manes - In continuation with my effort to drive down the memory foot print, I have raised another PR https://github.com/cbeust/testng/pull/2393 which removes references to ITestNGMethod array from TestRunner. This should further drive down the memory foot print. Can you please help check ? I think now the only thing that would be left is to figure out how to handle the reporting aspect of TestNG (which I guess is where the ITestResult implementation is going to slim down)
@kamilszymanski I don't see ITestNGMethod in VisualVM anymore. The TestResult instances still remain and causes heap exhaustion. From my prior analysis it would be fixed if you can remove the holds in 3 and 4.



@ben-manes I guess you meant @krmahadevan
oops, sorry for fat fingering the auto-complete!
@ben-manes
SuiteRunner#invokedMethods [todo]
I have fixed this ask as well, as part of the PR https://github.com/cbeust/testng/pull/2394
Can you please try this branch ? [ Note: my [previous PR](https://github.com/cbeust/testng/pull/2393) has not yet been merged ]
@krmahadevan It did seem a little better by performing more tests in the stresser. Since there are still places that retain all of them, e.g. ResultMap, unfortunately you have to fix all before we feel the impact.
If you feel that you have to wait on me and get blocked, you can perform my steps:
testng-stresser and run the project.maxHeapSize = '256m' locally to fail faster.gradlew cleanTest test until it slows downjmap -dump:live,format=b,file=/Users/ben/Downloads/1.hprof "$(pgrep -f GradleWorkerMain | head -n 1)"
@ben-manes - For now, this is the max I can do in terms of fixing this problem (atleast to some extent) because for the other areas such as ResultMap I think its a bit more involved effort which will need some serious refactoring of the codebase, because its part of exposed API interfaces and there's no way we can get rid of them without some re-engineering.
And the only reason why I was coming back to you, asking for feedback was because, I felt you could give me a quicker feedback in terms of the memory improvements, than me (since you already know what to look for and how to look for etc.,)
I will be honest here, this is the first time I am using these tools for reviewing memory footprints (please bear with me if that sounds a bit lame or childish excuse), so there's a learning curve involved for me and it means i spend more time. Trying to juggle between work priorities and testng bugs (i love working on testng coz of the learnings it gives me and that's why i keep coming back to it over and over again) and so i dont find enough time for these investigations. Hope that adds the context
All said and done, it would be really helpful for me if you could please share me some tutorial/video/blog post/book or anything else, that I can refer to so that i also start becoming acquainted with all these stuff and become a better coder.
Oh of course I don't mind. I'm sorry if you took that nudge negatively as if I do. I felt bad thinking you were making changes in the dark or might wait on me, as I know how important quick iterations are during refactorings and to stay in the zone. I'm happy to iterate with you on this, even if its delayed if requiring a major version bump, as you'll make usages better and it's your charitable labor.
This is an invasive change and I'm not sure if you'll be comfortable without a major bump, as it might impact user-facing APIs with their custom integrations. Or maybe not, you're the expert here :)
Using a Profiler seems scary and awkward at first (just like learning SQL), but it pays for itself and isn't that hard! The JVM team develops VisualVM and Java Mission Control. The former is older and can open a heap dump, while the latter also offers Flight Recorder (JFR) as a low overhead profiler that can run in production. They are unbundled in recent JDKs so you have to install them separately (brew cask install visualvm, brew cask install jmc). I'm not as familiar with JMC but its pretty well regarded. YourKit and JProfiler are excellent commercial profilers that have been around for many years, highly regarded, and generously free for open-source projects. They are all worth installing and playing with. The JVM tool jmap lets you dump a live heap rather than needing to attach the profiler against the running process, which was more convenient in this case.
When you take a heap dump it generates an hprof file. You might have seen this at work when a memory leak causes an OutOfMemoryError (where -XX:+HeapDumpOnOutOfMemoryError kindly dumps it prior to termination). The profilers can open the dump and provide insights like instance counts, memory dominators, duplicate strings, etc. In most cases when analyzing a leak, the problems are immediately obvious and the calculated object path shows where its held. Once you know the dominators and major allocation spots, it's back to the code to determine the fix. When you have a patch, if you can reproduce the failure then you iterate and can verify in a new heap dump the success.
For example I was curious to see how JUnit 5 handles this stress test (as I've been a happy TestNG user for over a decade). It turns out to run the simple allocator tests slower and fail with fewer executions (~200k vs ~600k). The YourKit snapshot of the heap shows that LinkedHashSet in TestIdentifier and TestTemplateInvocationTestDescriptor are the dominators, and from the source code they appear to be their version of TestResult. Since their results are heavier objects, the stress test fails with fewer executions.
Learning a profiler just takes an issue to track down and exploration. It's not hard, just new and requires a little knowledge to set up. Afterwards you'll quickly track down the problem and maybe use the profiler for only 5% of the overall effort. The value is that it quickly uncovers the root problem, which would otherwise be a guessing game (where you are almost always wrong). So it pays off, but using the tool requires no real expertise and just an issue to be motivated by.
Perhaps @cbeust has further insights here. Really I think you'll be happy once you take the jump and learn it. Like SQL, its seems scary at first and then pays off handsomely throughout your career.

Most helpful comment
In today's round of optimizations, I was able to drop a test suite task from 315MB (8.6M objects) down to 22MB (0.5M objects). The noticeable pauses from a full GC also disappeared.
There are roughly 30 test suites of equal size (140k tests per suite) in order to reduce the report size, as a single test task failed due to the memory overhead. Previously I had already optimized by aggressively removing test parameters and expected exceptions if successful and shortening + dedup'ing test names in the report. The GC is also tuned.
4.2M tests executing in parallel on TravisCI with 512mb heap now takes ~1 hour.