Users are experiencing instability with Reader, including a significant number of crashes. This issue is to describe some new efforts we're going to pursue make to determine the source or sources of this instability, and to solicit feedback from others on the best path forward.
We've basically been trying to manually create conditions on local dev, UAT, and prod that cause a crash and use browser developer tools to reconstruct what happened. This approach doesn't seem to be working very well for a few reasons:
I'll try this afternoon on UAT:
https://addyosmani.com/blog/taming-the-unicorn-easing-javascript-memory-profiling-in-devtools/
Open the DevTools > Profiles
Take a heap snapshot
Perform an action in your app that you think is causing leaks
Take a heap snapshot (ex: browse through 5-6 docs, zoom, add comments, add tags— or just open the document list page)
Repeat the same stuff once more
Take a final heap snapshot
Select the most recent snapshot taken
At the bottom of the window, find the drop-down that says "All objects" and switch this to "Objects allocated between snapshots 1 and 2". (You can also do the same for 2 and 3 if needed)
In the view you'll see a list of leaked objects that are still hanging around. You can select one to see what is being retained in its retaining tree.
If this memory leak is correlated with certain variables or patterns of user behavior, we want to be able to narrow this down. Going forward, here are some questions we could ask users to try to get more of a framework to reason about crash reports.
We don't currently have monitoring that reports on crashes. We might look into Sentry internals to figure out why, but it's likely that if a crash happens without a window.onerror firing, Sentry will not be able to report on it. We only have a few dozen Board users right now. We could instruct them to start Chrome with the --enable-precise-memory-info flag which exposes JS heap size at window.performance.memory:
{
totalJSHeapSize: 29400000,
usedJSHeapSize: 15200000,
jsHeapSizeLimit: 1530000000
}
Then add monitoring of the JS heap size (Google Analytics? Sentry? Prometheus?) to watch how it grows. We could then get a sense of how much memory users have available when they're using Reader, and would get monitoring that tells us which users are experiencing crashes when, and could potentially correlate that with logs (may be a long slog of a process). This would only capture JS memory and not memory allocated directly to DOM nodes or otherwise allocated (https://developers.google.com/web/tools/chrome-devtools/memory-problems/?utm_source=dcc&utm_medium=redirect&utm_campaign=2016q3)
We've never tested PDFjs separately. Create a local env to test if PDFjs leaks memory or dom nodes. Just open, render, and close documents on a loop, over and over.
Actively looking for critiques of current ideas and new ideas for how to move forward if anyone has thoughts!
Questions:
Can we reproduce this at all?
Do we know users that have experience this?
How do we know this is a memory leak issue?
Is Chrome crash reporting turned on for our users?
It is possible to read Chrome crash dumps. However, it is non-trivial to set up and will require some C/C++ and assembly reading experience. I would do this as last resort.
http://www.chromium.org/developers/decoding-crash-dumps
@askldjd
Can we reproduce this at all?
I have caused Reader to crash on a low-spec (4GB RAM, 5+ year old) computer sitting on the podium in Chobani by trying to nav through 30+ documents in rapid succession. When I try to behave like a more normal user behavior, I haven't seen a crash.
Do we know users that have experience this?
@abbyraskinUSDS receives reports through the Feedback app and through email, and has been responding to those users.
How do we know this is a memory leak issue?
We don't, necessarily. We have observed Reader using huge amounts of memory (5GB+) via Activity Monitor. According to @mdbenjam, he expects that Reader should only keep 3 documents at a time in memory, and those docs have a 60MB file max. So it does seem we are using unexpectedly high amounts memory somehow. However, it doesn't necessarily follow that that memory use creeps up over time, and in fact, GC does seem to correct for our high memory usage (even when it gets really high, eventually GC takes care of it).
Is Chrome crash reporting turned on for our users?
Not sure.
Another consideration is that there's a chance my refactoring efforts will magically clean things up. Some behavior will be changed. We will be rendering fewer pages at once, which might help. We also are using React Lifecycle events to clean up pdfs, which seems more reliable than my old code. I'm hoping to merge the feature branch into master by the end of the week. No guarantee that it fixes it, and it's still worthwhile to investigate.
Until then I'm thinking anything that we can add that helps diagnose things is probably the best path to go down. Either seeing if we can get sentry errors for browser crashes or monitor the memory usage.
sentry errors for browser crashes
I don't think that's possible. This is a native code crash in the C/C++ layer. The interpreter will also crash along with the tab container process. There is no chance for JS engine to run because this is not an exception that can be caught in the managed JS world.
If you look into chrome crash reporting, you can notice that the report actually contains a minidump file to Google. The dump report is something that requires a debugger (e.g. windbg/gdb) with debug symbols to interpret.
Tested in UAT. Opened appeal 1075466 in an incognito tab and flipped rapidly back and forth to the end of all 65 documents and back to the first, for 1-2 minutes.
Memory went up to 1GB while i was flipping, then continued to climb slowly with no user input until it hit around 8GB. I force quit Chrome at that point.

It really hard to diagnose this issue without actually being able to replicate it consistently. I agree with Mark, he has been doing a pretty big refactor of pdf.js and page rendering logic. It might be a wise to wait and see if the problems still persists after the refactor.
But we do need a better way to diagnose this problem and actually capture some data from the users.
I'm not sure how useful monitoring the heap size might be, if it's only capturing the Javascript memory and nothing else.
Pdf.js tests have to wait until Mark finishes up with the refactor.
Yeah — since @mdbenjam advised that he's about to refactor a ton of the code, I think we should wait to continue investigating until that's merged in. No sense trying to hit a moving target on this.
Going to leave this issue open, but assign it to @mdbenjam for now — @mdbenjam when you merge in your refactor, can you assign it back to me?
It's reproducible. I just crashed my machine completely by leaving reader open.
Also see:
Documents with 100ish pages can easily crash. confirmed by users referencing 500+ page documents.
need a user in UAT with all 500+ page documents? Just Slack me.
We should watch this crash happen live at a user's desk and observe the patterns that are causing this problem.
Great idea, @tejans24. We are fortunate that we have the ability to just walk down the hall and talk to people.
Always happy to facilitate any field trips (or Lync calls), let me know!
Yep, as soon as Mark's refactor gets merged in and deployed to prod, we should jump on the next crash report and ask the user if we can shadow :)
Closing because Mark and team did a ton of refactoring after this ticket and into early 2018
Most helpful comment
Going to leave this issue open, but assign it to @mdbenjam for now — @mdbenjam when you merge in your refactor, can you assign it back to me?