Openrefine: Fetching URLs in a large dataset ends with an OutOfMemoryError with no feedback for the user

Created on 20 Dec 2019 · 4Comments · Source: OpenRefine/OpenRefine

Describe the bug
A fetching URL's operation on a large dataset runs out of memory and gets stuck on 99%, giving no feedback to the user.

To Reproduce
Steps to reproduce the behavior:
Run an "Add column by fetching URL's" operation on ca 30k rows.

Current Results
Once the indicator on top of the page gets to 98 or 99%, it gets stuck there forever, giving no feedback to the user that something went wrong. When I check the console, I see:

10:38:08.848 [ project_utilities] Saved project '2337537703446' (2331ms) Exception in thread "Thread-8" com.google.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: GC overhead limit exceeded at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2205) at com.google.common.cache.LocalCache.get(LocalCache.java:3953) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3957) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4875) at com.google.refine.operations.column.ColumnAdditionByFetchingURLsOperation$ColumnAdditionByFetchingURLsProcess.cachedFetch(ColumnAdditionByFetchingURLsOperation.java:331) at com.google.refine.operations.column.ColumnAdditionByFetchingURLsOperation$ColumnAdditionByFetchingURLsProcess.run(ColumnAdditionByFetchingURLsOperation.java:292) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

Expected behavior
Either not run out of memory ;) or at least show an error message in the interface.

Desktop (please complete the following information):

OS: Ubuntu 18.04.3 LTS
Browser Version: Chrome Version 78.0.3904.87 (Official Build) (64-bit)
JRE or JDK Version:
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

OpenRefine (please complete the following information):

Version 3.3 Beta

bug error handling fetch urls large project support

Source

Vesihiisi

All 4 comments

In this particular case I'm fetching substantial chunks of data – about 10 kB per row, so no wonder it gets heavy. I imagine https://github.com/OpenRefine/OpenRefine/issues/120#issuecomment-567915828 could help here, since I end up needing only a small part of the result anyway…

Vesihiisi on 20 Dec 2019

👍1

@Vesihiisi How large is your dataset? How many tables? How many rows per table? How many MB/GB? How much memory do you have allocated to OpenRefine? (Have you changed it, or are you using the default)?

DataArchiver on 7 Feb 2020

This is one of the sort of problems that should be solved with our new architecture. This is far from ready though - in the meantime allocating more memory is the simplest solution. If you intend not to keep the full contents downloaded from the URLs, it would also help a lot to have a GREL function to fetch URLs: https://github.com/OpenRefine/OpenRefine/issues/120#issuecomment-567915828. You could also do that with a Jython expression at the moment.

wetneb on 7 Feb 2020

👍1

30K rows * 10 KB is only 300MB which definitely shouldn't cause problems, even with the default JVM heap allocation, so this seems like a bug.

The appearance of the Gauva cache library in the stack trace is suspicious. I wonder if there are multiple copies of the data lying around. Turning off the cache (it seems to be on by default) would be something to try.

OutOfMemory errors are, unfortunately, pretty difficult to handle gracefully in Java. By the time the situation gets that serious, there's little chance of recovery. #2792 would help provide visibility to the user that memory is running low (although for a long running operation like this, they may not be able to do much about it).