Openrefine: Improve memory efficiency of Excel Importer

Created on 25 Apr 2020 · 13Comments · Source: OpenRefine/OpenRefine

I often have to work OUTSIDE of OpenRefine to down convert large Excel spreadsheets into CSV files that OpenRefine can import because our current ExcelImporter doesn't use memory efficient ways to handle this.

Example Large XLSX file in the public domain (46 MB , 917000 rows x 15 cols):
https://usace.contentdm.oclc.org/digital/api/collection/p16021coll2/id/4706/download

Many folks have this problem with Apache POI itself such as mentioned here: https://stackoverflow.com/questions/11891851/how-to-load-a-large-xlsx-file-with-apache-poi

There's also a guy who forked the original https://github.com/monitorjbl/excel-streaming-reader and maintains that now here: https://github.com/pjfanning/excel-streaming-reader

Apache POI recognizes that using an InputStream does use more memory than using File.
https://poi.apache.org/apidocs/dev/org/apache/poi/xssf/usermodel/XSSFWorkbook.html#XSSFWorkbook-java.io.InputStream-
but it's not the complete reason for large memory usage; that stems from the overhead of retaining all the large structures and the content of cells, instead of minimal structure and just the content.

In researching further, I came to know that there are examples in Apache POI package that Nick Burch wrote that uses the EventUserModel code to efficiently (although with fewer options) parse and convert XLS files to CSV (of course while throwing away structures and minimizing Java Objects creation in the process)

POI EXAMPLES:
XSSF (.xlsx and ooxml):
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/eventusermodel/
HSSF (.xls):
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/
SS (either):
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/ss/examples/

Further research lead me to look at the EventUserModel a bit more and I noticed that POI has support for Text Extraction with Excel files in constrained environments (at a cost of losing Java Objects that retain the data structure, but the benefit being less memory is consumed, but there's little to no structure, so its mostly useful for dropping into text mining or Lucene indexes)
https://poi.apache.org/text-extraction.html

I'm wondering 2 things now:

Should we refactor ExcelImporter to be more memory efficient? How, with EventUserModel as shown in some of those examples?
Or is it easier/better to just offer an ExcelImporter that can somehow detect OUT OF MEMORY while importing, and instead offer to optionally convert to CSV on the fly, similar to the 2CSV / ToCSV examples in the above POI examples?

XLS(X) enhancement import performance

Source

thadguidry

Most helpful comment

https://github.com/pjfanning/OpenRefine/commit/a28d96f29d7bd40a33e47507d9fb80919cbbdad9 is basically the change needed to use the streaming parser

pjfanning on 26 Apr 2020

👍2 ❤1

All 13 comments

Short version of the above: investigate if using Apache POI's streaming API would help us reduce memory usage, as suggested here:
https://stackoverflow.com/questions/11891851/how-to-load-a-large-xlsx-file-with-apache-poi

wetneb on 25 Apr 2020

As I mentioned above, I think the best code to look at is here: https://github.com/pjfanning/excel-streaming-reader Where PJ Fanning @pjfanning just happens to be a contributor also to Jackson @ FasterXML :-) Maybe he could help us quickly fix this issue and we'd give him lots of hugs.

thadguidry on 25 Apr 2020

Hi - I'm not really sure what your requirements are but I presume you'll want to support getting the data and being able to attach a schema to the data.

You might want to look at https://github.com/apache/drill/pull/2024
https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/ExcelRelation.scala

Working with Excel files in Java applications can be difficult. If you need to support xls, you're probably stuck with POI's HSSFWorkbook but there are a few options for xlsx files.

Best of luck.

pjfanning on 25 Apr 2020

👍1

I had a quick look at the existing code and the export code could probably be easily changed to use POI SXSSFWorkbook instead of XSSFWorkbook. SXSSFWorkbook is used for writing out xlsx files with a lower JVM memory footprint.

pjfanning on 25 Apr 2020

❤1

@pjfanning Great, and what about our ExcelImporter ?

thadguidry on 25 Apr 2020

Try the excel-streaming-reader. The workbook instance it gives has the same API as XSSFWorkbook - all be it without the features of the full XSSFWorkbook class.

pjfanning on 25 Apr 2020

❤2 👍1

@pjfanning Thanks so much for taking a peek at us! :)

thadguidry on 25 Apr 2020

https://github.com/pjfanning/OpenRefine/commit/a28d96f29d7bd40a33e47507d9fb80919cbbdad9 is basically the change needed to use the streaming parser

pjfanning on 26 Apr 2020

👍2 ❤1

@pjfanning Thanks for the example. I tried it a bit, ran into an issue and left a comment

thadguidry on 26 Apr 2020

pjfanning@a28d96f is basically the change needed to use the streaming parser

Any chance you could convert that into a PR that could be reviewed and merged?

tfmorris on 25 Jun 2020

@thadguidry was looking at testing the sample code to see if it worked in practice

pjfanning on 25 Jun 2020

@pjfanning I'm not a core Java developer "at all" but worked with them all my life. Maybe @tfmorris could help you here. I'm better at digging into docs, research, and testing.

thadguidry on 26 Jun 2020

I don't know much about OpenRefine and wouldn't have much time to look at https://github.com/pjfanning/OpenRefine/commit/a28d96f29d7bd40a33e47507d9fb80919cbbdad9?diff=unified#commitcomment-38754764

pjfanning on 26 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings