I often have to work OUTSIDE of OpenRefine to down convert large Excel spreadsheets into CSV files that OpenRefine can import because our current ExcelImporter doesn't use memory efficient ways to handle this.
Example Large XLSX file in the public domain (46 MB , 917000 rows x 15 cols):
https://usace.contentdm.oclc.org/digital/api/collection/p16021coll2/id/4706/download
Many folks have this problem with Apache POI itself such as mentioned here: https://stackoverflow.com/questions/11891851/how-to-load-a-large-xlsx-file-with-apache-poi
There's also a guy who forked the original https://github.com/monitorjbl/excel-streaming-reader and maintains that now here: https://github.com/pjfanning/excel-streaming-reader
Apache POI recognizes that using an InputStream does use more memory than using File.
https://poi.apache.org/apidocs/dev/org/apache/poi/xssf/usermodel/XSSFWorkbook.html#XSSFWorkbook-java.io.InputStream-
but it's not the complete reason for large memory usage; that stems from the overhead of retaining all the large structures and the content of cells, instead of minimal structure and just the content.
In researching further, I came to know that there are examples in Apache POI package that Nick Burch wrote that uses the EventUserModel code to efficiently (although with fewer options) parse and convert XLS files to CSV (of course while throwing away structures and minimizing Java Objects creation in the process)
POI EXAMPLES:
XSSF (.xlsx and ooxml):
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/eventusermodel/
HSSF (.xls):
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/hssf/eventusermodel/examples/
SS (either):
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/ss/examples/
Further research lead me to look at the EventUserModel a bit more and I noticed that POI has support for Text Extraction with Excel files in constrained environments (at a cost of losing Java Objects that retain the data structure, but the benefit being less memory is consumed, but there's little to no structure, so its mostly useful for dropping into text mining or Lucene indexes)
https://poi.apache.org/text-extraction.html
I'm wondering 2 things now:
Short version of the above: investigate if using Apache POI's streaming API would help us reduce memory usage, as suggested here:
https://stackoverflow.com/questions/11891851/how-to-load-a-large-xlsx-file-with-apache-poi
As I mentioned above, I think the best code to look at is here: https://github.com/pjfanning/excel-streaming-reader Where PJ Fanning @pjfanning just happens to be a contributor also to Jackson @ FasterXML :-) Maybe he could help us quickly fix this issue and we'd give him lots of hugs.
Hi - I'm not really sure what your requirements are but I presume you'll want to support getting the data and being able to attach a schema to the data.
Working with Excel files in Java applications can be difficult. If you need to support xls, you're probably stuck with POI's HSSFWorkbook but there are a few options for xlsx files.
Best of luck.
I had a quick look at the existing code and the export code could probably be easily changed to use POI SXSSFWorkbook instead of XSSFWorkbook. SXSSFWorkbook is used for writing out xlsx files with a lower JVM memory footprint.
@pjfanning Great, and what about our ExcelImporter ?
Try the excel-streaming-reader. The workbook instance it gives has the same API as XSSFWorkbook - all be it without the features of the full XSSFWorkbook class.
@pjfanning Thanks so much for taking a peek at us! :)
https://github.com/pjfanning/OpenRefine/commit/a28d96f29d7bd40a33e47507d9fb80919cbbdad9 is basically the change needed to use the streaming parser
@pjfanning Thanks for the example. I tried it a bit, ran into an issue and left a comment
pjfanning@a28d96f is basically the change needed to use the streaming parser
Any chance you could convert that into a PR that could be reviewed and merged?
@thadguidry was looking at testing the sample code to see if it worked in practice
@pjfanning I'm not a core Java developer "at all" but worked with them all my life. Maybe @tfmorris could help you here. I'm better at digging into docs, research, and testing.
I don't know much about OpenRefine and wouldn't have much time to look at https://github.com/pjfanning/OpenRefine/commit/a28d96f29d7bd40a33e47507d9fb80919cbbdad9?diff=unified#commitcomment-38754764
Most helpful comment
https://github.com/pjfanning/OpenRefine/commit/a28d96f29d7bd40a33e47507d9fb80919cbbdad9 is basically the change needed to use the streaming parser