Openrefine: Migrate OpenCSV to a more recent version

Created on 29 Dec 2019 · 9Comments · Source: OpenRefine/OpenRefine

We are currently using a custom snapshot of OpenCSV which is years old. The OpenCSV project is still active and a lot of releases have been published since then. We should look into upgrading to a newer version, which would also let us get rid of the locally stored .jar.

CSTSV enhancement import

Source

wetneb

👍1

Most helpful comment

With the migration to spark, Spark SQL's own CSV parser is a natural choice since it allows efficient partitioning (so, scales well to large datasets).

wetneb on 18 Feb 2020

👍2 🎉1

All 9 comments

Version 5.0 does not seem to support multi-character separators, which are currently used at least in the smartSplit function.

wetneb on 31 Dec 2019

Multi-character separators were proposed but the patch did not make it upstream: https://sourceforge.net/p/opencsv/patches/44/
We don't seem to be the only ones to require this though: https://stackoverflow.com/questions/8653797/java-csv-parser-with-string-separator-multi-character

wetneb on 3 Jan 2020

The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission. (I.E. they consider bits outside the "csv standard" to not really be csv) Even if you provided a new class/method then it would probably be rejected, but you never know and might want to ask them.

I still think investment in one of these would be better all around:
This might mean that perhaps we look at using Jackson CSV (I prefer it more, since they know CSV can be "messy" sometimes and supports extensions, where we would just need to implement something like CsvParser.Feature.ALLOW_MULTI_SEPARATORS)?
https://github.com/FasterXML/jackson-dataformats-text/tree/master/csv#configuring-csvschema

or Apache CSV (which wanted to unify development with OpenCSV at one-time, but didn't get far with them either if I recall)

thadguidry on 3 Jan 2020

OpenCSV does not stick to RFC4180, they also have a more flexible parser which accommodates with non-standard needs.

But yeah, switching to another parser could also be an option. There seems to be quite a lot of them actually! https://github.com/uniVocity/csv-parsers-comparison

wetneb on 3 Jan 2020

I did not say they stick?

thadguidry on 3 Jan 2020

Well, you wrote that "The OpenCSV project seems to prefer a strict stance of RFC 4180 which seems reasonable given their mission" and I think that is not a very accurate description of OpenCSV, since their default parser is much more flexible and accepts CSVs that do not conform with RFC4180. They also have a RFC4180 parser, but that is not the default one.

wetneb on 3 Jan 2020

Thanks for info, but I also see flexibility with other parsers. So switching, although painful might be wiser. Up to you.

thadguidry on 3 Jan 2020

With the migration to spark, Spark SQL's own CSV parser is a natural choice since it allows efficient partitioning (so, scales well to large datasets).

wetneb on 18 Feb 2020

👍2 🎉1

OpenCSV rejected multi-character separators (a second time): https://sourceforge.net/p/opencsv/feature-requests/119/
There still don't seem to be any available Java CSV parsers which support multi-character separator strings.