It appears that .splitCsv does not handle quoted tokens containing commas correctly.
Given a token wrapped in quotes "1,234", I would expect .splitCsv to parse this as a single value [.., '1,234', ...]. However, it breaks this token into two, leaving a singular quote with each half [..., "1, 234", ...].
Using the following test case above as test.csv
value_1,value_2,value_3
100,10.1%,300
"9,600",98.9%,"24,155"
and parsing as
Channel
.fromPath("test.csv")
.splitCsv(header: true, quote: '\"')
.subscribe { println "$it" }
I've tried a variety of values of quote above to no avail (I'm new to Groovy so advice on how to best specify this option would be helpful).
[value_1: 100, value_2: 10.1%, value_3: 300]
[value_1: '9,600', value_2: 98.9%, value_3: '24,155']
[value_1: 100, value_2: 10.1%, value_3: 300]
[value_1: "9, value_2: 600", value_3: 98.9%]
I think the issue is related to how StringUtils.splitPreserveAllTokens is used in csvSplitter.groovy. The line is tokenized by , first then quotes are removed, but the tokenization by .splitPreserveAllTokens doesn't respect quotes:
import org.apache.commons.lang.StringUtils
line = '"9,600",86.3%,"23,000"'
println StringUtils.splitPreserveAllTokens(line, ",")
yields 5 tokens:
["9, 600", 86.3%, "23, 000"]
Perhaps regex is needed to extract the proper tokens (example), then quotes can be removed from them:
line = '"9,600",86.3%,"23,000"'
println line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*\$)", -1)
yields the desired 3 tokens
["9,600", 86.3%, "23,000"]
I'm not sure why you want this behaviour...CSV stands for "Comma-separated values"
Maybe could be nice to be able to indicate the splitter character on the .splitCsv
But I'm not sure it makes sense to not split when a comma is found (if any other splitter char is defined)
@edgano Well, for one, the tools I'm using generate CSV files with quoted values. Further, it seems to be a widely accepted interpretation of the CSV specification.
Yep, the quotes are not the "problem"...
The problem I see, is to avoid splitting a value with comma. But I assume it will be possible to take the element inside the quotes as a literal
Quotation handling should take precedent over splitting on the separator, be it a comma or another sep value passed to .splitCsv, but also possibly over newline handling as well. Typical CSV parse implementations handle the latter differently, but nonetheless this is a more general problem with .splitCSV than just handling commas.
Anyway, comma handling might be solved just with the regex above, then stripping quotes if they exist from the tokens.
Yes, you are right. This is a bug.
@pditommaso Thank you for the quick response and fix. I can confirm this is working for my pipelines using
NXF_VER=19.04.0-SNAPSHOT nextflow run ...