Nextflow: splitCsv does not handle quoted values containing commas correctly

Created on 5 Apr 2019 · 6Comments · Source: nextflow-io/nextflow

Bug report

It appears that .splitCsv does not handle quoted tokens containing commas correctly.

Expected behavior and actual behavior

Given a token wrapped in quotes "1,234", I would expect .splitCsv to parse this as a single value [.., '1,234', ...]. However, it breaks this token into two, leaving a singular quote with each half [..., "1, 234", ...].

Steps to reproduce the problem

Using the following test case above as test.csv

value_1,value_2,value_3
100,10.1%,300
"9,600",98.9%,"24,155"

and parsing as

Channel
    .fromPath("test.csv")
    .splitCsv(header: true, quote: '\"')
    .subscribe { println "$it" }

I've tried a variety of values of quote above to no avail (I'm new to Groovy so advice on how to best specify this option would be helpful).

Program output

Expected output

[value_1: 100, value_2: 10.1%, value_3: 300]
[value_1: '9,600', value_2: 98.9%, value_3: '24,155']

Actual output

[value_1: 100, value_2: 10.1%, value_3: 300]
[value_1: "9, value_2: 600", value_3: 98.9%]

Environment

Nextflow version: 19.01.0.5050
Java version: 1.8.0_73
Operating system: CentOS 6.7

Additional context

I think the issue is related to how StringUtils.splitPreserveAllTokens is used in csvSplitter.groovy. The line is tokenized by , first then quotes are removed, but the tokenization by .splitPreserveAllTokens doesn't respect quotes:

import org.apache.commons.lang.StringUtils
line = '"9,600",86.3%,"23,000"'
println StringUtils.splitPreserveAllTokens(line, ",")

yields 5 tokens:

["9, 600", 86.3%, "23, 000"]

Perhaps regex is needed to extract the proper tokens (example), then quotes can be removed from them:

line = '"9,600",86.3%,"23,000"'
println line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*\$)", -1)

yields the desired 3 tokens

["9,600", 86.3%, "23,000"]

kinbug

Source

wflynny

All 6 comments

I'm not sure why you want this behaviour...CSV stands for "Comma-separated values"
Maybe could be nice to be able to indicate the splitter character on the .splitCsv
But I'm not sure it makes sense to not split when a comma is found (if any other splitter char is defined)

edgano on 5 Apr 2019

@edgano Well, for one, the tools I'm using generate CSV files with quoted values. Further, it seems to be a widely accepted interpretation of the CSV specification.

wflynny on 5 Apr 2019

👍1

Yep, the quotes are not the "problem"...
The problem I see, is to avoid splitting a value with comma. But I assume it will be possible to take the element inside the quotes as a literal

edgano on 5 Apr 2019

Quotation handling should take precedent over splitting on the separator, be it a comma or another sep value passed to .splitCsv, but also possibly over newline handling as well. Typical CSV parse implementations handle the latter differently, but nonetheless this is a more general problem with .splitCSV than just handling commas.

Anyway, comma handling might be solved just with the regex above, then stripping quotes if they exist from the tokens.

wflynny on 5 Apr 2019

Yes, you are right. This is a bug.

pditommaso on 5 Apr 2019

👍1

@pditommaso Thank you for the quick response and fix. I can confirm this is working for my pipelines using

NXF_VER=19.04.0-SNAPSHOT nextflow run ...

wflynny on 10 Apr 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Channel join loses duplicate keys

stevekm · 5Comments

Singleton list conversion cause quirk behaviour

apeltzer · 7Comments

Allow access to manifest scope during workflow execution

ewels · 4Comments

Println operator output is not shown with ansi logging

olgabot · 7Comments

ignite error on aws EC2

rsuchecki · 3Comments