Nextflow: Add a keepHeader option to collectFile

Created on 11 Oct 2017  ·  16Comments  ·  Source: nextflow-io/nextflow

In many cases, I want to collect several csv files that share the same structure. The ideal operator would skip their headers while collecting the files and then reattach the first header it encountered before emitting the collected file.

Most helpful comment

This was more tricky than expected. You guys owe me a 🍺 or a ☕️ at your choice. 😄

All 16 comments

+1

plus one, some process like BLAST nr database, we can split FASTA and run separately. However, it would cost another process to remove redundant header.

👍
Had to do this more times than I'd care to admit :)

"""
(printf "${HEADER}") > out_temp
cat ${out} >> out_temp
"""

The only problem I see here is how to manage the splitting of many CSV in many chunks that may have different headers. What header is supposed to be applied in the resulting collected files? the first one any case or the last processed ?

Honestly I don't know. I am personally interested in the case where I'm collecting several CSV files that all have the same header. Perhaps a first idea could be to collect all the header, compare them and provide a warning in case they don't match.

A bit too complicated. I would go with the assumption they are equals and only the first is retained and applied to the collected files.

I'd love that feature!

And now, the most difficult question: keepHeader or keepHeaders ? 😄

keepHeader of course! We want to preserve one header after all..😉

I would use keepHeader :) But then you have to specify either the number
of rows or the presence of a character able to identify the header (like #)

L

On 17/11/2017 17:52, Paolo Di Tommaso wrote:
>

And now, the most difficult question: |keepHeaders| or |keepHeader| ? 😄


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/nextflow-io/nextflow/issues/479#issuecomment-345299176,
or mute the thread
https://github.com/notifications/unsubscribe-auth/APPSvbhDHemQ62ZD5qxP642EF_ygf_5bks5s3bnRgaJpZM4P1HeT.

keepHeader won.

you have to specify either the number of rows or the presence of a character able to identify the header

Not sure to understand

Examples:

Name    Surname
Luca    Cozzuto
Paolo    Di Tommaso

So here the header is row 1

Again:

This is the the wonderful format made by some crazy bioinfo
Name    Surname
Luca    Cozzuto
Paolo    Di Tommaso

So here the header is composed by rows 1-2

Finally

Let's add some variable rows because you know there is

nothing better than some strange text here

Name    Surname

Luca    Cozzuto
Paolo    Di Tommaso

So here the variable header is described by a prefix (#).

PS: I hope there won't be a combination of the two situations :)

L

On 17/11/2017 17:57, Paolo Di Tommaso wrote:
>

|keepHeader| won.

you have to specify either the number of rows or the presence of a
character able to identify the header

Not sure to understand


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/nextflow-io/nextflow/issues/479#issuecomment-345300816,
or mute the thread
https://github.com/notifications/unsubscribe-auth/APPSvYNjydxUusxbKn3QeQHUb2gNCKd1ks5s3bsRgaJpZM4P1HeT.

ummmmmmmmmmm, multi-line headers !

I think I'm going to the beer session ..

Available for test in version 0.27.0-SNAPSHOT.

This was more tricky than expected. You guys owe me a 🍺 or a ☕️ at your choice. 😄

Was this page helpful?
0 / 5 - 0 ratings