Deno: [std/encoding/csv] Omit header row when specifying header options

Created on 11 Sep 2020  路  15Comments  路  Source: denoland/deno

It doesn't appear that this kind of behavior is documented: https://deno.land/[email protected]/encoding#csv

Example:

data.csv:

size,cost
m,12
s,10
l,14

main.ts:

import {parse} from 'https://deno.land/[email protected]/encoding/csv.ts';

const parseOptions = {
  header: [
    {
      name: 'size',
      parse: (string: string) => string.toUpperCase(),
    },
    {
      name: 'cost',
      parse: (string: string) => Number(string),
    },
  ],
};

const raw = await Deno.readTextFile('data.csv');

const result = await parse(raw, parseOptions);
console.log(result);

output:

% deno run --allow-read=. --allow-write=. main.ts
[
  { size: "SIZE", cost: NaN },
  { size: "M", cost: 12 },
  { size: "S", cost: 10 },
  { size: "L", cost: 14 }
]

As you can see, the resulting array includes the header row, but I can't figure out how to omit it when parsing using header options.

If this is not possible, please consider this a feature request. If I just missed it, please point me in the right direction. Thanks!

All 15 comments

You can resolve that problem using header: true and parse option:

import {parse} from 'https://deno.land/[email protected]/encoding/csv.ts';

const parseOptions = {
  header: true,
  parse(row: any) {
    return {
      size: row.size.toUpperCase(),
      cost: Number(row.cost),
    };
  }
};
const raw = await Deno.readTextFile('data.csv');

const result = await parse(raw, parseOptions);
console.log(result);

Output:

[ { size: "M", cost: 12 }, { size: "S", cost: 10 }, { size: "L", cost: 14 } ]

@uki00a Thank you for responding.

Using that method does not allow for transforming the header names. It seems like those are currently mutually exclusive possibilities. Is it possible to do both (parse the data and provide custom header names)? That is an option that I expected from this module.

@jsejcksn

Is it possible to do both (parse the data and provide custom header names)? That is an option that I expected from this module.

There is currently no option to do so, but I think it's seems not to be hard to implement it :thinking:
Do you have any ideas/suggestions for implementing it? (e.g. option name)

Do you have any ideas/suggestions for implementing it? (e.g. option name)

@uki00a I think exists is a good candidate for the option name. If the value of exists is false or undefined then the current behavior will apply. If the value is true then the first row of data will be skipped, and the value of name will be used (if name is undefined then it will default to the value in the first row).

@jsejcksn I'm sorry for the late reply and thanks for your suggestion! Does this match your suggestion?

const parseOptions = {
  header: [
    {
      name: 'size',
      parse: (string: string) => string.toUpperCase(),
      exists: true,
    },
    {
      name: 'cost',
      parse: (string: string) => Number(string),
      exists: true,
    },
  ],
};
const result = await parse(data, parseOptions);

IMHO, I think it would make the intent clearer to use a name such as renameHeader (like node-fast-csv) :thinking: What do you think?

Does this match your suggestion?

Yes, it does.

IMHO, I think it would make the intent clearer to use a name such as renameHeader (like node-fast-csv) 馃 What do you think?

I like it as a top-level option鈥擨 think is more logical. I also wonder about the case where you'd like to change only some of the header names (not all). Here is an example:

data.csv:

date,name,price,price_unit,qty
2019-02-11T00:00:00.000Z,globe,73.99,USD,1
2019-06-04T00:00:00.000Z,lamp,29.00,USD,1

_Parse data, rename first header only:_

const parseOptions = {
  headersExist: true,
  headers: [
    {
      name: 'purchased',
      parse: (str: string) => new Date(str),
    },
    {},
    {},
    {},
    { parse: (str: string) => Number(str) },
  ],
};

const data = await parse(await Deno.readTextFile('data.csv'), parseOptions);
console.log(JSON.stringify(data, null, 2));
[
  {
    "purchased": "2019-02-11T00:00:00.000Z",
    "name": "globe",
    "price": 73.99,
    "price_unit": "USD",
    "qty": 1
  },
  {
    "purchased": "2019-06-04T00:00:00.000Z",
    "name": "lamp",
    "price": 29,
    "price_unit": "USD",
    "qty": 1
  }
]

Is this a reasonable idea, or should the user be required to input all header names when renaming even one?

const parseOptions = {
  headersExist: true,
  headers: [
    {
      name: 'purchased',
      parse: (str: string) => new Date(str),
    },
    { name: 'name' },
    { name: 'price' },
    { name: 'price_unit' },
    {
      name: 'qty',
      parse: (str: string) => Number(str),
    },
  ],
};

Further: I think using the plural headers instead of header is better since it takes an array (or boolean). And in ReadOptions, the option comma should probably be called separator.

@jsejcksn

I like it as a top-level option鈥擨 think is more logical. I also wonder about the case where you'd like to change only some of the header names (not all). Here is an example:
Is this a reasonable idea, or should the user be required to input all header names when renaming even one?

Ah, I see. I agree with the top level option :+1:
Do you have time to work on this issue? If not, I'll do it this weekend :slightly_smiling_face:

@uki00a I'm not sure, but if I can start working on it, I'll post here to let you know.

@jsejcksn I recognize that the problems this issue tries to solve are:

  • skipping the first line.
  • treating header option as the header definition.
  • renaming the column name of the header.

I've tried to implement these based on your suggestion, but it seems difficult to achieve all of these without changing the current API :cry:

So I came up with the following proposal. I'd like to hear your opinion on this.

Proposal

I would propose a new class called CSVReader to provide more fine-grained control over reading the record:

import { CSVReader } from "https://deno.land/[email protected]/encoding/csv.ts";

const rawData = await Deno.readTextFile("data.csv");
const reader = new CSVReader(rawData, {
  header: [
    {
      name: 'size',
      parse: (string: string) => string.toUpperCase(),
    },
    {
      name: 'cost',
      parse: (string: string) => Number(string),
    },
  ],
});

await reader.read(); // Skips header
for await (const record of reader) { // Iterates each record
  processRecord(record);
}

I'd like to hear your opinion on this.

@uki00a It seems like this approach would be good for streaming, but if rawData must be a string, then the reader isn't useful for streaming, right? In that case, I'm not sure that the extra complication and asynchronous API are worth the change. It also seems that all of the header names must still be supplied, instead of just some. If I have overlooked something about the design, please share it with me.

Perhaps it will be simpler just to add a boolean like ParseOptions.skipFirstRow. That is definitely the most important point, and will still allow for providing custom header names when using along with an array of HeaderOptions. (It also makes no breaking changes to the API.) Keeping it synchronous is useful if the entire string must be provided anyway.


The only other feedback I have is the same from before:

  • ParseOptions.header 鉃★笍 ParseOptions.headers
  • ReadOptions.comma 鉃★笍 ReadOptions.separator

What do you think about that?

@jsejcksn

It seems like this approach would be good for streaming, but if rawData must be a string, then the reader isn't useful for streaming, right?

I think this problem would be solved by supporting not only string but also BufReader (like the current parse() function):

const buf = new BufReader(dataSource);
const reader = new CSVReader(buf, options);

Perhaps it will be simpler just to add a boolean like ParseOptions.skipFirstRow.
That is definitely the most important point, and will still allow for providing custom header names when using along with an array of HeaderOptions. (It also makes no breaking changes to the API.)

ParseOptions.skipFirstRow looks simple and good :+1: I'll try to implement it.

ParseOptions.header :arrow_right: ParseOptions.headers

I disagree with this. The reason for this is that I think the header means the first record (singular, not plural.) in the CSV records.

ReadOptions.comma :arrow_right: ReadOptions.separator

I agree with this. I think separator makes more sense to me too.

I disagree with this. The reason for this is that I think the header means the first record (singular, not plural.) in the CSV records.

I just wanted to let you know that I found it a surprise to provide an array to a variable name that was singular. (This violated my expectations based on using other modules.) I understand referring to a _record_ using a singular name, however, here the name is not for the record, but for the data being provided (boolean or Array).

I understand. So how about changing the ParseOptions signature as follows:

  • Make header option accept only boolean.
  • Add columns option to define the header definition.

Before (current definition):

export interface ParseOptions extends ReadOptions {
  ...
  header: boolean | string[] | HeaderOptions[];
  ...
}

After:

export interface ParseOptions extends ReadOptions {
  ...
  header: boolean;
  columns: string[] | HeaderOptions[];
  ...
}

I think columns is very clear! But now it seems like HeaderOptions should be named ColumnOptions. 馃槄

Also, this looks like it will probably change quite a bit in the documentation, so I'll be happy to review/update it after the changes are settled.

@uki00a Thanks! Much easier to use now.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ry picture ry  路  3Comments

metakeule picture metakeule  路  3Comments

JosephAkayesi picture JosephAkayesi  路  3Comments

benjamingr picture benjamingr  路  3Comments

kyeotic picture kyeotic  路  3Comments