Opening a feature request kicks off a discussion.
When trying to parse data from https://www.ndbc.noaa.gov/data/latest_obs/latest_obs.txt , they substitute MM in for fields that are missing a value. This data cannot be parsed by the CSV parser because if it hits that MM, then it throws a field conversion error.
It would be nice to be able to define a set of values to "ignore" when parsing CSV.
There is no way to configure ignoring missing values.
A way to ignore specific values for parsing csv data where a value is substituted in for a missing value, such as in the link above.
This would allow me to parse additional CSV data without needed to pre-process it or write a custom script.
Seconded!
@pierwill Would you be able to add a quick description of how this works in your dataset, if by luck it is a public dataset a link would be great too.
I'm using data from https://metrics.torproject.org/bandwidth.csv with the following config:
[[inputs.file]]
files = ["data/raw/bandwidth.csv"]
data_format = "csv"
csv_header_row_count = 1
csv_column_names = ["date", "advbw", "bwhist"]
csv_column_types = []
csv_skip_rows = 0
csv_skip_columns = []
csv_delimiter = ","
csv_comment = "#"
csv_trim_space = false
csv_tag_columns = []
name_override = "bandwidth"
csv_timestamp_column = "date"
csv_timestamp_format = "2006-01-02"
A typical error I'm getting as a result of missing data is
[inputs.file] Error in plugin: column type: parse float error strconv.ParseFloat: parsing "": invalid syntax
Your case is even more clear cut to me. Anytime a column is empty it shouldn't be an error, instead we should just skip the field on that line. I think we could do this safely across the board.
So this:
date,advbw,bwhist
2007-10-27,1.917726488,
2019-09-01,,196.773939248
2019-09-02,408.762870552,194.191607076
Leaving off the timestamp should look like:
```
bandwidth advbw=1.917726488
bandwidth bwhist=196.773939248
bandwidth advbw=408.762870552,bwhist=194.191607076
Would this resolve an issue that I'm currently encountering?
I'm parsing a csv that's generated from an application's logs. In some cases cells will be populated with a float, and in some cases when no data is recorded it's blank. As a result of this, when a row containing a blank cell is fed into a database column that's set to a float type, I'm getting the below error.
2020-04-01T17:01:50Z E! [outputs.influxdb] When writing to [http://database:8086]: received error partial write: field type conflict: input field "speed" on measurement "file" is type string, already exists as type float dropped=4; discarding points
@astro-arphid Yes, I believe so.
Most helpful comment
Your case is even more clear cut to me. Anytime a column is empty it shouldn't be an error, instead we should just skip the field on that line. I think we could do this safely across the board.
So this:
Leaving off the timestamp should look like:
```
bandwidth advbw=1.917726488
bandwidth bwhist=196.773939248
bandwidth advbw=408.762870552,bwhist=194.191607076