I think due to some recent changes default culture (and delimiter) is determined by CultureInfo.CurrentCulture and setting Configuration.CultureInfo = CultureInfo.InvariantCulture does not alter the ListSeparator value.
I recommend against such approach. Based on particular developer local settings ListSeparator may vary unless explicitly set in code. This is actually the issue we experienced. Using InvariantCulture instead of CurrentCulture in Configuration class constructor may be much better choice.
I want things to work without change by default. That means in cultures where the ListSeparator is not ,, it needs to read/write CSV files using that cultures separator. If you've made some changes to your default culture settings, then you'll have to configure CsvHelper to work that way.
There is now a constructor on Configuration where you can pass in the CultureInfo and the Delimiter is set to that. It's a one time default setting and changing the culture later won't affect it.
The point is CSV stands for Comma Separated Values. Although some developers may want to use another separator based on regional settings, I would not say it is predominant trend or recommended approach.
Given software solution typically uses predefined list separator. Setting the default value based on local user settings is highly error prone. I know from own experience.
I agree with you completely and would like to do this. Based on what people in other cultures have told me, the files they consume are all written with the culture specific list separator. I changed it so it would work for most people by default.
What about the CultureInfo setting? That determines formatting of strings. I would think this should be the CurrentCulture, right?
I guess, vox populi, vox dei. Maybe it is just me - usually involved with software used internationally while csv writing/reading is not part of the UI. Anyway, just my 3 cents. My problem is solved. From the API design perspective I would probably go for something like ParsingCulture but I agree, it sounds a bit awkward.
Do you have examples of any software that doesn't use ListSeparator? I'm trying to find some. Excel does. Apparently this is because other languages will use a comma for a decimal separator. With CSV this doesn't really matter because of quoted fields.
I'm trying to make an argument to change the default to be , and not based on any sort of culture.
I would like the library to use a , as default since that is what the spec says it should.
I was just fixing code parsing MS Azure usage CSV files. After upgrade involving packages nuget updates for some users software worked, for some it did not. The use of CurrentCulture for ListSeparator turned out to be the issue.
Hi Josh and Tomasz,
The example I have is that the decimal point is a comma, and the delimiter is a semi-colon in the de-DE culture. If I remember correctly from what my coworker has said, Excel would open such a delimited file. So using comma would possibly break that. That said, I'm curious about what Josh said about quoting fields and continuing to use comma anyway in such cultures.
I don't have time to for another few days because of the holidays, but I'd be curious to boot up a de-DE setup VM and see what happens: Does Excel open a comma delimited .csv file as long as values are quoted, particularly if they contain decimal points? (downside is that the file gets much larger if you quote all/most fields when you kind of don't have to).
If so, then comma as the default always, but possibly continuing to observe current culture but also setting Delimiter to comma, might be fair game. At the end of the day though, if you are only reading files and know your files will have a specific format always and forever, is it horrible to pass in a CultureInfo, such as InvariantCulture? Also, sometimes the hard part is producing files that will work in multiple cultures, not just reading files programmatically. My company has certainly struggled with producing comma delimited .csv files and having European customers complain that the file won't open, or doesn't open correctly, in Excel.
Hi, if this library, an excellent piece of work, main purpose is to play well with Excel then using local user settings by default is the right choice. However, 1) Excel does not stick to CSV RFC 4180 which causes a lot troubles in multinational environments. I know first hand. 2) There is big world outside Windows & MS Office.
I can appreciate that Excel isn't the only program used to view CSV files, but I'd speculate it is one that is quite frequently used, and is available on more than Windows. I know my customer base almost always has Excel. But I think we can all agree that file specs are more important than coding to a particular implementation. I did not mean to suggest that Excel is the only client program to code for, nor that it's a perfect program either. Sometimes I hate Excel.
I've looked up CSV RFC 4180, and whether or not it's the IETF or Wikipedia, they both are quoted to say that CSV is not an actual, locked down spec. That would make things too easy; one way to write, one way to read and every one is happy.
IETF:
While there are various specifications and implementations for the CSV format... there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.
…
Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" when processing CSV files.
Wikipedia:
In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters, for example, semicolons. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used in some European countries, such as Italy, instead of commas.
In short, CSV does not always mean comma separated values, for better or worse.
That unfortunately means Josh and other CSV library writers/maintainers have to sometimes make hard decisions, possibly aided by using various readers, including, but not limited to Excel. Do you have suggestions of programs to test? I think that would benefit the discussion.
To be liberal in what is accepted, as the IETF suggests, might mean probing a file with multiple delimiters until sensible data is found, or allowing users to specify the delimiter when _reading_ files. You might get the right delimiter eventually, but might still parse decimals wrong if you look for dot but got comma instead.
using System.Globalization;
> CultureInfo de = new CultureInfo("DE-de");
> de
[de-DE]
> string input = "21,1";
> double.Parse(input)
211
> double.Parse(input, de)
21.1
However, being conservative in what you do, as the IETF suggests, might in fact mean always using comma-delimiter when _writing_ files. But that won't solve decimal values using dot versus comma. Should European customers be forced to use a dot instead of comma? Should we respect the current culture, but bloat the file big time with lots of quotes so comma can mean decimal separator and delimiter? Honestly, _is CSV a good choice for internationalization support_? It's starting to look like it's not!
In the end, it depends... this discussion is for what the best default is, and I honestly don't know what the best default is. Each developer will know what's best for their app, but CsvHelper does need a default. CurrentCulture feels like a good choice, but certainly isn't always the right choice.
This message is getting too long. I can't say "do A, that's best", or "do B, that's best". But hopefully my comments can help someone decide. 🙂
Thanks for the input! I'm looking for a discussion to help me make a good decision __and stick with it__. This is one of the subjects I waffle on, and I need to just stick with a choice, but I need to have a good reason why I'm making that choice.
I actually have Conservative when writing, liberal when reading. as a listed feature of the library. That is why I changed it to use ListSeparator as the default. If a file was created in another program in another culture, more than likely it will have used that instead of a comma.
It should definitely output a comma when writing though, because it should conservative when writing. I want it to generate a spec valid file when writing. This means that when reading, it needs to also use a comma for a default. It needs to be able to read the file it just wrote with default settings.
I can't do any sort of detection because it's a forward only reader on top of a TextReader. It's possible to add another class/method that can try and do some detection though, which would be separate from reading.
Here is what I'm going to do.
__Default__
__Culture passed into constructor__
@AdamDotNet What should be done here? Since the default won't be using ListSeparator anymore, it feels odd that it's grabbing it in this case. It seems like it should be a second required parameter. I believe your original idea was that it should set everything culture specific in the system when you pass it in, but I believe the only culture specific thing will be the culture itself now after this. What are your thoughts?
$0.02 - I have worked in analytic software for the last 18 years with worldwide customers. In addition to our software, they work with CSVs in Excel, SAS, SPSS, R, etc. In all these cases, they're using files formatted with their local culture (semicolons if that is the case).
If you force a comma and make all non-US customers quote the fields, you risk other software packages interpreting all the fields as string instead of numbers and throwing errors.
Standards are good things, but useless if the industry doesn't use them. In this case the question is whether you want to appeal to a standard or what people and other software actually use.
If I were designing a CSV package like this, I would make the CultureInfo a required parameter when constructing the reader/writer and use that for formatting as well as field separation. If I had an overloaded version without a CultureInfo, then it would simply be defaulting to CultureInfo.InvariantCulture.
@JoshClose I'm going to mostly defer to @sparticus1701 on this one, regarding forcing comma delimiter. We have a few more software examples of CSV behavior, so I'd be a tad more hesitant to force comma delimiter in all cultures.
If you move forward anyway though, I just want to clarify what you're asking. Do you mean make CultureInfo a required parameter to the constructor? Or the delimiter? Not quite sure which one you meant by
It seems like it should be a second required parameter.
Also, if you move forward with always comma delimiter by default, I'd version the package to 13.x instead of a minor bump to 12.x, since that's a behavioral breaking change.
Note we are discussing the default behavior. Nobody is forced to use this or the other settings and changing them is a matter of adding two extra lines of code.
In any case, I suggest the tool by default is able to read what it wrote.
Forcing developer to make a conscious decision (constructor) may be a good solution.
Personally I prefer consistency (InvariantCulture over Local) to make my live easier and software less dependent on culture settings but for someone dealing with only one culture opposite may be true.
I agree that forcing a conscious decision by the developer might be a good solution. Alternatively, I'd prefer the defaults to not cause any surprises (just use InvariantCulture). The current defaults risk the developer unknowingly depending on system culture settings.
I just ran into this because my culture is set to nl-NL where the ListSeparator is ;. My file is comma separated.
The _Getting Started_ page on the project website is incorrect for any user where the ListSeparator is not a comma. It took me a while to figure out what was going on here. Following the first example and enumerating the records results in a CsvHelper.HeaderValidationException:
'Header with name 'Id' was not found. If you are expecting some headers to be missing and want to ignore this validation, set the configuration HeaderValidated to null. You can also change the functionality to do something else, like logging the issue.'
I'd also like to add that Excel indeed seems to also use the list separator by default, but if you open the file from inside Excel itself instead of with a double-click in Explorer it will ask you how to interpret the data (delimiter, decimal separator, quotation mark, date format, etc.). It's quite flexible. So using a comma delimiter by default is not necessarily a problem for Excel users in regions where the culture settings don't match InvariantCulture.
I suggest to have 3 different constructors.
Something like that. The default delimiter should be a comma, for very obvious reasons and it should not be the culture specific list separator by default.
My very simplistic thoughts...
For me, setting the delimiter to CultureInfo by default is not a good strategy.
I have cases where a unit test that reads a CSV file breaks depending on the culture of the build agent it runs under.
The other thing is, that if a back-end system needs to parse CSV files received from different vendors, the CSV Service should set the delimiter explicitly (if other than a comma) based on the document it needs to parse. A person in one culture can generate a CSV that is then parsed by a system running under a completely different culture.
I have also had Linux installations where the ListSeparator was blank by default, causing very hard to debug situations.
I also think reading the delimiter from the locale settings is a bad default. This pretty much prevents CSV files from being interoperable across locales. The default should be ,.
@mshindal I agree, unless CsvHelper is actually meant to be ExcelExportReader - as someone suggested.
We just stumbled upon this (again) because program failed on a machine with locale en-us while we use semicolon for most files (living in de-de). The default settings were used.
Normally we set Delimiter explicitly everywhere but sometimes a developer forgets about it. The trappy thing is that everything works "fine" on some machines (e.g. the developer's or the build server) if the locale is good and will fail elsewhere -- at least if you need identical input/output files (backend system). Therefore I personally would favor an invariant behavior of the delimiter.
But I see the user perspective (e.g. Excel) and there is another reason pro "current culture": Type mapping (like floating point or date time values) probably should use "current culture" by default -- at least this is what .NET does by default.
There probably will never be a good default. This brings me to another "rebellious" proposal:
We could make the Culture an obligatory argument for creating a reader or writer. I'm aware that this is a hard step and many developers would not like it. But I really think this would remove runtime errors. We can make them compile time errors (which IMHO is a good thing).
Example solution:
Make current default constructor(s) of reader/writer (and eventually configuration) without culture setting obsolete.
To make this less ugly (not forcing everyone to build a complicated CsvReader with Configuration with Culture explicitly) there could be two or three factory methods for the most common cases:
CsvReader.CreateUsingInvariantCulture(TextReader) // backend use, stable, comma, en-us
CsvReader.CreateUsingCurrentCulture(TextReader) // Excel, current culture
maybe CsvReader.Create(TextReader, CultureInfo)
I wouldn't mind another change in the library -- it changed before multiple times (IIRC we had it fixed to comma some releases ago). I'm aware that this could make it a bit more ugly when Configuration object comes into play. We should make its CultureInfo setting obsolete, then.
Alternative solution: Make it obligatory to give a Configuration object to constructor of CsvReader and make it obligatory to give culture decision to constructor of Configuration Supply some static factory methods (for ease of use / transition of current code):
static Configuration.UseCurrentCulture => new Configuration(CultureInfo.Current)
static Configuration.UseInvariantCulture => new Configuration(CultureInfo.Invariant)
Regarding constructors vs. factory methods: might be useful to make important properties obligatory in constructor and use factory methods for "short cuts".
What do you think?
I don't like having to create a configuration object ahead of time to pass in. I like being able to do
var csv = new CsvReader(textReader);
csv.Configuration.Foo = "bar";
A lot of the time you won't need any configuration setting changes.
I want the library to be able to write and read back it's own data by default. This would mean
, for delimiter.CultureInfo.InvariantCulture for the culture.This would ensure that no matter what your local culture settings are, you'll be able to, by default, write something one day then read it another on another computer in another part of the world.
I think there are 2 ways to achieve this.
CultureInfo constructor parameter.I don't really like the idea of having to always set a CultureInfo, so I'm leaning towards the first option.
@JoshClose that is even better
@JoshClose: I really understand your preference and if you want it very simple and basic (like now) this (fully invariant by default) is IMHO better than the current (fully currentculture).
But both variants will probably annoy a relevant part of users.
I agree that creating a reader/writer with/without configuration should be as easy as now. You would get this _and_ more clarity using the factory pattern:
// use comma / invariant culture
var csv = CsvReader.CreateInvariant(textReader);
csv.Configuration.Foo = ...
// use current culture
var csv = CsvReader.CreateUsingCurrentCulture(textReader);
csv.Configuration.Foo = ...
// be explicit new style
var csv = CsvReader.Create(textReader, CultureInfo.GetCulture("de-DE"));
csv.Configuration.Foo = ...
// be explicit old style
var configuration = new Configuration(CultureInfo.GetCulture("de-DE"));
var csv = new CsvReader(textReader, configuration);
Setting csv.Configuration.Delimiter should always be allowed afterwards.
But I think setting Configuration.CultureInfo should be marked Obsolete because it's ambivalent whether this should overwrite Delimiter (I remember this was an issue in the past). One should pass the Culture to the factory function somehow (or giving a configuration object with cultureinfo to constructor of CsvReader).
User should be discouraged (meaning [Obsolete] now, maybe remove later) to create a reader without making an explicit culture decision in one way (see above).
IMHO this solution would have these advantages:
What do you think?
BTW: Forcing users to make an explicit choice has another "pro" argument: Maybe many users just don't see that there might be a problem with culture (either way). So having different factory methods would make clear: "you have to make a choice to make things work as you intend".
Would the delimiter default to CultureInfo.ListSeparator then?
That's one possibility (probably making people of Excel party happy).
In "de-DE" local csv files with semicolon are probably more common. At least this makes sense because comma is our decimal separator and many tools are bad at quoting.
Sounds like that wouldn't be a consistent solution then.
I think it's easy enough for people to set the ListSeparator themselves.
@JoshClose, @StefanBertels both solutions (mandatory constructor or CultureInfo.Invariant) are way better than the current one, which is very error prone. That's why I got involved with this discussion. Which of the two proposed solutions is selected is less important. Flip the coin :)
This will be in 13.0.0 on NuGet.
Thanks a lot, looks good to me. As far as I can see you didn't add any shortcuts (like CsvReader.CreateInvariant) but this is totally fine. (If there are any, maybe add info here / to the docs).
I think this issue can be closed.
I may add shortcuts in the future. Didn't really seem to need them though.
Most helpful comment
This will be in 13.0.0 on NuGet.