I am writing a program that needs to parse a bunch of text files generated by some third-party software. Some of these files will be generated in France, where something like “1,5” means “one and a half”. Other files will be generated in the US, where “1,5” is not a number, and “one and a half” is “1.5”. Of course, “1,234.5” is a legitimate number in the US.
These are just examples; in reality, my program needs to deal with a variety of numbers in a variety of locales; it needs to handle things like “e-5” and “2e10”, etc. Unfortunately, there’s no way to know ahead of time which file comes from which locale.
Is there some commonly accepted solution to this problem in C# ? I realize that I can write my own number-parsing code, but I’d prefer to avoid it, unless there’s no other way…
Since your entire input file has been generated from one locale, you could look at the problem as having to detect the specific locale from the input file prior to actually parsing it. It’s an extra requirement that results from the inadequate input files (which should all use one agreed locale or have a field to specify the locale used).
Language detection is not a complete solution as number formatting is not language-specific but locale-specific. Here is an example: If you detect the language as Spanish, would that be es-ES (Spain) or es-MX (Mexico)? In the former case, the decimal separator is a comma (1,23). In the latter, the decimal separator is a period (1.23).
The solution would be heuristics-based. The simplest is probably that if you know what your locale generally is (e.g. most of your users use the period), you could have an ordered list of culture identifiers and try then one after the other until you’ve found one that can be used to interpret all the numbers in the file. Could be as simple as starting with en-US and, failing that, trying with en-GB, since for numbers, there really aren’t many more formats.