I need to be able to parse both CSV and TSV files. I can’t rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use?
One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done.
Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means that I won’t know how many fields I have until I can parse it.
You could show them the results in preview window – similar to the way Excel does it. It’s pretty clear when the wrong delimiter is being used in that case. You could then allow them to select a range of delimiters and have the preview update in real time.
Then you could just make a simple guess as to the delimiter to start with (e.g. does a comma or a tab come first).