Found this code that breaks out CSV fields if contains double-quotes
But I don’t really understand the pattern matching from regex
If someone can give me an step by step explanation of how this expression evaluates a pattern it would be appreciated
"([^\"]*)"|(?<=,|^)([^,]*)(?:,|$)
Thanks
====
Old posting
This is working well for me – either it matches on “two quotes and whatever is between them”, or “something between the start of the line or a comma and the end of the line or a comma”. Iterating through the matches gets me all the fields, even if they are empty. For instance,
the quick, “brown, fox jumps”, over, “the”,,”lazy dog” breaks down into
the quick “brown, fox jumps” over “the” “lazy dog”
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CSVParser {
/*
* This Pattern will match on either quoted text or text between commas, including
* whitespace, and accounting for beginning and end of line.
*/
private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");
private ArrayList<String> allMatches = null;
private Matcher matcher = null;
private String match = null;
private int size;
public CSVParser() {
allMatches = new ArrayList<String>();
matcher = null;
match = null;
}
public String[] parse(String csvLine) {
matcher = csvPattern.matcher(csvLine);
allMatches.clear();
String match;
while (matcher.find()) {
match = matcher.group(1);
if (match!=null) {
allMatches.add(match);
}
else {
allMatches.add(matcher.group(2));
}
}
size = allMatches.size();
if (size > 0) {
return allMatches.toArray(new String[size]);
}
else {
return new String[0];
}
}
public static void main(String[] args) {
String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";
CSVParser myCSV = new CSVParser();
System.out.println("Testing CSVParser with: \n " + lineinput);
for (String s : myCSV.parse(lineinput)) {
System.out.println(s);
}
}
}
I try to give you hints and the needed vocabulary to find very good explanations on regular-expressions.info
()is a group*is a quantifierIf there is a
?right after the opening bracket then it’s a special group, here(?<=,|^)is a lookbehind assertion.Square brackets declare a character class e.g.
[^\"]. This one is a special one, because of the^at the start. It is a negated character class.|denotes an alternation, i.e. an OR operator.(?:,|$)is a non capturing group$is a special character in regex, it is an anchor (which matches the end of the string)