I’m using this regex to parse lines of a CSV in APEX:
Pattern csvPattern = Pattern.compile('(?:^|,)(?:\"([^\"]+|\"\")*\"|([^,]+)*)');
It works great, but returns two groups for each match (one for the quoted values, and one for non-quoted values). See below:
Matcher csvMatcher = csvPattern.matcher('"hello",world');
Integer m = 1;
while (csvMatcher.find()) {
System.debug('Match ' + m);
for (Integer i = 1; i <= csvMatcher.groupCount(); i++) {
System.debug('Capture group ' + i + ': ' + csvMatcher.group(i));
}
m++;
}
Running this code will return the following:
[5]|DEBUG|Match 1
[7]|DEBUG|Capture group 1: hello
[7]|DEBUG|Capture group 2: null
[5]|DEBUG|Match 2
[7]|DEBUG|Capture group 1: null
[7]|DEBUG|Capture group 2: world
I’d like for each match to only return the non-null capture. Is that possible?
This is actually a difficult thing to do.
It could be done with lookahead/behind assertions.
Not very intuitive though.
It looks something like this:
(?:^|,)(\s*"(?=(?:[^"]+|"")*"\s*(?:,|$)))?((?<=")(?:[^"]+|"")*(?="\s*(?:,|$))|[^,]*)How it works is to line up the text body after the first quote
"on a valid quoted field. If its not a valid quoted field, it lines up on the quote itself. At that point the text body can be captured as either an un-quoted field, or as a quoted field minus the quotes, in a single capture buffer.This is probably a power regex that instruments a precise solution without the need for residual code. I could be missing something, but I see no way to do this without lookaround assertions. So, your engine must support that. If not, you’ll have to pick it out like your solution above.
Here is a prototype in Perl, with a commented expanded regex below it.
Good luck!
Output
Commented
Extension
If in fact you might use this, it can be sped up to consume a backreferenced quoted field
instead of matching a quoted field twice. Backreferences usually resolve to a single string
comparison api such as
strncmp()in C language, making it much faster.As a side note, whitespace before/after the field body of non-quoted fields, can be trimmed
within the regex with a little extra notation.
Good luck!
Compressed
(?:^|,)(?:\s*"(?=((?:[^"]+|"")*)"\s*(?:,|$)))?((?<=")\1|[^,]*)Expanded
Expanded with comments