I’m trying to write a parser for the EDI data format, which is just delimited text but where the delimiters are defined at the top of the file.
Essentially it’s a bunch of splits() based on values I read at the top of my code.
The problem is theres also a custom ‘escape character’ that indicates that I need to ignore the following delimiter.
For example assuming * is the delimiter and ? is the escape, I’m doing something like
use Data::Dumper;
my $delim = "*";
my $escape = "?";
my $edi = "foo*bar*baz*aster?*isk";
my @split = split("\\" . $delim, $edi);
print Dumper(\@split);
I need it to return “aster*isk” as the last element.
My original idea was to do something where I replace every instance of the escape character and the following character with some custom-mapped unprintable ascii sequence before I call my split() functions, then another regexp to switch them back to the right values.
That is doable but feels like a hack, and will get pretty ugly once I do it for all 5 different potential delimiters. Each delimiter is potentially a regexp special char as well, leading to a lot of escaping in my own regular expressions.
Is there any way to avoid this, possibly with a special regexp passed to my split() calls?
This is a bit tricky if you want to handle the case where the escape character is the last character of a field correctly. Here’s one way:
Note that this assumes that neither the escape char nor the delimiter will be a digit, but it does support the full range of Unicode characters.