My task is to extract some data from a given document using Perl-style (or at least extended) regular expression. I have:
- a source document (as a file, as a variable – it doesn’t really matter):
- for example:
Some text: 1234.55 value more text - 8863 value
- for example:
- a Perl-style / extended regular expression as a string
- for example:
^.*: ([0-9.]+) value .* - (\d+) value$
- for example:
What is the best approach to extract the data in a UNIX shell script?
Let me define what I’d like to see in best approach, in the order of importance:
- Portability – ideally, it should work on most current OSes and environments – i.e. at least GNU/Linux, FreeBSD/OpenBSD, Mac OS X; Cygwin is probably the same as Linux, but not in all cases
- Minimal system requirements – i.e. asking to run some exotic interpreters / programs is generally a bad thing to do
- Fair use of resources – i.e. it shouldn’t take ages to process some simple regexp
- Clean, small, easy to understand code
I understand that it’s impossible to reach all these goals at once, so I’ve considered my alternatives:
- Using
sed– probably it would be the best way to go, but, alas, POSIX sed supports only basic regexps, not extended and definitely not perl-style. Various implementations add extensions, but they’re generally incompatible: GNU sed uses-ror--regexp-extendedoption to switch in extended mode, and BSD sed (also on Mac OS X) uses-E. - Convert extended regular expressions to basic and use original
sed– seems somewhat awkward to me and I can’t find any decent algorithm proven to work properly for this task. - Using
awk– generally the same assed, but even worse: there are myriads of implementations of awk with slight incompatibilities in the wild and support for extended regular expressions is even more obscure. - Using
perl– probably the easiest and sanest alternative, but, alas, Perl is not available everywhere as POSIX standard utilities are – i.e. as far as I remember, Perl is not in a core system in *BSD (and Mac OS X), it requires separate installation in Cygwin world, even some Linux distributions give a chance to omit it. - Using
php,python,ruby– the same situation as with perl, but they’re generally even more uncommon, as I see in the current world. - Using
grep– same as with sed; BSD uses GNU grep, but it doesn’t support-PAKA--perl-regexp, only-EAKA--extended-regexpon BSD systems. What’s even worse – it seems to be impossible to print out groups, not whole pattern matched – i.e. usinggrep -o(Show only the part of a matching line that matches) gives only the whole pattern, not distinct values of groups.
So, I’m kind of lost what would be the most portable and easiest to support way. Right now I’m choosing between:
- Make a wrapper over
sedto check whether we’re using BSD or GNU sed and run relevant commands - Insist on having perl installed to be able to run my script
Is there something missing from this overview? What would be the best alternatives? May be there’s already a wrapper written for this task somewhere (i.e. autotools / some other mysterious projects that use shell script)?
absolutely portable is hard. how about do in this way, i don’t know if it is a good idea…
in fact the extracting part is easy, no matter which tool we are using. interesting is to decide, if this tool is available/suitable on current system.
you could create a list (array) of all the tools, then at the beginning of your script, you could check those tools’ availabilities, detailed versions, I think checking those a simple grep is enough. e.g.
using $? for checking availability
using simple grep to check version details: like
once you found a tool which can do your job, using this tool. calling the certain script.
however, you have to prepare N solutions for the same question using N tools.