I want to match against key/value assignments in shell scripts, config files, etc., which may or may not be single-, double- or backtick-quoted, and which may or may not have a line-ending comment. For example, I want:
RAILS_ENV=production
# => key: RAILS_ENV, value: production
listen_address = 127.0.0.1 # localhost only by default
# => key: listen_address, value: 127.0.0.1
PATH="/usr/local/bin"
# => key: PATH, value: "/usr/local/bin" (or /usr/local/bin would be fine)
HOSTNAME=`cat /etc/hostname`
# => key: HOSTNAME, value: `cat /etc/hostname`
If you feel fancy, it can handle escaped quotes and # inside the quotes, but I don’t think I’ll run into any. If you feel differently fancy, you can make it all named-capture expanded-style and pretty:
CONFIG_LINE = %r{
(?<export> export ){0}
(?<key> [\w-]+ ){0}
(?<value> \S* ){0}
(?<comment> \#.*$ ){0}
^\s*(\g<export>\s+)?\g<key>\s*=\s*\g<value>\s*(\g<comment>)?$
}x
but I think nobody really writes regexen like that..
I’ve seen Regex for quoted string with escaping quotes, but I’m not good enough to adapt any of those solutions to optional quotes; I don’t quite see how to do “expect an end quote, and therefore allow internal spaces, if I had a start quote.”
Edit: the Tin Man gave a practical answer, so now I’m looking for the purist answer. Throw some state machines at me, or tell me why it can’t be done.
It’s probably possible to do in one regex pattern, but I am a believer in keeping the patterns simple. Regex can be insidious and hide lots of little errors. Keep it simple to avoid that, then tweak afterwards.
To trim off the trailing comment is easy in a subsequent
map:If you want to normalize all your name/values so they have no extraneous spaces you can do that in the
mapalso:What the regex “
/^([^=]+)=(.+)/” is doing is:^” is “At the beginning of a line”, which is the character after a “\n”. This is not the same as the start of a string, which would be\A. There is an important difference so if you don’t understand the two it is a good idea to learn when and why you’d want to use one over the other. That’s one of those places a regex can be insidious.([^=]+)” is “Capture everything that is not an equal-sign”.=” is obviously the equal-sign we were looking for in the previous step.(.+)” is going to capture everything after the equal-sign.I purposely kept the above pattern simple. For production use I’d tighten up the patterns a little using some “non-greedy” flags, along with a trailing “
$” anchor:+?means find the first matching ‘=’. It’s already implied by the use of[^=]but+?makes that even more obvious to be my intent. I can get away without the?but it’s more of a self-documentation thing for later maintenance. In your use-case it should be benign but is a worthy thing to keep in your Regex Bag ‘o Tricks.$means the end-of-the-string, i.e., the place immediately preceding the EOL, AKA end-of-line, or carriage-return. It’s implied also, but inserting it in the pattern makes it more obvious that’s what I’m searching for.EDIT to track the OP’s added test:
If I was writing this for myself I’d generate a hash for convenience: