I’m writing a regexp to pick out punctuation in strings and I’m getting some behavior I don’t expect:
ix = regexp('FGFR4','[~!@#$%^&*()-=+{}\|;:''",<.>/?\[]')
ix =
[5]
ix = regexp('FGFR4','[~!@#$%^&*()-+{}\|;:''",<.>/?\[]') %note, the '=' is gone
ix =
[]
So it appears that ‘=’ is matching the number 4. What I expect is it only to match the ‘=’ sign as so:
ix = regexp('FOO=SPAM','[~!@#$%^&*()-=+{}\|;:''",<.>/?\[]')
ix =
[4]
What’s going on here?
The problem is not the
=but the-in front of it. It creates a range of all characters from)to=(in ASCII order). The reason why this is not a problem if you remove the=is that+comes before4in ASCII order, so the range does not include the4(in fact it only includes),*and+, and since you have*anyway, this would have never mattered.Three solutions:
escape the hyphen:
or put it at the end of the character class:
unless you want to make sure that you use exactly this set of characters… so if you would be alright with matching anything that is not a space, a letter or a digit (or an underscore), you could just as well use this:
Matches any non-underscore, non-letter, non-digit, non-whitespace character.