Trying to learn regular expressions. As a practice, I’m trying to find every word that appears exactly one time in my document — in linguistics this is a hapax legemenon (http://en.wikipedia.org/wiki/Hapax_legomenon)
So I thought the following expression give me the desired result:
\w{1}
But this doesn’t work. The \w returns a character not a whole word. Also it does not appear to be giving me characters that appear only once (it actually returns 25873 matches — which I assume are all alphanumeric characters). Can someone give me an example of how to find “hapax legemenon” with a regular expression?
If you’re trying to do this as a learning exercise, you picked a very hard problem 🙂
First of all, here is the solution:
Now, here is the explanation:
We want to match a word. This is
\b\w+\b– a run of one or more (+) word characters (\w), with a ‘word break’ (\b) on either side. A word break happens between a word character and a non-word character, so this will match between (e.g.) a word character and a space, or at the beginning and the end of the string. We also capture the word into a backreference by using parentheses ((...)). This means we can refer to the match itself later on.Next, we want to exclude the possibility that this word has already appeared in the string. This is done by using a negative lookbehind –
(?<! ... ). A negative lookbehind doesn’t match if its contents match the string up to this point. So we want to not match if the word we have matched has already appeared. We do this by using a backreference (\1) to the already captured word. The final match here is\b\1\b.*\b\1\b– two copies of the current match, separated by any amount of string (.*).Finally, we don’t want to match if there is another copy of this word anywhere in the rest of the string. We do this by using negative lookahead –
(?! ... ). Negative lookaheads don’t match if their contents match at this point in the string. We want to match the current word after any amount of string, so we use (.*\b\1\b).Here is an example (using C#):
Output: