Why next code returns true (Saxon-EE 9.2 for .NET)?
matches('some text>', '^[\w ]{3,200}$')
There is no > symbol in the pattern.
Thanks.
XQuery:
<regexp-test>
<!-- why true? -->
<test1>{matches('some text>', '^[\w ]{3,200}$')}</test1>
<test2>{matches('some text>', '^[\w ]+$')}</test2>
<test3>{matches('< < >', '^[\w ]+$')}</test3>
<!-- valid: -->
<test4>{matches('some text!', '^[\w ]+$')}</test4>
<test5>{matches('.,', '^[\w ]+$')}</test5>
</regexp-test>
Output:
<regexp-test>
<!-- why true? -->
<test1>true</test1>
<test2>true</test2>
<test3>true</test3>
<!-- valid: -->
<test4>false</test4>
<test5>false</test5>
</regexp-test>
After some digging, experimentation and help from others in the eXist community, I find that the definition of character classes in UNICODE and used in the definition of regexps in XPath and XML schema is different to the POSIX classes. In particular the characters
$+<=>^|~
are in the Symbol class \p{S} not the Punctuation class \p{P}. Since the definition of \w ( from http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes-with-errata.html ) is
“[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of “punctuation”, “separator” and “other” characters) “
these characters will be included in \w
This leads to a workaround using [^\W\p{S}]