Given a single unicode character I need to determine if it is alphanumeric for any language script. I don’t have access to regular expressions or any useful API that work with unicode. I think my only solution is to compare the Unicode value to a set of character ranges for alphanumeric characters.
The problem is that I can’t find a list of such ranges.
Can anyone either suggest a better solution of else point me to a definitive list of alphanumeric ranges to compare against?
Thanks,
Tim
You can check Unicode Character Database – and PropList files (here’s 5.0 example), with ‘character points – properties’ mapping, in particular. Alternatively, you can parse the main listing file (this one for 5.0, for example) (it’s huge), taking all the character points with properties you need (L and N, I suppose) then building the ranges from this data.
Also, you didn’t mention the tools you use, but I think referring to this Perl module (and
XS.xsfile in its distribution package) might be helpful too.