How do I replace a Unicode numeral subscript or superscript (eg, ₂) with the corresponding numeral (ie, 2) using regular expressions? I can of course replace each of them separately, but that is ten lines of code…
I am implementing this in Perl but that should not really matter.
Here from the unisupers script is a Perl function to convert to Unicode superscripts:
And from the unisubs script is one for subscripts:
You just have to go the other way.
Another and simpler approach is simply to use the k-compat normalizations, which just return the base characters instead of their upper/lower versions. I haven’t checked these to see that they are all the inverses of the functions above. You can play with them using the nfkd and
nfkc scripts.