It’s my first foray into UTF-8 land. I’m an IIS Admin, so I’ve never gotten to touch this professionally. I’m trying to help a missionary who’s translated the bible into an African language and now needs to do some global matching against large UTF-8 files. We’re specifically matching for accented characters.
We’re using older XP computers here, so I cobbled together a quick script in VBS knowing the language would be installed on their boxes already. After playing around for a few minutes, it appears VBS regexes handle UTF-8 by breaking each character up into 2 characters. To match a single â, my pattern is \u00c3\u00a2. Shouldn’t this be \u00e2?
Since I’m out of my depth I thought I’d seek a little guidance. It almost looks like UTF-8 simply requires this kind of double matching (and UTF-8 is required.) Can someone tell me into which box canyon I’m coding? 🙂
Downloading and installing Perl or Java is probably outside this project’s bandwidth and technical know-how. The tool should be built in. MS Office is installed, so VBA is an option if there’s some library that offers specific support. JavaScript is installed as well, though I don’t know what versions.
Thanks
Unless you need to match two or more consecutive dots (e.g. you have .. or … in your regex but not .*) you can use any ASCII regex library on UTF-8 and expect it to work correctly.
The trick is to know what you are looking for. UTF-8 does that kind of byte breakup, so write your regex in whatever you are familiar with and convert it to UTF-8 and it will work unless it contains “..”.