I copied and pasted text from a PDF file but it didn’t extract the numbers. If I do less or more on the exported txt file I see the following:
"Christina, daughter of David Brodie, on <U+F735> November <U+F731><U+F736><U+F736><U+F735>. She was the sister of"
It should read:
“Christina, daughter of David Brodie, on 5 November 1665. She was the sister of”
Initially I though it would be a simple search and replace, but the <U+F73n> numbers are encoded and I’m not sure how to extract them or even how they’re encoded, although I did save the file as utf-8 originally. I tried to use php’s mb_string functions to see if I could extract the codes in some way but I haven’t been successful.
Has anyone else come across this problem and is there a simple solution that has eluded me?
Unfortunately U+Fxxx is in the Private Use Area of Unicode. There is no automatic way to fix this, short of knowing the mapping ahead of time. Based on the codepoints in your sample, I would venture to say that you could subtract 0xF731 from the character values and then add 0x30 to convert them to ASCII numbers.