I’m parsing RTF 1.5+ files generated by Word 2003+ that may have content from

Question

0

Asked: May 17, 20262026-05-17T01:25:37+00:00 2026-05-17T01:25:37+00:00

I’m parsing RTF 1.5+ files generated by Word 2003+ that may have content from

0

I’m parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\’xx). I would like to convert these literals to unicode values.

I know my document’s code page by looking for ansicpg (\ansi\ansicpg1252).

When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.

However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.

\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286
\’d1\’f2\’f0\’e0\’ed\’e8\’f6\’fb \’e1\’e5\’e7 \’ed\’e0\’e7\’e2\’e0\’ed\’e8\’ff. \’dd\’f2
\’e0 \’f1\’f2\’f0\’e0\’ed\’e8\’f6\’e0 \’ed\’e5 \’e4\’ee\’eb\’e6\’ed\’e0
\’ee\’f2\’ee\’e1\’f0\’e0\’e6\’e0\’f2\’fc\’f1\’ff \’e2 \’f2\’e0\’e1\’eb\’e8\’f6\’e5
\’e2 \’f1\’ee\’e4\’e5\’f0\’e6\’e0\’ed\’e8\’e8.

I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I’m looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T01:25:37+00:00

\lang really only marks up particular stretches of the text as being in a particular language, and shouldn’t impact what code page is to be used for the old non-Unicode \' escapes.

Putting an \ansicpg token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and \' escapes.

Or must I use \f277 as a font reference and see if it has an associated codepage?

It looks that way. Changing the \fcharset of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m parsing RTF 1.5+ files generated by Word 2003+ that may have content from

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply