Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 505609
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T06:36:58+00:00 2026-05-13T06:36:58+00:00

We are doing Natural Language Processing on a range of English language documents (mainly

  • 0

We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be “ASCII”, UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to “flatten” these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).

I appreciate that there is no absolute answer – any escaping can clash in some cases – but I am looking for something allong the lines of XML’s <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I’m wondering if there is a generally adopted approach rather than inventing our own.

A typical example is the “degrees” symbol:

HTML Entity (decimal)   &#176;
HTML Entity (hex)   &#xb0;
HTML Entity (named)     &deg;
How to type in Microsoft Windows    Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex)     0xC2 0xB0 (c2b0)
UTF-8 (binary)  11000010:10110000
UTF-16 (hex)    0x00B0 (00b0)
UTF-16 (decimal)    176
UTF-32 (hex)    0x000000B0 (00b0)
UTF-32 (decimal)    176
C/C++/Java source code  "\u00B0"
Python source code  u"\u00B0"

We are also likely to encounter TeX

$10\,^{\circ}{\rm C}$

or

\degree

so backslashes, curlies and dollars are a poor idea.

We could for example use markup like:

__deg__
__#176__

and this will probably work but I’d appreciate advice from those who have similar problems.

update I accept @MichaelB’s insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I’ll revisit this. Note that my original question is not well worded – read his answer and the link in it.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T06:36:58+00:00Added an answer on May 13, 2026 at 6:36 am
    • Get someone to do this who really understands character encodings. It looks like you don’t, because you’re not using the terminology correctly. Alternatively, read this.
    • Do not brew up your own escape scheme – it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you’re really that scared of high bits.
    • In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn’t, abandon it – it is most likely very bad quality code in many other ways as well and not worth the hassle using.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

When doing an INSERT with a lot of data, ie: INSERT INTO table (mediumtext_field)
Doing odd/even styling with jQuery is pretty easy: $(function() { $(.oddeven tbody tr:odd).addClass(odd); $(.oddeven
When doing small icons, header graphics and the like for websites, is it better
When doing case-insensitive comparisons, is it more efficient to convert the string to upper
When doing TDD , how to tell that's enough tests for this class /
When doing a cvs update , you get a nice summary of the state
I doing a function in Javascript like the VisualBasic DateDiff. You give two dates
When doing a simple performance measurement, I was astonished to see that calling String.IndexOf(char)
When doing thread synchronization in C# should I also lock an object when I
When doing an ALTER TABLE statement in MySQL, the whole table is read-locked (allowing

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.