Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3283066
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T19:59:56+00:00 2026-05-17T19:59:56+00:00

I have a string that I have read in from a Word document. I

  • 0

I have a string that I have read in from a Word document. I think it is in “Cp1252” encoding. Java uses UTF8.

How do I search that string for those special characters in Cp1252 and replace them with an appropriate UTF8 character?

specifically, I want to replace the “En Dash” character with a plain “-“

The following code block takes the projDateString which is coming from the Word document, and trying to do such a thing

    char[] test = projDateString.getBytes("Cp1252");
    for(int i = 0; i < test.length; i++){
    System.out.println "test["+ i + "] = " + Integer.toHexString((byte)test[i]);
    }
    String projDateString2 = new String(test);
    projDateString2.replaceAll("\0x96", "\u2013");
    System.out.println("projDateString2: " + projDateString)

I am not sure I am setting up projDateString2 correctly. As you can see, the hex value of that dash is ffffff96 when I getBytes on the string using Cp1252 encoding. If I getBytes with UTF8 it comes in as 3 hex values instead of one.

This gives me the following output:

test[0] = 30
test[1] = 38
test[2] = 2f
test[3] = 32
test[4] = 30
test[5] = 31
test[6] = 30
test[7] = 20
test[8] = ffffff96
test[9] = 20
test[10] = 50
test[11] = 72
test[12] = 65
test[13] = 73
test[14] = 65
test[15] = 6e
test[16] = 74
projDateString2: 08/2010 ΓÇô Present

As you can see, the replace did nothing, and the println still gives me garbage chars instead of a plaintext “-”

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T19:59:56+00:00Added an answer on May 17, 2026 at 7:59 pm

    Java strings are always in UTF-16, at least as far as the API is concerned… but you can generally just think of them as being “Unicode”. The fact that they’re UTF-16 is only really relevant when it comes to characters outside the Basic Multilingual Plane, i.e. with Unicode values above U+FFFF. They have to be represented as surrogate pairs in Java. But I don’t think you need to worry about this in your case. So just think of the values in Strings as “Unicode text” without a specific encoding… in particular, definitely not in UTF-8 or CP1252. Those are the encodings used to convert binary data (e.g. a byte array) into text data (e.g. a string).

    You shouldn’t be using String.getBytes() or new String(byte[]) without specifying the encoding – that’s the problem. Those always use the platform default encoding – which is almost always the wrong choice.

    You say you “have a string that I have read in from a Word document” – how did you read it in? How did it start off life?

    If you have the bytes and you know the relevant encoding, you should use:

    String text = new String(bytes, encoding);
    

    You should never have to deal with a string which has been created using the wrong encoding – if you get to that stage, you’re almost bound to be risking information loss. Tackle the problem as early as you possibly can, rather than trying to fix the data up later on.

    The next thing to understand is that the String class in Java is immutable. Calling replaceAll on a string won’t change the existing string. It will instead return a new string with the replacements made.

    So this statement:

    projDateString2.replaceAll("\0x96", "\u2013");
    

    will never do what you want. Even if everything else is correct, you should be using:

    projDateString2 = projDateString2.replaceAll("\0x96", "\u2013");
    

    (or something similar). I don’t think that actually will do what you want anyway, but you need to be aware of it for when everything else is sorted out.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have string that I want to chop to array of substrings of given
I have a string that has some Environment.Newline in it. I'd like to strip
I have a string that contains the representation of a date. It looks like:
I have a string that is like below. ,liger, unicorn, snipe in other languages
I have a string that looks like this: Name1=Value1;Name2=Value2;Name3=Value3 Is there a built-in class/function
I have a string that I would like to print . Is it possible
I have a string that I need to convert to the equivalent array of
I have a string that is HTML encoded: '''&lt;img class=&quot;size-medium wp-image-113&quot;\ style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot;\
I have a string that contains both double-quotes and backslashes that I want to
I have a string that represents a non indented XML that I would like

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.