In my C# code, I am extracting text from a PDF document. When I

Question

0

Asked: June 16, 20262026-06-16T04:46:15+00:00 2026-06-16T04:46:15+00:00

In my C# code, I am extracting text from a PDF document. When I

0

In my C# code, I am extracting text from a PDF document. When I do that, I get a string that’s in UTF-8 or Unicode encoding (I’m not sure which). When I use Encoding.UTF8.GetBytes(src); to convert it into a byte array, I notice that the whitespace is actually two characters with byte values of 194 and 160.

For example the string “CLE action” looks like

[67, 76, 69, 194 ,160, 65 ,99, 116, 105, 111, 110]

in a byte array, where the whitespace is 194 and 160… And because of this src.IndexOf("CLE action"); is returning -1 when I need it to return 1.

How can I fix the encoding of the string?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T04:46:17+00:00

Editorial Team

2026-06-16T04:46:17+00:00Added an answer on June 16, 2026 at 4:46 am

194 160 is the UTF-8 encoding of a NO-BREAK SPACE codepoint (the same codepoint that HTML calls  ).

So it’s really not a space, even though it looks like one. (You’ll see it won’t word-wrap, for instance.) A regular expression match for \s would match it, but a plain comparison with a space won’t.

To simply replace NO-BREAK spaces you can do the following:

src = src.Replace('\u00A0', ' ');

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In my C# code, I am extracting text from a PDF document. When I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply