I use this code to read pdf content using iTextSharp. it works fine when

Question

0

Asked: June 2, 20262026-06-02T04:04:05+00:00 2026-06-02T04:04:05+00:00

I use this code to read pdf content using iTextSharp. it works fine when

0

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn’t work whene content is Persian or Arabic
Result is something like this :

Here is sample non-English PDF for test.

ÙŽÙ›Ù†Ø§ ÙÙ”Ø¨Ù˜Ø·Ø« ÛŒØ¿ÛŒÙ›Ù˜ Ø²Ø¾Ø§ ÙÙ›ÙØÙ” Ù‚Ù›Ù…Ø
ÛŒÙ”Ø¨Ù•Ø³ Â© Karl Seguin foppersian.codeplex.com
http://www.codebetter.com 1 1 ÙÙ”Ø¨Ù˜Ø·Ø« ÙŽÙ›Ù†Ø§ ÛŒØ¿ÛŒÙ›Ù˜
Ù‡Ù…Ø§Ù†Ø±Ø¨ Ù„ÙˆØµØ§ ÛŒØ³ÛŒÙˆÙ†  Ù…Ø±Ù† Ø¯ÛŒÙ„ÙˆØª Ø±ØªÙ‡Ø¨ Ø±Ø§Ø²ÙØ§

What is the solution ?

  public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                    text.Append(currentText);
                    pdfReader.Close();
                }
            }
            return text.ToString();
        }

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T04:04:06+00:00

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn’t matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn’t make sense and will almost always fail.

Your problem is this line:

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));

I’m going to pull it apart into a couple of lines to illustrate:

byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ÛŒ

The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

Side-note, it is totally possible that whatever creates a string does it incorrectly, that’s not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.

EDIT

The code should be the exact same as yours above except that one line should be removed. Also, whatever you’re using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you’re using a recent version of iTextSharp. I tested this with 5.2.0.0.

    public string ReadPdfFile(string fileName) {
        StringBuilder text = new StringBuilder();

        if (File.Exists(fileName)) {
            PdfReader pdfReader = new PdfReader(fileName);

            for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                text.Append(currentText);
            }
            pdfReader.Close();
        }
        return text.ToString();
    }

EDIT 2

The above code fixes the encoding issue but doesn’t fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

Consequently, showing text in such right-to-left writing systems
requires either positioning each glyph individually (which is tedious
and costly) or representing text with show strings (see 9.2,
“Organization and Use of Fonts”) whose character codes are given in
reverse order.

PDF 2008 Spec – 14.8.2.3.3 – Reverse-Order Show Strings

When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a “marked content” section, BMC. However, the few sample PDFs that I’ve looked at and generated don’t appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you’ll have to poke around so more.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I use this code to read pdf content using iTextSharp. it works fine when

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply