Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7811185
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T04:04:05+00:00 2026-06-02T04:04:05+00:00

I use this code to read pdf content using iTextSharp. it works fine when

  • 0

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn’t work whene content is Persian or Arabic
Result is something like this :

Here is sample non-English PDF for test.

َٛنا Ùٔب٘طث یؿیٛ٘ زؾا ÙÙ›ÙØ­Ù” قٛمح
یٔبٕس © Karl Seguin foppersian.codeplex.com
http://www.codebetter.com 1 1 Ùٔب٘طث َٛنا یؿیٛ٘

همانرب لوصا یسیون  مرن دیلوت رتهب Ø±Ø§Ø²ÙØ§

What is the solution ?

  public string ReadPdfFile(string fileName)
        {
            StringBuilder text = new StringBuilder();

            if (File.Exists(fileName))
            {
                PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
                    text.Append(currentText);
                    pdfReader.Close();
                }
            }
            return text.ToString();
        }
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T04:04:06+00:00Added an answer on June 2, 2026 at 4:04 am

    In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn’t matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn’t make sense and will almost always fail.

    Your problem is this line:

    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
    

    I’m going to pull it apart into a couple of lines to illustrate:

    byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
    byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
    string final = Encoding.UTF8.GetString(converted);//final now holds ی
    

    The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.

    Side-note, it is totally possible that whatever creates a string does it incorrectly, that’s not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.

    EDIT

    The code should be the exact same as yours above except that one line should be removed. Also, whatever you’re using to display the text in, make sure that it supports Unicode. Also, as @kuujinbo said, make sure that you’re using a recent version of iTextSharp. I tested this with 5.2.0.0.

        public string ReadPdfFile(string fileName) {
            StringBuilder text = new StringBuilder();
    
            if (File.Exists(fileName)) {
                PdfReader pdfReader = new PdfReader(fileName);
    
                for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
    
                    text.Append(currentText);
                }
                pdfReader.Close();
            }
            return text.ToString();
        }
    

    EDIT 2

    The above code fixes the encoding issue but doesn’t fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.

    Consequently, showing text in such right-to-left writing systems
    requires either positioning each glyph individually (which is tedious
    and costly) or representing text with show strings (see 9.2,
    “Organization and Use of Fonts”) whose character codes are given in
    reverse order.

    PDF 2008 Spec – 14.8.2.3.3 – Reverse-Order Show Strings

    When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a “marked content” section, BMC. However, the few sample PDFs that I’ve looked at and generated don’t appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you’ll have to poke around so more.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

In my open-source plain C code I use this simple structure to read and
I use this code to read data from sqlite database: keyFromSql = [NSString stringWithCString:(char
how are we today? I use this code to read an XML file and
I use code Request.QueryString[u] to read passed URL to my web application. Everything works
I use this code to invoke a GridView: <asp:GridView runat=server ID=detailView AutoGenerateEditButton=true OnRowEditing=EditRow OnRowCancelingEdit=CancelEdit
I use this code to update data in database table. Can reuse same code
I use this code: http://blogswizards.com/plugin-development/sliding-boxes-and-captions-with-jquery On a simple gallery site I am building. Specifically
I use this code to process a date string coming in from a json
I use this code which is taken from MVC futures and attach the Attribute
I use this code to generate a random number. Random R = new Random(0);

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.