Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6197773
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T03:52:50+00:00 2026-05-24T03:52:50+00:00

I am using iTextSharp to read text contents from PDF. I am able to

  • 0

I am using iTextSharp to read text contents from PDF. I am able to read that also. But I am loosing text formatting like the font, color etc. Is there any way to get that formatting as well.

Below is the code segment i am using to exact text –

PdfReader reader = new PdfReader("F:\\EBooks\\AspectsOfAjax.pdf");
textBox1.Text = ExtractTextFromPDFBytes(reader.GetPageContent(1));

private string ExtractTextFromPDFBytes(byte[] input)
{
    if (input == null || input.Length == 0) return "";
    try
    {
        string resultString = "";
        // Flag showing if we are we currently inside a text object
        bool inTextObject = false;
        // Flag showing if the next character is literal  e.g. '\\' to get a '\' character or '\(' to get '('
        bool nextLiteral = false;
        // () Bracket nesting level. Text appears inside ()
        int bracketDepth = 0;
        // Keep previous chars to get extract numbers etc.:
        char[] previousCharacters = new char[_numberOfCharsToKeep];
        for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
        for (int i = 0; i < input.Length; i++)
        {
            char c = (char)input[i];
            if (inTextObject)
            {
                // Position the text
                if (bracketDepth == 0)
                {
                    if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
                    {
                        resultString += "\n\r";
                    }
                    else
                    {
                        if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))
                        {
                            resultString += "\n";
                        }
                        else
                        {
                            if (CheckToken(new string[] { "Tj" }, previousCharacters))
                            {
                                resultString += " ";
                            }
                        }
                    }
                }
                // End of a text object, also go to a new line.
                if (bracketDepth == 0 && CheckToken( new string[]{"ET"}, previousCharacters))
                {
                    inTextObject = false;
                    resultString += " ";
                }
                else
                {
                    // Start outputting text
                    if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
                    {
                        bracketDepth = 1;
                    }
                    else
                    {
                        // Stop outputting text
                        if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
                        {
                            bracketDepth = 0;
                        }
                        else
                        {
                            // Just a normal text character:
                            if (bracketDepth == 1)
                            {
                                // Only print out next character no matter what. 
                                // Do not interpret.
                                if (c == '\\' && !nextLiteral)
                                {
                                    nextLiteral = true;
                                }
                                else
                                {
                                    if (((c >= ' ') && (c <= '~')) || ((c >= 128) && (c < 255)))
                                    {
                                        resultString += c.ToString();
                                    }
                                    nextLiteral = false;
                                }
                            }
                        }
                    }
                }
            }
            // Store the recent characters for when we have to go back for a checking
            for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
            {
                previousCharacters[j] = previousCharacters[j + 1];
            }
            previousCharacters[_numberOfCharsToKeep - 1] = c;

            // Start of a text object
            if (!inTextObject && CheckToken(new string[]{"BT"}, previousCharacters))
            {
                inTextObject = true;
            }
        }
        return resultString;
    }
    catch
    {
        return "";
    }
}

private bool CheckToken(string[] tokens, char[] recent)
{
    foreach(string token in tokens)
    {
        if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
            (recent[_numberOfCharsToKeep - 2] == token[1]) &&
            ((recent[_numberOfCharsToKeep - 1] == ' ') ||
            (recent[_numberOfCharsToKeep - 1] == 0x0d) ||
            (recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
            ((recent[_numberOfCharsToKeep - 4] == ' ') ||
            (recent[_numberOfCharsToKeep - 4] == 0x0d) ||
            (recent[_numberOfCharsToKeep - 4] == 0x0a))
            )
        {
            return true;
        }
    }
    return false;
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T03:52:52+00:00Added an answer on May 24, 2026 at 3:52 am

    Let me try pointing you in a different direction. iTextSharp has a really beautiful and simple text extraction system that handle some of the basic tokens. Unfortunately it doesn’t handle color information but according to @Mark Storer it might not be too hard to implement yourself.

    BEGIN EDIT

    I started work on implementing color information. See my blog post here for more details. (Sorry for the bad formatting, heading off to dinner now.)

    END EDIT

    The code below combines several questions and answers here including this one to get the font height (although its not exact) as well as another one (that for the life of me I can’t seem to find anymore) that shows how to detect for faux bold.

    The PostscriptFontName returns some additional characters in front of the font name, I think it has to do with when you embed font subsets.

    Below is a complete WinForms application that targets iTextSharp 5.1.1.0 and extracts text as HTML.

    Screenshot of sample PDF

    Screenshot of sample PDF

    Sample text extracted as HTML

    <span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Hello </span>
    <span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">w</span>
    <span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:37.87201">o</span>
    <span style="font-family:NJNSWD+Papyrus-Regular-Bold;font-size:11.61407">rl</span>
    <span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">d </span>
    <br />
    <span style="font-family:NJNSWD+Papyrus-Regular;font-size:11.61407">Test </span>
    

    Code

    using System;
    using System.Collections.Generic;
    using System.Text;
    using System.Windows.Forms;
    using iTextSharp.text.pdf.parser;
    using iTextSharp.text.pdf;
    
    namespace WindowsFormsApplication2
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                InitializeComponent();
            }
    
            private void Form1_Load(object sender, EventArgs e)
            {
                PdfReader reader = new PdfReader(System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Document.pdf"));
                TextWithFontExtractionStategy S = new TextWithFontExtractionStategy();
                string F = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
                Console.WriteLine(F);
    
                this.Close();
            }
    
            public class TextWithFontExtractionStategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy
            {
                //HTML buffer
                private StringBuilder result = new StringBuilder();
    
                //Store last used properties
                private Vector lastBaseLine;
                private string lastFont;
                private float lastFontSize;
    
                //http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/TextRenderInfo.html
                private enum TextRenderMode
                {
                    FillText = 0,
                    StrokeText = 1,
                    FillThenStrokeText = 2,
                    Invisible = 3,
                    FillTextAndAddToPathForClipping = 4,
                    StrokeTextAndAddToPathForClipping = 5,
                    FillThenStrokeTextAndAddToPathForClipping = 6,
                    AddTextToPaddForClipping = 7
                }
    
    
    
                public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
                {
                    string curFont = renderInfo.GetFont().PostscriptFontName;
                    //Check if faux bold is used
                    if ((renderInfo.GetTextRenderMode() == (int)TextRenderMode.FillThenStrokeText))
                    {
                        curFont += "-Bold";
                    }
    
                    //This code assumes that if the baseline changes then we're on a newline
                    Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
                    Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
                    iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
                    Single curFontSize = rect.Height;
    
                    //See if something has changed, either the baseline, the font or the font size
                    if ((this.lastBaseLine == null) || (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]) || (curFontSize != lastFontSize) || (curFont != lastFont))
                    {
                        //if we've put down at least one span tag close it
                        if ((this.lastBaseLine != null))
                        {
                            this.result.AppendLine("</span>");
                        }
                        //If the baseline has changed then insert a line break
                        if ((this.lastBaseLine != null) && curBaseline[Vector.I2] != lastBaseLine[Vector.I2])
                        {
                            this.result.AppendLine("<br />");
                        }
                        //Create an HTML tag with appropriate styles
                        this.result.AppendFormat("<span style=\"font-family:{0};font-size:{1}\">", curFont, curFontSize);
                    }
    
                    //Append the current text
                    this.result.Append(renderInfo.GetText());
    
                    //Set currently used properties
                    this.lastBaseLine = curBaseline;
                    this.lastFontSize = curFontSize;
                    this.lastFont = curFont;
                }
    
                public string GetResultantText()
                {
                    //If we wrote anything then we'll always have a missing closing tag so close it here
                    if (result.Length > 0)
                    {
                        result.Append("</span>");
                    }
                    return result.ToString();
                }
    
                //Not needed
                public void BeginTextBlock() { }
                public void EndTextBlock() { }
                public void RenderImage(ImageRenderInfo renderInfo) { }
            }
        }
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have block of text read from a PDF document, using the ItextSharp library(method:
I am looking at the feasibility of creating something using C# and iTextSharp that
I'm using iTextSharp to generate PDFs. I've added a test method below that makes
I am using Itextsharp version 5.0.6 to convert Asp.net page into PDF. I am
I have just looked at using iTextSharp 5.0, however things like table/cell have been
I am trying to create a PDF using iTextSharp library (version 4.1.2.0). At the
I'm trying to create a pdf report using iTextSharp and I'm stumped as to
I can convert html files to pdfs with iTextSharp using code from Kyle in
Using active record, how can I return the results for a model if they
using CodeIgniter normally one has to specify the controllers in the config/routes.php file. This

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.