I am using iTextSharp in my C# winform application.I want to get particular paragraph in PDF file. Is this possible in iTextSharp?
I am using iTextSharp in my C# winform application.I want to get particular paragraph
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Yes and no.
First the no. The PDF format doesn’t have a concept of text structures such as paragraphs, sentences or even words, it just has runs of text. The fact that two runs of text are near to each other so that we think of them as structured is a human thing. When you see something that looks like a three line paragraph in a PDF, in reality the program that generated the PDF actually did the job of chopping up the text into three unrelated text lines and then drew each line at specific x,y coordinates. And even worse, depending on what the designer wants, each line of text could be composed of smaller runs that could be words or even just characters. So it might be
draw "the cat in the hat" at 10,10or it might bedraw "t" at 10,10, then draw "h" at 14,10, then draw "e" at 18,10and so on. This is actually pretty common with PDFs from heavily designed programs like Adobe InDesign.Now the yes. Actually its a maybe. If you are willing to put in a little work you might be able to get iTextSharp to do what you are looking for. There is a class called
PdfTextExtractorthat has a method calledGetTextFromPagethat will get all of the raw text from a page. The last parameter to this method is an object that implements theITextExtractionStrategyinterface. If you create your own class that implements this interface you can process each run of text and perform your own logic.In this interface there’s a method called
RenderTextwhich gets called for every run of text. You’ll be given aiTextSharp.text.pdf.parser.TextRenderInfoobject from which you can get the raw text from the run as well as other things like current coordinates that it is starting at, current font, etc. Since a visual line of text can be composed of multiple runs, you can use this method to compare the run’s baseline (the starting x coordinate) to the previous run to determine if it is part of the same visual line.Below is an example of an implementation of that interface:
To call it we’d do:
We’re actually throwing away the value from
GetTextFromPageand instead inspecting the worker’sbaselinesandstringsarray fields. The next step for this would be to compare the baselines and try to determine how to group lines together to become paragraphs.I should note, not all paragraphs have spacing that’s different from individual lines of text. For instance, if you run the PDF created below through the code above you’ll see that every line of text is 18 points away from each other, regardless of if the line forms a new paragraph or not. If you open the PDF it creates in Acrobat and cover everything but the first letter of each line you’ll see that your eye can’t even tell the difference between a line break and a paragraph break.