I’m working on a text extraction system from PDF files using iTextSharp. I have already created a class that implements ITextExtractionStrategy and implemented methods like RenderText(), GetResultantText() etc. I have studied LocationTextExtractionStrategy class provided by iTextSharp itself as well.
The problem I’m facing is that for a particular PDF document, the RenderText() method reports the horizontal position of a few text chunks incorrectly. This happens for around 15-20 chunks out of a total of 700+ text chunks available on the page. I’m using the following simple code to get text position in RenderText():
Vector curBaselineStart = renderInfo.GetBaseline().GetStartPoint();
LineSegment segment = renderInfo.GetBaseline();
TextChunk location = new TextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth());
chunks.Add(location);
After collecting all the text chunks, I try to draw them on a bitmap, using Graphics class and the following simple loop:
for (int k = 0; k < chunks.Count; k++)
{
var ch = chunks[k];
g.DrawString(ch.text, fnt, Brushes.Black, ch.startLocation[Vector.I1], bmp.Height - ch.startLocation[Vector.I2], StringFormat.GenericTypographic);
}
The problem happens with the X (horizontal) dimension only for these few text chunks. They appear slightly towards the left than their actual position. Was wondering if there’s something wrong with my code here.
Shujaat
Finally figured this out. In PDF, computing actual text positions is more complicated than simply getting the baseline co-ordinates. You need to incorporate character and word spacing, horizontal and vertical scaling and some other factors too. I did some correspondance with iText guys and they have now incorporated a new method in TextRenderInfo class that provides actual character-by-character positions by taking care of all of the above factors.