I am trying to extract text with all information from the pdf using pdfbox. I got all the information i want, except color. I tried different ways to get the fontcolor (including Getting Text Colour with PDFBox). But not working. And now I copied code from PageDrawer class of pdfBox. But then also the RGB value is not correct.
protected void processTextPosition(TextPosition text) {
Composite com;
Color col;
switch(this.getGraphicsState().getTextState().getRenderingMode()) {
case PDTextState.RENDERING_MODE_FILL_TEXT:
com = this.getGraphicsState().getNonStrokeJavaComposite();
int r = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRed();
int g = this.getGraphicsState().getNonStrokingColor().getJavaColor().getGreen();
int b = this.getGraphicsState().getNonStrokingColor().getJavaColor().getBlue();
int rgb = this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB();
float []cosp = this.getGraphicsState().getNonStrokingColor().getColorSpaceValue();
PDColorSpace pd = this.getGraphicsState().getNonStrokingColor().getColorSpace();
break;
case PDTextState.RENDERING_MODE_STROKE_TEXT:
System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
System.out.println(this.getGraphicsState().getStrokingColor().getJavaColor().getRGB());
break;
case PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT:
//basic support for text rendering mode "invisible"
Color nsc = this.getGraphicsState().getStrokingColor().getJavaColor();
float[] components = {Color.black.getRed(),Color.black.getGreen(),Color.black.getBlue()};
Color c1 = new Color(nsc.getColorSpace(),components,0f);
System.out.println(this.getGraphicsState().getStrokeJavaComposite().toString());
break;
default:
System.out.println(this.getGraphicsState().getNonStrokeJavaComposite().toString());
System.out.println(this.getGraphicsState().getNonStrokingColor().getJavaColor().getRGB());
}
I am using the above code. The values getting are
r = 0, g = 0, b = 0, inside cosp object value is [0.0], inside pd object array = null and colorSpace = null. and RGB value is always -16777216. Please help me. Thanks in advance.
I tried the code in the link you posted and it worked for me. The colors I get back are 148.92, 179.01001 and 214.965. I wish I could give you my PDF to work with, maybe if I store it externally to SO? My PDF used a sort of palish blue color and that seems to match. It was just one page of text created in Word 2010 and exported, nothing too intense.
A couple of suggestions ….
That is all I can think of now, otherwise I have version of 1.7.1 of pdfbox and fontbox and like I said I pretty much followed the link you gave.
EDIT
Based upon my comments, here perhaps is a minorly invasive way of doing it for pdf files like
color.pdf?In
PDFStreamEngine.javain theprocessOperatormethod one can do inside the try blockThis will show you the information, then it is up to you to process the color information for each section according to your needs. Here is a snippet from the beginning of the output of the above code when run on
color.pdf…BT
rG
[COSInt(1), COSInt(0), CosInt(0)]
RG
[COSInt(1), COSInt(0), CosInt(0)]
ET
BT
ET
BT
rG
[COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}]
RG
[COSFloat{0.573}, COSFloat{0.816}, COSFloat{0.314}]
ET
......
You see in the above output an empty BT ET section, this being a section which is marked DEVICEGRAY. All the other give you [0,1] values for the R, G and B components