I want to extract images from an PDF. I’m using iTextSharp right now.
Some images can be extracted correct, but most of them don’t have the right colors and are distorted.
I did some experiments with different PixelFormats, but I didn’t get a solution for my Problem…
This is the Code which separates the image-types:
if (filter == "/FlateDecode")
{
// ...
int w = int.Parse(width);
int h = int.Parse(height);
int bpp = tg.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue;
byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)tg);
byte[] decodedBytes = PdfReader.FlateDecode(rawBytes);
byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, tg.GetAsDict(PdfName.DECODEPARMS));
PixelFormat[] pixFormats = new PixelFormat[23] {
PixelFormat.Format24bppRgb,
// ... all Pixel Formats
};
for (int i = 0; i < pixFormats.Length; i++)
{
Program.ToPixelFormat(w, h, pixFormats[i], streamBytes, bpp, images));
}
}
This is the Code to save the Image in a MemoryStream. Saving the image in a folder is implemented later.
private static void ToPixelFormat(int width, int height, PixelFormat pixelformat, byte[] bytes, int bpp, IList<Image> images)
{
Bitmap bmp = new Bitmap(width, height, pixelformat);
BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, width, height),
ImageLockMode.WriteOnly, pixelformat);
Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
bmp.UnlockBits(bmd);
using (var ms = new MemoryStream())
{
bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Tiff);
bytes = ms.GetBuffer();
}
images.Add(bmp);
}
Please help me.
I found an solution for my own problem.
To extract all Images on all Pages, it is not necessary to implement different filters.
iTextSharp has an Image Renderer, which saves all Images in their original image type.
Just do the following found here: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx
You don’t need to implement HttpHandler…