I’m working on a project to automatically process scanned invoices. In order get a

Question

0

Asked: May 26, 20262026-05-26T22:37:48+00:00 2026-05-26T22:37:48+00:00

I’m working on a project to automatically process scanned invoices. In order get a

0

I’m working on a project to automatically process scanned invoices. In order get a better result the OCR engine, I’d like to first remove noise from images. Beside scratches, I’d also like to remove anything that was added to document after it was printed. Many invoice e.g. were ticked off and sometimes it makes parts of the invoice unreadable for OCR.

For example have a look at this image. The description of the second item won’t be readable and I’d like to remove “noise” like that.

So how can I remove handwritten regions like that and still maintain a high quality of the printed text beneath?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T22:37:49+00:00

Scratches and other spots can fairly easily be filtered by just ignoring any pixels that are not of at least a certain colour intensity.

You have three options for dealing with the lines:

First important question, are the hand writing written in a different colour? A simple solution would be to give everyone blue or red pens and ban black pens from being used. You can then scan the documents in colour, then you can easily just use the Green buffer as your grey-scale image instead of all three buffer. That would be the easiest way to implement this, almost every scanner now days supports colour scanning.
Otherwise you’re going to have to write an algorithm that can detect
lines within an image, for this to work, you would need to first
calibrate the algorithm to first know what is the size of a
character usually, then find any lines that are longer than X
pixels, then remove the line from there. This is going to be very problematic and not going to work too well for you, and you will spend a long time trying to get it working and it still will never be 100%.
The other way is that after doing your OCR you should present your
data to an end user to verify it’s correct, you could then present
them with the scanned image and allow them to overwrite what was
scanned if it was incorrect.

Of the three options I would say your best option is just to prevent people from writing on invoices with a black pen. If you can’t do that, then scan the document best you can and provide it to an end user to clarify the problematic fields (you could even flag them as being a problem so the user doesn’t need to check the whole document all the time).

Edit: one thing that is worth pointing out, is that if you are receiving documents that have been written on and then faxed, you’re not going to be able to do very much with them other than option 3 (try your best and then present to a user).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working on a project to automatically process scanned invoices. In order get a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply