I have 40KB HTML page and I want to find certain patterns in it.
I can read it by 1K buffer but I want to avoid situation that pattern that I’m searching would be split between two buffer reads.
How to overcome this problem?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
This is easy. You count the longest pattern you will look for, then either backtrack the file pointer by that amount, or you scroll through the file, reading only the delta.
Imagine the longest pattern being 26 bytes.
Edit: Let me clarify: There are two methods to do this, both have their merits. The one I documented above is best used if you are reading from a stream, which means a data source that does not support seeking. If, however, your datasource does support seeking (like a filesystem file), you can easily do the same with seeks. Check for pattern, if not found, seek back the size of your longest pattern, then start from there.
If, however, you want to support the search for patterns that are longer than your buffer size, you might need a much more clever algorithm. You would need a lookup table of all patterns that are currently “open” when you contnue to read more data, which in turn will cost more memory – you get the problem.