I am currently designing a focused webcrawler. I have it tested with some websites until i encountered below anchor (“the <a href=”…”>):
href=”javascript: openDocument(‘DATA//PCP200803.pdf’);”
My html parsing routine results to
javascript: openDocument(‘DATA//PCP200803.pdf’);
Does anyone have any idea on how to download the referenced document?
Thanks a lot.
In the case of the
openDocument()command, you could just add “DATA/PCP200803.pdf” to your collection of other resources to fetch/crawl, same as any other hyperlink in the page.Other JavaScript methods, though, (e.g., XMLHttpRequest’s
open()) may not be as straightforward.