I have used arachnode.net crawler to crawl a website. The resulting crawl data has resulted in a database at the size of +100 gb!!!
I have looked around at the arachnode.net database and found the table “webpages” to be the culprit. When I crawl a website I do not download, images, media or anything a like, I only download the html code. However in this case I can see that the html webpages contains huge about of hidden viewdata and javascript.
So I need to do the crawling once again and this time strip out the hidden viewdata and javascript code before saving to the webpages table.
Anyone have some idea on how to achieve it.
Thanks.
Yes, you can write a plugin which modifies the CrawlRequest.Data and CrawlRequest.DecodedHtml before the data is inserted into the database.
Create a PostRequest CrawlAction as shown here: http://arachnode.net/Content/CreatingPlugins.aspx