I wrote a program which crawls website, processes html pages and stores results in MySql database. By ‘results’ i mean html contents, all the links with their attributes and various errors in case when crawler couldn’t fetch. I use this program for analytical purposes.
Everything works fine but the main problem is that data takes way too much disk space. For each 100000 websites crawled (20 pages per site maximum) i have like 5 mysql tables totaling ~60 Gigabytes of space and i need to process 20-30 times more websites.
Of course i cannot process that much data on my home pc at once and i am forced to process only small chunks of it which is time consuming and not efficient.
So i am seeking for advice or solution that would:
1) give the same flexibility accessing data that relational DB does
2) allow smart and efficient saving of data
I doubt a different storage engine will get much more efficient than that – if you store everything in one table, without any indexes, and using natural primary keys, then almost no storage overhead will be incurred, and even if you do add a bit of structure, it should still remain sane.
My guess would be that your problem is the sheer amount of data you collect, so you probably want to remove considerable portions of your sample data before storing: for example, you may want to boil the page source down to a bunch of (normalized) keywords, you may want to skip heavy content (images etc.) and stuff that doesn’t interest you (e.g. CSS stylesheets, javascript, etc.), etc.