I want to store in db crawled sites (html code). Sites will be millions. I will be searching in that sites special strings.
Now i am using PostrgreSQL, but i have doubts if relational database is proper. Maybe some NoSQL soultions?
What soultion do you recommend?
After you fetch your web page you need to truncate extra invaluable information from your web pages (ads, unrelated text, …). using this strategy you will decrease the page size you should store in database and your search results more relevant information.
I suggest you to create a program and extract valuable information and store those in database (if you don’t need original page) after that you can create a lucene library above to search for your information
If you want more accurate information you can analyze your page and store some rules (content direction, category, links to external resources resources, valuable information to all text rate, ….) to create a rank for your page which is techniques of text mining.