i wrote a simle webcrawler. i fetched all the websites and but them on my hdd.
now i want to analyse them, so i could write a simple interface like http://www.google.de and search for information in my fetched pages.
the problem is how to find out the important informations in a “fast” way. so the calculation is important. it could be realtime or after the fetch. my idea is to write a dictonary with a list of english words and count the entrys… or what to do ? i need lecture how to extract information and compress them. but i dont know where to look.
the crawler is based on c++ with mysql where the links stored.
i hope my question is clearly. 😀
btw sry for my bad english but there istn a board like this in german 😛
The science of Information Retrieval (IR) is a complicated one.
Have you looked at any of the standard texts? Like:
Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (Jul 7, 2008) – http://www.amazon.com/Introduction-Information-Retrieval-Christopher-Manning/dp/0521865719/ref=sr_1_1?s=books&ie=UTF8&qid=1305573574&sr=1-1
Information Retrieval: Implementing and Evaluating Search Engines by Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack (Jul 23, 2010) – http://www.amazon.com/Information-Retrieval-Implementing-Evaluating-Engines/dp/0262026511/ref=sr_1_3?s=books&ie=UTF8&qid=1305573574&sr=1-3
Search for ‘information retrieval’ on Amazon for more.
You might also take a look at my answer to Design Question for Notification System which outlines a general architecture for spidering websites for search.