Is there any open-source library that can be used to search the Deep Web?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
there is an Open Archives Initiative Protocol for Metadata Harvesting which uses xml over html . you can find it at : http://www.openarchives.org/Register/BrowseSites
Also The deep Web (also called Deepnet, the invisible Web, dark Web or the hidden Web) refers to World Wide Web content that is not part of the surface Web, which is indexed by standard search engines.
Commercial search engines have begun exploring alternative methods to crawl the deep Web. The Sitemap Protocol (first developed by Google) and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web. Google’s deep Web surfacing system pre-computes submissions for each HTML form and adds the resulting HTML pages into the Google search engine index. The surfaced results account for a thousand queries per second to deep Web content. In this system, the pre-computation of submissions is done using three algorithms:
(1) selecting input values for text search inputs that accept keywords,
(2) identifying inputs which accept only values of a specific type (e.g., date), and
(3) selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index.