Can we programmatically determine the components of a website by crawling its content?
I understand that this seems kind-of impossible but i think anything is possible in code. I am trying brainstorm ideas based on which i could determine the individual components of a website if i have crawled all of its data!
I am interested in determining components such as , for example , in case of a ecommerce website, i would want to determine or identify:
1. Login Url
2. Registration Url
3. Dashboard url
4. Add order url
5. Shopping cart url
6. Logout Url
and more
The information we might have can be:
1. Session, Cookies, Metadata,
2. Backlinks (internal and external)
3. Forms in a page, fields in a page etc
Any ideas or pointers will be greatly helpful.
You can get the raw HTML results by crawling the domain. And to your URL getting question: Yes, you can determine login, register etc. URL’s according to URL and the HTML elements by a system, which can be designed with some experiments.
Worked on crawling gifts’ pictures, price etc. from online-shops, it was doable. We gave relativeness points; for example for price, if a text includes “price” it gets 2 points, if it includes “$” or “€” it gets 3 points etc. I try to say you need to do experiments on the data.
You can get forms, Javascript lines etc. as I know, and can experiment on those too.
I recommend using Crawler4j if you’ll work with java. Apache Nutch is good too, you can get information about “saving raw html with Nutch” from my questions in my profile, but it’s a very big project and I don’t think it’s worth dealing with all that stuff, for your situation.