Does anyone know a script/recipe/library to find most relevant contact information on a website?
Some possible case:
- Find contact phone number on a personal web page
- Find owner email address on a blog
- Find url of the contact page
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I’m not aware of any libraries that do this.
Hm, I would use regular expressions to match for phone numbers and email addresses, combined with a web spider that walks the site, and then a method for ranking the contact information.
Typically contact information will also be partnered with one of a few common labels such as “Support”, “Support email”, “Sales”, etc. There’s probably a dozen or so versions of this that will cover 95% of all sites in English.
So, basically I would start by building a simple recursive web spider that walks all the publicly accessible pages in a given domain, parsing the HTML for email addresses and phone numbers, and making a list of them, and then ranking them based on whether or not they are listed near to any of the common labels.
It won’t be perfect, but then again, that’s part of the value of the algorithm – making it smarter, and tweaking it over time until it gets better.