I’m trying to scrape an HTML page for it’s title using a regular expression. Here’s what I’m trying:
\<title\>\A\Z\</title\>
Any suggestions?
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
The brackets around
.*?lets you reference the capture group. Your regular expression library will probably have a way to return what is matched in capture groups. The group indexed 0 is the whole match. So you should probably pick group index 1, which is the first starting bracket it comes across (there’s only one set of brackets here).In some libraries, you need:
because some require a complete match of the string.
Be aware that this is not foolproof. Webpages can break your regular expression with pages like:
You can avoid the possibility of this by making your regex more complicated before matching the title. However, that doesn’t really work. Because the fake title could be in an HTML comment
<!-- <title></title> -->, or a/* javascript */comment.Thus, it is better to use an actual HTML parser. You can search google to find many of these.
If you are using Ruby, you can use the nokogiri gem – http://nokogiri.org/.
For Java – http://htmlparser.sourceforge.net/.
For python – http://docs.python.org/library/htmlparser.html.