Given a html document, what is the most correct and concise regular expression pattern to remove the query strings from each url in the document?
Given a html document, what is the most correct and concise regular expression pattern
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
You can’t usefully parse HTML with a regexp. If you know the format of the page in advance — eg.
then you can just about get away with it, but for general [X]HTML a regexp parser is unsuitable.
Depending on what language you’re using, you’d need to find either an HTML parser library (eg. Python’s BeautifulSoup), or an HTML tidier combined with a standard XML parser, then scan the document for < a> elements (and maybe others, eg. < img> if you’re interested in those?), then split the attribute value on ‘?’.