I have a file, index.html, containing data like this:
<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-&-fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>
I need to strip the & symbols from the URLs, such that "/bbq-spareribs-&-sauce-eat-lean-&-fat.html" becomes "/bbq-spareribs--sauce-eat-lean--fat.html". However, I do not wish to remove the & symbol from the parts of the file which are not URLs, such as the text of the link, bbq spareribs & sauce (eat lean & fat).
How would I accomplish this on a standard Linux install? It doesn’t matter to me what specific tool/language is used to achieve the result so long as it works.
If you’re happy to install BeautifulSoup, this simple Python script may do what you want:
Example usage:
Caveat: Since we’re regenerating the output HTML based on a parsed representation of it, the formatting may change. Other possible changes include the explicit closing of tags if your markup is not well formed.
I may be wrong, but I suspect most solutions that use a proper XML/HTML parser will result in similar issues. To maintain the file exactly as it is and only remove the offending chars, you will have to end up using regex-based search and remove/replace. Many will advice against parsing XML/HTML with regex except for really trivial patterns. In your case, that may be true, but I’m yet to be convinced.