I would like to remove all html tags but leave
E.G. <a href="http://www.domain.com/">Link Title</a>
So far this works for me except that it removes the </a> part.
sed -e 's/<[^">]*>//g'
I would like to know if there is a better way to do this.
Basically what you’ve written removes any blocks of
<Stuff>whereStuffdoesn’t have any double quotes in it. If for example there were a perfectly valid bit of html like:or even some odd html like:
it wouldn’t work for you.
Regular expressions are considered a notoriously bad way to process HTML except in cases where you know exactly the full range of variations you can possibly process.
So read this viewpoint first.
I could suggest something like: