So I’m trying to use sed (it has to be sed on these systems, so please don’t just recommend to use Perl) to match an HTML tag and get the contents out of it. The HTML tags look about like this:
<div class="SectionText"> Received poor service or think your current mechanic is ripping you off? Get some help from <a href="http://www.union.umd.edu/gradlegalaid/index.htm" target="_blank">Graduate Legal Aid</a> or consult the <a href="http://www.oag.state.md.us/Consumer/index.htm" target="_blank">Maryland Attorney General Office of Consumer Protection</a> at <a href="mailto:consumer@oag.state.md.us">consumer@oag.state.md.us</a> or through their hotline at 410-528-8662 or 888-743-0023.<br /></div>
All on one line. So, I wrote this one… But it doesn’t work.
sed 's/<div class=\"SectionText\">\([^<\/div>]*\)<\/div>/\1/g'
This does not alter any text.
I tried to use this website as a guideline – http://www.ibm.com/developerworks/linux/library/l-sed2/index.html (under RegExp Snafus)\
The most important thing is for this line script NOT to be greedy and match up until the last
This does not do what you think it does. This matches any sequence of characters that are not
<,/,d,i,vor>.In Perl you could simply use
.*?, but as sed does not support non-greedy matches, you’ll have to write something like this beauty:This says “any sequence of characters that are not
<, or are<not followed by/, or are</not followed byd, and so on.Needless to say, this is an unreadable, unmaintainable and nearly unwritable piece of crap and you should almost certainly not be using it, but if you absolutely, positively must use regexes to parse HTML and absolutely, positively must use sed, then here you go.