Need a solution to kill nodes like <footer>foobar</footer> and <div class="nav"></div> from many several HTML files.
I want to dump a site to disk without the menus and footers and what not. Ideally I would accomplish this task using basic unix tools like sed. Since it’s not XML I can’t use xmlstarlet.
Could anyone please suggest recipes, so I can ideally have a script running kill-node.sh 'div class="toplinks"' *.html to prune the bits I don’t want. Thank you,
Just to drive you regex haters nuts, try this on for size:
sed ':a;$!N;$!ba;s/B/-B/g;s/A/BB/g;s/<\/foo>/A/g;:b;s/<foo>[^A]*A//;tb;s/BB/A/g;s/-B/B/g' foo.htmlWith
foo.htmlbeing:Otherwise can someone do a cmdline HTML5 parser please. Thanks. x