I want to remove all text from html page that I load with nokogiri. For example, if a page has the following:
<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>
I want to process it with Nokogiri and return html like the following after stripping the text like so:
<body><script>var x = 10;</script><div></div><div><h1></h1></div></body>
(That is, remove the actual h1 text, text between divs, text in p elements etc, but keep the tags. Also, don’t remove text in the script tags.)
Warning: As you did not specify how to handle a case like
<div>foo<h1>bar</h1></div>the above may or may not do what you expect. Alternatively, the following may match your needs:Update
Here’s a more elegant solution using a single xpath to select all text nodes not part of a
<script>element. I’ve added more text nodes to show how it handles them.For Ruby 1.9, the meat is more simply: