How can I get only the text of the node <p> which has other tags in it like:
<p>hello my website is <a href="www.website.com">click here</a> <b>test</b></p>
I only want “hello my website is“
This is what I tried:
begin
node = html_doc.css('p')
node.each do |node|
node.children.remove
end
return (node.nil?) ? '' : node.text
rescue
return ''
end
Update 2: all right, well you are removing all children with
node.children.remove, including the text nodes, a proposed solution might look like:This sample returns “hello my website is”, but note that
doc.css('p')als finds<p>tags within<p>tags.Update: sorry, misread your question, you only want “hello my website is”, see solution above, original answer:
Not directly with nokogiri, but the sanitize gem might be an option: https://github.com/rgrove/sanitize/
FYI, it uses nokogiri internally.