I’ve got to use some HTML like this:
<li><a href="#">S:</a><a class="#"> (n) </a><a href="#">trial</a>, <a href="#">trial run</a>, <b>test</b>, <a href="#">tryout</a> (trying something to find out about it) <i>"a sample for ten days free trial"; "a trial of progesterone failed to relieve the pain"</i></li>
The problem is that I need to get the text from both the children (like the as and is) and the text nodes (like the , parts between children).
All I’m able to do is get the text from each child and put it together (which gives me everything but all the text nodes) OR get the just the text nodes (and not the a and is). Is there a way to get both?
The lxml changelog shows lxml v2.3 is compatible with python 3.1.2 and newer.
Also you could use regexp
re.sub(r'<[^>]*?>', '', val)as Python’s equivalent to PHP’s strip_tags said.