I have a large set of HTML files that contain text from a magazine in nodes span. My PDF to HTML converter inserted the character entity throughout the HTML. The problem is that in R, I use the xmlValue function (in XML package) to extract the text but wherever there was a the space between the words is eliminated. For example:
<span class="ft6">kids, and kids in your community, in DIY projects. </span>
will come out of the xmlValue function as:
"kids,and kids in your community,in DIYprojects."
I was thinking that the easiest way to resolve this would be to find all before running the span nodes through xmlValue, and replace them with a " " (space). How would I approach that?
I have re-written the answer to reflect the problem of the original poster not being able to get text from an
XMLValue. There’s probably different ways to tackle this but one way is to just to directly open/replace/write the HTML files themselves. Generally tackling XML/HTML with regexes is A Bad Idea but in this case we have a straightforward problem of unwanted non-breaking spaces, so it’s likely not too much of an issue. The following code is an example of how to create a list of matching files and perform agsubon the contents. It should be easy to modify or expand as needed.