I have a large set of HTML files that contain text from a magazine

Question

0

Asked: June 17, 20262026-06-17T07:09:31+00:00 2026-06-17T07:09:31+00:00

I have a large set of HTML files that contain text from a magazine

0

I have a large set of HTML files that contain text from a magazine in nodes span. My PDF to HTML converter inserted the character entity   throughout the HTML. The problem is that in R, I use the xmlValue function (in XML package) to extract the text but wherever there was a   the space between the words is eliminated. For example:

<span class="ft6">kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>

will come out of the xmlValue function as:

"kids,and kids in your community,in DIYprojects."

I was thinking that the easiest way to resolve this would be to find all   before running the span nodes through xmlValue, and replace them with a " " (space). How would I approach that?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T07:09:32+00:00

I have re-written the answer to reflect the problem of the original poster not being able to get text from an XMLValue. There’s probably different ways to tackle this but one way is to just to directly open/replace/write the HTML files themselves. Generally tackling XML/HTML with regexes is A Bad Idea but in this case we have a straightforward problem of unwanted non-breaking spaces, so it’s likely not too much of an issue. The following code is an example of how to create a list of matching files and perform a gsub on the contents. It should be easy to modify or expand as needed.

setwd("c:/test/")
# Create 'html' file to use with test
txt <- "<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>"
writeLines(txt, "file1.html")

# Now read files - in this case only one
html.files <- list.files(pattern = ".html")
html.files

# Loop through the list of files
retval <- lapply(html.files, function(x) {
          in.lines <- readLines(x, n = -1)
          # Replace non-breaking space with space
          out.lines <- gsub("&nbsp;"," ", in.lines)
          # Write out the corrected lines to a new file
          writeLines(out.lines, paste("new_", x, sep = ""))
})

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large set of HTML files that contain text from a magazine

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply