I’m trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I’m coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn’t call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin’s 76-yard touchdown run in the Washington Redskins’ victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side,” Pierre-Paul told New York media. “Go the other
way. …“Yes, it’ll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn’t have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don’t think there’s many people that can run him down,”
right guard Chris Chester said. “I’m still going to go out there and
try to block and make sure no one touches Robert at all. But he’s a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul’s comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert’s
my guy. I don’t know Pierre-Paul. I don’t know why he would say
something like that,” he said. “Maybe he knows something I don’t.”
You could try inserting a space before each p tag:
but a better approach probably is something like: