I know there are a tons of docs and debates out there, but still:
This is my best shot on my Rails attempt to test scraped data from various websites. Strange fact is that if I manually copy-paste the source of an URL everything goes right.
What can I do?
# encoding: utf-8
require 'rubygems'
require 'iconv'
require 'nokogiri'
require 'open-uri'
require 'uri'
url = 'http://www.website.com/url/test'
sio = open(url)
@cur_encoding = sio.charset
doc = Nokogiri::HTML(sio, nil, @cur_encoding)
txtdoc = doc.to_s
# 1) String manipulation test
p doc.search('h1')[0].text # "Nove36 "
p doc.search('h1')[0].text.strip! # nil <- ERROR
# 2) Regex test
# txtdoc = "test test 44.00 € test test" # <- THIS WORKS
regex = "[0-9.]+ €"
p /#{regex}/i =~ txtdoc # integer expected
I realize that probably my OS Ubuntu plus my text editor is doing some good encoding conversion over probably some broken encoding: that’s fine, BUT how can I fix this problem on my app while running live?
The problems you’re having are caused by non breaking space characters (Unicode U+00A0) in the page.
In your first problem, the string:
actually ends with U+00A0, and
String#strip!doesn’t consider this character to be whitespace to be removed:In your second problem, the space between the price and the euro sign is again a non breaking space, so the regex simply doesn’t match as it is looking for a normal space:
When you copy and paste the source, the browser probably normalises the non breaking spaces, so you only copy normal space character, which is why it works that way.
The simplest fix would be to do a global substitution of
\u00a0for space before you start processing: