I try to scrape a website but can’t handle this encoding issue:
# putting together the url:
search_str <- "allintitle:amphibian richness OR diversity"
url <- paste("http://scholar.google.at/scholar?q=",
search_str, "&hl=en&num=100&as_sdt=1,5&as_vis=1", sep = "")
# get content and parse it:
doc <- htmlParse(url)
# encoding isssue, like here..
xpathSApply(doc, '//div[@class="gs_a"]', xmlValue)
[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
[5] "MÃ RodrÃguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
[7] "D Vallan - Journal of Tropical Ecology, 2002 - Cambridge Univ Press"
[8] "MO Rödel, R Ernst - Ecotropica, 2004 - gtoe.de"
# ...
any pointers?
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.91-1.1 bitops_1.0-4.1 XML_3.9-4.1
loaded via a namespace (and not attached):
[1] tools_2.15.1
> getOption("encoding")
[1] "native.enc"
This worked to some degree for me
thou
was displaying incorrectly on my windows box for example.
switching to Font
DotumCheusing GUI preferences however showed it displaying correctly so it may just be a display issue not a parsing one.