I’ve tried to write reproducable example below. It is a mix of .Rmd and .r . Hopefully you can see why.
The problem I have is that non-english characters are treated differently depending on whether code is run directly in the console or when Knitted to HTML.
In the example below I create a small data.frame with characters ü and ö, write it to csv, then read it back in again.
If the writing and reading both take place inside or outside a chunk, then all is well.
But if the writing and reading take place in different places then a different encoding is used (I think). and characters get mixed up.
This means that when reading in data I need a different encoding when compiling an .Rmd file than when working directly in R.
As far as I can see the locale is always the same, so I don’t understand what’s going on.
Any ideas?
Write and read csv directly to create new datafile
df2 <- data.frame(Cäl1 = c(1,2), Col2 = c("ü","a"))
write.csv(df2, file="df2.csv")
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
Now try Knitting the whole document (just running the chunk behaves differently)
```{r read_inside}
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
```
this second chunk will work because the data.frame is created inside the chunk
```{r write_read_inside}
df2 <- data.frame(Cäl1 = c(1,2), Col2 = c("ü","a"))
write.csv(df2, file="df2.csv")
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
```
Session Info:
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.15.0
So the answer is to guarantee UTF8 encoding, e.g.
write.csv(..., fileEncoding = 'UTF-8'). The root problem was actually that RStudio uses UTF8 by default, but R uses the native encoding of the OS by default. We can either ask R to use UTF8 inwrite.csv, or ask RStudio to use native encoding (options(encoding = 'native.enc')).