I’m working on being able to read transcripts of dialogue into R. However I run into a bump with special characters like curly quotes en and em dashes etc. Typically I replace these special characters in a microsoft product first with replace. Typically I replace special characters with plain text but on some occasions desire to replace them with other characters (ie I replace “ ” with { }). This is tedious and not always thorough. If I could read the transcripts into R as is and then use Encoding to switch their encoding to a recognizable unicode format, I could gsub them out and replace them with plain text versions. However the file is read in in some way I don’t understand.
Here’s an xlsx of what my data may look like:
http://dl.dropbox.com/u/61803503/test.xlsx
This is what is in the .xlsx file
text num
“ ” curly quotes 1
en dash (–) and the em dash (—) 2
‘ ’ curly apostrophe-ugg 3
… ellipsis are uck in R 4
This can be read into R with:
URL <- "http://dl.dropbox.com/u/61803503/test.xlsx"
library(gdata)
z <- read.xls(URL, stringsAsFactors = FALSE)
The result is:
text num
1 “ †curly quotes 1
2 en dash (–) and the em dash (—) 2
3 ‘ ’ curly apostrophe-ugg 3
4 … ellipsis are uck in R 4
So I tried to use Encoding to convert to Unicode:
iconv(z[, 1], "latin1", "UTF-8")
This gives:
[1] "â\u0080\u009c â\u0080\u009d curly quotes" "en dash (â\u0080\u0093) and the em dash (â\u0080\u0094)"
[3] "â\u0080\u0098 â\u0080\u0099 curly apostrophe-ugg" "â\u0080¦ ellipsis are uck in R"
Which makes gsubing less useful.
What can I do to convert these special characters to distinguishable unicode so I can gsub them out appropriately? To be more explicit I was hoping to have z[1, 1] read:
\u201C 2\u01D curly quotes
To make it even more clear my desired outcome I will webscrape the tables from a page like wikipedia’s: http://en.wikipedia.org/wiki/Quotation_mark_glyphs and use the unicode reference chart to replace characters appropriately. So I need the characters to be in unicode or some standard format that I can systematically go through and replace the characters. Maybe it already is and I’m missing it.
PS I don’t save the files as .csv or plain text because the special characters are replaced with ? hence the use of read.xls I’m not attached to any particular method of reading in the file (ie read.xls) if you’ve got a better alternative.
Maybe this will help (I’ll have access to a Windows machine tomorrow and can probably play with it more at that point if SO doesn’t get you the answer first).
On my Linux system, when I do the following:
I get:
This is not UTF, but (I believe) ISO hex entities. Still, if you are able to get to this point also, then you should be able to use
gsubthe way you intend to.See this page (reserved section in particular) for conversions.
Update
You can also try converting to an encoding that doesn’t have those characters, like ASCII and set
subto"byte". On my machine, that gives me:It’s ugly, but
UTF-8(e2, 80, 9c)is a right curly quote (each character, I believe, is a set of three values in angled brackets). You can find conversions at this site where you can search by punctuation mark name.