I’m working on a Text Mining Solution with SQL and R.
First I Import Data into R from my SQL selection and than I do data mining stuff with it.
Here is what I got:
rawData = sqlQuery(dwhConnect,sqlString)
a = data.frame(rawData$ENNOTE_NEU)
If I do a
a[[1]][1:3]
you see the structure:
[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help
Now I want to do some data cleaning with my own dictionary.
An Example would be to replace li with lorem ipsum and kd as well as kdin with kunde
My Problem is how to do that for the whole Data Frame.
for(i in 1:(nrow(a)))
{
a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
}
works but is slow for a lot of data.
Is there a better way to do that?
cheers The Captain
gsubis vectorised, so you don’t need the loop.is quicker.
Also, are you sure you want spaces inside your regexes? That way you won’t match words at the start or end of lines.