I have a CSV that starts with 3 columns. A cumulative percentage column of Costs, a Cost column, and a keyword column. The R script works for small files but totally dies (never finishes) when I feed it the actual file (which has a million rows). Can you help me make this script more efficient? The Token.Count is the one I’am having trouble creating. Thank you!
# Token Histogram
# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)
# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)
# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)
# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]
# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize
# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"
Preallocate your storage before filling it with the loop. Never do what you are doing and concatenate or
r|cbindobjects inside a loop. R has to copy, allocate more storage etc at each iteration of the loop and that is overhead that cripples your code.Create
Token.Countwith enough rows and columns and fill it in the loop. Something like:Sorry I can’t be more specific, but I don’t know how many columns
Myfunreturns.Update 1: Having taken a look at
textcnt, I think you can probably avoid the loop altogether. You have something like this data frameIf we strip out the Keywords, and convert it to a list
Then we can call
textcntrecursively on this list to count the words in each list component;the above is almost what you had, except I added
recursive = TRUEto treat each of the input vectors separately. The final step is tosapplythesumfunction tocountKeysto get the number of words:Which appears to be what you are trying to achieve with the loop and the function. Have I got this right?
Update 2: OK, if having fixed the preallocation issue and used
textcntin a vectorized way still isn’t quite as quick as you would like, we can investigate other ways of counting words. It could well be possible that you don’t need all the functionality oftextcntto do what you want. [I can’t check if the solution below will work for all your data, but it is a lot quicker.]One potential solution is to split the
Keyword.textvector into words using the inbuiltstrsplitfunction, for example usingkeywrdsgenerated above and only the first element:To use this idea it is perhaps easier to wrap it in a user function:
which we can then apply to the
keywrdslist:For this simple example data set we get the same result. What about compute time? First for the solution using
textcnt, combining two of the steps from Update 1:and then for the solution in Update 2:
So even for this small sample, there is a considerable overhead involved in calling
textcnt, but whether this difference holds when applying both approaches to the full data set remains to be seen.Finally, we should note that the
strsplitapproach can be vectorised to work directly on the vectorKeyword.textinDF:which gives the same results as the other two approaches, and is marginally faster than the non-vectorized use of
strsplit:Are any of these faster on you full data set?
Minor Update: replicating
DFto give 130 rows of data and timing the three approaches suggests that the last (vectorizedstrsplit()) scales better: