I have a CSV that starts with 3 columns. A cumulative percentage column of

Question

0

Asked: May 18, 20262026-05-18T20:06:55+00:00 2026-05-18T20:06:55+00:00

I have a CSV that starts with 3 columns. A cumulative percentage column of

0

I have a CSV that starts with 3 columns. A cumulative percentage column of Costs, a Cost column, and a keyword column. The R script works for small files but totally dies (never finishes) when I feed it the actual file (which has a million rows). Can you help me make this script more efficient? The Token.Count is the one I’am having trouble creating. Thank you!

# Token Histogram

# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)

# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)

# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)

# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]

# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize

# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
  sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-18T20:06:56+00:00

Preallocate your storage before filling it with the loop. Never do what you are doing and concatenate or r|cbind objects inside a loop. R has to copy, allocate more storage etc at each iteration of the loop and that is overhead that cripples your code.

Create Token.Count with enough rows and columns and fill it in the loop. Something like:

Token.Count <- matrix(ncol = ?, nrow = nrow(Mydf))
for (i in seq_len(nrow(Mydf))) {
    Token.Count[i, ] <- Myfun(i)
}
Token.Count <- data.frame(Token.Count)

Sorry I can’t be more specific, but I don’t know how many columns Myfun returns.

Update 1: Having taken a look at textcnt, I think you can probably avoid the loop altogether. You have something like this data frame

DF <- data.frame(CumuCost = c(0.00439, 0.0067), Cost = c(1678, 880),
                 Keyword.text = c("north+face+outlet", "kinect sensor"),
                 stringsAsFactors = FALSE)

If we strip out the Keywords, and convert it to a list

keywrds <- with(DF, as.list(Keyword.text))
head(keywrds)

Then we can call textcnt recursively on this list to count the words in each list component;

countKeys <- textcnt(keywrds, split = "[[:space:][:punct:]]+", method = "string",
                     n = 1L, recursive = TRUE)
head(countKeys)

the above is almost what you had, except I added recursive = TRUE to treat each of the input vectors separately. The final step is to sapply the sum function to countKeys to get the number of words:

> sapply(countKeys, sum)
[1] 3 2

Which appears to be what you are trying to achieve with the loop and the function. Have I got this right?

Update 2: OK, if having fixed the preallocation issue and used textcnt in a vectorized way still isn’t quite as quick as you would like, we can investigate other ways of counting words. It could well be possible that you don’t need all the functionality of textcnt to do what you want. [I can’t check if the solution below will work for all your data, but it is a lot quicker.]

One potential solution is to split the Keyword.text vector into words using the inbuilt strsplit function, for example using keywrds generated above and only the first element:

> length(unlist(strsplit(keywrds[[1]], split = "[[:space:][:punct:]]+")))
[1] 3

To use this idea it is perhaps easier to wrap it in a user function:

fooFun <- function(x) {
    length(unlist(strsplit(x, split = "[[:space:][:punct:]]+"),
                  use.names = FALSE, recursive = FALSE))
}

which we can then apply to the keywrds list:

> sapply(keywrds, fooFun)
[1] 3 2

For this simple example data set we get the same result. What about compute time? First for the solution using textcnt, combining two of the steps from Update 1:

> system.time(replicate(10000, sapply(textcnt(keywrds, 
+                                     split = "[[:space:][:punct:]]+", 
+                                     method = "string", n = 1L, 
+                                     recursive = TRUE), sum)))
   user  system elapsed 
  4.165   0.026   4.285

and then for the solution in Update 2:

> system.time(replicate(10000, sapply(keywrds, fooFun)))
   user  system elapsed 
  0.883   0.001   0.889

So even for this small sample, there is a considerable overhead involved in calling textcnt, but whether this difference holds when applying both approaches to the full data set remains to be seen.

Finally, we should note that the strsplit approach can be vectorised to work directly on the vector Keyword.text in DF:

> sapply(strsplit(DF$Keyword.text, split = "[[:space:][:punct:]]+"), length)
[1] 3 2

which gives the same results as the other two approaches, and is marginally faster than the non-vectorized use of strsplit:

> system.time(replicate(10000, sapply(strsplit(DF$Keyword.text, 
+                              split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  0.732   0.001   0.734

Are any of these faster on you full data set?

Minor Update: replicating DF to give 130 rows of data and timing the three approaches suggests that the last (vectorized strsplit()) scales better:

> DF2 <- rbind(DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF)
> dim(DF2)
[1] 130   3
> system.time(replicate(10000, sapply(textcnt(keywrds2, split = "[[:space:][:punct:]]+", method = "string", n = 1L, recursive = TRUE), sum)))
   user  system elapsed 
238.266   1.790 241.404
> system.time(replicate(10000, sapply(keywrds2, fooFun)))
   user  system elapsed 
 28.405   0.007  28.511
> system.time(replicate(10000, sapply(strsplit(DF2$Keyword.text,split = "[[:space:][:punct:]]+"), length)))
   user  system elapsed 
  7.497   0.011   7.528

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a CSV that starts with 3 columns. A cumulative percentage column of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply