Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3597412
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 18, 20262026-05-18T20:06:55+00:00 2026-05-18T20:06:55+00:00

I have a CSV that starts with 3 columns. A cumulative percentage column of

  • 0

I have a CSV that starts with 3 columns. A cumulative percentage column of Costs, a Cost column, and a keyword column. The R script works for small files but totally dies (never finishes) when I feed it the actual file (which has a million rows). Can you help me make this script more efficient? The Token.Count is the one I’am having trouble creating. Thank you!

# Token Histogram

# Import CSV data from Report Downloader API Feed
Mydf <- read.csv("Output_test.csv.csv", sep=",", header = TRUE, stringsAsFactors=FALSE)

# Helps limit the dataframe according the HTT
# Change number to:
# .99 for big picture
# .8 for HEAD
limitor <- Mydf$CumuCost <= .8
# De-comment to ONLY measure TORSO
#limitor <- (Mydf$CumuCost <= .95 & Mydf$CumuCost > .8)
# De-comment to ONLY measure TAIL
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .95)
# De-comment to ONLY measure Non-HEAD
#limitor <- (Mydf$CumuCost <= 1 & Mydf$CumuCost > .8)

# Creates a column with HTT segmentation labels
# Creates a dataframe
HTT <- data.frame()
# Populates dataframe according to conditions
HTT <- ifelse(Mydf$CumuCost <= .8,"HEAD",ifelse(Mydf$CumuCost <= .95,"TORSO","TAIL"))
# Add the column to Mydf and rename it HTT
Mydf <- transform(Mydf, HTT = HTT)

# Count all KWs in account by using the dimension function
KWportfolioSize <- dim(Mydf)[1]

# Percent of portfolio
PercentofPortfolio <- sum(limitor)/KWportfolioSize

# Length of Keyword -- TOO SLOW
# Uses the Tau package
# My function takes the row number and returns the number of tokens
library(tau)
Myfun = function(n) {
  sum(sapply(Mydf$Keyword.text[n], textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L))}
# Creates a dataframe to hold the results
Token.Count <- data.frame()
# Loops until last row and store it in data.frame
for (i in c(1:dim(Mydf)[1])) {Token.Count <- rbind(Token.Count,Myfun(i))}
# Add the column to Mydf
Mydf <- transform(Mydf, Token.Count = Token.Count)
# Not quite sure why but the column needs renaming in this case
colnames(Mydf)[dim(Mydf)[2]] <- "Token.Count"
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-18T20:06:56+00:00Added an answer on May 18, 2026 at 8:06 pm

    Preallocate your storage before filling it with the loop. Never do what you are doing and concatenate or r|cbind objects inside a loop. R has to copy, allocate more storage etc at each iteration of the loop and that is overhead that cripples your code.

    Create Token.Count with enough rows and columns and fill it in the loop. Something like:

    Token.Count <- matrix(ncol = ?, nrow = nrow(Mydf))
    for (i in seq_len(nrow(Mydf))) {
        Token.Count[i, ] <- Myfun(i)
    }
    Token.Count <- data.frame(Token.Count)
    

    Sorry I can’t be more specific, but I don’t know how many columns Myfun returns.


    Update 1: Having taken a look at textcnt, I think you can probably avoid the loop altogether. You have something like this data frame

    DF <- data.frame(CumuCost = c(0.00439, 0.0067), Cost = c(1678, 880),
                     Keyword.text = c("north+face+outlet", "kinect sensor"),
                     stringsAsFactors = FALSE)
    

    If we strip out the Keywords, and convert it to a list

    keywrds <- with(DF, as.list(Keyword.text))
    head(keywrds)
    

    Then we can call textcnt recursively on this list to count the words in each list component;

    countKeys <- textcnt(keywrds, split = "[[:space:][:punct:]]+", method = "string",
                         n = 1L, recursive = TRUE)
    head(countKeys)
    

    the above is almost what you had, except I added recursive = TRUE to treat each of the input vectors separately. The final step is to sapply the sum function to countKeys to get the number of words:

    > sapply(countKeys, sum)
    [1] 3 2
    

    Which appears to be what you are trying to achieve with the loop and the function. Have I got this right?


    Update 2: OK, if having fixed the preallocation issue and used textcnt in a vectorized way still isn’t quite as quick as you would like, we can investigate other ways of counting words. It could well be possible that you don’t need all the functionality of textcnt to do what you want. [I can’t check if the solution below will work for all your data, but it is a lot quicker.]

    One potential solution is to split the Keyword.text vector into words using the inbuilt strsplit function, for example using keywrds generated above and only the first element:

    > length(unlist(strsplit(keywrds[[1]], split = "[[:space:][:punct:]]+")))
    [1] 3
    

    To use this idea it is perhaps easier to wrap it in a user function:

    fooFun <- function(x) {
        length(unlist(strsplit(x, split = "[[:space:][:punct:]]+"),
                      use.names = FALSE, recursive = FALSE))
    }
    

    which we can then apply to the keywrds list:

    > sapply(keywrds, fooFun)
    [1] 3 2
    

    For this simple example data set we get the same result. What about compute time? First for the solution using textcnt, combining two of the steps from Update 1:

    > system.time(replicate(10000, sapply(textcnt(keywrds, 
    +                                     split = "[[:space:][:punct:]]+", 
    +                                     method = "string", n = 1L, 
    +                                     recursive = TRUE), sum)))
       user  system elapsed 
      4.165   0.026   4.285
    

    and then for the solution in Update 2:

    > system.time(replicate(10000, sapply(keywrds, fooFun)))
       user  system elapsed 
      0.883   0.001   0.889
    

    So even for this small sample, there is a considerable overhead involved in calling textcnt, but whether this difference holds when applying both approaches to the full data set remains to be seen.

    Finally, we should note that the strsplit approach can be vectorised to work directly on the vector Keyword.text in DF:

    > sapply(strsplit(DF$Keyword.text, split = "[[:space:][:punct:]]+"), length)
    [1] 3 2
    

    which gives the same results as the other two approaches, and is marginally faster than the non-vectorized use of strsplit:

    > system.time(replicate(10000, sapply(strsplit(DF$Keyword.text, 
    +                              split = "[[:space:][:punct:]]+"), length)))
       user  system elapsed 
      0.732   0.001   0.734
    

    Are any of these faster on you full data set?

    Minor Update: replicating DF to give 130 rows of data and timing the three approaches suggests that the last (vectorized strsplit()) scales better:

    > DF2 <- rbind(DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF,DF)
    > dim(DF2)
    [1] 130   3
    > system.time(replicate(10000, sapply(textcnt(keywrds2, split = "[[:space:][:punct:]]+", method = "string", n = 1L, recursive = TRUE), sum)))
       user  system elapsed 
    238.266   1.790 241.404
    > system.time(replicate(10000, sapply(keywrds2, fooFun)))
       user  system elapsed 
     28.405   0.007  28.511
    > system.time(replicate(10000, sapply(strsplit(DF2$Keyword.text,split = "[[:space:][:punct:]]+"), length)))
       user  system elapsed 
      7.497   0.011   7.528
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a CSV that starts with 3 columns. A cumulative percentage column of
I have CSV files that have multiple columns that are sorted. For instance, I
I have .csv file that contain 2 columns delimited with , . file.csv word1,word2
I have a csv that looks like this: blah,blah, blah, blah ect, ect,column 3
I have a csv that has several columns. I need the middle columns to
I have a .csv file that contains data for only certain columns in a
I have a CSV file that contains over 80,000 rows and 100 columns. I'm
I have a CSV file that goes something like this: ['Name1', '', '', '',
I have received a CSV that has been converted/compressed/compacted into a SAV file from
I have a CSV file that is formatted like: 0.0023709,8.5752e-007,4.847e-008 and I would like

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.