Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7729317
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T05:56:19+00:00 2026-06-01T05:56:19+00:00

I’m writing a function in R to find formality statistics (a linguistic measure) on

  • 0

I’m writing a function in R to find formality statistics (a linguistic measure) on typed dialogue. I use openNLP‘s parts of speech tagger to tag words (amazing tool but slow because it’s doing some heavy duty stuff). Anyway time is an issue already with this function and I’m running into an issue that I want to make sur eruns a quickly as possible. I started thinking in convoluted terms and knew I needed some collective group think on this.

I have a list of vectors with tags in them like this:

G 
[[1]]
[1] "MD"  "DT"  "NN"  "VB"  "VBG" "TO"  "POS"

[[2]]
[1] "DT" "NN" "JJ" "RB"

[[3]]
[1] "RB"  "TO"  "PRP"

[[4]]
[1] "VBZ" "PRP" "VBG" "RB"  "TO"  "NN" 

[[5]]
[1] "NN" "NN"

For each vector I want to count the frequency of occurrences of all possible tags (a zero will be inserted of a vector does not contain a tag) and generate a data frame structure like this below:

  DT  JJ  MD  NN  POS PRP RB  TO  VB  VBG VBZ
1  1   0   1   1    1   0  0   1   1    1   0
2  1   1   0   1    0   0  1   0   0    0   0
3  0   0   0   0    0   1  1   1   0    0   0
4  0   0   0   1    0   1  1   1   1    1   1
5  0   0   0   2    0   0  0   0   0    0   0

I’ve put my begining thinking around it below as well as the fake data set. I initially thought to go with table on this but I’m not sure 9as I know this is slower than say the use of rle or match or indexing [ if any of these can be used. I also thought about using Reduce with merge on these vectors to do a multi merge, but know that the higher order functions in R may be slower than other methods (perhaps this can be done with some sweet indexing).

Any way I’d greatly appreciate help on this problem. The two paremeters I’m looking for are:

  1. A base solution
  2. Speed

The data and my initial thinking (table may be the wrong way to go:

G <- list(c("MD", "DT", "NN", "VB", "VBG", "TO", "POS"), c("DT", "NN", 
"JJ", "RB"), c("RB", "TO", "PRP"), c("VBZ", "PRP", "VBG", "RB", 
"TO", "NN"), c("NN", "NN"))

P <- lapply(G, function(x) table(sort(x)))  #to get frequencies on each word
sort(unique(names(unlist(P))))  #to get the column names and number

Apologies for the thread name as this one is a hard one to classify.

EDIT: (added bench marking results)

Very creative answers. I didn’t even think about the factor solution and specifying the levels. Smart. For speed Joran’s second answer winds (I just added the column names back using you’re already created lev. mdsummer’s response was the least amount of code and was tied-ish for second with speed. I’ll go with Joran’s second response as it will get me the best speed boost. Thank you all! Much appreciation 🙂 Comparison available as a gist https://gist.github.com/trinker/91802b8c4ba759034881

       expr        min         lq      mean     median        uq       max neval
   JORAN1()  648.04435  689.16756  714.9142  712.59122  732.4991  831.6623   100
   JORAN2()   86.83879   92.91911   98.7068   97.44690  101.6764  177.4228   100
   RINKER()   87.40797   94.07564  100.1154   98.39624  104.0887  177.3146   100
      TIM()  900.65847  964.23419  993.9475  988.89306 1023.0587 1137.6263   100
 MDSUMMER() 1395.95920 1487.45279 1527.3181 1527.92664 1571.0997 1685.3298   100
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T05:56:20+00:00Added an answer on June 1, 2026 at 5:56 am

    I’d do either this:

    lev <- sort(unique(unlist(G)))
    
    G1 <- do.call(rbind,lapply(G,function(x,lev){ table(factor(x,levels = lev,
                                                         ordered = TRUE))},lev = lev))
    
         DT JJ MD NN POS PRP RB TO VB VBG VBZ
    [1,]  1  0  1  1   1   0  0  1  1   1   0
    [2,]  1  1  0  1   0   0  1  0  0   0   0
    [3,]  0  0  0  0   0   1  1  1  0   0   0
    [4,]  0  0  0  1   0   1  1  1  0   1   1
    [5,]  0  0  0  2   0   0  0  0  0   0   0
    

    or for more speed (but losing the column names):

    G1 <- do.call(rbind,lapply(G,function(x,lev){ tabulate(factor(x,levels = lev,
                                    ordered = TRUE),nbins = length(lev))},lev = lev))
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I am trying to understand how to use SyndicationItem to display feed which is
I want use html5's new tag to play a wav file (currently only supported
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
Seemingly simple, but I cannot find anything relevant on the web. What is the
I need a function that will clean a strings' special characters. I do NOT
I want to construct a data frame in an Rcpp function, but when I
I am writing an app with both english and french support. The app requests
I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.