Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7644287
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T09:39:03+00:00 2026-05-31T09:39:03+00:00

First off, sorry about the long post. Figured it’s better to give context to

  • 0

First off, sorry about the long post. Figured it’s better to give context to get good answers (I hope!). Some time ago I wrote an R function that will get all pairwise interactions of variables in a data frame. This worked fine at the time, but now a colleague would like me to do this with a much larger dataset. They don’t know how many variables they are going to have in the end but they are guessing approximately 2,500 – 3,000. My function below is way too slow for this (4 minutes for 100 variables). At the bottom of this post I have included some timings for various numbers of variables and total numbers of interactions. I have the results of calling Rprof() on the 100 variables run of my function, so If anyone wants to take a look at it let me know. I don’t want to make a super long any longer than it needs to be.

What I’d like to know is if there is anything I can do to speed this function up. I tried looking going directly to glm.fit, but as far as I understood, for that to be useful the computation of the design matrices and all of that other stuff that I frankly don’t understand, needs to be the same for each model, which is not the case for my analysis, although perhaps I am wrong about this.

Any ideas on how to make this run faster would be greatly appreciated. I am planning on using parallelization to run the analysis in the end but I don’t know how many CPU’s I am going to have access to but I’d say it won’t be more than 8.

Thanks in advance,
Cheers
Davy.

getInteractions2 = function(data, fSNPcol, ccCol)
{
#fSNPcol is the number of the column that contains the first SNP
#ccCol is the number of the column that contains the outcome variable
  require(lmtest)
  a = data.frame()
  snps =  names(data)[-1:-(fSNPcol-1)]
  names(data)[ccCol] = "PHENOTYPE"
  terms = as.data.frame(t(combn(snps,2)))
  attach(data)

  fit1 = c()
  fit2 = c()
  pval  = c()

  for(i in 1:length(terms$V1))
  {
    fit1 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")
    fit2 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")
    a  = lrtest(fit1, fit2)
    pval = c(pval, a[2,"Pr(>Chisq)"])
  }

  detach(data)
  results = cbind(terms,pval) 
  return(results)
}

In the table below is the system.time results for increasing numbers of variables being passed through the function. n is the number, and Ints, is the number of pair-wise interactions given by that number of variables.

      n   Ints     user.self sys.self elapsed
time  10   45      1.20     0.00    1.30
time  15  105      3.40     0.00    3.43
time  20  190      6.62     0.00    6.85
... 
time  90 4005    178.04     0.07  195.84
time  95 4465    199.97     0.13  218.30
time 100 4950    221.15     0.08  242.18

Some code to reproduce a data frame in case you want to look at timings or the Rprof() results. Please don’t run this unless your machine is super fast, or your prepared to wait for about 15-20 minutes.

df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))
gtypes = matrix(nrow=2000, ncol=3000)
gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})
snps = paste("rs", 1000:3999,sep="")
df = cbind(df,gtypes)
names(df) = c("sid", "status", snps)

times = c()
for(i in seq(10,100, by=5)){
  if(i==100){Rprof()}
  time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))
  print(time)
  times = rbind(times, time)
  if(i==100){Rprof(NULL)}
}
numI = function(n){return(((n^2)-n)/2)}
timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T09:39:04+00:00Added an answer on May 31, 2026 at 9:39 am

    So I have sort of solved this (with help from the R mailing lists) and am posting it up in-case it’s useful to anyone.

    Basically, where the SNPs or variables are independent (i.e. Not in LD, not correlated) you can centre each SNP/Variable at it’s mean like so:

    rs1cent <- rs1-mean(rs1)
    rs2cent <- rs2 -mean(rs2)
    

    you can then test for correlation between phenotype and interaction as a screening step:

    rs12interaction <- rs1cent*rs2cent
    cor(PHENOTYPE, rs12interaction)
    

    and then fully investigate using the full glm any that seem to be correlated. cut-off choice is, as ever, arbitrary.

    Other suggestions were to use a RAO score test, which involves only fitting the null hypothesis model this halving the computation time for this step, but I don’t really understand how this works (yet! more reading required.)

    Anyway there you go. Maybe be of use to someone someday.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

First off, sorry for the lame title, but I couldn't think of a better
First off sorry for a re-post, I voted to delete my old post because
First off, I know next to nothing about language theory, and I barely know
Hey guys, first off all sorry, i can't login using my yahoo provider. anyways
First off, sorry if the title is confusing! I'm working on a part of
First time poster so sorry if this is too long, but I've hit a
First off, sorry for the cryptic question. My team is currently using Selenium 2.0rc3
First off, I am sorry if this is a duplicate, but every time I
First off, I'm rather n00b at AS3 so I'm sorry if this is a
First off, rest assured that I've read every Oracle Java article & tutorial about

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.