Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4242798
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T03:29:28+00:00 2026-05-21T03:29:28+00:00

This is a saga which began with the problem of how to do survey

  • 0

This is a saga which began with the problem of how to do survey weighting. Now that I appear to be doing that correctly, I have hit a bit of a wall (see previous post for details on the import process and where the strata variable came from):

> require(foreign)
> ipums <- read.dta('/path/to/data.dta')
> require(survey)
> ipums.design <- svydesign(id=~serial, strata=~strata, data=ipums, weights=perwt)
Error in if (nbins > .Machine$integer.max) stop("attempt to make a table with >= 2^31 elements") : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In pd * (as.integer(cat) - 1L) : NAs produced by integer overflow
2: In pd * nl : NAs produced by integer overflow
> traceback()
9: tabulate(bin, pd)
8: as.vector(data)
7: array(tabulate(bin, pd), dims, dimnames = dn)
6: table(ids[, 1], strata[, 1])
5: inherits(x, "data.frame")
4: is.data.frame(x)
3: rowSums(table(ids[, 1], strata[, 1]) > 0)
2: svydesign.default(id = ~serial, weights = ~perwt, strata = ~strata, 
       data = ipums)
1: svydesign(id = ~serial, weights = ~perwt, strata = ~strata, data = ipums)

This error seems to come from the tabulate function, which I hoped would be straightforward enough to circumvent, first by changing .Machine$integer.max

> .Machine$integer.max <- 2^40

and when that didn’t work the whole source code of tabulate:

> tabulate <- function(bin, nbins = max(1L, bin, na.rm=TRUE))
{
    if(!is.numeric(bin) && !is.factor(bin))
    stop("'bin' must be numeric or a factor")
    #if (nbins > .Machine$integer.max)
    if (nbins > 2^40) #replacement line
        stop("attempt to make a table with >= 2^31 elements")
    .C("R_tabulate",
       as.integer(bin),
       as.integer(length(bin)),
       as.integer(nbins),
       ans = integer(nbins),
       NAOK = TRUE,
       PACKAGE="base")$ans
}

Neither circumvented the problem. Apparently this is one reason why the ff package was created, but what worries me is the extent to which this is a problem I cannot avoid in R. This post seems to indicate that even if I were to use a package that would avoid this problem, I would only be able to access 2^31 elements at a time. My hope was to use sql (either sqlite or postgresql) to get around the memory problems, but I’m afraid I’ll spend a while getting that to work, only to run into the same fundamental limit.

Attempting to switch back to Stata doesn’t solve the problem either. Again see the previous post for how I use svyset, but the calculation I would like to run causes Stata to hang:

svy: mean age, over(strata)

Whether throwing more memory at it will solve the problem I don’t know. I run R on my desktop which has 16 gigs, and I use Stata through a Windows server, currently setting memory allocation to 2000MB, but I could theoretically experiment with increasing that.

So in sum:

  1. Is this a hard limit in R?
  2. Would sql solve my R problems?
  3. If I split it up into many separate files would that fix it (a lot of work…)?
  4. Would throwing a lot of memory at Stata do it?
  5. Am I seriously barking up the wrong tree somehow?
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T03:29:29+00:00Added an answer on May 21, 2026 at 3:29 am

    Both @Gavin and @Martin deserve credit for this answer, or at least leading me in the right direction. I’m mostly answering it separately to make it easier to read.

    In the order I asked:

    1. Yes 2^31 is a hard limit in R, though it seems to matter what type it is (which is a bit strange given it is the length of the vector, rather than the amount of memory (which I have plenty of) which is the stated problem. Do not convert strata or id variables to factors, that will just fix their length and nullify the effects of subsetting (which is the way to get around this problem).

    2. sql could probably help, provided I learn how to use it correctly. I did the following test:

      library(multicore) # make svy fast!
      ri.ny <- subset(ipums, statefips_num %in% c(36, 44))
      ri.ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri.ny)
      svyby(~incwage, ~strata, ri.ny.design, svymean, data=ri.ny, na.rm=TRUE, multicore=TRUE)
      
      ri <- subset(ri.ny, statefips_num==44)
      ri.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ri)
      ri.mean <- svymean(~incwage, ri.design, data=ri, na.rm=TRUE)
      
      ny <- subset(ri.ny, statefips_num==36)
      ny.design <- svydesign(id=~serial, weights=~perwt, strata=~strata, data=ny)
      ny.mean <- svymean(~incwage, ny.design, data=ny, na.rm=TRUE, multicore=TRUE)
      

      And found the means to be the same, which seems like a reasonable test.

      So: in theory, provided I can split up the calculation by either using plyr or sql, the results should still be fine.

    3. See 2.

    4. Throwing a lot of memory at Stata definitely helps, but now I’m running into annoying formatting issues. I seem to be able to perform most of the calculation I want (much quicker and with more stability as well) but I can’t figure out how to get it into the form I want. Will probably ask a separate question on this. I think the short version here is that for big survey data, Stata is much better out of the box.

    5. In many ways yes. Trying to do analysis with data this big is not something I should have taken on lightly, and I’m far from figuring it out even now. I was using the svydesign function correctly, but I didn’t really know what’s going on. I have a (very slightly) better grasp now, and it’s heartening to know I was generally correct about how to solve the problem. @Gavin’s general suggestion of trying out small data with external results to compare to is invaluable, something I should have started ages ago. Many thanks to both @Gavin and @Martin.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a saga that has 3 states; Initial, ReceivingRows, Completed - public static
Quick question: I have a saga that can have a scenario where it needs
I have an agent that employs a saga to track incoming messages of a
This should be a simple one: I have an observableArray object called To in
In this new episode of the saga Me vs Boost, libconfig, protocol buffers and
This is my first time creating a PHP form that will run a MySQL
This has been bugging me lately. Say I have a base class Base. If
I build a website, in this website I have this two buttons by jQuery
This must be a very simple solution that has eluded me this last hour.
This is a super simple issue that I can't figure out. I want to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.