Here is my problem. I have a dataset with 200k rows. Each row corresponds

Question

0

Editorial Team

Asked: June 4, 20262026-06-04T00:02:26+00:00 2026-06-04T00:02:26+00:00

Here is my problem. I have a dataset with 200k rows. Each row corresponds

0

Here is my problem. I have a dataset with 200k rows.

Each row corresponds to a test conducted on a subject.
Subjects have unequal number of tests.
Each test is dated.

I want to assign an index to each test. E.g. The first test of subject 1 would be 1, the second test of subject 1 would be 2. The first test of subject 2 would be 1 etc..

My strategy is to get a list of unique Subject IDs, use lapply to subset the dataset into a list of dataframes using the unique Subject IDs, with each Subject having his/her own dataframe with the tests conducted. Ideally I would then be able to sort each dataframe of each subject and assign an index for each test.

However, doing this over a 200k x 32 dataframe made my laptop (i5, Sandy Bridge, 4GB ram) run out of memory quite quickly.

I have 2 questions:

Is there a better way to do this?
If there is not, my only option to overcome the memory limit is to break my unique SubjectID list into smaller sets like 1000 SubjectIDs per list, lapply it through the dataset and at the end of everything, join the lists together. Then, how do I create a function to break my SubjectID list by supplying say an integer that denotes the number of partitions. e.g. BreakPartition(Dataset, 5) will break the dataset into 5 partitions equally.

Here is code to generate some dummy data:

UniqueSubjectID <- sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse =""))
UniqueSubjectID <- subset(UniqueSubjectID, !duplicated(UniqueSubjectID))
Dataset <- data.frame(SubID = sample(sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse ="")),5000, replace = TRUE))
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), 5000, replace = TRUE)
Dataset <- cbind(Dataset, Dates)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T00:02:27+00:00

I would guess that the splitting/lapply is what is using up the memory. You should consider a more vectorized approach. Starting with a slightly modified version of your example code:

n <- 200000
UniqueSubjectID <- replicate(500, paste(letters[sample(26, 5, replace=TRUE)], collapse =""))
UniqueSubjectID <- unique(UniqueSubjectID)
Dataset <- data.frame(SubID = sample(UniqueSubjectID , n, replace = TRUE))
Dataset$Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), n, replace = TRUE)

And assuming that what you want is an index counting the tests by date order by subject, you could do the following.

Dataset <- Dataset[order(Dataset$SubID, Dataset$Dates), ]
ids.rle <- rle(as.character(Dataset$SubID))
Dataset$SubIndex <- unlist(sapply(ids.rle$lengths, function(n) 1:n))

Now the ‘SubIndex’ column in ‘Dataset’ contains a by-subject numbered index of the tests. This takes a very small amount of memory and runs in a few seconds on my 4GB Core 2 duo Laptop.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Here is my problem. I have a dataset with 200k rows. Each row corresponds

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply