I have a data frame with 150000 lines in long format with multiple occurences

Question

0

Asked: June 17, 20262026-06-17T20:14:05+00:00 2026-06-17T20:14:05+00:00

I have a data frame with 150000 lines in long format with multiple occurences

0

I have a data frame with 150000 lines in long format with multiple occurences of the same id variable. I’m using reshape (from stat, rather than package=reshape(2)) to convert this to wide format. I am generating a variable to count each occurence of a given level of id to use as an index.

I’ve got this working with a small dataframe using plyr, but it is far too slow for my full df. Can I programme this more efficiently?

I’ve struggled doing this with the reshape package as I have around 30 other variables. It may be best to reshape only what I’m looking at (rather than the whole df) for each individual analysis.

> # u=id variable with three value variables 
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
   u  v  w  x
1  a  1 20 40
2  a  2 21 41
3  a  3 22 42
4  a  4 23 43
5  b  5 24 44
6  b  6 25 45
7  b  7 26 46
8  c  8 27 47
9  c  9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
> 
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first")) 
> df2
   u  v  w  x count
1  a  1 20 40     1
2  a  2 21 41     2
3  a  3 22 42     3
4  a  4 23 43     4
5  b  5 24 44     1
6  b  6 25 45     2
7  b  7 26 46     3
8  c  8 27 47     1
9  c  9 28 48     2
10 c 10 29 49     3
11 c 11 30 50     4
12 c 12 31 51     5
13 c 13 32 52     6
14 d 14 33 53     1
15 d 15 34 54     2
16 d 16 35 55     3
17 d 17 36 56     4
18 d 18 37 57     5
> reshape(df2, idvar="u", timevar="count", direction="wide")
   u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1  a   1  20  40   2  21  41   3  22  42   4  23  43  NA  NA  NA  NA  NA  NA
5  b   5  24  44   6  25  45   7  26  46  NA  NA  NA  NA  NA  NA  NA  NA  NA
8  c   8  27  47   9  28  48  10  29  49  11  30  50  12  31  51  13  32  52
14 d  14  33  53  15  34  54  16  35  55  17  36  56  18  37  57  NA  NA  NA

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T20:14:07+00:00

I still can’t quite figure out why you would want to ultimately convert your dataset from wide to long, because to me, that seems like it would be an extremely unwieldy dataset to work with.

If you’re looking to speed up the enumeration of your factor levels, you can consider using ave() in base R, or .N from the “data.table” package. Considering that you are working with a lot of rows, you might want to consider the latter.

First, let’s make up some data:

set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
                 v = runif(150000, 0, 10),
                 w = runif(150000, 0, 100),
                 x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
#   u        v        w        x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
# 
# [[2]]
#        u        v        w        x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
# 
#     a     b     c     d     e     f 
# 25332 24691 24993 24975 25114 24895

Load our required packages:

library(plyr)
library(data.table)

Create a “data.table” version of our dataset

DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
#         u         v         w        x
#      1: a 6.2378578 96.098294 643.2433
#      2: a 5.0322400 46.806132 544.6883
#      3: a 9.6289786 87.915303 334.6726
#      4: a 4.3393403  1.994383 753.0628
#      5: a 6.2300123 72.810359 579.7548
#     ---                               
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
#    u     N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895

Now let’s run a few basic tests. The results from ave() aren’t sorted, but they are in “data.table” and “plyr”, so we should also test the timing for sorting when using ave().

system.time(AVE <- within(df, {
  count <- ave(as.numeric(u), u, FUN = seq_along)
}))
#    user  system elapsed 
#   0.024   0.000   0.027 

# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
#    user  system elapsed 
#   0.264   0.000   0.262 

system.time(DDPLY <- ddply(df, .(u), transform, 
                           count=rank(u, ties.method="first")))
#    user  system elapsed 
#   0.944   0.000   0.984 

system.time(DT[, count := 1:.N, by = key(DT)])
#    user  system elapsed 
#   0.008   0.000   0.004 

all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE

That syntax for “data.table” sure is compact, and it’s speed is blazing!

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a data frame with 150000 lines in long format with multiple occurences

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply