I have a SQL table that maps, say, authors and books. I would like

Question

0

Asked: June 13, 20262026-06-13T00:12:53+00:00 2026-06-13T00:12:53+00:00

I have a SQL table that maps, say, authors and books. I would like

0

I have a SQL table that maps, say, authors and books. I would like to group linked authors and books (books written by the same author, and authors who co-wrote a book) together and ascertain how big these groups get. For example, if J.K. Rowling co-wrote with Junot Diaz, and Junot Diaz co-wrote a book with Zadie Smith, then I would want all three authors in the same group.

Here’s a toy data set (h/t Matthew Dowle) with some of the relationships I am talking about:

set.seed(1)
authors <- replicate(100,sample(1:3,1))
book_id <- rep(1:100,times=authors)
author_id <- c(lapply(authors,sample,x=1:100,replace=FALSE),recursive=TRUE)
aubk <- data.table(author_id = author_id,book_id = book_id)
aubk[order(book_id,author_id),]

Here one sees that authors 27 and 36 co-wrote book 2, so they should be in the same group. The same for authors 63 and 100 for 3; and D, F and L for 4. And so on.

I can’t think of a good way to do this other than a for-loop, which (as you can guess) is slow. I tried a bit of data.table to avoid unnecessary copying. Is there a better way of doing it?

aubk$group <- integer(dim(aubk)[1])
library(data.table)
aubk <- data.table(aubk)
#system.time({
for (x in 1:dim(aubk)[1]) {
    if(identical(x,1)) {
        value <- 1L
    } else {
        sb <- aubk[1:(x-1),]
        index <- match(aubk[x,author_id],sb[,author_id])
        if (identical(index,NA_integer_)) {
            index <- match(aubk[x,book_id],sb[,book_id])
            if (identical(index,NA_integer_)) {
                value <- x
            } else {
                value <- aubk[index,group]
            }
        } else {
            value <- aubk[index,group]
        }
    }
    aubk[x,group:=value]
}
#})

EDIT: As mentioned by @Josh O’Brien and @thelatemail, my problem can also be worded as looking for the connected components of a graph from a two-column list where every edge is a row, and the two columns are the nodes connected.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T00:12:54+00:00

Converting 500K nodes into an adjacency matrix was too much for my computer’s memory, so I couldn’t use igraph. The RBGL package isn’t updated for R version 2.15.1, so that was out as well.

After writing a lot of dumb code that doesn’t seem to work, I think the following gets me to the right answer.

aubk[,grp := author_id]
num.grp.old <- aubk[,length(unique(grp))]
iterations <- 0
repeat {
    aubk[,grp := min(grp),by=author_id]
    aubk[,grp := min(grp), by=book_id]
    num.grp.new <- aubk[,length(unique(grp))] 
    if(num.grp.new == num.grp.old) {break}
    num.grp.old <- num.grp.new
    iterations <- iterations + 1
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a SQL table that maps, say, authors and books. I would like

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply