Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9101615
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T01:10:34+00:00 2026-06-17T01:10:34+00:00

I have described my (non-trivial) problem described below. This is my first post, now

  • 0

I have described my (non-trivial) problem described below. This is my first post, now in a modified version. Any inputs or proposed solutions would be helpful.

There are several dimensions to this: determination of an optimal solution of small scale problem (several suggestions already below), time (data.table solution below seems to check the box) and memory management. The problem is about tags that enumerated in one table and represented by clusters in another(same cluster if within 30bp on same strand).

The challenge boils do to determining an efficient procedure of allocating a given tag to the appropriate interval. We are dealing with genomic data which means thats that tag coordinates are determined by start position, end position (=start position + 1), chromosome (25 values in full data set) and strand (position is on either the plus or minus strand over the double-stranded DNA). Clusters are thus non-overlapping on the same strand, but cluster coordinates can overlap if their intervals are on different strands complicates things.

This is a modified version of my post from Jan 9th, that better encapsulates the inherent difficulty of the problem. A fast solution that solves the small scale problem is shown later.
If anyone want to work on the full dataset, let me know.
Many thanks in advance!

Regards,

Nick Clark

Background
The issue involves intervals and greatest n per group. I have two tables containing clustered gene coordinates (clusters) and the input data (tags). The clusters table contains the summed tags from each covered non-overlapping interval on the same strand from the tags table. The full clusters table has 1.6 mln rows. The tags table about 4 mln rows, so the solution should ideally be vectorised. See below for some sample data. The setting is a data set on transcription initiation sites (CAGE) for human.

As I am working in R, I am look for a solution based in R or SQL. I have made previous unsuccessful attempts both via the plyr and sqldf packages in R.

The Challenge
What I am is missing is a row in the clustered table identifying the start coordinate from the input data table associated with the largest tag contribution.

Note that
1) clusters from different strands can have overlapping coordinates,
2) chr / chr_clst can take on 25 different values (not shown in example),
3) a solution needs to account for both strand and chr / chr_clst.

My Ideal Solution:
Vectorized R code or improvement on SQL statement below. A version of the solution below, that accounts for the memory problem would do the trick. As would an improved sql statement that efficiently determines the appropriate row from the clusters table.

Status so far
Here is the apparently best solution so far. Hat tip and cool points to user1935457 for the code and to mnel for subsequent suggested modification. The snag here is that moving from the toy example to the fill scale tables crashes both R (and R Studio) due to excessive demands on memory.

# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)

# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))

# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)

# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))

# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
       list(IDX=.I[which.max(tag_count)]),
       by=list(chr, start_clst,end_clst,strand)]$IDX

# Get those rows from merged table
tmp[idx]

I initially created a basic SQL query using the sqldf package in R (this version finds the max and not the coordinate associated with the max). The query takes forever to run, despite placing (hopefully) appropriate indices on both tables.

output_tablename <- sqldf(c(
"create index ai1 on clusters(chr_clst, start_clst, end_clst, strand_clst)",
"create index ai2 on tags(chr, start, end, strand)",
"select a.chr_clst, a.start_clst, a.end_clst, a.strand_clst, sum(b.tags)
from main.clusters a
inner join main.tags b on a.chr_clst=b.chr and a.strand_clst = b.strand 
and b.end between a.start_clst and a.end_clst
group by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst
order by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst"
))

Table structure
clusters: chr_clst, start_clst, end_clst, strand_clst, tags_clst.
tags: chr, start, end, strand, tag_count.

Sample Data in R format
Let me know if anyone want to work on the full dataset, let me know.

clusters:

chr_clst <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
start_clst <- c(568911, 569233, 569454, 569793, 569877, 569926, 569972, 570048, 570166, 713987)
end_clst <- c(568941, 569256, 569484, 569803, 569926, 569952, 569973, 570095, 570167, 714049)
strand_clst <- c("+", "+", "+", "+", "+", "-", "+", "+", "+", "-")
tags_clst <- c(37, 4, 6, 3, 80, 25, 1, 4, 1, 46)

clusters <- data.frame(cbind(chr_clst, start_clst, end_clst, strand_clst, tags_clst))
clusters$start_clst <- as.numeric(as.character(clusters$start_clst))
clusters$end_clst <- as.numeric(as.character(clusters$end_clst))
clusters$tags_clst <- as.numeric(as.character(clusters$tags_clst))
rm(chr_clst, start_clst, end_clst, start_clst, strand_clst, tags_clst)

tags:

chr <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1")

start <- c(568911, 568912, 568913, 568913, 568914, 568916, 568917, 568918, 568929, 
568929, 568932, 568933, 568935, 568937, 568939, 568940, 568940, 569233, 569247, 
569255, 569454, 569469, 569471, 569475, 569483, 569793, 569802, 569877, 569880, 
569887, 569889, 569890, 569891, 569893, 569894, 569895, 569895, 569896, 569897, 
569898, 569898, 569899, 569900, 569900, 569901, 569901, 569903, 569905, 569906, 
569907, 569907, 569908, 569908, 569909, 569910, 569910, 569911, 569911, 569912, 
569914, 569914, 569915, 569916, 569917, 569918, 569919, 569920, 569920, 569925, 
569926, 569936, 569938, 569939, 569939, 569940, 569941, 569941, 569942, 569942, 
569943, 569944, 569948, 569948, 569951, 569972, 570048, 570057, 570078, 570094, 
570166, 713987, 713989, 713995, 714001, 714001, 714007, 714008, 714010, 714011, 
714011, 714011, 714013, 714015, 714015, 714017, 714018, 714019, 714023, 714025, 
714029, 714034, 714034, 714037, 714038, 714039, 714039, 714040, 714042, 714048, 
714048)

end <- c(568912, 568913, 568914, 568914, 568915, 568917, 568918, 568919, 568930, 
568930, 568933, 568934, 568936, 568938, 568940, 568941, 568941, 569234, 569248,
569256, 569455, 569470, 569472, 569476, 569484, 569794, 569803, 569878, 569881, 
569888, 569890, 569891, 569892, 569894, 569895, 569896, 569896, 569897, 569898, 
569899, 569899, 569900, 569901, 569901, 569902, 569902, 569904, 569906, 569907, 
569908, 569908, 569909, 569909, 569910, 569911, 569911, 569912, 569912, 569913, 
569915, 569915, 569916, 569917, 569918, 569919, 569920, 569921, 569921, 569926, 
569927, 569937, 569939, 569940, 569940, 569941, 569942, 569942, 569943, 569943, 
569944, 569945, 569949, 569949, 569952, 569973, 570049, 570058, 570079, 570095, 
570167, 713988, 713990, 713996, 714002, 714002, 714008, 714009, 714011, 714012, 
714012, 714012, 714014, 714016, 714016, 714018, 714019, 714020, 714024, 714026, 
714030, 714035, 714035, 714038, 714039, 714040, 714040, 714041, 714043, 714049, 
714049)

strand <- c("+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", 
"+", "+", "+", "+", "+", "+", "+", "+", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "+", "+", "+", "+", "+", "+", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", 
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-")

tag_count <- c(1, 1, 1, 2, 3, 2, 3, 1, 1, 1, 1, 1, 2, 1, 6, 2, 8, 1, 1, 2, 1, 1, 2, 
1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 4, 4, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 4, 2, 4, 2, 4, 
1, 1, 1, 1, 3, 2, 1, 3, 1, 2, 3, 1, 1, 3, 2, 1, 1, 1, 5, 1, 2, 1, 2, 1, 1, 2, 2, 
4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 3, 2, 4, 2, 1, 1, 1, 
2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 1, 2)

tags <- data.frame(cbind(chr, start, end, strand, tag_count))    
tags$start <- as.numeric(as.character(tags$start))
tags$end <- as.numeric(as.character(tags$end))
tags$tag_count <- as.numeric(as.character(tags$tag_count))
rm(chr, start, end, strand, tag_count)
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T01:10:35+00:00Added an answer on June 17, 2026 at 1:10 am

    Here’s an attempt with the data.table package:

    # Convert sample data provided in question
    clusters <- as.data.table(clusters)
    tags <- as.data.table(tags)
    
    # Rename chr and strand for easier joining
    setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))
    
    # Set key on each table for next step
    setkey(clusters, chr, strand)
    setkey(tags, chr, strand)
    
    # Merge on the keys
    tmp <- merge(clusters, tags, by = c("chr", "strand"))
    
    # Find index (in merged table, tmp) of largest tag_count in each
    # group subject to start_clst <= end <= end_clst
    idx <- tmp[between(end, start_clst, end_clst),
               list(IDX=.I[which.max(tag_count)]),
               by=list(chr, start_clst,end_clst,strand)]$IDX
    
    # Get those rows from merged table
    tmp[idx]
    

    Output of last line:

         chr strand start_clst end_clst tags_clst  start    end tag_count
     1: chr1      -     569926   569952        25 569942 569943         4
     2: chr1      -     713987   714049        46 714011 714012         4
     3: chr1      +     568911   568941        37 568940 568941         8
     4: chr1      +     569233   569256         4 569255 569256         2
     5: chr1      +     569454   569484         6 569471 569472         2
     6: chr1      +     569793   569803         3 569793 569794         2
     7: chr1      +     569877   569926        80 569925 569926         5
     8: chr1      +     569972   569973         1 569972 569973         1
     9: chr1      +     570048   570095         4 570048 570049         1
    10: chr1      +     570166   570167         1 570166 570167         1
    

    Edit

    Based on the memory issues discussed in comments below, here’s another attempt. I use the intervals package to find overlapping intervals between the two tables. You could also explore parallelizing the for loop to gain speed.

    require(data.table)
    require(intervals)
    clusters <- data.table(clusters)
    tags <- data.table(tags)
    
    #  Find all unique combinations of chr and strand...
    setkey(clusters, chr_clst, strand_clst)
    setkey(tags, chr, strand)
    
    unique.keys <- unique(rbind(clusters[, key(clusters), with=FALSE],
                                tags[, key(tags), with=FALSE], use.names=FALSE))
    
    # ... and then work on each pair individually to avoid creating
    # enormous objects in memory
    result.list <- vector("list", nrow(unique.keys))
    for(i in seq_len(nrow(unique.keys))) {
      tmp.clst <- clusters[unique.keys[i]]
      tmp.tags <- tags[unique.keys[i]]
    
      # Keep track of each row for later
      tmp.clst[, row.id := seq_len(nrow(tmp.clst))]
      tmp.tags[, row.id := seq_len(nrow(tmp.tags))]
    
      # Use intervals package to find all overlapping [start, end] 
      # intervals between the two tables
      clst.intervals <- Intervals(tmp.clst[, list(start_clst, end_clst)],
                                  type = "Z")
      tags.intervals <- Intervals(tmp.tags[, list(start, end)],
                                  type = "Z")
      rownames(tags.intervals) <- tmp.tags$row.id
    
      # This goes to C++ code in intervals package; 
      # I didn't spend too much time looking over how it works
      overlaps <- interval_overlap(tags.intervals,
                                   clst.intervals,
                                   check_valid = FALSE)
    
      # Retrieve rows from clusters table with overlaps and add a column
      # indicating which intervals in tags table they overlapped with
      matches <- lapply(as.integer(names(overlaps)), function(n) {
        ans <- tmp.clst[overlaps[[n]]]
        ans[, match.in.tags := n]
      })
    
      # List back to one table...
      matches <- rbindlist(matches)
    
      # ... and join each match from tags to its relevant row from tags
      setkey(matches, match.in.tags)
      setkey(tmp.tags, row.id)
    
      # add the rows for max of tag_count by start_clst and
      # end_clst from this particular unique key to master list...
      result.list[[i]] <- tmp.tags[matches][, .SD[which.max(tag_count)],
                                            by = list(start_clst, end_clst)]
    }
    
    # and concatenate master list into none table,
    # getting rid of the helper columns
    rbindlist(result.list)[, c("row.id", "row.id.1") := NULL][]
    

    The last line gives:

        start_clst end_clst  chr strand  start    end tag_count chr_clst strand_clst tags_clst
     1:     569926   569952 chr1      - 569942 569943         4     chr1           -        25
     2:     713987   714049 chr1      - 714011 714012         4     chr1           -        46
     3:     568911   568941 chr1      + 568940 568941         8     chr1           +        37
     4:     569233   569256 chr1      + 569255 569256         2     chr1           +         4
     5:     569454   569484 chr1      + 569471 569472         2     chr1           +         6
     6:     569793   569803 chr1      + 569793 569794         2     chr1           +         3
     7:     569877   569926 chr1      + 569925 569926         5     chr1           +        80
     8:     569972   569973 chr1      + 569972 569973         1     chr1           +         1
     9:     570048   570095 chr1      + 570048 570049         1     chr1           +         4
    10:     570166   570167 chr1      + 570166 570167         1     chr1           +         1
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have implemented a selection pattern similar to the one described in this post
I have two tables described below. What I need is a single query that
I have the same problem described here on the e-office blog , but in
I have the same problem as described in ( Last record of Join table
I have a question which is described below: What problems would arise for testing
I have the same question problem as described here How to purge a cached
I have a very complicated problem that occurs only for non-admin windows user. I
I have previously solved a similar problem in this question , where I asked
We have const and non- const function overloading in C++ as described here and
As described here http://www.ddj.com/cpp/184401688 I do not have time to write this from scratch.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.