I have described my (non-trivial) problem described below. This is my first post, now in a modified version. Any inputs or proposed solutions would be helpful.
There are several dimensions to this: determination of an optimal solution of small scale problem (several suggestions already below), time (data.table solution below seems to check the box) and memory management. The problem is about tags that enumerated in one table and represented by clusters in another(same cluster if within 30bp on same strand).
The challenge boils do to determining an efficient procedure of allocating a given tag to the appropriate interval. We are dealing with genomic data which means thats that tag coordinates are determined by start position, end position (=start position + 1), chromosome (25 values in full data set) and strand (position is on either the plus or minus strand over the double-stranded DNA). Clusters are thus non-overlapping on the same strand, but cluster coordinates can overlap if their intervals are on different strands complicates things.
This is a modified version of my post from Jan 9th, that better encapsulates the inherent difficulty of the problem. A fast solution that solves the small scale problem is shown later.
If anyone want to work on the full dataset, let me know.
Many thanks in advance!
Regards,
Nick Clark
Background
The issue involves intervals and greatest n per group. I have two tables containing clustered gene coordinates (clusters) and the input data (tags). The clusters table contains the summed tags from each covered non-overlapping interval on the same strand from the tags table. The full clusters table has 1.6 mln rows. The tags table about 4 mln rows, so the solution should ideally be vectorised. See below for some sample data. The setting is a data set on transcription initiation sites (CAGE) for human.
As I am working in R, I am look for a solution based in R or SQL. I have made previous unsuccessful attempts both via the plyr and sqldf packages in R.
The Challenge
What I am is missing is a row in the clustered table identifying the start coordinate from the input data table associated with the largest tag contribution.
Note that
1) clusters from different strands can have overlapping coordinates,
2) chr / chr_clst can take on 25 different values (not shown in example),
3) a solution needs to account for both strand and chr / chr_clst.
My Ideal Solution:
Vectorized R code or improvement on SQL statement below. A version of the solution below, that accounts for the memory problem would do the trick. As would an improved sql statement that efficiently determines the appropriate row from the clusters table.
Status so far
Here is the apparently best solution so far. Hat tip and cool points to user1935457 for the code and to mnel for subsequent suggested modification. The snag here is that moving from the toy example to the fill scale tables crashes both R (and R Studio) due to excessive demands on memory.
# Convert sample data provided in question
clusters <- as.data.table(clusters)
tags <- as.data.table(tags)
# Rename chr and strand for easier joining
setnames(clusters, c("chr_clst", "strand_clst"), c("chr", "strand"))
# Set key on each table for next step
setkey(clusters, chr, strand)
setkey(tags, chr, strand)
# Merge on the keys
tmp <- merge(clusters, tags, by = c("chr", "strand"))
# Find index (in merged table, tmp) of largest tag_count in each
# group subject to start_clst <= end <= end_clst
idx <- tmp[between(end, start_clst, end_clst),
list(IDX=.I[which.max(tag_count)]),
by=list(chr, start_clst,end_clst,strand)]$IDX
# Get those rows from merged table
tmp[idx]
I initially created a basic SQL query using the sqldf package in R (this version finds the max and not the coordinate associated with the max). The query takes forever to run, despite placing (hopefully) appropriate indices on both tables.
output_tablename <- sqldf(c(
"create index ai1 on clusters(chr_clst, start_clst, end_clst, strand_clst)",
"create index ai2 on tags(chr, start, end, strand)",
"select a.chr_clst, a.start_clst, a.end_clst, a.strand_clst, sum(b.tags)
from main.clusters a
inner join main.tags b on a.chr_clst=b.chr and a.strand_clst = b.strand
and b.end between a.start_clst and a.end_clst
group by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst
order by a.chr_clst, a.start_clst, a.end_clst, a.strand_clst"
))
Table structure
clusters: chr_clst, start_clst, end_clst, strand_clst, tags_clst.
tags: chr, start, end, strand, tag_count.
Sample Data in R format
Let me know if anyone want to work on the full dataset, let me know.
clusters:
chr_clst <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
start_clst <- c(568911, 569233, 569454, 569793, 569877, 569926, 569972, 570048, 570166, 713987)
end_clst <- c(568941, 569256, 569484, 569803, 569926, 569952, 569973, 570095, 570167, 714049)
strand_clst <- c("+", "+", "+", "+", "+", "-", "+", "+", "+", "-")
tags_clst <- c(37, 4, 6, 3, 80, 25, 1, 4, 1, 46)
clusters <- data.frame(cbind(chr_clst, start_clst, end_clst, strand_clst, tags_clst))
clusters$start_clst <- as.numeric(as.character(clusters$start_clst))
clusters$end_clst <- as.numeric(as.character(clusters$end_clst))
clusters$tags_clst <- as.numeric(as.character(clusters$tags_clst))
rm(chr_clst, start_clst, end_clst, start_clst, strand_clst, tags_clst)
tags:
chr <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1")
start <- c(568911, 568912, 568913, 568913, 568914, 568916, 568917, 568918, 568929,
568929, 568932, 568933, 568935, 568937, 568939, 568940, 568940, 569233, 569247,
569255, 569454, 569469, 569471, 569475, 569483, 569793, 569802, 569877, 569880,
569887, 569889, 569890, 569891, 569893, 569894, 569895, 569895, 569896, 569897,
569898, 569898, 569899, 569900, 569900, 569901, 569901, 569903, 569905, 569906,
569907, 569907, 569908, 569908, 569909, 569910, 569910, 569911, 569911, 569912,
569914, 569914, 569915, 569916, 569917, 569918, 569919, 569920, 569920, 569925,
569926, 569936, 569938, 569939, 569939, 569940, 569941, 569941, 569942, 569942,
569943, 569944, 569948, 569948, 569951, 569972, 570048, 570057, 570078, 570094,
570166, 713987, 713989, 713995, 714001, 714001, 714007, 714008, 714010, 714011,
714011, 714011, 714013, 714015, 714015, 714017, 714018, 714019, 714023, 714025,
714029, 714034, 714034, 714037, 714038, 714039, 714039, 714040, 714042, 714048,
714048)
end <- c(568912, 568913, 568914, 568914, 568915, 568917, 568918, 568919, 568930,
568930, 568933, 568934, 568936, 568938, 568940, 568941, 568941, 569234, 569248,
569256, 569455, 569470, 569472, 569476, 569484, 569794, 569803, 569878, 569881,
569888, 569890, 569891, 569892, 569894, 569895, 569896, 569896, 569897, 569898,
569899, 569899, 569900, 569901, 569901, 569902, 569902, 569904, 569906, 569907,
569908, 569908, 569909, 569909, 569910, 569911, 569911, 569912, 569912, 569913,
569915, 569915, 569916, 569917, 569918, 569919, 569920, 569921, 569921, 569926,
569927, 569937, 569939, 569940, 569940, 569941, 569942, 569942, 569943, 569943,
569944, 569945, 569949, 569949, 569952, 569973, 570049, 570058, 570079, 570095,
570167, 713988, 713990, 713996, 714002, 714002, 714008, 714009, 714011, 714012,
714012, 714012, 714014, 714016, 714016, 714018, 714019, 714020, 714024, 714026,
714030, 714035, 714035, 714038, 714039, 714040, 714040, 714041, 714043, 714049,
714049)
strand <- c("+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+",
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+",
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+",
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+",
"+", "+", "+", "+", "+", "+", "+", "+", "-", "-", "-", "-", "-", "-", "-", "-",
"-", "-", "-", "-", "-", "-", "-", "+", "+", "+", "+", "+", "+", "-", "-", "-",
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-",
"-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-")
tag_count <- c(1, 1, 1, 2, 3, 2, 3, 1, 1, 1, 1, 1, 2, 1, 6, 2, 8, 1, 1, 2, 1, 1, 2,
1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 4, 4, 1, 1, 1, 1, 1, 3, 2, 1, 1, 2, 4, 2, 4, 2, 4,
1, 1, 1, 1, 3, 2, 1, 3, 1, 2, 3, 1, 1, 3, 2, 1, 1, 1, 5, 1, 2, 1, 2, 1, 1, 2, 2,
4, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 3, 2, 4, 2, 1, 1, 1,
2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 1, 2)
tags <- data.frame(cbind(chr, start, end, strand, tag_count))
tags$start <- as.numeric(as.character(tags$start))
tags$end <- as.numeric(as.character(tags$end))
tags$tag_count <- as.numeric(as.character(tags$tag_count))
rm(chr, start, end, strand, tag_count)
Here’s an attempt with the
data.tablepackage:Output of last line:
Edit
Based on the memory issues discussed in comments below, here’s another attempt. I use the
intervalspackage to find overlapping intervals between the two tables. You could also explore parallelizing theforloop to gain speed.The last line gives: