Problem: find ids that are in one file but not in another. Each file

Question

0

Asked: May 24, 20262026-05-24T18:09:55+00:00 2026-05-24T18:09:55+00:00

Problem: find ids that are in one file but not in another. Each file

0

Problem: find ids that are in one file but not in another. Each file is about 6.5 GB. Specifically (for those in the bioinformatics domain), one file is a fastq file of sequencing reads and the other is a sam alignment file from a tophat run. I would like to determine which reads in the fastq file are not in the sam alignment file.

I am getting java.lang.OutOfMemory: Java heap space errors. As suggested (ref1, ref2) I am using lazy sequences. However, I am still running out of memory. I have looked at this tutorial, but I don’t quite understand it yet. So I am posting my less sophisticated attempt at a solution with the hope that I am only making a minor mistake.

My attempt:

Since neither file will fit into memory, the lines from the sam file are read a chunk at a time and the ids of each line in the chunk are put into a set. A lazy list of fastq ids are then filtered using the sam ids in the set keeping only those ids that are not in the set. This is repeated with the next chunk of sam lines and the remaining fastq ids.

(defn ids-not-in-sam 
  [ids samlines chunk-size]
  (lazy-seq
    (if (seq samlines)
      (ids-not-in-sam (not-in (into #{} (qnames (take chunk-size samlines))) ids)
                      (drop chunk-size samlines) chunk-size)
      ids)))

not-in determines which ids are not in the set.

(defn not-in 
  ; Return the elements x of xs which are not in the set s
  [s xs]
  (filter (complement s) xs))

qnames gets the id field from a line in the sam file.

(defn qnames [samlines]
  (map #(first (.split #"\t" %)) samlines))

Finally, its put together with io (using read-lines and write-lines from clojure.contrib.io.

(defn write-fq-not-in-sam [fqfile samfile fout chunk-size] 
    (io/write-lines fout (ids-not-in-sam (map fq-id (read-fastq fqfile))
                                         (read-sam samfile) chunk-size)))

I am pretty sure I am doing every thing in a lazy manner. But I may be holding onto the head of a sequence somewhere that I do not notice.

Is there an error in the code above that is causing the heap to fill up? More importantly, is my approach to the problem all wrong? Is this an appropriate use for lazy sequences, am I expecting too much?

(The errors could be in the read-sam and read-fastq functions, but my post is already a bit long. I can show those later if need be).

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T18:09:57+00:00

Editorial Team

2026-05-24T18:09:57+00:00Added an answer on May 24, 2026 at 6:09 pm

sort your data sets (read chunks into memory, sort, write to temp file, merge sorted files)
iterate over both sets to find intersecting/missing elements (i hope algorithm is clear)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Problem: find ids that are in one file but not in another. Each file

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply