Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6331933
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T18:09:55+00:00 2026-05-24T18:09:55+00:00

Problem: find ids that are in one file but not in another. Each file

  • 0

Problem: find ids that are in one file but not in another. Each file is about 6.5 GB. Specifically (for those in the bioinformatics domain), one file is a fastq file of sequencing reads and the other is a sam alignment file from a tophat run. I would like to determine which reads in the fastq file are not in the sam alignment file.

I am getting java.lang.OutOfMemory: Java heap space errors. As suggested (ref1, ref2) I am using lazy sequences. However, I am still running out of memory. I have looked at this tutorial, but I don’t quite understand it yet. So I am posting my less sophisticated attempt at a solution with the hope that I am only making a minor mistake.

My attempt:

Since neither file will fit into memory, the lines from the sam file are read a chunk at a time and the ids of each line in the chunk are put into a set. A lazy list of fastq ids are then filtered using the sam ids in the set keeping only those ids that are not in the set. This is repeated with the next chunk of sam lines and the remaining fastq ids.

(defn ids-not-in-sam 
  [ids samlines chunk-size]
  (lazy-seq
    (if (seq samlines)
      (ids-not-in-sam (not-in (into #{} (qnames (take chunk-size samlines))) ids)
                      (drop chunk-size samlines) chunk-size)
      ids)))

not-in determines which ids are not in the set.

(defn not-in 
  ; Return the elements x of xs which are not in the set s
  [s xs]
  (filter (complement s) xs))

qnames gets the id field from a line in the sam file.

(defn qnames [samlines]
  (map #(first (.split #"\t" %)) samlines))

Finally, its put together with io (using read-lines and write-lines from clojure.contrib.io.

(defn write-fq-not-in-sam [fqfile samfile fout chunk-size] 
    (io/write-lines fout (ids-not-in-sam (map fq-id (read-fastq fqfile))
                                         (read-sam samfile) chunk-size))) 

I am pretty sure I am doing every thing in a lazy manner. But I may be holding onto the head of a sequence somewhere that I do not notice.

Is there an error in the code above that is causing the heap to fill up? More importantly, is my approach to the problem all wrong? Is this an appropriate use for lazy sequences, am I expecting too much?

(The errors could be in the read-sam and read-fastq functions, but my post is already a bit long. I can show those later if need be).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T18:09:57+00:00Added an answer on May 24, 2026 at 6:09 pm
    1. sort your data sets (read chunks into memory, sort, write to temp file, merge sorted files)
    2. iterate over both sets to find intersecting/missing elements (i hope algorithm is clear)
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a webapp development problem that I've developed one solution for, but am
this post is a long one sorry for that but the problem is complex
I have a very simple problem but cannot find a nice solution. I have
This is probably really stupid but i can't find the problem with my code.
I find I've been confused by the problem that when I needn't to use
Sorry for not included any example data for my problem. I couldn’t find a
my first post here. I'm not new to JQuery, but I find complex coding
This should be a simple problem, but I just can't seem to find a
i have this problem to find a particular xml node l have post this
I cannot find the solution for this problem, here is the simplified example: On

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.