There are two large text files (Millions of lines) that my program uses. These

Question

0

Asked: May 15, 20262026-05-15T22:39:29+00:00 2026-05-15T22:39:29+00:00

There are two large text files (Millions of lines) that my program uses. These

0

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.

database = extractDatabase(@type).chomp("fasta") + "yml"
revDatabase = extractDatabase(@type + "-r").chomp("fasta.reverse") + "yml"
@proteins = Hash.new
@decoyProteins = Hash.new

File.open(database, "r").each_line do |line|
  parts = line.split(": ")
  @proteins[parts[0]] = parts[1]
end

File.open(revDatabase, "r").each_line do |line|
  parts = line.split(": ")
  @decoyProteins[parts[0]] = parts[1]
end

And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.

MTMDK: P31946   Q14624  Q14624-2    B5BU24  B7ZKJ8  B7Z545  Q4VY19  B2RMS9  B7Z544  Q4VY20
MTMDKSELVQK: P31946 B5BU24  Q4VY19  Q4VY20
....

I’ve messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it’s still awfully slow.

Is there a way to improve the speed of this, or is there a whole other approach I can take?

List of things that don’t work:

YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-15T22:39:30+00:00

Editorial Team

2026-05-15T22:39:30+00:00Added an answer on May 15, 2026 at 10:39 pm

In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as

buffer = File.readlines(database)
buffer.each do |line|
    ...
end

If they’re too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

There are two large text files (Millions of lines) that my program uses. These

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply