I am a newbie working in a simple Rails app that translates a document

Question

0

Asked: May 13, 20262026-05-13T14:04:25+00:00 2026-05-13T14:04:25+00:00

I am a newbie working in a simple Rails app that translates a document

0

I am a newbie working in a simple Rails app that translates a document (long string) from a language to another. The dictionary is a table of terms (a string regexp to find and substitute, and a block that ouputs a substituting string). The table is 1 million records long.

Each request is a document that wants to be translated. In a first brutish force approach I need to run the whole dictionary against each request/document.

Since the dictionary will run whole every time (from the first record to the last), instead of loading the table of records of the dictionary with each document, I think the best would be to have the whole dictionary as an array in memory.

I know it is not the most efficient, but the dictionary has to run whole at this point.

1.- If no efficiency can be gained by restructuring the document and dictionary (meaning it is not possible to create smaller subsets of the dictionary). What is the best design approach?

2.- Do you know of similar projects that I can learn from?

3.- Where should I look to learn how to load such a big table into memory (cache?) at rails startup?

Any answer to any of the posed questions will be greatly appreciated. Thank you very much!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T14:04:26+00:00

I don’t think your web hoster will be happy with a solution like this. This script

dict = {}
(0..1000_000).each do | num |
    dict[/#{num}/] = "#{num}_subst"
end

consumes a gigabyte of RAM on my MBP for storing the hash table. Another approach will be to store your substitutions marshaled in memcached so that you could (at least) store them across machines.

require 'rubygems'
require 'memcached'
@table = Memcached.new("localhost:11211")

retained_keys = (0..1000_000).each do | num |
  stored_blob = Marshal.dump([/#{num}/, "#{num}_subst"])
  @table.set("p#{num}", stored_blob)
end

You will have to worry about keeping the keys “hot” since memcached will expire them if they are not needed.

The best approach however, for your case, would be very simple – write your substitutions to a file (one line per substitution) and make a stream-filter that reads the file line by line, and replaces from this file. You can also parallelize that by mapping work on this, say, per letter of substitution and replacing markers.

But this should get you started:

  require "base64"

  File.open("./dict.marshal", "wb") do | file |
    (0..1000_000).each do | num |
      stored_blob = Base64.encode64(Marshal.dump([/#{num}/, "#{num}_subst"]))
      file.puts(stored_blob)
    end
  end

  puts "Table populated (should be a 35 meg file), now let's run substitutions"

  File.open("./dict.marshal", "r") do | f |
    until f.eof?
      pattern, replacement = Marshal.load(Base64.decode64(f.gets))
    end
  end

  puts "All replacements out"

To populate the file AND load each substitution, this takes me:

 real    0m21.262s
 user    0m19.100s
 sys     0m0.502s

To just load the regexp and the string from file (all the million, piece by piece)

 real    0m7.855s
 user    0m7.645s
 sys     0m0.105s

So this is 7 seconds IO overhead, but you don’t lose any memory (and there is huge room for improvement) – the RSIZE is about 3 megs. You should easily be able to make it go faster if you do IO in bulk, or make one file for 10-50 substitutions and load them as a whole. Put the files on an SSD or a RAID and you got a winner, but you get to keep your RAM.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am a newbie working in a simple Rails app that translates a document

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply