When I read a text file into memory it brings my text in with ‘\n’ at the end due to the new lines.
["Hello\n", "my\n", "name\n", "is\n", "John\n"]
Here is how I am reading the text file
array = File.readlines('text_file.txt')
I need to do a lot of processing on this text array, so I’m wondering if I should remove the “\n” when I first create the array, or when I do the processing on each element with regex, performance wise.
I wrote some (admittedly bad) test code to remove the “\n”
array = []
File.open('text_file.txt', "r").each_line do |line|
data = line.split(/\n/)
array << data
end
array.flatten!
Is there a better way to do this if I should remove the “\n” when I first create the array?
If I wanted to read the file into a Set instead(for performance), is there a method similar to readlines to do that?
You need to run a benchmark test, using Ruby’s built-in Benchmark to figure out what is your fastest choice.
However, from experience, I’ve found that “slurping” the file, i.e., reading it all in at once, is not any faster than using a loop with
IO.foreachorFile.foreach. This is because Ruby and the underlying OS do file buffering as the reads occur, allowing your loop to occur from memory, not directly from disk.foreachwill not strip the line-terminators for you, likesplitwould, so you’ll need to add achomporchomp!if you want to mutate the line read in:or
Also, slurping has the problem of not being scalable; You could end up trying to read a file bigger than memory, taking your machine to its knees, while reading line-by-line will never do that.
Here’s some performance numbers:
And the results:
On today’s machines a 42MB file can be read into RAM pretty safely. I have seen files a lot bigger than that which won’t fit into the memory of some of our production hosts. While
foreachis slower, it’s also not going to take a machine to its knees by sucking up all memory if there isn’t enough memory.On Ruby 1.9.3, using the
map(&:chomp)method, instead of the older form ofmap { |s| s.chomp }, is a lot faster. That wasn’t true with older versions of Ruby, so caveat emptor.Also, note that all the above processed the data in less than one second on my several years old Mac Pro. All in all I’d say that worrying about the load speed is premature optimization, and the real problem will be what is done after the data is loaded.