I want to split a txt file into multiple files where each file contains no more than 5Mb. I know there are tools for this, but I need this for a project and want to do it in Ruby. Also, I prefer to do this with File.open in block context if possible, but I fail miserably :o(
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
File.open("#{file_num}.txt", 'w') do |data_out|
data_in.each_line do |line|
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
# next file
end
end
end
end
This work, but I don’t think it is elegant. Also, I still wonder if it can be done with File.open in block context only.
#!/usr/bin/env ruby
require 'pp'
MAX_BYTES = 5_000_000
file_num = 0
bytes = 0
File.open("test.txt", 'r') do |data_in|
data_out = File.open("#{file_num}.txt", 'w')
data_in.each_line do |line|
data_out = File.open("#{file_num}.txt", 'w') unless data_out.respond_to? :write
data_out.puts line
bytes += line.length
if bytes > MAX_BYTES
bytes = 0
file_num += 1
data_out.close
end
end
data_out.close if data_out.respond_to? :close
end
Cheers,
Martin
[Updated] Wrote a short version without any helper variables and put everything in a method:
Instead of a line loop you can use
.read(length)and do a loop only for theEOFmarker and the file cursor.This takes care that the chunky files are never bigger than your desired chunk size.
On the other hand it never takes care for line breaks (
\n)!Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (
00001).This is only possible because
.read(chunksize)is used. In the second example below, it could not be used!Update: Splitting with line break recognition
If your really need complete lines with
\nthen use this modified code snippet:I had to introduce a helper variable
linebecause I want to ensure that the chunky file size is always below thechunksizelimit! If you don’t do this extended check you will get also file sizes above the limit. Thewhilestatement only successfully checks in next iteration step when the line is already written. (Working with.ungetcor other complex calculations will make the code more unreadable and not shorter than this example.)Unfortunately you have to have a second
EOFcheck, because the last chunk iteration will mostly result in a smaller chunk.Also two helper variables are needed: the
lineis described above, theoutfilenumis needed, because the resulting file sizes mostly do not match the exactchunksize.