Slow file reads on Win/Ruby 1.9: Use 'rb' mode (and maybe 1.9.2) to slurp big files quickly.

posted 2010-Nov-15

I'm forced to use Ruby on Windows at work. (I love Ruby. I'm not a big fan of Windows.) Recently I was processing large (150MB) log files. Each is a text file with reasonable line lengths (300-500 characters per line). In the process, I found out that File.open is very slow for large files on Windows unless you read it as a binary file.

Following are the results of a test I made using Ruby 1.9.1 on Windows 7 (64-bit) using smaller subsets of one of the log files:

File.foreach File.open(file) File.open(file,'rb')
2k.txt
2,000 lines
630kiB
~19k lines/sec
~6.0MB/sec
~11k lines/sec
~3.5MB/sec
~188k lines/sec
~60MB/sec
10k.txt
10,000 lines
3,136kiB
~19k lines/sec
~6.0MB/sec
~8k lines/sec
~2.5MB/sec
~188k lines/sec
~58MB/sec
20k.txt
20,000 lines
6,394kiB
~19k lines/sec
~6.0MB/sec
~5.6k lines/sec
~1.8MB/sec
~188k lines/sec
~58MB/sec
40k.txt
40,000 lines
12,787kiB
~19k lines/sec
~6.0MB/sec
~3.3k lines/sec
~1.0MB/sec
~188k lines/sec
~58MB/sec

The graphs above are the result of watching the middle test on an 80k file using Process Explorer. (To ensure that the file is in the disk cache, the same process is run twice; see below.) It's particularly surprising that a) memory appears to be being garbage collected in the middle of the file slurp, and b) much of the memory allocated by the first slurp isn't released as the second one starts. Although not shown in these tests, disabling garbage collection does not improve this problem noticeably.

Lesson: Don't try to slurp a really large file on Windows using File.open in text mode. At 150MB, I let my script run for almost half an hour on the first file before giving up and killing the process.

The tests were run on a 2.8GHz Intel i7 860 with 8GB of RAM and a 10kRPM hard drive. The machine was under no load and had plenty of RAM available. All tests max out the CPU on one of eight cores while running. The worst-performing test consumed only 30.4MB at its peak.

Here is the test code:

...and here is the output I got:

C:\>ruby --version
ruby 1.9.1p243 (2009-07-16 revision 24175) [i386-mingw32]

C:\>ruby foreach.rb 2k.txt
File.foreach read 2000 lines & 0.6MB in 0.103s (19415 lines/sec, 6.0MB/sec)
File.foreach read 2000 lines & 0.6MB in 0.103s (19415 lines/sec, 6.0MB/sec)

C:\>ruby foreach.rb 10k.txt
File.foreach read 10000 lines & 3.1MB in 0.507s (19721 lines/sec, 6.0MB/sec)
File.foreach read 10000 lines & 3.1MB in 0.513s (19491 lines/sec, 6.0MB/sec)

C:\>ruby foreach.rb 20k.txt
File.foreach read 20000 lines & 6.2MB in 1.033s (19359 lines/sec, 6.0MB/sec)
File.foreach read 20000 lines & 6.2MB in 1.044s (19155 lines/sec, 6.0MB/sec)

C:\>ruby foreach.rb 40k.txt
File.foreach read 40000 lines & 12.5MB in 2.070s (19321 lines/sec, 6.0MB/sec)
File.foreach read 40000 lines & 12.5MB in 2.091s (19127 lines/sec, 6.0MB/sec)

C:\>ruby readrb.rb 2k.txt
File.open(...,'rb') read 2000 lines & 0.6MB in 0.011s (181801 lines/sec, 55.7MB/sec)
File.open(...,'rb') read 2000 lines & 0.6MB in 0.010s (199980 lines/sec, 61.3MB/sec)

C:\>ruby readrb.rb 10k.txt
File.open(...,'rb') read 10000 lines & 3.1MB in 0.053s (188657 lines/sec, 57.6MB/sec)
File.open(...,'rb') read 10000 lines & 3.1MB in 0.054s (185168 lines/sec, 56.5MB/sec)

C:\>ruby readrb.rb 20k.txt
File.open(...,'rb') read 20000 lines & 6.2MB in 0.106s (188659 lines/sec, 58.7MB/sec)
File.open(...,'rb') read 20000 lines & 6.2MB in 0.107s (186896 lines/sec, 58.2MB/sec)

C:\>ruby readrb.rb 40k.txt
File.open(...,'rb') read 40000 lines & 12.4MB in 0.213s (187774 lines/sec, 58.4MB/sec)
File.open(...,'rb') read 40000 lines & 12.4MB in 0.213s (187774 lines/sec, 58.4MB/sec)

C:\>ruby read.rb 2k.txt
File.open(...) read 2000 lines & 0.6MB in 0.177s (11298 lines/sec, 3.5MB/sec)
File.open(...) read 2000 lines & 0.6MB in 0.177s (11298 lines/sec, 3.5MB/sec)

C:\>ruby read.rb 10k.txt
File.open(...) read 10000 lines & 3.1MB in 1.205s (8297 lines/sec, 2.5MB/sec)
File.open(...) read 10000 lines & 3.1MB in 1.208s (8277 lines/sec, 2.5MB/sec)

C:\>ruby read.rb 20k.txt
File.open(...) read 20000 lines & 6.2MB in 3.547s (5638 lines/sec, 1.8MB/sec)
File.open(...) read 20000 lines & 6.2MB in 3.559s (5618 lines/sec, 1.7MB/sec)

C:\>ruby read.rb 40k.txt
File.open(...) read 40000 lines & 12.4MB in 11.784s (3394 lines/sec, 1.1MB/sec)
File.open(...) read 40000 lines & 12.4MB in 11.870s (3369 lines/sec, 1.0MB/sec)

The clear answer should be to use File.open(file,'rb'){ |x| x.read } (when RAM is available), but a separate machine (running the same build of Ruby on 32-bit Windows 2003 Server) gave me this error when run on large files:

irb(main):003:0> contents = File.open(logfile,'rb'){ |x| x.read }
Errno::EINVAL: Invalid argument - my_really_big_logfile.txt
        from (irb):3:in `read'
        from (irb):3:in `block in irb_binding'
        from (irb):3:in `open'
        from (irb):3
        from C:/Ruby/bin/irb:12:in `
'

Upgrading to Ruby 1.9.2 on that machine fixed this problem.

net.mind details other résumé contact
Phrogz.net