Thursday, December 03, 2009

Subtleties of Perl - Reading files

I've recently begun a new job, and with it has come a whole new segment of the software development universe. The new job uses a lot of Perl, C, Bash and various other languages to get stuff done. Today, I ran afoul of a Perl idiosyncrasy that's worth making a note of, because I'm sure I'll stumble across this problem again and I'm going to need to refer to this in the future. I should also note that I'm writing this as I'm waiting for a significantly large file to parse.

We have large log files that we parse on a daily basis to extract summary information from them about mechanical systems. We read the files, and then output a summary on a secondly basis, one line at a time. Recently, I ran afoul of Perl's file reading mechanisms. When reading files in Perl, there's any number of ways to do so, and it turns out that for the longest time, we've been using the wrong one. Previously, we had been using :


foreach my $line_of_log (<LOG>)
{
// DO STUFF WITH $line_of_log
}


We thought that this was reading the file in one line of the log file at a time, processing it, and then moving on. What it was actually doing was reading (or "slurping") the whole file into memory, and giving us an array of strings, which we processed one line at a time. After 10 minutes of cursory Googling, I ran across a tutorial which presented this :


while (<LOG>)
{
my $line_of_log = $_;
// DO STUFF WITH $line_of_log
}


The 'while' version of the file read actually does what we thought we where doing all along: reading one line from the file, and then doing stuff with it. The difference between the two methods is that in the 'foreach' version, the entire input file gets read into memory, whereas in the 'while' version, only a single line gets read into memory at any given time. As it turns out, another difference is that reading in a 7 MB file resulted in Perl grabbing 34 MB of memory with the 'foreach' version, but only 2.2 MB with the 'while' version. That's an ENTIRE ORDER OF MAGNITUDE in difference!. This also makes a huge difference when running Perl on memory-limited systems, as we are.

No comments: