slurp mode - reading a file in one step
While in most of the cases we'd process a text file line-by-line, there are cases when it is easier to do the work if all the content of the file is in the memory in a single scalar variable.
For example when we need to replace Java is Hot by Jabba the Hutt in a text file where the original text migh be spread over more than one lines. For example:
... We think that Java is Hot. ....
(Probably this is going to be funny only to programmers who are Star Wars fans and who have a Hungarian accent in English as I do. Or maybe not even to them.)
In any case you can escape now and read more about Jabba the Hutt or about Java.
Before you go on reading, please note, in this article first you'll see the "manual" way to slurp in a file. You can do that, but there are more modern and much more readable ways to do that using Path::Tiny.
Let's see an example. This is what we have in the data.txt file:
Java is Hot Java is Hot
examples/slurp_in_main.pl
use strict; use warnings; use 5.010; my $file = 'data.txt'; open my $fh, '<', $file or die; $/ = undef; my $data = <$fh>; close $fh; print $data; $data =~ s/Java\s+is\s+Hot/Jabba The Hutt/g; say '-' x 30; print $data;
Running the above Perl program we get the following output:
Java is Hot Java is Hot ------------------------------ Jabba The Hutt Jabba The Hutt
Explanation
The $/ variable is the Input Record Separator in Perl. When we put the read-line operator in scalar context, for example by assigning to a scalar variable $x = <$fh>, Perl will read from the file up-to and including the Input Record Separator which is, by default, the new-line \n.
What we did here is we assigned undef to $/. So the read-line operator will read the file up-till the first time it encounters undef in the file. That never happens so it reads till the end of the file. This is what is called slurp mode, because of the sound the file makes when we read it.
In case you are wondering about the regex part here is the quick recap provided by J.L. Bismarck Fuentes.
- =~ regex matches $data
- s substitution, its syntax is s/regex_to_match/substitution/modifiers
- \s+ One or more whitespaces
- g Globally match the pattern repeatedly in the string
The big problem with the above code is that $/ is a global variable. This mean if we change $/ in one place of our code, it will change the behavior of Perl in other places of our code. It will impact even third-party modules used in our application. That is certainly not good.
So it is better to localize it:
localize the change
examples/slurp_localized.pl
use strict; use warnings; use 5.010; my $file = 'data.txt'; my $data; { open my $fh, '<', $file or die; local $/ = undef; $data = <$fh>; close $fh; } print $data; $data =~ s/Java\s+is\s+Hot/Jabba The Hutt/g; say '-' x 30; print $data;
We have 3 changes in this code:
- We put the local keyword in front of the assignment to $/. This will make sure the value of $/ returns to whatever it was when the enclosing block ends.
- For this we needed an enclosing block, so we added a pair of curly braces around the code-snippet dealing with the file.
- The third change is that we had to declare the $data variable outside of the block, or it would go out of scope when the block ends.
Creating a slurp function
In the third iteration of the code, we create a separate function called slurp that will get the name of the file and return the content as a single string. This allows us to hide the code-snippet at the end of the program or even in a separate file. It also makes it reusable, so instead of copying it to other places where we might need the same functionality we can just call the slurp function.
This makes the main body of our code much nicer.
examples/slurp_in_function.pl
use strict; use warnings; use 5.010; my $file = 'data.txt'; my $data = slurp($file); print $data; $data =~ s/Java\s+is\s+Hot/Jabba The Hutt/g; say '-' x 30; print $data; sub slurp { my $file = shift; open my $fh, '<', $file or die; local $/ = undef; my $cont = <$fh>; close $fh; return $cont; }
Of course we could further improve our slurp function by setting the encoding to utf-8 and by providing better error message in case one of the system calls fail.
File::Slurp
In the article replacing a string in a file we had a similar example, except that there we used the read_file function of the File::Slurp module.
Path::Tiny
An even better solution is to use the Path::Tiny module. It exports the path function that gets a path to a file as a parameter and returns an object. We can then call the slurp or slurp_utf8 methods on that object:
examples/slurp_path_tiny.pl
use strict; use warnings; use 5.010; use Path::Tiny qw( path ); my $file = 'data.txt'; my $data = path($file)->slurp_utf8;
Installing the modules
Neither of these modules come with the standard Perl distribution so you will need to install them first. There are a number of ways to install a Perl module from CPAN.
Comments
As so far, it is not explained how this works: $data =~ s/Java\s+is\s+Hot/Jabba The Hutt/g;
=~ -> regex matches $data
s -> substitution, its syntax is s/
Once the file is slurped into $data is it possible to read line by line from $data?
You can split it by newline and do that, but I wonder, if you'd like to process it line-by-line then why read the whole file in memory?
Good point. I can explain why I wanted to do this in this way: I'm not a "real" programmer and I use mainly R (and SQL) where I usually read files into a so called dataframe (=table). From there I can work on this table, for example, selecting only rows which fullfills some criteria. So, now I learned that I have reconsider my habits. I will read line-by-line and create arrays or hashes according to the row criteria. By the way thank you very much for your blog.
Please do not recommend File::Slurp. Use File::Slurper instead.
In examples 2 and 3 the close $fh isn't needed. Perl will close the $fh when it reaches end of the scope.
I prefer Path::Class slurp
use Path::Class qw{file};
my $content = file("filename")->slurp();
my @lines = file("filename")->slurp(chomp=>1);
Published on 2013-08-26