Given a large CSV file, how can we split it into smaller pieces?
There are many ways to split the file. We are going to see two here: Horizontally or vertically. Horizontally would mean that every N lines go into a separate files, but each line remains intact.
Vertically would mean that every few columns go into a separate file.
Splitting horizontally
If we can assume that the CSV file does not contain any embedded newlines, and if are running on a Linux
machine, then we don't even need Perl. We can just use the split
command of Linux.
If we are on Windows, we could write a simple script to read the original file and every N lines save it in a new file.
Unfortunately these solutions won't work correctly for a file like the following, where at least one of the fields has an embedded newline.
Tudor;Vidor;10;Hapci
Szundi;Morgo;7;Szende
Kuka;"Hofeherke;
alma";100;Kiralyno
Boszorkany;Herceg;9;Meselo
(Also in this example the fields are separated using semi-colon ;
, instead of comma, but that's just for the extra fun.)
This solution will work:
use strict;
use warnings;
use Text::CSV;
my $file = 'csv_with_newline.csv';
my $size = 3;
my $file_counter = 0;
my $line_counter = 0;
my $out;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char => ';' });
open my $in, "<:encoding(utf8)", $file or die "$file: $!";
while (my $row = $csv->getline($in)) {
$line_counter++;
if ($out and $line_counter > $size) {
close $out;
undef $out;
$line_counter = 0;
}
if (not $out) {
$file_counter++;
my $outfile = "output$file_counter.csv";
open $out, ">:encoding(utf8)", $outfile or die "$outfile: $!";
}
$csv->print($out, $row);
print $out "\n";
}
close $in;
close $out;
Here we read the CSV file by using the getline
method of Text::CSV.
This will read a "logical" line as it understands the internal structure of CSV files.
We use our own scalar variable called $line_counter
to count the logical lines read. This helps us decide when do
we need to close the current output file (if it is open and we crossed the planned size) if ($out and $line_counter > $size)
.
The variable $file_counter
is used to number the output files.
In a more generic solution, we'd probably get the name of the input file, the size, and the format of the output files as command line parameters, but in this example we can use hard-coded values.
If you are not yet familiar with Text::CSV, there is another article that shows how to read a CSV file
with some more explanation. In a nutshell $row
always containing a reference to an ARRAY and each element in that array is a field from the current line in the
CSV file.
Splitting vertically
use strict;
use warnings;
use Text::CSV;
my $file = 'csv_with_newline.csv';
my $size = 3;
my @files;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char => ';' });
open my $in, "<:encoding(utf8)", $file or die "$file: $!";
while (my $row = $csv->getline($in)) {
if (not @files) {
my $file_counter = int @$row / $size;
$file_counter++ if @$row % $size;
for my $i (1 .. $file_counter) {
my $outfile = "output$i.csv";
open my $out, ">:encoding(utf8)", $outfile or die "$outfile: $!";
push @files, $out;
}
}
my @fields = @$row;
foreach my $i (0 .. $#files) {
my $from = $i*$size;
my $to = $i*$size+$size-1;
$to = $to <= $#fields ? $to : $#fields;
my @data = @fields[$from .. $to];
$csv->print($files[$i], \@data);
print {$files[$i]} "\n";
}
}
close $in;
close $_ for @files;
The resulting files look like these two:
Tudor;Vidor;10
Szundi;Morgo;7
Kuka;"Hofeherke;
alma";100
Boszorkany;Herceg;9
Hapci
Szende
Kiralyno
Meselo
In the code above, we assume all the lines have the same number of fields. After reading the first line,
when @files
is still empty (if (not @files) {
is true), we open all the necessary output files and put the file-handles in
the @files
array. The number of files necessary is calculated in $file_counter
.
It is either the total number of columns divided by the number of columns in each file, or one more, if
there are some left-over columns. (If modulo is not 0.)
For more convenience we copy the content of the current $row
to the @fields
array, then for each file we calculate
the range of the fields that go into that file ($from
, $to
), and use the range operator to fetch the values
of the specific columns: my @data = @fields[$from .. $to];
.
In the print statement we had to wrap $files[$i]
in curly braces, in order to ensure perl will see the whole
expression as the filehandle.
Alternative code
In the above code, we used a couple of temporary variables, such as @fields
and @data
.
This is probably good as it makes the code more readable, but let's see alternative expressions that do the same:
This code:
$to = $to <= $#fields ? $to : $#fields;
my @data = @fields[$from .. $to];
can be replaced by this code:
$to = $to <= $#$row ? $to : $#$row;
my @data = @$row[$from .. $to];
eliminating the need for the @fields
array. I'd probably not write this
as this involved expressions such as $#$row
and @$row[$from .. $to];
,
but I am showing it here so when you do encounter such expressions, you have a better
chance to deal with them.
If you know that $#$row
is the same as $#{$row}
then it might be easier to
see why this does the same as the original code.
In order to eliminate the need for the @data
array we can replace this code:
my @data = @$row[$from .. $to];
$csv->print($files[$i], \@data);
with this line:
$csv->print($files[$i], [ @$row[$from .. $to] ]);
I'd probably avoid this too.