Binary search in a file

If you have a file that contains some sorted information, e.g. a list of words, and you are trying to look up a given word, using in-memory array and Binary search in a Perl array is probably the first thing I would try to do.

However, if the file is big, if it is bigger than the available free memory of the computer, then this is not possible. We need another solution.

Reading the words one-by-one sequentially would be a solution as the memory requirement of that is the size of a single word (or a single line if we are talking about something longer than words). However, reading them sequentially might be a waste. Especially if the lines are already sorted.

We already saw in the binary search in arrays that sequential search has a complexity of O(n) and binary search has a complexity of O(log2(n)). On large files it can have a huge impact.

At 10,000 items, the linear version does 10,000 iterations while the binary search does 14 iterations.

At 10,000 items even if the fixed cost of every binary iteration is 100 times bigger than the fixed cost of a single linear iteration, we are still better off with the binary search.

How could we implement a binary search without reading the whole file into memory?

Planets

For the example I am using a very small file listing some of the planets in the Solar system:

examples/data/planets.txt

Ceres
Charon
Earth
Jupiter
Mars
Mercury
Neptune
Pluto
Saturn
Uranus
Venus

For a larger text file you might want to search for "list of english words in plain text files".

Solution 0: In-memory search

For small files the best solution might be to read the whole file into memory into an array and then use the binary search on the array.

examples/read_file_into_array.pl

use 5.010;
use strict;
use warnings;
use Data::Dumper qw(Dumper);

my ($name, $file) = @ARGV;
die "Usage: $0 NAME FILE\n" if not $file;

open my $fh, '<', $file or die;
my @words = <$fh>;
chomp @words;

#print Dumper \@words;

my $res = binary_search($name, \@words);
if (not defined $res) {
    say "$name not found";
} else {
    say "$name found at $res";
}

sub binary_search {
    my ($str, $list) = @_;

    my $min = 0;
    my $max = @$list - 1;
    
    while ($min <= $max) {
        my $middle = int(($max+$min) / 2);
        # say "$min - $max ($middle)";
        if ($name lt $list->[$middle]) {
            $max = $middle-1;
            next;
        }
        if ($name gt $list->[$middle]) {
            $min = $middle+1;
            next;
        }
        return $middle;
    }

    return;
}

$ perl examples/read_file_into_array.pl Pluto examples/data/planets.txt
Pluto found at 7

and

$ perl examples/read_file_into_array.pl "Brodo Asogi" examples/data/planets.txt
Brodo Asogi not found

Solution 1: Linear search in file

examples/linear_search_in_file.pl

use 5.010;
use strict;
use warnings;

my ($name, $file) = @ARGV;
die "Usage: $0 NAME FILE\n" if not $file;

my $location = linear_search_in_file($name, $file);
if (defined $location) {
    say "$name found in row $location";
} else {
    say "$name not found";
}

sub linear_search_in_file {
    my ($name, $file) = @_;

    open my $fh, '<', $file or die;
    my $count = 0;
    while (my $line = <$fh>) {
        $count++;
        chomp $line;
        if ($line eq $name) {
            return $count;
        }
    }
    return;
}

$ perl examples/linear_search_in_file.pl Pluto examples/data/planets.txt
Pluto found in row 8

$ perl examples/linear_search_in_file.pl "Brodo Asogi" examples/data/planets.txt
Brodo Asogi not found

The results are off-by one. This is due to the fact that in arrays the first element has index 0. So element with index 7 in an array is the 8th element in the original list. Which number is the more correct one will depend on the situation. I'll leave that as an exercise to you, the reader.

Solution 2: Binary search in a file

examples/binary_search_in_file.pl

use 5.010;
use strict;
use warnings;
use Fcntl qw(SEEK_SET SEEK_CUR SEEK_END);

my ($name, $file) = @ARGV;
die "Usage: $0 NAME FILE\n" if not $file;

my $location = binary_search_in_file($name, $file);
if (defined $location) {
    say "$name found at byte $location";
} else {
    say "$name not found";
}

sub binary_search_in_file {
    my ($name, $file) = @_;

    open my $fh, '<', $file or die;
    my $min = 0;
    my $max = -s $file;

    while ($min < $max) {
        my $middle = int(($max+$min) / 2);
        seek $fh, $middle, SEEK_SET;
        <$fh>; # read to the end of line and throw this away

        # make sure we don't try to read from the $max
        my $start = tell($fh);
        if ($max <= $start) {
            $start = $min;
            seek $fh, $start, SEEK_SET;
        }

        my $line = <$fh>;
        chomp $line;

        if ($name lt $line) {
            $max = $start;
            next;
        }
        if ($name gt $line) {
            $min = tell($fh);
            next;
        }

        return $start;
    }

    #while (my $line = <$fh>) {
    #    $count++;
    #    chomp $line;
    #    if ($line eq $name) {
    #        return $count;
    #    }
    #}
    return;
}

Here we maintain a range of bytes using the variables $min and $max. On every iteration we go to the byte that's half way between the min and max places. As this might be in the a line we read to the end of the current line.

Then we check if we are not at the $max, if we are then we should go to $min and start reading from there.

Then we read the current line, chomp it, and compare it to the string we are looking for.

Then either we change the $min or the $max to make the sections smaller.

Testing the solutions

I wanted to make sure the solutions work and if later I find improvements I can easily verify that the changes to the code don't change the results. So I wrote a few tests:

examples/test_binary_search.t

use strict;
use warnings;
use feature 'say';
use File::Temp qw(tempdir);
use File::Spec::Functions qw(catfile);
use Test::More;

my $dir = tempdir( CLEANUP => 1);

my $filepath =  create_file();

subtest memory => sub {
    my $cmd = "$^X examples/read_file_into_array.pl Citrus $filepath";
    my $result = qx($cmd);
    is $result, "Citrus found at 2\n";

    $cmd = "$^X examples/read_file_into_array.pl orange $filepath";
    $result = qx($cmd);
    is $result, "orange not found\n";
};

subtest linear => sub {
    my $cmd = "$^X examples/linear_search_in_file.pl Citrus $filepath";
    my $result = qx($cmd);
    is $result, "Citrus found in row 3\n";

    $cmd = "$^X examples/linear_search_in_file.pl orange $filepath";
    $result = qx($cmd);
    is $result, "orange not found\n";
};

subtest binary => sub {
    my $cmd = "$^X examples/binary_search_in_file.pl Citrus $filepath";
    my $result = qx($cmd);
    is $result, "Citrus found at byte 13\n";

    $cmd = "$^X examples/binary_search_in_file.pl orange $filepath";
    $result = qx($cmd);
    is $result, "orange not found\n";
};

done_testing;

sub create_file {
    my $filepath = catfile($dir, 'data.txt');
    open(my $fh, '>', $filepath) or die;
    print $fh "Apple\n";
    print $fh "Banana\n";
    print $fh "Citrus\n";
    print $fh "Dates\n";
    print $fh "Figs\n";
    close($fh);
    return $filepath;
}

Run them as:

prove examples/test_binary_search.t