Fast lookup by name or by date - Array - Hash

We have a lot of data-structures that have a name, a date when each one was created and some payload. Once in a while we need to find and remove an element. Either by the name of the element or by picking the oldest one. How can we make this efficient given that we have a lot of these data-structures? (Some 10,000.)

The data structure

If we hold each data structure in a hash we have something like this:

{
    name => "some name",
    date => time,
    payload => { },
}

The payload itself has Perl objects in them and I think some of them are even open sockets.

We basically have two ways to hold all the elements:

In array acting as a queue

We can keep them in an array and push any new element to the end. Then the first one is the oldest, we can use shift to remove it and we don't even need to keep the "date" field. Finding the element by name has a complexity of O(n) as we have to go over all the elements using grep or better yet using first from List::Util.

In a hash based on name

We can keep them in hash based on the name: Accessing $h{$name} has a complexity of O(1), but then in order to find the oldest we need to keep a timestamp as well in every object and we have to go over all the elements to find the object. This is O(n).

Linked list

In a linked list every element is connected to the previous element and the next element as well and there is a link from the outside world to the first element. (There might be also a link to the last element for easier bookkeeping.) We can implement such a linked list inside a hash. That way we can access any individual element inside the data structure by looking up its name. As this is a hash operation it takes O(1). We can also easily get to the oldest (first) element as we have a direct link to it from the outside world. That too is O(1).

It seems this approach can be by far the fastest, but we have to take in account that

The code is a bit more complex and thus we need to invest extra effort in verifying it.
Each operation is now slower so depending on the actual pattern of access we might not have any overall speed gain.
The data now takes up more space in memory as we need the have the links to the previous and next element for each piece of data.

Implemented as an array

examples/fast_lookup/AsArray.pm

package AsArray;
use strict;
use warnings;

use Time::HiRes qw(time);
use List::Util qw(first);

sub new {
    my ($class) = @_;

    my $self = bless {}, $class;
    $self->{data} = [];

    return $self;
}

sub add {
    my ($self, $name, $payload) = @_;

    push @{ $self->{data} }, {
       name    => $name,
       payload => $payload,
       date    => time,
    };
    return;
}

sub remove_oldest {
    my ($self) = @_;
    return shift @{ $self->{data} }
}

sub remove_by_name {
    my ($self, $name) = @_;
    
    my $index = first { $name eq $self->{data}[$_]{name} } 0 .. @{ $self->{data} } - 1;
    my ($elem) = splice @{ $self->{data} }, $index, 1; 
    return $elem;
}

sub count {
    my ($self) = @_;
    return scalar @{ $self->{data} };
}


1;

The tricky part here is the implementation of remove_by_name. We go over the indexes of the array from 0 to the highest index which is the number of elements minus 1. Then we use the first function of List::Util which is similar to grep, but stops after it founds the first matching value.

We then use splice to remove the element from the array.

Implemented as an hash

examples/fast_lookup/AsHash.pm

package AsHash;
use strict;
use warnings;

use Time::HiRes qw(time);
use List::Util qw(reduce);

sub new {
    my ($class) = @_;

    my $self = bless {}, $class;
    $self->{data} = {};

    return $self;
}

sub add {
    my ($self, $name, $payload) = @_;

    $self->{data}{$name} = {
       name    => $name,
       payload => $payload,
       date    => time,
    };
    return;
}

sub remove_oldest {
    my ($self) = @_;

    my $min = reduce { $a->{date} < $b->{date} ? $a : $b } values %{ $self->{data} };
    return $self->remove_by_name($min->{name});
}

sub remove_by_name {
    my ($self, $name) = @_;

    return delete $self->{data}{$name};
}

sub count {
    my ($self) = @_;
    return scalar keys %{ $self->{data} };
}

1;

In this implementation the tricky part is finding the oldest element for the remove_oldest method. We go over all the data structures, the values of the "data" hash and we are looking the element with the smallest number in the date field. For that we use the reduce function supplied by List::Util with the ternary operator inside.

Implemented as an linked list

examples/fast_lookup/AsLinkedList.pm

package AsLinkedList;
use strict;
use warnings;

use Time::HiRes qw(time);
use List::Util qw(reduce);

sub new {
    my ($class) = @_;

    my $self = bless {}, $class;
    $self->{data} = {};
    $self->{_first} = undef;
    $self->{_last} = undef;

    return $self;
}

sub add {
    my ($self, $name, $payload) = @_;

    $self->{data}{$name} = {
       name    => $name,
       payload => $payload,
       date    => time,
       _next   => undef,
       _prev   => $self->{_last},
    };
    my $last  = $self->{_last};
    if ($last) {
        $self->{data}{$last}{_next} = $name;
    }
    if (not $self->{_first}) {
        $self->{_first} = $name;
    }

    $self->{_last} = $name;

    return;
}

sub remove_oldest {
    my ($self) = @_;

    my $first = $self->{_first};
    return if not $first;

    return $self->remove_by_name($first);
}

sub remove_by_name {
    my ($self, $name) = @_;

    my $element = delete $self->{data}{$name};
    my $next = $element->{_next};
    my $prev = $element->{_prev};
    $self->{data}{$next}{_prev} = $prev if $next;
    $self->{data}{$prev}{_next} = $next if $prev;

    if ($self->{_first} eq $name) {
        $self->{_first} = $next;
    }
    if ($self->{_last} eq $name) {
        $self->{_last} = $prev;
    }

    delete $element->{_next};
    delete $element->{_prev};
    return $element;
}

sub count {
    my ($self) = @_;
    return scalar keys %{ $self->{data} };
}


1;

In this implementation there is a lot more bookkeeping both when we add a new element and when we remove one. The class itself holds the name of the first and last elements in the "_first" and "_last" fields respectively. Each element in the "data" hash also has a field with the name of the previous and next elements in the "_prev" and "_next" fields respectively.

When we add a new elemen and when we remove an old element we need to update these fields to keep all the data structure up to date.

As an improvement we might want to keep a reference to the objects instead of the name of the field, though I am not sure it is really an improvement.

Tests

I felt that the code is already complex enough to warrant tests.

examples/fast_lookup/test.t

use strict;
use warnings;

use Test::More;

my @classes = qw(AsArray AsHash AsLinkedList);
plan tests => 14 * scalar @classes;

for my $class (@classes) {
    eval "use $class";
    is $@, '', "use $class";

    my $obj = $class->new;
    isa_ok $obj, $class;
    $obj->add('foo', 11);
    $obj->add('bar', 22);
    $obj->add('moo', 33);
    $obj->add('abc', 44);
    $obj->add('def', 55);
    $obj->add('ghi', 66);
    $obj->add('xyz', 77);
    $obj->add('qqq', 88);
    is $obj->count, 8, 'count 8';

    my $old = $obj->remove_oldest;
    is $old->{name}, 'foo', 'foo';

    $old = $obj->remove_by_name('abc');
    is $old->{name}, 'abc', 'abc';

    $old = $obj->remove_oldest;
    is $old->{name}, 'bar', 'bar';

    $old = $obj->remove_oldest;
    is $old->{name}, 'moo', 'moo';

    $old = $obj->remove_oldest;
    is $old->{name}, 'def', 'def';

    $old = $obj->remove_by_name('qqq');
    is $old->{name}, 'qqq', 'qqq';

    $obj->add('zzz', 99);
    is $obj->count, 3, 'count 3';

    $old = $obj->remove_by_name('xyz');
    is $old->{name}, 'xyz', 'xyz';

    $old = $obj->remove_by_name('ghi');
    is $old->{name}, 'ghi', 'ghi';

    $old = $obj->remove_oldest;
    is $old->{name}, 'zzz', 'zzz';
    is $obj->count, 0, 'count 0';
}

Conclusion - which is the fastest

OK so we have built the 3 solutions, but have not compared the speed yet. Let me leave that as an exercise for you now.

Comments

This is an interesting post, addressing a need that I encountered a few years ago. However, I think Gabor has unnecessarity constrained this by stating:

Once in a while we need to find and remove an element. Either by the name of the element or by picking the oldest one.

A useful generalization would be to allow searches on any date/time, or searches over a specified time interval.

Fast lookup by name or by date - Array - Hash - Linked List