Indexing e-mails in an mbox

After reading and answering e-mails I usually store them in a hierarchy of mbox-es. Unfortunately finding something there is quite difficult. I have tons of messages and sometimes I file messages under the sender, sometimes under the topic, and I am sure there are also cases, probably lots of them, when I make a mistake and file a messaging in the wrong mbox.

So I decided I'll write a small application to index all the e-mails, put the data in a Mongo database and write a client to search among the messages. Let's start by processing the mailboxes.

There are probably other distribution on CPAN that can handle this job, but I found Email::Folder, currently maintained by RJBS. Let's give it a try.

Traversing the directory tree

Even before I do that I need a way to go over all the files. For this I used Path::Iterator::Rule that I have already used.

examples/traversing_directory_tree.pl

use strict;
use warnings;
use 5.010;

use Path::Iterator::Rule;

my $path_to_dir = shift or die "Usage: $0 path/to/mail\n";

my $rule = Path::Iterator::Rule->new;
my $it = $rule->iter( $path_to_dir );
while ( my $file = $it->() ) {
    next if not -f $file;
    say $file;
}

I run this using time so I can see how long it takes:

$ time perl bin/mboxer.pl /home/gabor/mail/
real  0m0.119s
user  0m0.065s
sys   0m0.032s

I did not have to wait for that.

Going over the messages

The next step is to load each file using Email::Folder. Before I go into really processing the data, let's see how long does this take.

Inside the while I added the following code: to open the email-folder and just go through all the messages:

    my $folder = Email::Folder->new($file);
    while (my $msg = $folder->next_message) {  # Email::Simple objects
    }

examples/traversing_messages.pl

use strict;
use warnings;
use 5.010;

use Path::Iterator::Rule;
use Email::Folder;

my $path_to_dir = shift or die "Usage: $0 path/to/mail\n";

my $count = 0;

my $rule = Path::Iterator::Rule->new;
my $it = $rule->iter( $path_to_dir );
while ( my $file = $it->() ) {
    next if not -f $file;
    say $file;
    my $folder = Email::Folder->new($file);
    while (my $msg = $folder->next_message) {  # Email::Simple objects
        $count++;
    }
}
say $count;

I ran this again with time and it took almost 2 minutes. I even added a counter to see how many messages I have. There are 119,026 messages in my folders. Clearly I can't just write a script to search things in these folders. I'll need to index them and then search on those indexes.

$ time bin/mboxer.pl /home/gabor/mail/

119026

real  1m45.088s
user  1m11.283s
sys   0m1.501s

Processing the headers

In most cases I'd like to search based on some of the header fields - not so much on the content of the mail. Well, OK, I guess searching on the body will be very important, but that's a lot of data. Let's try to do something with the header. But what headers are there? Surely there is a header called From. Let's see how can I fetch that?

I added the following line to the internal while-loop:

say $msg->header('From');

Running this would take another 2 minutes and after a few tens of addresses I am sure the same format would be repeated. So I add a counter that after 20 entries will call exit.

say $msg->header('From');
exit if $main::cnt++ > 20;

I know I could have used the $counter I added to the code previously, but when I ran this experiment first I did not have that counter yet. Besides, I find this counter a neat trick.

Instead of declaring a lexical variable I access a package variable in the main namespace. use strict disregards such variable when accessed with a fully qualified package name.

(If you recall the error you would get if you wanted to use a variable without declaring it with my say Global symbol requires explicit package name. With $main::cnt we provided that explicit package name.)

In addition use warnings does not complain when we call ++ on a variable that was undef. So this code can be used with additional ceremony.

The results had 3 different types of address:

Foo Bar <foo@bar.com>
"Foo Bar" <Foo@Bar.com>
=?ISO-8859-1?Q?Foo_B=F6r?= <foo@bar.com>

I'll have to unify them before inserting into the database.

What headers are there?

Another thing that I wanted to know is what kind of header are there? The Email::Simple object we got back from Email::Folder has a headers method that will return the list of header names. Such as From.

I replaced the say $msg->header('From'); line with a loop to go over the header names and print them:

    foreach my $h ($msg->headers) {
        say $h;
    }

Of course I got the same headers a lot of times, so I decided I'll collect the headers in a hash, and at the same time I'll also count how many times each header appears.

        foreach my $h ($msg->headers) {
            $main::count{$h}++;
        }

At the end I put:

foreach my $k (sort keys %main::count) {
    say "$k  $main::count{$k}";
}

and after the internal while loop I added a call to last. That way the process will stop just after the first file.

Even with that I got a lot of headers. Some of them only differ in case:

Accept-Language  11
Authentication-Results  74
CC  3
Cc  3
Content-Class  1
Content-Disposition  6
Content-Language  20
Content-Transfer-Encoding  34
Content-Type  101
Content-class  1
Content-type  1
DKIM-Signature  5
Date  103
Delivered-To  102
Delivery-date  7
Disposition-Notification-To  3
DomainKey-Signature  3
Envelope-to  7
From  103
Importance  9
In-Reply-To  38
MIME-Version  95
MIME-version  1
Message-ID  96
Message-Id  6
Message-id  1
Mime-Version  6
Organization  10
Priority  1
Received  102
Received-SPF  74
References  37
Reply-To  3
Reply-to  1
Return-Path  95
Return-path  7
Sender  9
Status  103
Subject  103
Thread-Index  26
Thread-Topic  13
To  102
User-Agent  17
X-AnalysisOut  1
X-AntiAbuse  4
X-Authenticated-Sender  2
X-Canit-Stats-ID  1
X-Gm-Message-State  6
X-Google-DKIM-Signature  6
X-Google-Sender-Auth  9
X-HDC-Scanned  2
X-IMAP  1
X-IronPort-AV  1
X-Keywords  102
X-MAIL-FROM  1
X-MIMEOLE  1
X-MS-Has-Attach  13
X-MS-TNEF-Correlator  13
X-MSMail-Priority  1
X-MSMail-priority  1
X-MXL-Hash  1
X-Mailer  24
X-MimeOLE  9
X-Original-To  95
X-OriginalArrivalTime  6
X-Originating-IP  13
X-Priority  6
X-Provags-ID  5
X-Received  4
X-SOURCE-IP  1
X-Scanned-By  1
X-Source  4
X-Source-Args  4
X-Source-Dir  4
X-Spam  1
X-Spam-CTCH-RefID  2
X-Spam-Checker-Version  1
X-Spam-Level  1
X-Spam-Report  2
X-Spam-Score  3
X-Spam-Status  1
X-Stationery  1
X-Status  102
X-UID  102
X-Virus-Scanned  7
acceptlanguage  7
thread-index  3
x-cr-hashedpuzzle  3
x-cr-puzzleid  3
x-mimeole  1
x-originating-ip  2
x-tm-as-product-ver  2
x-tm-as-result  2
x-tm-as-user-approved-sender  2
x-tm-as-user-blocked-sender  2

I think that's enough for the first step. I'll probably have to pick some of the most important header fields and start with those.

Probably: From, To, Date, CC (and Cc), Subject.

The whole series

Written by
Gabor Szabo

Published on 2015-05-15

If you have any comments or questions, feel free to post them on the source of this page in GitHub. Source on GitHub. Comment on this post