Indexing e-mails in an mbox
After reading and answering e-mails I usually store them in a hierarchy of mbox-es. Unfortunately finding something there is quite difficult. I have tons of messages and sometimes I file messages under the sender, sometimes under the topic, and I am sure there are also cases, probably lots of them, when I make a mistake and file a messaging in the wrong mbox.
So I decided I'll write a small application to index all the e-mails, put the data in a Mongo database and write a client to search among the messages. Let's start by processing the mailboxes.
There are probably other distribution on CPAN that can handle this job, but I found Email::Folder, currently maintained by RJBS. Let's give it a try.
Traversing the directory tree
Even before I do that I need a way to go over all the files. For this I used Path::Iterator::Rule that I have already used.
examples/traversing_directory_tree.pl
use strict; use warnings; use 5.010; use Path::Iterator::Rule; my $path_to_dir = shift or die "Usage: $0 path/to/mail\n"; my $rule = Path::Iterator::Rule->new; my $it = $rule->iter( $path_to_dir ); while ( my $file = $it->() ) { next if not -f $file; say $file; }
I run this using time so I can see how long it takes:
$ time perl bin/mboxer.pl /home/gabor/mail/ real 0m0.119s user 0m0.065s sys 0m0.032s
I did not have to wait for that.
Going over the messages
The next step is to load each file using Email::Folder. Before I go into really processing the data, let's see how long does this take.
Inside the while I added the following code: to open the email-folder and just go through all the messages:
my $folder = Email::Folder->new($file); while (my $msg = $folder->next_message) { # Email::Simple objects }
examples/traversing_messages.pl
use strict; use warnings; use 5.010; use Path::Iterator::Rule; use Email::Folder; my $path_to_dir = shift or die "Usage: $0 path/to/mail\n"; my $count = 0; my $rule = Path::Iterator::Rule->new; my $it = $rule->iter( $path_to_dir ); while ( my $file = $it->() ) { next if not -f $file; say $file; my $folder = Email::Folder->new($file); while (my $msg = $folder->next_message) { # Email::Simple objects $count++; } } say $count;
I ran this again with time and it took almost 2 minutes. I even added a counter to see how many messages I have. There are 119,026 messages in my folders. Clearly I can't just write a script to search things in these folders. I'll need to index them and then search on those indexes.
$ time bin/mboxer.pl /home/gabor/mail/ 119026 real 1m45.088s user 1m11.283s sys 0m1.501s
Processing the headers
In most cases I'd like to search based on some of the header fields - not so much on the content of the mail. Well, OK, I guess searching on the body will be very important, but that's a lot of data. Let's try to do something with the header. But what headers are there? Surely there is a header called From. Let's see how can I fetch that?
I added the following line to the internal while-loop:
say $msg->header('From');
Running this would take another 2 minutes and after a few tens of addresses I am sure the same format would be repeated. So I add a counter that after 20 entries will call exit.
say $msg->header('From'); exit if $main::cnt++ > 20;
I know I could have used the $counter I added to the code previously, but when I ran this experiment first I did not have that counter yet. Besides, I find this counter a neat trick.
Instead of declaring a lexical variable I access a package variable in the main namespace. use strict disregards such variable when accessed with a fully qualified package name.
(If you recall the error you would get if you wanted to use a variable without declaring it with my say Global symbol requires explicit package name. With $main::cnt we provided that explicit package name.)
In addition use warnings does not complain when we call ++ on a variable that was undef. So this code can be used with additional ceremony.
The results had 3 different types of address:
Foo Bar <foo@bar.com> "Foo Bar" <Foo@Bar.com> =?ISO-8859-1?Q?Foo_B=F6r?= <foo@bar.com>
I'll have to unify them before inserting into the database.
What headers are there?
Another thing that I wanted to know is what kind of header are there? The Email::Simple object we got back from Email::Folder has a headers method that will return the list of header names. Such as From.
I replaced the say $msg->header('From'); line with a loop to go over the header names and print them:
foreach my $h ($msg->headers) { say $h; }
Of course I got the same headers a lot of times, so I decided I'll collect the headers in a hash, and at the same time I'll also count how many times each header appears.
foreach my $h ($msg->headers) { $main::count{$h}++; }
At the end I put:
foreach my $k (sort keys %main::count) { say "$k $main::count{$k}"; }
and after the internal while loop I added a call to last. That way the process will stop just after the first file.
Even with that I got a lot of headers. Some of them only differ in case:
Accept-Language 11 Authentication-Results 74 CC 3 Cc 3 Content-Class 1 Content-Disposition 6 Content-Language 20 Content-Transfer-Encoding 34 Content-Type 101 Content-class 1 Content-type 1 DKIM-Signature 5 Date 103 Delivered-To 102 Delivery-date 7 Disposition-Notification-To 3 DomainKey-Signature 3 Envelope-to 7 From 103 Importance 9 In-Reply-To 38 MIME-Version 95 MIME-version 1 Message-ID 96 Message-Id 6 Message-id 1 Mime-Version 6 Organization 10 Priority 1 Received 102 Received-SPF 74 References 37 Reply-To 3 Reply-to 1 Return-Path 95 Return-path 7 Sender 9 Status 103 Subject 103 Thread-Index 26 Thread-Topic 13 To 102 User-Agent 17 X-AnalysisOut 1 X-AntiAbuse 4 X-Authenticated-Sender 2 X-Canit-Stats-ID 1 X-Gm-Message-State 6 X-Google-DKIM-Signature 6 X-Google-Sender-Auth 9 X-HDC-Scanned 2 X-IMAP 1 X-IronPort-AV 1 X-Keywords 102 X-MAIL-FROM 1 X-MIMEOLE 1 X-MS-Has-Attach 13 X-MS-TNEF-Correlator 13 X-MSMail-Priority 1 X-MSMail-priority 1 X-MXL-Hash 1 X-Mailer 24 X-MimeOLE 9 X-Original-To 95 X-OriginalArrivalTime 6 X-Originating-IP 13 X-Priority 6 X-Provags-ID 5 X-Received 4 X-SOURCE-IP 1 X-Scanned-By 1 X-Source 4 X-Source-Args 4 X-Source-Dir 4 X-Spam 1 X-Spam-CTCH-RefID 2 X-Spam-Checker-Version 1 X-Spam-Level 1 X-Spam-Report 2 X-Spam-Score 3 X-Spam-Status 1 X-Stationery 1 X-Status 102 X-UID 102 X-Virus-Scanned 7 acceptlanguage 7 thread-index 3 x-cr-hashedpuzzle 3 x-cr-puzzleid 3 x-mimeole 1 x-originating-ip 2 x-tm-as-product-ver 2 x-tm-as-result 2 x-tm-as-user-approved-sender 2 x-tm-as-user-blocked-sender 2
I think that's enough for the first step. I'll probably have to pick some of the most important header fields and start with those.
Probably: From, To, Date, CC (and Cc), Subject.
The whole series
- Indexing e-mails in an mbox (this article)
- Putting the email in MongoDB - part 1
- Refactoring the script and add logging
- Switching to Moo - adding command line parameters
- Adding the To: field to the MongoDB database
- Adding Date, Size, CC, and Message-ID
Published on 2015-05-15