Adding the To: field to the MongoDB database
The next step in indexing the mailbox will be to include the To: field.
Adding the To field
In the while-loop of the process()add_to(\%doc, $msg); immediately after the add_from call:
add_from(\%doc, $msg) or next; add_to(\%doc, $msg);
Implement the add_to function:
sub add_to { my ($doc, $msg) = @_; my $log = Log::Log4perl->get_logger('add_to'); my $to_string = $msg->header('To'); if (not defined $to_string) { $log->warn("There is no To field in this message"); return; } my @to = Email::Address->parse($to_string); #$log->info($to_string); $doc->{To} = [ map { { name => $_->name, address => $_->address, } } @to ]; return; }
It does check the number of addresses, instead there is a map expression that creates simple hashes from the Email::Address objects returned by the parse method. The new hashes are collected into an anonymous array, and a reference to that array is assigned to the To field of the $doc reference to hash.
If we now connect to the database using the command line client we can see how this structure is entered:
$ mongo > use mboxer switched to db mboxer > db.messages.find().limit(1) { "_id": ObjectId("5308aa0797e32fcb59000015"), "To": [ { "address": "foo@bar.com", "name": "Foo Bar" }, { "addess": "foobar@qux.com", "name": "foobar" } ], "From": { "address": "mongo@exmaple.co", "name": "Mongo User" }, "Subject": "Learning to use MongoDB?" } Fetched 1 record(s) in 4ms
We can see how the "From" field is a hash, while the "To" field is an array of hashes.
We can now play with the command line shell of mongodb and fetch some other messages. For example: db.messages.find( {'From.address': 'foobar@qux.com'} ) will return all the messages where the address of the From-field matches that value.
Respectively db.messages.find( {'To.address': 'foovar@qux.com'} ) will fetch all the messages where one of the addresses of the To field match.
Before executing a find() in the shell we might want to execute a db.messages.find( {'To.address': 'foovar@qux.com'} ).count() that will return the number of instances.
A couple of thins I noticed now.
- Some of the e-mail addresses were in mixed case. e.g FooBar@qux.com
- In cases where the "To" field was only an e-mail without a name, Email::Address helpfully provided the part before the @ as the name. That's probably not what I want in this case. I'd rather have the field left empty.
- There were cases where the name method returned the full e-mail address. Apparently there are emails that are addressed To: "foobar@qux.com" <foobar@qux.com>. We should recognize those too and exclude the name part.
Actually, in order to see these issues I added a couple of logging calls like this one: $log->info('name: ' . $to[0]->name);
We'll try to fix these issues now.
Lower case addresses
This part is easy. We just add a call to lc in front of the $_->address expression in the add_to function, and in front of the $from[0]->address expression in the add_from function.
Replace name by phrase
After looking at the documentation of Email::Address I saw that the name method is working harder than I wanted. I should use the phrase method. (And then there is also a comment section I have not deal with yet.)
If I replace the calls to name by calls to phrase, including in the code that prints the log: $log->info('phrase: ' . $to[0]->phrase);, and run the script, soon I'll get an exception:
Use of uninitialized value in concatenation (.) or string at bin/mbox-indexer.pl line 80,
Wait, we know that. That should not be an exception! What's going on here? Have we turned the warnings into FATAL errors by issuing use warning FATAL => 'all';?
Well, not us, but Moo did. When we write use Moo; it turns on use strict; and use warnings FATAL => 'all'; There was even an emotional discussion about the subject that I don't really agree with. Anyway, I could see it live now. This means the logging line needs to change to this code: $log->info('phrase: ' . ($to[0]->phrase // ''));
Running this did not throw any exception, but when I looked at the results in the database, I noticed that many entries had a null in the name.
"To": [ { "address": "foobar@qux.com", "name": null }
Of course, that's because in the map call we created the name field even when the value was undefined.
In order to this to work I had to change the whole section where we created the little hashes from the Email::Address object:
in the add_from function:
$doc->{From} = { address => lc $from[0]->address, }; if (defined $from[0]->phrase) { $doc->{From}{name} = $from[0]->phrase; }
in the add_to function:
foreach my $t (@to) { my %h = (address => lc $t->address); if (defined $t->phrase) { $h{name} = $t->phrase; } push @{ $doc->{To} }, \%h; }
I am not very happy with this code, but I did not have better idea how to write this.
Remove e-mail as name
The third issue I encountered was when the To field looked like this: To: "foobar@qux.com" <foobar@qux.com>.
That is when the "phrase" and the "address" are the same.
Let's eliminate that case:
Now actually the previous code change came handy. I just had to extend the conditions.
In the add_from function:
if (defined $from[0]->phrase and $from[0]->phrase ne $from[0]->address) { $doc->{From}{name} = $from[0]->phrase; }
In the add_to function:
if (defined $t->phrase and $t->phrase ne $t->address) { $h{name} = $t->phrase; }
The full code
The current version of the code:
use strict; use warnings; use 5.010; use Moo; use MooX::Options; use Path::Iterator::Rule; use Email::Folder; use Email::Address; use MongoDB; use Data::Dumper qw(Dumper); use Log::Log4perl; option path => (is => 'ro', required => 1, format => 's', doc => 'path/to/mail'); option limit => (is => 'ro', required => 0, format => 'i', doc => 'limit number of messages to be processed'); Log::Log4perl->init("log.conf"); main->new_with_options->process(); exit; sub process { my ($self) = @_; my $dir = $self->path; my $log = Log::Log4perl->get_logger('process'); $log->info("Starting to process in '$dir'"); my $client = MongoDB::MongoClient->new(host => 'localhost', port => 27017); my $database = $client->get_database( 'mboxer' ); $database->drop; my $collection = $database->get_collection( 'messages' ); my $count = 0; my $rule = Path::Iterator::Rule->new; my $it = $rule->iter( $dir ); while ( my $file = $it->() ) { next if not -f $file; $log->info("Processing $file"); my $folder = Email::Folder->new($file); while (my $msg = $folder->next_message) { # Email::Simple objects $count++; #say $msg->header; # Use of uninitialized value $field in lc at .../5.18.1/Email/Simple/Header.pm line 123, <GEN0> line 14. #say $msg->header('From'); my %doc; add_from(\%doc, $msg) or next; add_to(\%doc, $msg); #file => $file, $doc{Subject} = $msg->header('Subject'), $collection->insert(\%doc); exit if defined $self->limit and $count > $self->limit; } #last; #exit; } $log->info("Count: $count"); } sub add_to { my ($doc, $msg) = @_; my $log = Log::Log4perl->get_logger('add_to'); my $to_string = $msg->header('To'); if (not defined $to_string) { $log->warn("There is no To field in this message"); return; } $log->info("To: $to_string"); my @to = Email::Address->parse($to_string); if (not @to) { $log->warn("Very strange. No email recognized in the To field! " . $msg->header('To')); return; } $log->info('name: ' . $to[0]->name); $log->info('phrase: ' . ($to[0]->phrase // '')); $log->info('address: ' . $to[0]->address); foreach my $t (@to) { my %h = (address => lc $t->address); if (defined $t->phrase and $t->phrase ne $t->address) { $h{name} = $t->phrase; } push @{ $doc->{To} }, \%h; } return; } sub add_from { my ($doc, $msg) = @_; my $log = Log::Log4perl->get_logger('add_from'); my $from_string = $msg->header('From'); #$log->info("From: $from_string"); if (not defined $from_string) { $log->warn("There is no From field in this message"); return 1; } my @from = Email::Address->parse($from_string); #$log->info(Dumper \@from); if (@from > 1) { $log->warn("Strange, there were more than one emails recognized in the From field: " . $msg->header('From')); } if (not @from) { $log->warn("Very strange. No email in the From field! " . $msg->header('From')); return 1; } #say Dumper \@from; #say $from[0]->address; #say $from[0]->name; if ($from[0]->name eq 'Mail System Internal Data') { return; } $doc->{From} = { address => lc $from[0]->address, }; if (defined $from[0]->phrase and $from[0]->phrase ne $from[0]->address) { $doc->{From}{name} = $from[0]->phrase; } return 1; }

Published on 2015-06-03