stormbind.net - project-laura - tech doc

There are so many stones around so we just need some mortar to build a wall against the spam. Let's see how it's possible to do it in our very own home grown system.

For people with big and unified setups I'd guess that DSPAM is a better and more consistent solution.

My current base for all the involved scripts and documentation can always be checked out with svn co svn://www.stormbind.net/stormbind-tools/project-laura/trunk.

An interesting read about a similar system is this Usenix paper from 2004.

The idea

Most users nowdays use IMAP anyway so it's very easy for them to sort their mail in different folders. It would be just a minor hassle to sort them in spam and non-spam mails that we could use them for a statistic filter to tag new mails.

In this case we're using postfix and would add bogofilter through some custom procmail scripts which would also arrange the filtering and sorting for the user.

The concept

We'll add for every user four new subdirs in his IMAP Maildir directory:

Everything else will end up in the default inbox. Additional actions on mails tagged as Spam or Unsure should be configureable through the user .procmailrc file.

During the testing period we'll use a fith mailbox called BKUPLAURA where we store every spam and ham mail after feeding them to the bogofilter database just in case someone moved a mail to the Feed folder instead of creating a copy.

The user .procmailrc file in his $HOME should be as simple as possible so it will hold only the configure variables and an include statement for the main script which will hold the logic.

An example for the user .procmailrc file could look like this:
BAYES=no
KILL_SPAM=no
UNSURE_IN_INBOX=no

### DO NOT DELETE OR MODIFY ANYTHING BENEATH THIS LINE ###
INCLUDERC=/etc/bogofilter_procmailrc

The global procmail file in /etc/bogofilter_procmailrc will be not that much more complex because it has only to check for the three variables and take action according to it. Be careful if you have mail-only users without a working shell. Those users require that you set the SHELL enviroment variable in the procmailrc to something useful like /bin/sh. I decided to do that aswell in the bogofilter_procmailrc.

As a safety messure for people without a .procmailrc file it's important to also setup a /etc/procmailrc file with something like this in it:
MAILDIR=$HOME/Maildir
DEFAULT=$MAILDIR/new
$HOME is one of the enviroment variables past through from the MTA to the MDA by default.

To get bogofilter working correctly we'll have to setup a .bogofilter/ directory with kind of a sample wordlist.db. It should be mostly ok to create such a db fith a few spam mails and copy it in with your other skeleton files.

The creation of maildirs and most copy jobs for the various files can be arranged with a small shell script and helper tools like maildirmake. For newly created users most files should be part of their skeleton directory so they'll be pulled in automatically.

Feeding the sorted mails from the users maildir will be arrangend by a script run via cron.hourly. This script will simply iterate over the user $HOME and check if a file wordlist.db exists within the .bogofilter directorty. If yes it will try to feed the spam and ham mails to bogofilter and remove the mails afterwards. During the testing periode all ham and spam mails are safed in a BKUPLAURA folder just in case someone moves a ham mail instead of making a copy.

An additional check is performed on the filesize of the wordlist.db. If it grows over a set limit it will be reduced with the bogocompress tool.

Stuff missing

I'm still lacking any ideas how to grant the users access to their personal .procmailrc because we've seperated ftp/shell access from the mailsystem. Maybe it's worth to setup a database with a webfrontend and export the procmailrc files periodicaly from the database to the system? It might be possible to hack something into squirrelmail to allow the configuration.
This is done now for a black- and whitelist but not for some free form filtering.

It might be worth to think about a fourth configuration option which allows the user to simply get the mails tagged and delivered without any special sorting in their inbox. This could help users which are still part time POP3 users.
I started to play with the idea in a branch but that went nowhere near to be useable. One of the main problems is to parse forwarded mails so that we can get the original mail from the attic to train bogofilter with it.

Maybe it's worth to keep the BKUPLAURA Folder and give it some kind of retention timer like 30 days with automatic cleanups. A similar retention time could be usefull for the T-SPAM Folder. Because of the similarity of the variable declaration in the user .procmailrc an in normal shell scripts it should be even possible to make it configurable through that file or use a database directly if we would implement such a thing anyway.

Documentation for the Squirrelmail plugin frontend

I finally managed to write a small plugin for the Squirrelmail webmail system. It's just as simple and naked as the whole Squirrelmail itself. So don't expect something fancy and be carefull if you're running something different then the 1.4.x series. Things might have changed so it might need some tweaking.

The database layout

I'm using MySQL or SQLite through the php database abstraction so it should be just a matter of changing the $dsn variable to get working with something different.
For the TCL scripts I'm using xotcl with xosql until TDB is ready and widespread.

Currently the table has 8 fields:

fieldname description possible values type
username Corresponds to the username you can login to squirrelmail.   varchar(30)
bayes dot_procmail option BAYES 0|1 bool
killspam dot_procmail option KILL_SPAM 0|1 bool
unsure_inbox dot_procmail option UNSURE_IN_INBOX 0|1 bool
bkupmail option for the bogofilter feeding script 0|1 bool
needsupdate option for the dot_procmailrc exporter 0|1 bool
ltd TIMESTAMP for last time touched auto-generated by mysql TIMESTAMP
ltt checking value for last tool touched Currently in use are: sqm-laura_conf-c/sqm-laura_conf-e/manual

varchar(20)

Thanks

Kudos to Ralf for his constant feedback and his patience.

Last changed: 2010-08-11 sven at timegate dot de