Posts Tagged ‘fuzzyocr’

Howto install FuzzyOcr on Debian 5.0 (Lenny) / FuzzyOCR install tutorial on Debian Linux

Friday, March 5th, 2010

FuzzyOcr Logo
Recently, I had a task to install FuzzyOCR on Debian Lenny in order to reduce the amount of the “image spam” delivered to the end users.Since there is no official install tutorial for debian users I decided to create this one with the hope it might be useful for others.
Here are few lines that explain what is FuzzyOCR;

FuzzyOcr is a plugin for SpamAssassin which is aimed at unsolicited bulk mail (also known as “Spam”) containing images as the main content carrier. Using different methods, it analyzes the content and properties of images to distinguish between normal mails (Ham) and spam mails.

Now I won’t get into details anymore and I’ll get you to the concrete packages and configurations I’ve done in order to have the software up and running.

1. Install required debian packages

debian-server# apt-get install netpbm gocr giftext giflib-tools libungif-bin
libpng3 libungif4g gifsicle ocrad
libstring-approx-perl libmldbm-perl libmldbm-sync-perl
liblog-agent-perl libpng12-dev libtiff4-dev libsvga1-dev libx11-dev

2. Download latest version of FuzzyOCR


debian-server# wget http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.6.0.tar.gz

3. Copy some FuzzyOCR configuration and installation files in /etc/mail/spamassassin/

debian-server# cp -rpf FuzzyOcr.scansets /etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr.preps /etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr.pm /etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr/ /etc/mail/spamassassin/
debian-server# cp -rpf FuzzyOcr.cf /etc/mail/spamassassin

4. Create some log files and files in order to use FuzzyOCR with a hashing database.

debian-server# touch /var/log/qmail/FuzzyOcr.log
debian-server# chown vpopmail:vchkpw /var/log/qmail/FuzzyOcr.log
debian-server# touch /etc/mail/spamassassin/FuzzyOcr.db
debian-server# chown vpopmail:vchkpw /etc/mail/spamassassin/FuzzyOcr.db
debian-server# touch /etc/mail/spamassassin/FuzzyOcr.safe.db
debian-server# chown vpopmail:vchkpw /etc/mail/spamassassin/FuzzyOcr.safe.db

5. Edit FuzzyOcr configuration files.

debian-server# vim /etc/mail/spamassassin/FuzzyOcr.cf

You need to put there the following directives:

focr_enable_image_hashing 2
focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db
focr_db_safe /etc/mail/spamassassin/FuzzyOcr.safe.db
focr_db_max_days 15

Now there are few more things that need to be done before we have a complete install, e.g. we need to compile netpbm from source, because three of the binary executables required by FuzzyOcr are for some reason not bundled with debian lenny netpbm package. So;
So first we download and untar the latest version of netpbm:

debian-server# links "http://downloads.sourceforge.net/project/netpbm/super_stable/10.35.73/netpbm-10.35.73.tgz?use_mirror=sunet"
debian-server# tar -zxvvf netpbm-10.35.73.gz

We need to have the following “hack” in order to have the source compile properly:

debian-server# mkdir /usr/X11R6/lib
debian-server# ln -sf /usr/lib/libX11.so /usr/X11R6/lib/libX11.so

Next we compile the source of netbpm and install it:


debian-server# cd netpbm-10.35.73
debian-server# make && make install

If it happens that your build fails during the “make”, then you must use the apt-file program to determine which debian package contains the missing header files because of which the build has failed
We proceed next, with the installation of tesseract . Tesseract is 1 of the best OCR open source engine available nowadays
Therefore we now download and install it:

debian-server# wget http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
debian-server# tar -zxvvf tesseract-2.04.tar.gz
debian-server# cd tesseract-2.04
debian-server# ./configure && make && make install

In order to load FuzzyOcr in spamassassin we have to restart Spamassassin:

debian-server# /etc/init.d/spamassassin restart

Note: If you are have spamassassin running via djb daemontools restart spamassassin via the svc command:

Last thing we do is the check out if FuzzyOcr is correctly loaded and checking against image spam when new messages arrives, so here is how:

Change back to your FuzzyOcr-3.6.0/ directory:

debian-server# cd FuzzyOcr-3.6.0/
debian-server# cd samples
debian-server# spamassassin --debug FuzzyOcr < ocr-animated.eml >/dev/null

Check out the lines related to FuzzyOcr, you should have some lines in the output reporting FuzzyOcr has found a spam in the ocr-animated.eml file.
Another possible approach to test what is happening in spamassassin is to use:


debian-server# spamassassin -D

The above command will provide you with information about spamassassin in real time.
This article is pretty much in a beta stage, I’ll be glad of any feedback on it so I can advance it!
Thanks for reading!